seqdata.read_flat_fasta¶
- seqdata.read_flat_fasta(name, out, fasta, batch_size, fixed_length, n_threads=1, overwrite=False)¶
Reads sequences from a “flat” FASTA file into xarray.
We differentiate between “flat” and “genome” FASTA files. A flat FASTA file is one where each contig in the FASTA file is a sequence in our dataset. A genome FASTA file is one where we may pull out multiple subsequences from a given contig.
- Parameters:
name (str) – Name of the sequence variable in the output dataset.
out (PathType) – Path to the output Zarr store where the data will be saved. Usually something like
/path/to/dataset_name.zarr.fasta (PathType) – Path to the input FASTA file.
batch_size (int) – Number of sequences to read at a time. Use as many as you can fit in memory.
fixed_length (bool) – Whether your sequences have a fixed length or not. If they do, the data will be stored in a 2D array as bytes, otherwise it will be stored as unicode strings.
n_threads (int) – Number of threads to use for reading the FASTA file.
overwrite (bool) – Whether to overwrite the output Zarr store if it already exists.
- Returns:
The output dataset.
- Return type:
xr.Dataset