seqdata.read_genome_fasta

seqdata.read_genome_fasta(name, out, fasta, bed, batch_size, fixed_length, n_threads=1, alphabet=None, max_jitter=0, overwrite=False)

Reads sequences from a “genome” FASTA file into xarray.

We differentiate between “flat” and “genome” FASTA files. A flat FASTA file is one where each contig in the FASTA file is a sequence in our dataset. A genome FASTA file is one where we may pull out multiple subsequences from a given contig.

Parameters:
  • name (str) – Name of the sequence variable in the output dataset.

  • out (PathType) – Path to the output Zarr store where the data will be saved. Usually something like /path/to/dataset_name.zarr.

  • fasta (PathType) – Path to the input FASTA file.

  • bed (Union[PathType, pd.DataFrame]) – Path to the input BED file or a pandas DataFrame with the BED data. Used to define the regions of the genome to pull out. TODO: what does the BED have to have?

  • batch_size (int) – Number of sequences to read at a time. Use as many as you can fit in memory.

  • fixed_length (Union[int, bool]) – Whether your sequences have a fixed length or not. If they do, the data will be stored in a 2D array as bytes, otherwise it will be stored as unicode strings.

  • n_threads (int) – Number of threads to use for reading the FASTA file.

  • alphabet (Optional[Union[str, sp.NucleotideAlphabet]]) – Alphabet to use for reading sequences

  • max_jitter (int) – Maximum amount of jitter anticipated. This will read in max_jitter/2 extra sequence on either side of the region defined by the BED file. This is useful for training models on coverage data

  • overwrite (bool) – Whether to overwrite the output Zarr store if it already exists.

Return type:

Dataset