seqdata.read_bam¶

seqdata.read_bam(seq_name, cov_name, out, fasta, bams, samples, bed, batch_size, fixed_length, n_jobs=1, threads_per_job=1, alphabet=None, dtype=<class 'numpy.uint16'>, max_jitter=0, overwrite=False)¶

Read in sequences with coverage from a BAM file.

Parameters:

seq_name (str) – Name of the sequence variable in the output dataset.
cov_name (str) – Name of the coverage variable in the output dataset.
out (PathType) – Path to the output Zarr store where the data will be saved. Usually something like /path/to/dataset_name.zarr.
fasta (PathType) – Path to the reference genome.
bams (ListPathType) – List of paths to BAM files. Can be a single file or a list of files.
samples (List[str]) – List of sample names to include. Should be the same length as bams.
bed (Union[PathType, pd.DataFrame]) – Path to a BED file or a DataFrame with columns “chrom”, “start”, and “end”.
batch_size (int) – Number of regions to read at once. Use as many as you can fit in memory.
fixed_length (Union[int, bool]) – Whether your sequences have a fixed length or not. If they do, the data will be stored in a 2D array as bytes, otherwise it will be stored as unicode strings.
n_jobs (int) – Number of parallel jobs. Use if you have multiple BAM files.
threads_per_job (int) – Number of threads per job.
alphabet (Optional[Union[str, sp.NucleotideAlphabet]]) – Alphabet the sequences have.
dtype (Union[str, Type[np.number]]) – Data type to use for coverage.
max_jitter (int) – Maximum jitter to use for sampling regions. This will read in max_jitter/2 extra sequence on either side of the region defined by the BED file. This is useful for training models on coverage data
overwrite (bool) – Whether to overwrite an existing dataset.

Returns:

Dataset with dimensions “_sequence” TODO: what are the dimensions?

Return type:

xr.Dataset