seqdata.read_vcf¶

seqdata.read_vcf(name, out, vcf, fasta, samples, bed, batch_size, fixed_length, n_threads=1, samples_per_chunk=10, alphabet=None, max_jitter=0, overwrite=False, splice=False)¶

Read a VCF file and return a Dataset.

Parameters:

name (str) – Name of the sequence variable in the output dataset.
out (PathType) – Path to the output Zarr store where the data will be saved. Usually something like /path/to/dataset_name.zarr.
vcf (PathType) – Path to the VCF file.
fasta (PathType) – Path to the reference genome.
samples (List[str]) – List of sample names to include.
bed (Union[PathType, pd.DataFrame]) – Path to a BED file or a DataFrame with columns “chrom”, “start”, and “end”.
batch_size (int) – Number of regions to read at once. Use as many as you can fit in memory.
fixed_length (Union[int, bool]) – Whether your sequences have a fixed length or not. If they do, the data will be stored in a 2D array as bytes, otherwise it will be stored as unicode strings.
n_threads (int) – Number of threads to use for reading the VCF file.
samples_per_chunk (int) – Number of samples to read at a time.
alphabet (Optional[Union[str, sp.NucleotideAlphabet]]) – Alphabet the sequences have.
max_jitter (int) – Maximum jitter to use for sampling regions.
overwrite (bool) – Whether to overwrite an existing dataset.
splice (bool) – TODO

Returns:

xarray dataset

Return type:

xr.Dataset