seqdata.label_overlapping_regions

seqdata.label_overlapping_regions(sdata, targets, mode, label_dim=None, fraction_overlap=None)

Label regions for binary or multitask classification based on whether they overlap with another set of regions.

Parameters:
  • sdata (xr.Dataset) –

  • targets (Union[str, Path, pd.DataFrame, List[str]]) – Either a DataFrame (or path to one) with (for binary classification) at least columns [‘chrom’, ‘chromStart’, ‘chromEnd’], or a list of variable names in sdata to use that correspond to the [‘chrom’, ‘chromStart’, ‘chromEnd’] columns, in that order. This is useful if, for example, another set of regions is already in the sdata object under a different set of column names. For multitask classification, the ‘name’ column is also required (i.e. binary requires BED3 format, multitask requires BED4).

  • mode (Literal["binary", "multitask"]) – Whether to mark regions for binary (intersects with any of the target regions) or multitask classification (which target region does it intersect with?).

  • label_dim (str, optional) – Name of the label dimension. Only needed for multitask classification.

  • fraction_overlap (float, optional) – Fraction of the length that must be overlapping to be considered an overlap. This is the “reciprocal minimal overlap fraction” as described in the [bedtools documentation](https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html#r-and-f-requiring-reciprocal-minimal-overlap-fraction).

Return type:

DataArray