API

Introduction

This section describes application programming interface (API) for the fuc package.

Below is the list of submodules available in the fuc API:

  • common : The common submodule is used by other fuc submodules such as pyvcf and pybed. It also provides many day-to-day actions used in the field of bioinformatics.

  • pybam : The pybam submodule is designed for working with sequence alignment files (SAM/BAM/CRAM). It essentially wraps the pysam package to allow fast computation and easy manipulation. If you are mainly interested in working with depth of coverage data, please check out the pycov submodule which is specifically designed for the task.

  • pybed : The pybed submodule is designed for working with BED files. It implements pybed.BedFrame which stores BED data as pandas.DataFrame via the pyranges package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard BED specification.

  • pycov : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements pycov.CovFrame which stores read depth data as pandas.DataFrame via the pysam package to allow fast computation and easy manipulation. The pycov.CovFrame class also contains many useful plotting methods such as CovFrame.plot_region and CovFrame.plot_uniformity.

  • pyfq : The pyfq submodule is designed for working with FASTQ files. It implements pyfq.FqFrame which stores FASTQ data as pandas.DataFrame to allow fast computation and easy manipulation.

  • pygff : The pygff submodule is designed for working with GFF/GTF files. It implements pygff.GffFrame which stores GFF/GTF data as pandas.DataFrame to allow fast computation and easy manipulation. The submodule strictly adheres to the standard GFF specification.

  • pykallisto : The pykallisto submodule is designed for working with RNAseq quantification data from Kallisto. It implements pykallisto.KallistoFrame which stores Kallisto’s output data as pandas.DataFrame to allow fast computation and easy manipulation. The pykallisto.KallistoFrame class also contains many useful plotting methods such as KallistoFrame.plot_differential_abundance.

  • pymaf : The pymaf submodule is designed for working with MAF files. It implements pymaf.MafFrame which stores MAF data as pandas.DataFrame to allow fast computation and easy manipulation. The pymaf.MafFrame class also contains many useful plotting methods such as MafFrame.plot_oncoplot and MafFrame.plot_summary. The submodule strictly adheres to the standard MAF specification.

  • pysnpeff : The pysnpeff submodule is designed for parsing VCF annotation data from the SnpEff program. It should be used with pyvcf.VcfFrame.

  • pyvcf : The pyvcf submodule is designed for working with VCF files. It implements pyvcf.VcfFrame which stores VCF data as pandas.DataFrame to allow fast computation and easy manipulation. The pyvcf.VcfFrame class also contains many useful plotting methods such as VcfFrame.plot_comparison and VcfFrame.plot_tmb. The submodule strictly adheres to the standard VCF specification.

  • pyvep : The pyvep submodule is designed for parsing VCF annotation data from the Ensembl VEP program. It should be used with pyvcf.VcfFrame.

For getting help on a specific submodule (e.g. pyvcf):

from fuc import pyvcf
help(pyvcf)

fuc.common

The common submodule is used by other fuc submodules such as pyvcf and pybed. It also provides many day-to-day actions used in the field of bioinformatics.

Classes:

AnnFrame(df)

Class for storing sample annotation data.

Functions:

color_print(s[, color, bold])

Print colored text.

conda_env()

str : Name of the current conda environment.

convert_file2list(fn)

Convert a text file to a list of filenames.

convert_num2cat(s[, n, decimals])

Convert numeric values to categorical variables.

extract_sequence(fasta, region)

Extract the region's DNA sequence from the FASTA file.

get_most_similar(a, l)

Return the most similar string in a list.

get_similarity(a, b)

Return a value from 0 to 1 representing how similar two strings are.

is_similar(a, b[, threshold])

Return True if the similarity is equal to or greater than threshold.

legend_handles(labels[, colors])

Create custom legend handles.

load_dataset(name[, force])

Load an example dataset from the online repository (requires internet).

parse_list_or_file(obj[, extensions])

Parse the input variable and then return a list of items.

parse_region(region)

Parse specified genomic region.

parse_variant(variant)

Parse specified genomic variant.

plot_cytobands(cytoband, bed[, ax, figsize])

Create chromosome ideograms along with BED data.

plot_exons(starts, ends[, name, offset, ...])

Create a gene model where exons are drawn as boxes.

rename(original, names[, indicies])

Rename sample names flexibly.

sort_regions(regions)

Return sorted list of regions.

sort_variants(variants)

Return sorted list of variants.

sumstat(fp, fn, tp, tn)

Return various summary statistics from (FP, FN, TP, TN).

update_chr_prefix(regions[, mode])

Add or remove the (annoying) 'chr' string from specified regions.

class fuc.api.common.AnnFrame(df)[source]

Class for storing sample annotation data.

This class stores sample annotation data as pandas.DataFrame with sample names as index.

Note that an AnnFrame can have a different set of samples than its accompanying pymaf.MafFrame, pyvcf.VcfFrame, etc.

Parameters

df (pandas.DataFrame) – DataFrame containing sample annotation data. The index must be unique sample names.

See also

AnnFrame.from_dict

Construct AnnFrame from dict of array-like or dicts.

AnnFrame.from_file

Construct AnnFrame from a delimited text file.

Examples

>>> import pandas as pd
>>> from fuc import common
>>> data = {
...     'SampleID': ['A', 'B', 'C', 'D'],
...     'PatientID': ['P1', 'P1', 'P2', 'P2'],
...     'Tissue': ['Normal', 'Tissue', 'Normal', 'Tumor'],
...     'Age': [30, 30, 57, 57]
... }
>>> df = pd.DataFrame(data)
>>> df = df.set_index('SampleID')
>>> af = common.AnnFrame(df)
>>> af.df
         PatientID  Tissue  Age
SampleID
A               P1  Normal   30
B               P1  Tissue   30
C               P2  Normal   57
D               P2   Tumor   57

Attributes:

df

DataFrame containing sample annotation data.

samples

List of the sample names.

shape

Dimensionality of AnnFrame (samples, annotations).

Methods:

from_dict(data, sample_col)

Construct AnnFrame from dict of array-like or dicts.

from_file(fn, sample_col[, sep])

Construct AnnFrame from a delimited text file.

plot_annot(group_col[, group_order, ...])

Create a categorical heatmap for the selected column using unmatched samples.

plot_annot_matched(patient_col, group_col, ...)

Create a categorical heatmap for the selected column using matched samples.

sorted_samples(by[, mf, keep_empty, nonsyn])

Return a sorted list of sample names.

subset(samples[, exclude])

Subset AnnFrame for specified samples.

property df

DataFrame containing sample annotation data.

Type

pandas.DataFrame

classmethod from_dict(data, sample_col)[source]

Construct AnnFrame from dict of array-like or dicts.

The dictionary must contain a column that represents sample names.

Parameters
  • data (dict) – Of the form {field : array-like} or {field : dict}.

  • sample_col (str or int) – Column containing unique sample names, either given as string name or column index.

Returns

AnnFrame object.

Return type

AnnFrame

See also

AnnFrame

AnnFrame object creation using constructor.

AnnFrame.from_file

Construct AnnFrame from a delimited text file.

Examples

>>> from fuc import common
>>> data = {
...     'SampleID': ['A', 'B', 'C', 'D'],
...     'PatientID': ['P1', 'P1', 'P2', 'P2'],
...     'Tissue': ['Normal', 'Tissue', 'Normal', 'Tumor'],
...     'Age': [30, 30, 57, 57]
... }
>>> af = common.AnnFrame.from_dict(data, sample_col='SampleID') # or sample_col=0
>>> af.df
         PatientID  Tissue  Age
SampleID
A               P1  Normal   30
B               P1  Tissue   30
C               P2  Normal   57
D               P2   Tumor   57
classmethod from_file(fn, sample_col, sep='\t')[source]

Construct AnnFrame from a delimited text file.

The file must contain a column that represents sample names.

Parameters
  • fn (str) – Text file (compressed or uncompressed).

  • sample_col (str or int) – Column containing unique sample names, either given as string name or column index.

  • sep (str, default: ‘\t’) – Delimiter to use.

Returns

AnnFrame object.

Return type

AnnFrame

See also

AnnFrame

AnnFrame object creation using constructor.

AnnFrame.from_dict

Construct AnnFrame from dict of array-like or dicts.

Examples

>>> from fuc import common
>>> af = common.AnnFrame.from_file('sample-annot.tsv', sample_col='SampleID')
>>> af = common.AnnFrame.from_file('sample-annot.csv', sample_col=0, sep=',')
plot_annot(group_col, group_order=None, samples=None, colors='tab10', sequential=False, xticklabels=True, ax=None, figsize=None)[source]

Create a categorical heatmap for the selected column using unmatched samples.

See this tutorial to learn how to create customized oncoplots.

Parameters
  • group_col (str) – AnnFrame column containing sample group information. If the column has NaN values, they will be converted to ‘N/A’ string.

  • group_order (list, optional) – List of sample group names (in that order too). You can use this to subset samples belonging to specified groups only. You must include all relevant groups when also using samples.

  • samples (list, optional) – Display only specified samples (in that order too).

  • colors (str or list, default: ‘tab10’) – Colormap name or list of colors.

  • sequential (bool, default: False) – Whether the column is sequential data.

  • xticklabels (bool, default: True) – If True, plot the sample names.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

  • matplotlib.axes.Axes – The matplotlib axes containing the plot.

  • list – Legend handles.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> ax, handles = af.plot_annot('FAB_classification', samples=af.samples[:10])
>>> legend = ax.legend(handles=handles)
>>> ax.add_artist(legend)
>>> plt.tight_layout()
_images/api-1.png

We can display only selected groups:

>>> ax, handles = af.plot_annot('FAB_classification', group_order=['M7', 'M6'])
>>> legend = ax.legend(handles=handles)
>>> ax.add_artist(legend)
>>> plt.tight_layout()
_images/api-2.png

We can also display sequenital data in the following way:

>>> ax, handles = af.plot_annot('FAB_classification',
...                             samples=af.samples[:10],
...                             colors='viridis',
...                             sequential=True)
>>> legend = ax.legend(handles=handles)
>>> ax.add_artist(legend)
>>> plt.tight_layout()
_images/api-3.png
plot_annot_matched(patient_col, group_col, annot_col, patient_order=None, group_order=None, annot_order=None, colors='tab10', sequential=False, xticklabels=True, ax=None, figsize=None)[source]

Create a categorical heatmap for the selected column using matched samples.

See this tutorial to learn how to create customized oncoplots.

Parameters
  • patient_col (str) – AnnFrame column containing patient information.

  • group_col (str) – AnnFrame column containing sample group information.

  • annot_col (str) – Column to plot.

  • patient_order (list, optional) – Plot only specified patients (in that order too).

  • group_order (list, optional) – List of sample group names.

  • annot_order (list, optional) – Plot only specified annotations (in that order too).

  • colors (str or list, default: ‘tab10’) – Colormap name or list of colors.

  • sequential (bool, default: False) – Whether the column is sequential data.

  • xticklabels (bool, default: True) – If True, plot the sample names.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

  • matplotlib.axes.Axes – The matplotlib axes containing the plot.

  • list – Legend handles.

property samples

List of the sample names.

Type

list

property shape

Dimensionality of AnnFrame (samples, annotations).

Type

tuple

sorted_samples(by, mf=None, keep_empty=False, nonsyn=False)[source]

Return a sorted list of sample names.

Parameters

df (str or list) – Column or list of columns to sort by.

subset(samples, exclude=False)[source]

Subset AnnFrame for specified samples.

Parameters
  • samples (str or list) – Sample name or list of names (the order matters).

  • exclude (bool, default: False) – If True, exclude specified samples.

Returns

Subsetted AnnFrame.

Return type

AnnFrame

Examples

>>> from fuc import common
>>> data = {
...     'SampleID': ['A', 'B', 'C', 'D'],
...     'PatientID': ['P1', 'P1', 'P2', 'P2'],
...     'Tissue': ['Normal', 'Tumor', 'Normal', 'Tumor'],
...     'Age': [30, 30, 57, 57]
... }
>>> af = common.AnnFrame.from_dict(data, sample_col='SampleID') # or sample_col=0
>>> af.df
         PatientID  Tissue  Age
SampleID
A               P1  Normal   30
B               P1   Tumor   30
C               P2  Normal   57
D               P2   Tumor   57

We can subset the AnnFrame for the normal samples A and C:

>>> af.subset(['A', 'C']).df
         PatientID  Tissue  Age
SampleID
A               P1  Normal   30
C               P2  Normal   57

Alternatively, we can exclude those samples:

>>> af.subset(['A', 'C'], exclude=True).df
         PatientID Tissue  Age
SampleID
B               P1  Tumor   30
D               P2  Tumor   57
fuc.api.common.color_print(s, color='green', bold=False)[source]

Print colored text.

fuc.api.common.conda_env()[source]

str : Name of the current conda environment.

fuc.api.common.convert_file2list(fn)[source]

Convert a text file to a list of filenames.

Parameters

fn (str) – File containing one filename per line.

Returns

List of filenames.

Return type

list

Examples

>>> from fuc import common
>>> common.convert_file2list('bam.list')
['1.bam', '2.bam', '3.bam']
fuc.api.common.convert_num2cat(s, n=5, decimals=0)[source]

Convert numeric values to categorical variables.

Parameters
  • pandas.Series – Series object containing numeric values.

  • n (int, default: 5) – Number of variables to output.

Returns

Series object containing categorical variables.

Return type

pandas.Series

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> s = af.df.days_to_last_followup
>>> s[:10]
Tumor_Sample_Barcode
TCGA-AB-2802     365.0
TCGA-AB-2803     792.0
TCGA-AB-2804    2557.0
TCGA-AB-2805     577.0
TCGA-AB-2806     945.0
TCGA-AB-2807     181.0
TCGA-AB-2808    2861.0
TCGA-AB-2809      62.0
TCGA-AB-2810      31.0
TCGA-AB-2811     243.0
Name: days_to_last_followup, dtype: float64
>>> s = common.convert_num2cat(s)
>>> s.unique()
array([ 572.2, 1144.4, 2861. , 2288.8, 1716.6,    nan])
>>> s[:10]
Tumor_Sample_Barcode
TCGA-AB-2802     572.2
TCGA-AB-2803    1144.4
TCGA-AB-2804    2861.0
TCGA-AB-2805    1144.4
TCGA-AB-2806    1144.4
TCGA-AB-2807     572.2
TCGA-AB-2808    2861.0
TCGA-AB-2809     572.2
TCGA-AB-2810     572.2
TCGA-AB-2811     572.2
Name: days_to_last_followup, dtype: float64
fuc.api.common.extract_sequence(fasta, region)[source]

Extract the region’s DNA sequence from the FASTA file.

Parameters
  • fasta (str) – FASTA file.

  • region (str) – Region (‘chrom:start-end’).

Returns

DNA sequence. Empty string if there is no matching sequence.

Return type

str

fuc.api.common.get_most_similar(a, l)[source]

Return the most similar string in a list.

fuc.api.common.get_similarity(a, b)[source]

Return a value from 0 to 1 representing how similar two strings are.

fuc.api.common.is_similar(a, b, threshold=0.9)[source]

Return True if the similarity is equal to or greater than threshold.

fuc.api.common.legend_handles(labels, colors='tab10')[source]

Create custom legend handles.

Parameters
  • labels (list) – List of labels.

  • colors (str or list, default: ‘tab10’) – Colormap name or list of colors.

Returns

List of legend handles.

Return type

list

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common
>>> fig, ax = plt.subplots()
>>> handles1 = common.legend_handles(['A', 'B'], colors='tab10')
>>> handles2 = common.legend_handles(['C', 'D'], colors=['yellow', 'green'])
>>> legend1 = ax.legend(handles=handles1, loc='center left')
>>> legend2 = ax.legend(handles=handles2)
>>> ax.add_artist(legend1)
>>> ax.add_artist(legend2)
>>> plt.tight_layout()
_images/api-4.png
fuc.api.common.load_dataset(name, force=False)[source]

Load an example dataset from the online repository (requires internet).

Parameters
fuc.api.common.parse_list_or_file(obj, extensions=['txt', 'tsv', 'csv', 'list'])[source]

Parse the input variable and then return a list of items.

This method is useful when parsing a command line argument that accepts either a list of items or a text file containing one item per line.

Parameters
  • obj (str or list) – Object to be tested. Must be non-empty.

  • extensions (list, default: [‘txt’, ‘tsv’, ‘csv’, ‘list’]) – Recognized file extensions.

Returns

List of items.

Return type

list

Examples

>>> from fuc import common
>>> common.parse_list_or_file(['A', 'B', 'C'])
['A', 'B', 'C']
>>> common.parse_list_or_file('A')
['A']
>>> common.parse_list_or_file('example.txt')
['A', 'B', 'C']
>>> common.parse_list_or_file(['example.txt'])
['A', 'B', 'C']
fuc.api.common.parse_region(region)[source]

Parse specified genomic region.

The method will return parsed region as a tuple with a shape of (chrom, start, end) which has data types of (str, int, int).

Note that only chrom is required when specifing a region. If start and end are omitted, the method will return NaN in their respective positions in the output tuple.

Parameters

region (str) – Region (‘chrom:start-end’).

Returns

Parsed region.

Return type

tuple

Examples

>>> from fuc import common
>>> common.parse_region('chr1:100-150')
('chr1', 100, 150)
>>> common.parse_region('chr1')
('chr1', nan, nan)
>>> common.parse_region('chr1:100')
('chr1', 100, nan)
>>> common.parse_region('chr1:100-')
('chr1', 100, nan)
>>> common.parse_region('chr1:-100')
('chr1', nan, 100)
fuc.api.common.parse_variant(variant)[source]

Parse specified genomic variant.

Generally speaking, the input string should consist of chromosome, position, reference allele, and alternative allele separated by any one or combination of the following delimiters: -, :, > (e.g. ‘22-42127941-G-A’). The method will return parsed variant as a tuple with a shape of (chrom, pos, ref, alt) which has data types of (str, int, str, str).

Note that it’s possible to omit reference allele and alternative allele from the input string to indicate position-only data (e.g. ‘22-42127941’). In this case, the method will return empty string for the alleles – i.e. (str, int, '', '') if both are omitted and (str, int, str, '') if only alternative allele is omitted.

Parameters

variant (str) – Genomic variant.

Returns

Parsed variant.

Return type

tuple

Examples

>>> from fuc import common
>>> common.parse_variant('22-42127941-G-A')
('22', 42127941, 'G', 'A')
>>> common.parse_variant('22:42127941-G>A')
('22', 42127941, 'G', 'A')
>>> common.parse_variant('22-42127941')
('22', 42127941, '', '')
>>> common.parse_variant('22-42127941-G')
('22', 42127941, 'G', '')
fuc.api.common.plot_cytobands(cytoband, bed, ax=None, figsize=None)[source]

Create chromosome ideograms along with BED data.

The method’s source code is derived from a Python script (ideograms.py) written by Ryan Dale. The original script can be found at: https://gist.github.com/daler/c98fc410282d7570efc3#file-ideograms-py

Parameters
  • cytoband (str) – Text file containing cytoband ideogram information.

  • bed (str) – BED file to be displayed.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common
>>> common.load_dataset('cytoband')
>>> cytoband_file = '~/fuc-data/cytoband/cytoBandIdeo.txt.gz'
>>> bed_file = '~/fuc-data/cytoband/ucsc_genes.bed.gz'
>>> common.plot_cytobands(cytoband_file, bed_file, figsize=(10, 8))
_images/api-5.png
fuc.api.common.plot_exons(starts, ends, name=None, offset=1, fontsize=None, color='black', y=0, height=1, ax=None, figsize=None)[source]

Create a gene model where exons are drawn as boxes.

Parameters
  • starts (list) – List of exon start positions.

  • ends (list) – List of exon end positions.

  • name (str, optional) – Gene name. Use name='$text$' to italicize the text.

  • offset (float, default: 1) – How far gene name should be plotted from the gene model.

  • color (str, default: ‘black’) – Box color.

  • y (float, default: 0) – Y position of the backbone.

  • height (float, default: 1) – Height of the gene model.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common
>>> cyp2d6_starts = [42522500, 42522852, 42523448, 42523843, 42524175, 42524785, 42525034, 42525739, 42526613]
>>> cyp2d6_ends = [42522754, 42522994, 42523636, 42523985, 42524352, 42524946, 42525187, 42525911, 42526883]
>>> ax = common.plot_exons(cyp2d6_starts, cyp2d6_ends, name='CYP2D6', fontsize=20)
>>> ax.set_ylim([-2, 2])
>>> plt.tight_layout()
_images/api-6.png
fuc.api.common.rename(original, names, indicies=None)[source]

Rename sample names flexibly.

Parameters
  • original (list) – List of original names.

  • names (dict or list) – Dict of old names to new names or list of new names.

  • indicies (list or tuple, optional) – List of 0-based sample indicies. Alternatively, a tuple (int, int) can be used to specify an index range.

Returns

List of updated names.

Return type

list

Examples

>>> from fuc import common
>>> original = ['A', 'B', 'C', 'D']
>>> common.rename(original, ['1', '2', '3', '4'])
['1', '2', '3', '4']
>>> common.rename(original, {'B': '2', 'C': '3'})
['A', '2', '3', 'D']
>>> common.rename(original, ['2', '4'], indicies=[1, 3])
['A', '2', 'C', '4']
>>> common.rename(original, ['2', '3'], indicies=(1, 3))
['A', '2', '3', 'D']
fuc.api.common.sort_regions(regions)[source]

Return sorted list of regions.

Parameters

regions (list) – List of regions.

Returns

Sorted list.

Return type

list

Examples

>>> from fuc import common
>>> regions = ['chr22:1000-1500', 'chr16:100-200', 'chr22:200-300', 'chr16_KI270854v1_alt', 'chr3_GL000221v1_random', 'HLA-A*02:10']
>>> sorted(regions) # Lexicographic sorting (not what we want)
['HLA-A*02:10', 'chr16:100-200', 'chr16_KI270854v1_alt', 'chr22:1000-1500', 'chr22:200-300', 'chr3_GL000221v1_random']
>>> common.sort_regions(regions)
['chr16:100-200', 'chr22:200-300', 'chr22:1000-1500', 'chr16_KI270854v1_alt', 'chr3_GL000221v1_random', 'HLA-A*02:10']
fuc.api.common.sort_variants(variants)[source]

Return sorted list of variants.

Parameters

variants (list) – List of variants.

Returns

Sorted list.

Return type

list

Examples

>>> from fuc import common
>>> variants = ['5-200-G-T', '5:100:T:C', '1:100:A>C', '10-100-G-C']
>>> sorted(variants) # Lexicographic sorting (not what we want)
['10-100-G-C', '1:100:A>C', '5-200-G-T', '5:100:T:C']
>>> common.sort_variants(variants)
['1:100:A>C', '5:100:T:C', '5-200-G-T', '10-100-G-C']
fuc.api.common.sumstat(fp, fn, tp, tn)[source]

Return various summary statistics from (FP, FN, TP, TN).

This method will return the following statistics:

Terminology

Derivation

sensitivity, recall, hit rate, or true positive rate (TPR)

\(TPR = TP / P = TP / (TP + FN) = 1 - FNR\)

specificity, selectivity or true negative rate (TNR)

\(TNR = TN / N = TN / (TN + FP) = 1 - FPR\)

precision or positive predictive value (PPV)

\(PPV = TP / (TP + FP) = 1 - FDR\)

negative predictive value (NPV)

\(NPV = TN / (TN + FN) = 1 - FOR\)

miss rate or false negative rate (FNR)

\(FNR = FN / P = FN / (FN + TP) = 1 - TPR\)

fall-out or false positive rate (FPR)

\(FPR = FP / N = FP / (FP + TN) = 1 - TNR\)

false discovery rate (FDR)

\(FDR = FP / (FP + TP) = 1 - PPV\)

false omission rate (FOR)

\(FOR = FN / (FN + TN) = 1 - NPV\)

accuracy (ACC)

\(ACC = (TP + TN)/(TP + TN + FP + FN)\)

Parameters

fp, fn, tp, tn (int) – Input statistics.

Returns

Dictionary containing summary statistics.

Return type

dict

Examples

This example is directly taken from the Wiki page Sensitivity and specificity.

>>> from fuc import common
>>> results = common.sumstat(180, 10, 20, 1820)
>>> for k, v in results.items():
...     print(k, f'{v:.3f}')
...
tpr 0.667
tnr 0.910
ppv 0.100
npv 0.995
fnr 0.333
fpr 0.090
fdr 0.900
for 0.005
acc 0.906
fuc.api.common.update_chr_prefix(regions, mode='remove')[source]

Add or remove the (annoying) ‘chr’ string from specified regions.

The method will automatically detect regions that don’t need to be updated and will return them unchanged.

Parameters
  • regions (str or list) – One or more regions to be updated.

  • mode ({‘add’, ‘remove’}, default: ‘remove’) – Whether to add or remove the ‘chr’ string.

Returns

str or list.

Return type

VcfFrame

Example

>>> from fuc import common
>>> common.update_chr_prefix(['chr1:100-200', '2:300-400'], mode='remove')
['1:100-200', '2:300-400']
>>> common.update_chr_prefix(['chr1:100-200', '2:300-400'], mode='add')
['chr1:100-200', 'chr2:300-400']
>>> common.update_chr_prefix('chr1:100-200', mode='remove')
'1:100-200'
>>> common.update_chr_prefix('chr1:100-200', mode='add')
'chr1:100-200'
>>> common.update_chr_prefix('2:300-400', mode='add')
'chr2:300-400'
>>> common.update_chr_prefix('2:300-400', mode='remove')
'2:300-400'

fuc.pybam

The pybam submodule is designed for working with sequence alignment files (SAM/BAM/CRAM). It essentially wraps the pysam package to allow fast computation and easy manipulation. If you are mainly interested in working with depth of coverage data, please check out the pycov submodule which is specifically designed for the task.

Functions:

count_allelic_depth(bam, sites)

Count allelic depth for specified sites.

has_chr_prefix(fn)

Return True if contigs have the (annoying) 'chr' string.

index(fn)

Index a BAM file.

slice(bam, regions[, format, path, fasta])

Slice a BAM file for specified regions.

tag_sm(fn)

Extract SM tags (sample names) from a BAM file.

tag_sn(fn)

Extract SN tags (contig names) from a BAM file.

fuc.api.pybam.count_allelic_depth(bam, sites)[source]

Count allelic depth for specified sites.

Parameters
  • bam (str) – BAM file.

  • sites (str or list) – Genomic site or list of sites. Each site should consist of chromosome and 1-based position in the format that can be recognized by common.parse_variant() (e.g. ‘22-42127941’).

Returns

DataFrame containing allelic depth.

Return type

pandas.DataFrame

Examples

>>> from fuc import pybam
>>> pybam.count_allelic_depth('in.bam', ['19-41510048', '19-41510053', '19-41510062'])
  Chromosome  Position  Total    A  C    G    T  N  DEL  INS
0         19  41510048    119  106  7    4    0  0    2    0
1         19  41510053    120    1  2    0  116  0    0    1
2         19  41510062    115    0  0  115    0  0    0    0
fuc.api.pybam.has_chr_prefix(fn)[source]

Return True if contigs have the (annoying) ‘chr’ string.

Parameters

fn (str) – BAM file.

Returns

Whether the ‘chr’ string is found.

Return type

bool

fuc.api.pybam.index(fn)[source]

Index a BAM file.

This simply wraps the pysam.index() method.

Parameters

fn (str) – BAM file.

fuc.api.pybam.slice(bam, regions, format='BAM', path=None, fasta=None)[source]

Slice a BAM file for specified regions.

Parameters
  • bam (str) – Input BAM file. It must be already indexed to allow random access. You can index a BAM file with the pybam.index() method.

  • regions (str, list, or pybed.BedFrame) – One or more regions to be sliced. Each region must have the format chrom:start-end and be a half-open interval with (start, end]. This means, for example, chr1:100-103 will extract positions 101, 102, and 103. Alternatively, you can provide a BED file (compressed or uncompressed) to specify regions. Note that the ‘chr’ prefix in contig names (e.g. ‘chr1’ vs. ‘1’) will be automatically added or removed as necessary to match the input BED’s contig names.

  • path (str, optional) – Output BAM file. Writes to stdout when path='-'. If None is provided the result is returned as a string.

  • format ({‘BAM’, ‘SAM’, ‘CRAM’}, default: ‘BAM’) – Output file format.

  • fasta – FASTA file. Required when format is ‘CRAM’.

Returns

If path is None, returns the resulting BAM format as a string. Otherwise returns None.

Return type

None or str

fuc.api.pybam.tag_sm(fn)[source]

Extract SM tags (sample names) from a BAM file.

Parameters

fn (str) – BAM file.

Returns

List of SM tags.

Return type

list

Examples

>>> from fuc import pybam
>>> pybam.tag_sm('NA19920.bam')
['NA19920']
fuc.api.pybam.tag_sn(fn)[source]

Extract SN tags (contig names) from a BAM file.

Parameters

fn (str) – BAM file.

Returns

List of SN tags.

Return type

list

Examples

>>> from fuc import pybam
>>> pybam.tag_sn('NA19920.bam')
['chr3', 'chr15', 'chrY', 'chr19', 'chr22', 'chr5', 'chr18', 'chr14', 'chr11', 'chr20', 'chr21', 'chr16', 'chr10', 'chr13', 'chr9', 'chr2', 'chr17', 'chr12', 'chr6', 'chrM', 'chrX', 'chr4', 'chr8', 'chr1', 'chr7']

fuc.pybed

The pybed submodule is designed for working with BED files. It implements pybed.BedFrame which stores BED data as pandas.DataFrame via the pyranges package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard BED specification.

BED lines can have the following fields (the first three are required):

No.

Name

Description

Examples

1

Chromosome

Chromosome

‘chr2’, ‘2’

2

Start

Start position

10041, 23042

3

End

End position

10041, 23042

4

Name

Feature name

‘TP53’

5

Score

Score for color density (0, 1000)

342, 544

6

Strand

‘+’ or ‘-’ (‘.’ for no strand)

‘+’, ‘-’

7

ThickStart

Start position for thick drawing

10041, 23042

8

ThickEnd

End position for thick drawing

10041, 23042

9

ItemRGB

RGB value

‘255,0,0’

10

BlockCount

Number of blocks (e.g. exons)

12, 8

11

BlockSizes

‘,’-separated block sizes

‘224,423’

12

BlockStarts

‘,’-separated block starts

‘2345,5245’

Classes:

BedFrame(meta, gr)

Class for storing BED data.

class fuc.api.pybed.BedFrame(meta, gr)[source]

Class for storing BED data.

Parameters
  • meta (list) – Metadata lines.

  • gr (pyranges.PyRanges) – PyRanges object containing BED data.

See also

BedFrame.from_dict

Construct BedFrame from a dict of array-like or dicts.

BedFrame.from_file

Construct BedFrame from a BED file.

BedFrame.from_frame

Construct BedFrame from a dataframe.

BedFrame.from_region

Construct BedFrame from a list of regions.

Examples

>>> import pandas as pd
>>> import pyranges as pr
>>> from fuc import pybed
>>> data = {
...     'Chromosome': ['chr1', 'chr2', 'chr3'],
...     'Start': [100, 400, 100],
...     'End': [200, 500, 200]
... }
>>> df = pd.DataFrame(data)
>>> gr = pr.PyRanges(df)
>>> bf = pybed.BedFrame([], gr)
>>> bf.gr.df
  Chromosome  Start  End
0       chr1    100  200
1       chr2    400  500
2       chr3    100  200

Attributes:

contigs

List of contig names.

gr

Two-dimensional representation of genomic intervals and their annotations.

has_chr_prefix

Whether the (annoying) 'chr' string is found.

meta

Metadata lines.

shape

Dimensionality of BedFrame (intervals, columns).

Methods:

copy_meta()

Return a copy of the metadata.

from_dict(meta, data)

Construct BedFrame from a dict of array-like or dicts.

from_file(fn)

Construct BedFrame from a BED file.

from_frame(meta, data)

Construct BedFrame from a dataframe.

from_regions(meta, regions)

Construct BedFrame from a list of regions.

intersect(other)

Find intersection between the BedFrames.

merge()

Merge overlapping intervals within BedFrame.

sort()

Sort the BedFrame by chromosome and position.

to_file(fn)

Write the BedFrame to a BED file.

to_regions([merge])

Return a list of regions from BedFrame.

to_string()

Render the BedFrame to a console-friendly tabular output.

update_chr_prefix([mode])

Add or remove the (annoying) 'chr' string from the Chromosome column.

property contigs

List of contig names.

Type

list

copy_meta()[source]

Return a copy of the metadata.

classmethod from_dict(meta, data)[source]

Construct BedFrame from a dict of array-like or dicts.

Parameters
  • meta (list) – Metadata lines.

  • data (dict) – Of the form {field : array-like} or {field : dict}.

Returns

BedFrame object.

Return type

BedFrame

See also

BedFrame

BedFrame object creation using constructor.

BedFrame.from_file

Construct BedFrame from a BED file.

BedFrame.from_frame

Construct BedFrame from a dataframe.

BedFrame.from_region

Construct BedFrame from a list of regions.

Examples

>>> from fuc import pybed
>>> data = {
...     'Chromosome': ['chr1', 'chr2', 'chr3'],
...     'Start': [100, 400, 100],
...     'End': [200, 500, 200]
... }
>>> bf = pybed.BedFrame.from_dict([], data)
>>> bf.gr.df
  Chromosome  Start  End
0       chr1    100  200
1       chr2    400  500
2       chr3    100  200
classmethod from_file(fn)[source]

Construct BedFrame from a BED file.

Parameters

fn (str) – BED file path.

Returns

BedFrame object.

Return type

BedFrame

See also

BedFrame

BedFrame object creation using constructor.

BedFrame.from_dict

Construct BedFrame from a dict of array-like or dicts.

BedFrame.from_frame

Construct BedFrame from a dataframe.

BedFrame.from_region

Construct BedFrame from a list of regions.

Examples

>>> from fuc import pybed
>>> bf = pybed.BedFrame.from_file('example.bed')
classmethod from_frame(meta, data)[source]

Construct BedFrame from a dataframe.

Parameters
  • meta (list) – Metadata lines.

  • data (pandas.DataFrame) – DataFrame containing BED data.

Returns

BedFrame object.

Return type

BedFrame

See also

BedFrame

BedFrame object creation using constructor.

BedFrame.from_dict

Construct BedFrame from a dict of array-like or dicts.

BedFrame.from_file

Construct BedFrame from a BED file.

BedFrame.from_region

Construct BedFrame from a list of regions.

Examples

>>> import pandas as pd
>>> from fuc import pybed
>>> data = {
...     'Chromosome': ['chr1', 'chr2', 'chr3'],
...     'Start': [100, 400, 100],
...     'End': [200, 500, 200]
... }
>>> df = pd.DataFrame(data)
>>> bf = pybed.BedFrame.from_frame([], df)
>>> bf.gr.df
  Chromosome  Start  End
0       chr1    100  200
1       chr2    400  500
2       chr3    100  200
classmethod from_regions(meta, regions)[source]

Construct BedFrame from a list of regions.

Parameters
  • meta (list) – Metadata lines.

  • regions (str or list) – Region or list of regions.

Returns

BedFrame object.

Return type

BedFrame

See also

BedFrame

BedFrame object creation using constructor.

BedFrame.from_dict

Construct BedFrame from a dict of array-like or dicts.

BedFrame.from_file

Construct BedFrame from a BED file.

BedFrame.from_frame

Construct BedFrame from a dataframe.

Examples

>>> from fuc import pybed
>>> data = ['chr1:100-200', 'chr2:100-200', 'chr3:100-200']
>>> bf = pybed.BedFrame.from_regions([], data)
>>> bf.gr.df
  Chromosome  Start  End
0       chr1    100  200
1       chr2    100  200
2       chr3    100  200
property gr

Two-dimensional representation of genomic intervals and their annotations.

Type

pyranges.PyRanges

property has_chr_prefix

Whether the (annoying) ‘chr’ string is found.

Type

bool

intersect(other)[source]

Find intersection between the BedFrames.

merge()[source]

Merge overlapping intervals within BedFrame.

Returns

Merged BedFrame.

Return type

BedFrame

Examples

>>> from fuc import pybed
>>> data = {
...     'Chromosome': ['chr1', 'chr1', 'chr2', 'chr2', 'chr3', 'chr3'],
...     'Start': [10, 30, 15, 25, 50, 61],
...     'End': [40, 50, 25, 35, 60, 80]
... }
>>> bf = pybed.BedFrame.from_dict([], data)
>>> bf.gr.df
  Chromosome  Start  End
0       chr1     10   40
1       chr1     30   50
2       chr2     15   25
3       chr2     25   35
4       chr3     50   60
5       chr3     61   80
>>> bf.merge().gr.df
  Chromosome  Start  End
0       chr1     10   50
1       chr2     15   35
2       chr3     50   60
3       chr3     61   80
property meta

Metadata lines.

Type

list

property shape

Dimensionality of BedFrame (intervals, columns).

Type

tuple

sort()[source]

Sort the BedFrame by chromosome and position.

Returns

Sorted BedFrame.

Return type

BedFrame

Examples

>>> from fuc import pybed
>>> data = {
...     'Chromosome': ['chr1', 'chr3', 'chr1'],
...     'Start': [400, 100, 100],
...     'End': [500, 200, 200]
... }
>>> bf = pybed.BedFrame.from_dict([], data)
>>> bf.gr.df
  Chromosome  Start  End
0       chr1    400  500
1       chr1    100  200
2       chr3    100  200
>>> bf.sort().gr.df
  Chromosome  Start  End
0       chr1    100  200
1       chr1    400  500
2       chr3    100  200
to_file(fn)[source]

Write the BedFrame to a BED file.

to_regions(merge=True)[source]

Return a list of regions from BedFrame.

Parameters

merge (bool, default: True) – Whether to merge overlapping intervals.

Returns

List of regions.

Return type

list

Examples

>>> from fuc import pybed
>>> data = {
...     'Chromosome': ['chr1', 'chr1', 'chr2', 'chr2', 'chr3', 'chr3'],
...     'Start': [10, 30, 15, 25, 50, 61],
...     'End': [40, 50, 25, 35, 60, 80]
... }
>>> bf = pybed.BedFrame.from_dict([], data)
>>> bf.to_regions()
['chr1:10-50', 'chr2:15-35', 'chr3:50-60', 'chr3:61-80']
>>> bf.to_regions(merge=False)
['chr1:10-40', 'chr1:30-50', 'chr2:15-25', 'chr2:25-35', 'chr3:50-60', 'chr3:61-80']
to_string()[source]

Render the BedFrame to a console-friendly tabular output.

update_chr_prefix(mode='remove')[source]

Add or remove the (annoying) ‘chr’ string from the Chromosome column.

Parameters

mode ({‘add’, ‘remove’}, default: ‘remove’) – Whether to add or remove the ‘chr’ string.

Returns

Updated BedFrame.

Return type

BedFrame

Examples

>>> from fuc import pybed
>>> data = {
...     'Chromosome': ['1', '1', 'chr2', 'chr2'],
...     'Start': [100, 400, 100, 200],
...     'End': [200, 500, 200, 300]
... }
>>> bf = pybed.BedFrame.from_dict([], data)
>>> bf.gr.df
  Chromosome  Start  End
0          1    100  200
1          1    400  500
2       chr2    100  200
3       chr2    200  300
>>> bf.update_chr_prefix(mode='remove').gr.df
  Chromosome  Start  End
0          1    100  200
1          1    400  500
2          2    100  200
3          2    200  300
>>> bf.update_chr_prefix(mode='add').gr.df
  Chromosome  Start  End
0       chr1    100  200
1       chr1    400  500
2       chr2    100  200
3       chr2    200  300

fuc.pycov

The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements pycov.CovFrame which stores read depth data as pandas.DataFrame via the pysam package to allow fast computation and easy manipulation. The pycov.CovFrame class also contains many useful plotting methods such as CovFrame.plot_region and CovFrame.plot_uniformity.

Classes:

CovFrame(df)

Class for storing read depth data from one or more SAM/BAM/CRAM files.

Functions:

concat(cfs[, axis])

Concatenate CovFrame objects along a particular axis.

merge(cfs[, how])

Merge CovFrame objects.

simulate([mode, loc, scale, size])

Simulate read depth data for single sample.

class fuc.api.pycov.CovFrame(df)[source]

Class for storing read depth data from one or more SAM/BAM/CRAM files.

Parameters

df (pandas.DataFrame) – DataFrame containing read depth data.

See also

CovFrame.from_bam

Construct CovFrame from BAM files.

CovFrame.from_dict

Construct CovFrame from dict of array-like or dicts.

CovFrame.from_file

Construct CovFrame from a text file containing read depth data.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1'] * 1000,
...     'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
... }
>>> df = pd.DataFrame(data)
>>> cf = pycov.CovFrame(df)
>>> cf.df.head()
  Chromosome  Position   A   B
0       chr1      1000  22  23
1       chr1      1001  34  30
2       chr1      1002  33  27
3       chr1      1003  32  21
4       chr1      1004  32  15

Attributes:

contigs

List of contig names.

df

DataFrame containing read depth data.

has_chr_prefix

Whether the (annoying) 'chr' string is found.

samples

List of the sample names.

shape

Dimensionality of CovFrame (positions, samples).

Methods:

copy()

Return a copy of the CovFrame.

copy_df()

Return a copy of the dataframe.

from_bam(bams[, regions, zero, map_qual, names])

Construct CovFrame from BAM files.

from_dict(data)

Construct CovFrame from dict of array-like or dicts.

from_file(fn[, compression])

Construct CovFrame from a TSV file containing read depth data.

mask_bed(bed[, opposite])

Mask rows that overlap with BED data.

matrix_uniformity([frac, n, m])

Compute a matrix of fraction of sampled bases >= coverage with a shape of (coverages, samples).

merge(other[, how])

Merge with the other CovFrame.

plot_distribution([mode, frac, ax, figsize])

Create a line plot visualizaing the distribution of per-base read depth.

plot_region(sample[, region, samples, ...])

Create read depth profile for specified region.

plot_uniformity([mode, frac, n, m, marker, ...])

Create a line plot visualizing the uniformity in read depth.

rename(names[, indicies])

Rename the samples.

slice(region)

Slice the CovFrame for the region.

subset(samples[, exclude])

Subset CovFrame for specified samples.

to_file(fn[, compression])

Write the CovFrame to a TSV file.

to_string()

Render the CovFrame to a console-friendly tabular output.

update_chr_prefix([mode])

Add or remove the (annoying) 'chr' string from the Chromosome column.

property contigs

List of contig names.

Type

list

copy()[source]

Return a copy of the CovFrame.

copy_df()[source]

Return a copy of the dataframe.

property df

DataFrame containing read depth data.

Type

pandas.DataFrame

classmethod from_bam(bams, regions=None, zero=False, map_qual=None, names=None)[source]

Construct CovFrame from BAM files.

Under the hood, the method computes read depth using the samtools depth command.

Parameters
  • bams (str or list) – One or more input BAM files. Alternatively, you can provide a text file (.txt, .tsv, .csv, or .list) containing one BAM file per line.

  • regions (str, list, or pybed.BedFrame, optional) – By default (regions=None), the method counts all reads in BAM files, which can be excruciatingly slow for large files (e.g. whole genome sequencing). Therefore, use this argument to only output positions in given regions. Each region must have the format chrom:start-end and be a half-open interval with (start, end]. This means, for example, chr1:100-103 will extract positions 101, 102, and 103. Alternatively, you can provide a BED file (compressed or uncompressed) or a pybed.BedFrame object to specify regions. Note that the ‘chr’ prefix in contig names (e.g. ‘chr1’ vs. ‘1’) will be automatically added or removed as necessary to match the input BAM’s contig names.

  • zero (bool, default: False) – If True, output all positions including those with zero depth.

  • map_qual (int, optional) – Only count reads with mapping quality greater than or equal to this number.

  • names (list, optional) – By default (names=None), sample name is extracted using SM tag in BAM files. If the tag is missing, the method will set the filename as sample name. Use this argument to manually provide sample names.

Returns

CovFrame object.

Return type

CovFrame

See also

CovFrame

CovFrame object creation using constructor.

CovFrame.from_dict

Construct CovFrame from dict of array-like or dicts.

CovFrame.from_file

Construct CovFrame from a text file containing read depth data.

Examples

>>> from fuc import pycov
>>> cf = pycov.CovFrame.from_bam(bam)
>>> cf = pycov.CovFrame.from_bam([bam1, bam2])
>>> cf = pycov.CovFrame.from_bam(bam, region='19:41497204-41524301')
classmethod from_dict(data)[source]

Construct CovFrame from dict of array-like or dicts.

Parameters

data (dict) – Of the form {field : array-like} or {field : dict}.

Returns

CovFrame object.

Return type

CovFrame

See also

CovFrame

CovFrame object creation using constructor.

CovFrame.from_bam

Construct CovFrame from BAM files.

CovFrame.from_file

Construct CovFrame from a text file containing read depth data.

Examples

>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1'] * 1000,
...     'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.df.head()
  Chromosome  Position   A   B
0       chr1      1000  36  22
1       chr1      1001  39  35
2       chr1      1002  33  19
3       chr1      1003  36  20
4       chr1      1004  31  24
classmethod from_file(fn, compression=False)[source]

Construct CovFrame from a TSV file containing read depth data.

Parameters
  • fn (str or file-like object) – TSV file (compressed or uncompressed). By file-like object, we refer to objects with a read() method, such as a file handle.

  • compression (bool, default: False) – If True, use GZIP decompression regardless of filename.

Returns

CovFrame object.

Return type

CovFrame

See also

CovFrame

CovFrame object creation using constructor.

CovFrame.from_bam

Construct CovFrame from BAM files.

CovFrame.from_dict

Construct CovFrame from dict of array-like or dicts.

Examples

>>> from fuc import pycov
>>> cf = pycov.CovFrame.from_file('unzipped.tsv')
>>> cf = pycov.CovFrame.from_file('zipped.tsv.gz')
>>> cf = pycov.CovFrame.from_file('zipped.tsv', compression=True)
property has_chr_prefix

Whether the (annoying) ‘chr’ string is found.

Type

bool

mask_bed(bed, opposite=False)[source]

Mask rows that overlap with BED data.

Parameters
  • bed (pybed.BedFrame or str) – BedFrame object or BED file.

  • opposite (bool, default: False) – If True, mask rows that don’t overlap with BED data.

Returns

Masked CovFrame.

Return type

CovFrame

Examples

Assume we have the following data:

>>> import numpy as np
>>> from fuc import pycov, pybed
>>> data = {
...     'Chromosome': ['chr1'] * 1000,
...     'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.df.head()
  Chromosome  Position   A   B
0       chr1      1000  34  31
1       chr1      1001  31  20
2       chr1      1002  41  22
3       chr1      1003  28  41
4       chr1      1004  34  23
>>> data = {
...     'Chromosome': ['chr1', 'chr1'],
...     'Start': [1000, 1003],
...     'End': [1002, 1004]
... }
>>> bf = pybed.BedFrame.from_dict([], data)
>>> bf.gr.df
  Chromosome  Start   End
0       chr1   1000  1002
1       chr1   1003  1004

We can mask rows that overlap with the BED data:

>>> cf.mask_bed(bf).df.head()
  Chromosome  Position     A     B
0       chr1      1000   NaN   NaN
1       chr1      1001   NaN   NaN
2       chr1      1002  41.0  22.0
3       chr1      1003   NaN   NaN
4       chr1      1004  34.0  23.0

We can also do the opposite:

>>> cf.mask_bed(bf, opposite=True).df.head()
  Chromosome  Position     A     B
0       chr1      1000  34.0  31.0
1       chr1      1001  31.0  20.0
2       chr1      1002   NaN   NaN
3       chr1      1003  28.0  41.0
4       chr1      1004   NaN   NaN
matrix_uniformity(frac=0.1, n=20, m=None)[source]

Compute a matrix of fraction of sampled bases >= coverage with a shape of (coverages, samples).

Parameters
  • frac (float, default: 0.1) – Fraction of data to be sampled (to speed up the process).

  • n (int or list, default: 20) – Number of evenly spaced points to generate for the x-axis. Alternatively, positions can be manually specified by providing a list.

  • m (float, optional) – Maximum point in the x-axis. By default, it will be the maximum depth in the entire dataset.

Returns

Matrix of fraction of sampled bases >= coverage.

Return type

pandas.DataFrame

Examples

>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1'] * 1000,
...     'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.matrix_uniformity()
              A     B
Coverage
1.000000   1.00  1.00
3.368421   1.00  1.00
5.736842   1.00  1.00
8.105263   1.00  1.00
10.473684  1.00  1.00
12.842105  1.00  0.98
15.210526  1.00  0.93
17.578947  1.00  0.87
19.947368  1.00  0.77
22.315789  1.00  0.64
24.684211  1.00  0.50
27.052632  0.97  0.35
29.421053  0.84  0.25
31.789474  0.70  0.16
34.157895  0.51  0.07
36.526316  0.37  0.07
38.894737  0.21  0.03
41.263158  0.09  0.02
43.631579  0.04  0.00
46.000000  0.02  0.00
merge(other, how='inner')[source]

Merge with the other CovFrame.

Parameters
  • other (CovFrame) – Other CovFrame. Note that the ‘chr’ prefix in contig names (e.g. ‘chr1’ vs. ‘1’) will be automatically added or removed as necessary to match the contig names of self.

  • how (str, default: ‘inner’) – Type of merge as defined in pandas.DataFrame.merge().

Returns

Merged CovFrame.

Return type

CovFrame

See also

merge

Merge multiple CovFrame objects.

Examples

Assume we have the following data:

>>> import numpy as np
>>> from fuc import pycov
>>> data1 = {
...     'Chromosome': ['chr1'] * 5,
...     'Position': np.arange(100, 105),
...     'A': pycov.simulate(loc=35, scale=5, size=5),
...     'B': pycov.simulate(loc=25, scale=7, size=5),
... }
>>> data2 = {
...     'Chromosome': ['1'] * 5,
...     'Position': np.arange(102, 107),
...     'C': pycov.simulate(loc=35, scale=5, size=5),
... }
>>> cf1 = pycov.CovFrame.from_dict(data1)
>>> cf2 = pycov.CovFrame.from_dict(data2)
>>> cf1.df
  Chromosome  Position   A   B
0       chr1       100  40  27
1       chr1       101  32  33
2       chr1       102  32  22
3       chr1       103  32  29
4       chr1       104  37  22
>>> cf2.df
  Chromosome  Position   C
0          1       102  33
1          1       103  29
2          1       104  35
3          1       105  27
4          1       106  25

We can merge the two VcfFrames with how=’inner’ (default):

>>> cf1.merge(cf2).df
  Chromosome  Position   A   B   C
0       chr1       102  32  22  33
1       chr1       103  32  29  29
2       chr1       104  37  22  35

We can also merge with how=’outer’:

>>> cf1.merge(cf2, how='outer').df
  Chromosome  Position     A     B     C
0       chr1       100  40.0  27.0   NaN
1       chr1       101  32.0  33.0   NaN
2       chr1       102  32.0  22.0  33.0
3       chr1       103  32.0  29.0  29.0
4       chr1       104  37.0  22.0  35.0
5       chr1       105   NaN   NaN  27.0
6       chr1       106   NaN   NaN  25.0
plot_distribution(mode='aggregated', frac=0.1, ax=None, figsize=None, **kwargs)[source]

Create a line plot visualizaing the distribution of per-base read depth.

Parameters
  • mode ({‘aggregated’, ‘individual’}, default: ‘aggregated’) – Determines how to display the lines:

    • ‘aggregated’: Aggregate over repeated values to show the mean and 95% confidence interval.

    • ‘individual’: Show data for individual samples.

  • frac (float, default: 0.1) – Fraction of data to be sampled (to speed up the process).

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.lineplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

By default (mode='aggregated'), the method will aggregate over repeated values:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1'] * 1000,
...     'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.plot_distribution(mode='aggregated', frac=0.9)
>>> plt.tight_layout()
_images/api-7.png

We can display data for individual samples:

>>> cf.plot_distribution(mode='individual', frac=0.9)
>>> plt.tight_layout()
_images/api-8.png
plot_region(sample, region=None, samples=None, label=None, ax=None, figsize=None, **kwargs)[source]

Create read depth profile for specified region.

Region can be omitted if there is only one contig in the CovFrame.

Parameters
  • region (str, optional) – Target region (‘chrom:start-end’).

  • label (str, optional) – Label to use for the data points.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.lineplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...    'Chromosome': ['chr1'] * 1000,
...    'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> ax = cf.plot_region('A')
>>> plt.tight_layout()
_images/api-9.png

We can draw multiple profiles in one plot:

>>> ax = cf.plot_region('A', label='A')
>>> cf.plot_region('B', label='B', ax=ax)
>>> ax.legend()
>>> plt.tight_layout()
_images/api-10.png
plot_uniformity(mode='aggregated', frac=0.1, n=20, m=None, marker=None, ax=None, figsize=None, **kwargs)[source]

Create a line plot visualizing the uniformity in read depth.

Parameters
  • mode ({‘aggregated’, ‘individual’}, default: ‘aggregated’) – Determines how to display the lines:

    • ‘aggregated’: Aggregate over repeated values to show the mean and 95% confidence interval.

    • ‘individual’: Show data for individual samples.

  • frac (float, default: 0.1) – Fraction of data to be sampled (to speed up the process).

  • n (int or list, default: 20) – Number of evenly spaced points to generate for the x-axis. Alternatively, positions can be manually specified by providing a list.

  • m (float, optional) – Maximum point in the x-axis. By default, it will be the maximum depth in the entire dataset.

  • marker (str, optional) – Marker style string (e.g. ‘o’).

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.lineplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

By default (mode='aggregated'), the method will aggregate over repeated values:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1'] * 1000,
...     'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.plot_uniformity(mode='aggregated')
>>> plt.tight_layout()
_images/api-11.png

We can display data for individual samples:

>>> cf.plot_uniformity(mode='individual')
>>> plt.tight_layout()
_images/api-12.png
rename(names, indicies=None)[source]

Rename the samples.

Parameters
  • names (dict or list) – Dict of old names to new names or list of new names.

  • indicies (list or tuple, optional) – List of 0-based sample indicies. Alternatively, a tuple (int, int) can be used to specify an index range.

Returns

Updated CovFrame.

Return type

CovFrame

Examples

>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1'] * 2,
...     'Position': np.arange(1, 3),
...     'A': pycov.simulate(loc=35, scale=5, size=2),
...     'B': pycov.simulate(loc=25, scale=7, size=2),
...     'C': pycov.simulate(loc=25, scale=7, size=2),
...     'D': pycov.simulate(loc=25, scale=7, size=2),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.df
  Chromosome  Position   A   B   C   D
0       chr1         1  31  19  28  15
1       chr1         2  35  24  22  17
>>> cf.rename(['1', '2', '3', '4']).df
  Chromosome  Position   1   2   3   4
0       chr1         1  31  19  28  15
1       chr1         2  35  24  22  17
>>> cf.rename({'B': '2', 'C': '3'}).df
  Chromosome  Position   A   2   3   D
0       chr1         1  31  19  28  15
1       chr1         2  35  24  22  17
>>> cf.rename(['2', '4'], indicies=[1, 3]).df
  Chromosome  Position   A   2   C   4
0       chr1         1  31  19  28  15
1       chr1         2  35  24  22  17
>>> cf.rename(['2', '3'], indicies=(1, 3)).df
  Chromosome  Position   A   2   3   D
0       chr1         1  31  19  28  15
1       chr1         2  35  24  22  17
property samples

List of the sample names.

Type

list

property shape

Dimensionality of CovFrame (positions, samples).

Type

tuple

slice(region)[source]

Slice the CovFrame for the region.

Parameters

region (str) – Region (‘chrom:start-end’).

Returns

Sliced CovFrame.

Return type

CovFrame

Examples

>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1']*500 + ['chr2']*500,
...     'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.slice('chr2').df.head()
  Chromosome  Position   A   B
0       chr2      1500  37  34
1       chr2      1501  28  12
2       chr2      1502  35  29
3       chr2      1503  34  34
4       chr2      1504  32  21
>>> cf.slice('chr2:1500-1504').df
  Chromosome  Position   A   B
0       chr2      1500  37  34
1       chr2      1501  28  12
2       chr2      1502  35  29
3       chr2      1503  34  34
4       chr2      1504  32  21
>>> cf.slice('chr2:-1504').df
  Chromosome  Position   A   B
0       chr2      1500  37  34
1       chr2      1501  28  12
2       chr2      1502  35  29
3       chr2      1503  34  34
4       chr2      1504  32  21
subset(samples, exclude=False)[source]

Subset CovFrame for specified samples.

Parameters
  • samples (str or list) – Sample name or list of names (the order matters).

  • exclude (bool, default: False) – If True, exclude specified samples.

Returns

Subsetted CovFrame.

Return type

CovFrame

Examples

Assume we have the following data:

>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1'] * 1000,
...     'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
...     'C': pycov.simulate(loc=15, scale=2),
...     'D': pycov.simulate(loc=45, scale=8),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.df.head()
  Chromosome  Position   A   B   C   D
0       chr1      1000  30  30  15  37
1       chr1      1001  25  24  11  43
2       chr1      1002  33  24  16  50
3       chr1      1003  29  22  15  46
4       chr1      1004  34  30  11  32

We can subset the CovFrame for the samples A and B:

>>> cf.subset(['A', 'B']).df.head()
  Chromosome  Position   A   B
0       chr1      1000  30  30
1       chr1      1001  25  24
2       chr1      1002  33  24
3       chr1      1003  29  22
4       chr1      1004  34  30

Alternatively, we can exclude those samples:

>>> cf.subset(['A', 'B'], exclude=True).df.head()
  Chromosome  Position   C   D
0       chr1      1000  15  37
1       chr1      1001  11  43
2       chr1      1002  16  50
3       chr1      1003  15  46
4       chr1      1004  11  32
to_file(fn, compression=False)[source]

Write the CovFrame to a TSV file.

If the file name ends with ‘.gz’, the method will automatically use the GZIP compression when writing the file.

Parameters
  • fn (str) – TSV file (compressed or uncompressed).

  • compression (bool, default: False) – If True, use the GZIP compression.

to_string()[source]

Render the CovFrame to a console-friendly tabular output.

Returns

String representation of the CovFrame.

Return type

str

update_chr_prefix(mode='remove')[source]

Add or remove the (annoying) ‘chr’ string from the Chromosome column.

Parameters

mode ({‘add’, ‘remove’}, default: ‘remove’) – Whether to add or remove the ‘chr’ string.

Returns

Updated CovFrame.

Return type

CovFrame

Examples

>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1'] * 3 + ['2'] * 3,
...     'Position': np.arange(1, 7),
...     'A': pycov.simulate(loc=35, scale=5, size=6),
...     'B': pycov.simulate(loc=25, scale=7, size=6),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.df
  Chromosome  Position   A   B
0       chr1         1  35  25
1       chr1         2  23  14
2       chr1         3  32  23
3          2         4  38  25
4          2         5  33   8
5          2         6  21  22
>>> cf.update_chr_prefix(mode='remove').df
  Chromosome  Position   A   B
0          1         1  35  25
1          1         2  23  14
2          1         3  32  23
3          2         4  38  25
4          2         5  33   8
5          2         6  21  22
>>> cf.update_chr_prefix(mode='add').df
  Chromosome  Position   A   B
0       chr1         1  35  25
1       chr1         2  23  14
2       chr1         3  32  23
3       chr2         4  38  25
4       chr2         5  33   8
5       chr2         6  21  22
fuc.api.pycov.concat(cfs, axis=0)[source]

Concatenate CovFrame objects along a particular axis.

Parameters
  • cfs (list) – List of CovFrame objects.

  • axis ({0/’index’, 1/’columns’}, default: 0) – The axis to concatenate along.

Returns

Concatenated CovFrame.

Return type

CovFrame

fuc.api.pycov.merge(cfs, how='inner')[source]

Merge CovFrame objects.

Parameters
  • cfs (list) – List of CovFrames to be merged. Note that the ‘chr’ prefix in contig names (e.g. ‘chr1’ vs. ‘1’) will be automatically added or removed as necessary to match the contig names of the first CovFrame.

  • how (str, default: ‘inner’) – Type of merge as defined in pandas.merge().

Returns

Merged CovFrame.

Return type

CovFrame

See also

CovFrame.merge

Merge self with another CovFrame.

Examples

Assume we have the following data:

>>> import numpy as np
>>> from fuc import pycov
>>> data1 = {
...     'Chromosome': ['chr1'] * 5,
...     'Position': np.arange(100, 105),
...     'A': pycov.simulate(loc=35, scale=5, size=5),
...     'B': pycov.simulate(loc=25, scale=7, size=5),
... }
>>> data2 = {
...     'Chromosome': ['1'] * 5,
...     'Position': np.arange(102, 107),
...     'C': pycov.simulate(loc=35, scale=5, size=5),
... }
>>> cf1 = pycov.CovFrame.from_dict(data1)
>>> cf2 = pycov.CovFrame.from_dict(data2)
>>> cf1.df
  Chromosome  Position   A   B
0       chr1       100  33  17
1       chr1       101  36  20
2       chr1       102  39  39
3       chr1       103  31  19
4       chr1       104  31  10
>>> cf2.df
  Chromosome  Position   C
0          1       102  41
1          1       103  37
2          1       104  35
3          1       105  33
4          1       106  39

We can merge the two VcfFrames with how=’inner’ (default):

>>> pycov.merge([cf1, cf2]).df
  Chromosome  Position   A   B   C
0       chr1       102  39  39  41
1       chr1       103  31  19  37
2       chr1       104  31  10  35

We can also merge with how=’outer’:

>>> pycov.merge([cf1, cf2], how='outer').df
  Chromosome  Position     A     B     C
0       chr1       100  33.0  17.0   NaN
1       chr1       101  36.0  20.0   NaN
2       chr1       102  39.0  39.0  41.0
3       chr1       103  31.0  19.0  37.0
4       chr1       104  31.0  10.0  35.0
5       chr1       105   NaN   NaN  33.0
6       chr1       106   NaN   NaN  39.0
fuc.api.pycov.simulate(mode='wgs', loc=30, scale=5, size=1000)[source]

Simulate read depth data for single sample.

Generated read depth will be integer and non-negative.

Parameters
  • mode ({‘wgs’}, default: ‘wgs’) – Additional modes will be made available in future releases.

  • loc (float, default: 30) – Mean (“centre”) of the distribution.

  • scale (float, default: 5) – Standard deviation (spread or “width”) of the distribution. Must be non-negative.

  • size (int, default: 1000) – Number of base pairs to return.

Returns

Numpy array object.

Return type

numpy.ndarray

Examples

>>> from fuc import pycov
>>> pycov.simulate(size=10)
array([25, 32, 30, 31, 26, 25, 33, 29, 28, 35])

fuc.pyfq

The pyfq submodule is designed for working with FASTQ files. It implements pyfq.FqFrame which stores FASTQ data as pandas.DataFrame to allow fast computation and easy manipulation.

Classes:

FqFrame(df)

Class for storing FASTQ data.

class fuc.api.pyfq.FqFrame(df)[source]

Class for storing FASTQ data.

Methods:

from_file(fn)

Construct FqFrame from a FASTQ file.

readlen()

Return a dictionary of read lengths and their counts.

to_file(file_path)

Write the FqFrame to a FASTQ file.

Attributes:

shape

Number of sequence reads in the FqFrame.

classmethod from_file(fn)[source]

Construct FqFrame from a FASTQ file.

Parameters

fn (str) – FASTQ file path (compressed or uncompressed).

Returns

FqFrame.

Return type

FqFrame

See also

FqFrame

FqFrame object creation using constructor.

readlen()[source]

Return a dictionary of read lengths and their counts.

property shape

Number of sequence reads in the FqFrame.

Type

int

to_file(file_path)[source]

Write the FqFrame to a FASTQ file.

fuc.pygff

The pygff submodule is designed for working with GFF/GTF files. It implements pygff.GffFrame which stores GFF/GTF data as pandas.DataFrame to allow fast computation and easy manipulation. The submodule strictly adheres to the standard GFF specification.

A GFF/GTF file contains nine columns as follows:

No.

Name

Description

Examples

1

Seqid

Landmark ID

‘NC_000001.10’, ‘NC_012920.1’

2

Source

Feature source

‘RefSeq’, ‘BestRefSeq’, ‘Genescan’, ‘Genebank’

3

Type

Feature type

‘transcript’, ‘exon’, ‘gene’

4

Start

Start coordinate

11874, 14409

5

End

End coordinate

11874, 14409

6

Score

Feature score

‘.’, ‘1730.55’, ‘1070’

7

Strand

Feature strand

‘.’, ‘-’, ‘+’, ‘?’

8

Phase

CDS phase

‘.’, ‘0’, ‘1’, ‘2’

9

Attributes

‘;’-separated attributes

‘ID=NC_000001.10:1..249250621;Dbxref=taxon:9606;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA’

Classes:

GffFrame(meta, df, fasta)

Class for storing GFF/GTF data.

class fuc.api.pygff.GffFrame(meta, df, fasta)[source]

Class for storing GFF/GTF data.

Parameters
  • meta (list) – List of metadata lines.

  • df (pandas.DataFrame) – DataFrame containing GFF/GTF data.

  • fasta (str) – FASTA sequence lines.

Attributes:

df

DataFrame containing GFF/GTF data.

fasta

FASTA sequence lines.

meta

List of metadata lines.

Methods:

from_file(fn)

Construct GffFrame from a GFF/GTF file.

protein_length(gene[, name])

Return the protein length of a gene.

property df

DataFrame containing GFF/GTF data.

Type

pandas.DataFrame

property fasta

FASTA sequence lines.

Type

dict

classmethod from_file(fn)[source]

Construct GffFrame from a GFF/GTF file.

Parameters

fn (str) – GFF/GTF file (compressed or uncompressed).

Returns

GffFrame object.

Return type

GffFrame

property meta

List of metadata lines.

Type

list

protein_length(gene, name=None)[source]

Return the protein length of a gene.

Parameters
  • gene (str) – Name of the gene.

  • name (str, optional) – Protein sequence ID (e.g. ‘NP_005219.2’). Required when the gene has multiple protein sequences available.

Returns

Protein length.

Return type

int

fuc.pykallisto

The pykallisto submodule is designed for working with RNAseq quantification data from Kallisto. It implements pykallisto.KallistoFrame which stores Kallisto’s output data as pandas.DataFrame to allow fast computation and easy manipulation. The pykallisto.KallistoFrame class also contains many useful plotting methods such as KallistoFrame.plot_differential_abundance.

Classes:

KallistoFrame(metadata, tx2gene, ...[, ...])

Class for working with RNAseq quantification data from Kallisto.

Functions:

basic_filter(row[, min_reads, min_prop])

A basic filter to be used.

class fuc.api.pykallisto.KallistoFrame(metadata, tx2gene, aggregation_column, filter_func=None, filter_target_id=None, filter_off=False)[source]

Class for working with RNAseq quantification data from Kallisto.

Parameters
  • metadata (pandas.DataFrame) – List of metadata lines.

  • tx2gene (pandas.DataFrame) – DataFrame containing transcript to gene mapping data.

  • aggregation_column (str) – Column name in tx2gene to aggregate transcripts to the gene level.

  • filter_func (func, optional) – Filtering function to be applied to each row (i.e. transcript). By default, the pykallisto.basic_filter() method will be used.

  • filter_target_id (list, optional) – Transcripts to filter using methods that can’t be implemented using filter_func. If provided, this will override filter_func.

  • filter_off (bool, default: False) – If True, do not apply any filtering. Useful for generating a simple count or tpm matrix.

Methods:

aggregate([filter])

Aggregate transcript-level data to obtain gene-level data.

compute_fold_change(group, genes[, unit, flip])

Compute fold change of gene expression between two groups.

plot_differential_abundance(gene, group[, ...])

Plot differential abundance results for single gene.

aggregate(filter=True)[source]

Aggregate transcript-level data to obtain gene-level data.

Running this method will set the attributes KallistoFrame.df_gene_count and KallistoFrame.df_gene_tpm.

Parameters

filter (bool, default: True) – If true, use filtered transcripts only. Otherwise, use all.

compute_fold_change(group, genes, unit='tpm', flip=False)[source]

Compute fold change of gene expression between two groups.

Parameters
  • group (str) – Column in KallistoFrame.metadata specifying group information.

  • gene (list) – Genes to compare.

  • unit ({‘tpm’, ‘count’}, default: ‘tpm’) – Abundance unit to display.

  • flip (bool, default: False) – If true, flip the denominator and numerator.

plot_differential_abundance(gene, group, aggregate=True, filter=True, name='target_id', unit='tpm', ax=None, figsize=None)[source]

Plot differential abundance results for single gene.

Parameters
  • gene (str) – Gene to compare.

  • group (str) – Column in KallistoFrame.metadata specifying group information.

  • aggregate (bool, default: True) – If true, display gene-level data (the KallistoFrame.aggregate() method must be run beforehand). Otherwise, display transcript-level data.

  • filter (bool, default: True) – If true, use filtered transcripts only. Otherwise, use all. Ignored when aggregate=True.

  • name (str, default: ‘target_id’) – Column in KallistoFrame.tx2gene specifying transcript name to be displayed in the legend. Ignored when aggregate=True.

  • unit ({‘tpm’, ‘count’}, default: ‘tpm’) – Abundance unit to display.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

fuc.api.pykallisto.basic_filter(row, min_reads=5, min_prop=0.47)[source]

A basic filter to be used.

By default, the method will filter out rows (i.e. transcripts) that do not have at least 5 estimated counts in at least 47% of the samples. Note that this is equivalent to the sleuth.basic_filter() method.

Parameters
  • row (pandas.Series) – This is a vector of numerics that will be passed in.

  • min_reads (int, default: 5) – The minimum number of estimated counts.

  • min_prop (float, default: 0.47) – The minimum proportion of samples.

Returns

A pandas series of boolean.

Return type

pd.Series

fuc.pymaf

The pymaf submodule is designed for working with MAF files. It implements pymaf.MafFrame which stores MAF data as pandas.DataFrame to allow fast computation and easy manipulation. The pymaf.MafFrame class also contains many useful plotting methods such as MafFrame.plot_oncoplot and MafFrame.plot_summary. The submodule strictly adheres to the standard MAF specification.

A typical MAF file contains many columns ranging from gene symbol to protein change. However, most of the analysis in pymaf uses the following columns:

No.

Name

Description

Examples

1

Hugo_Symbol

HUGO gene symbol

‘TP53’, ‘Unknown’

2

Chromosome

Chromosome name

‘chr1’, ‘1’, ‘X’

3

Start_Position

Start coordinate

119031351

4

End_Position

End coordinate

44079555

5

Variant_Classification

Translational effect

‘Missense_Mutation’, ‘Silent’

6

Variant_Type

Mutation type

‘SNP’, ‘INS’, ‘DEL’

7

Reference_Allele

Reference allele

‘T’, ‘-’, ‘ACAA’

8

Tumor_Seq_Allele1

First tumor allele

‘A’, ‘-’, ‘TCA’

9

Tumor_Seq_Allele2

Second tumor allele

‘A’, ‘-’, ‘TCA’

10

Tumor_Sample_Barcode

Sample ID

‘TCGA-AB-3002’

11

Protein_Change

Protein change

‘p.L558Q’

It is also recommended to include additional custom columns such as variant allele frequecy (VAF) and transcript name.

If sample annotation data are available for a given MAF file, use the common.AnnFrame class to import the data.

There are nine nonsynonymous variant classifcations that pymaf primarily uses: Missense_Mutation, Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Nonsense_Mutation, Nonstop_Mutation, Splice_Site, and Translation_Start_Site.

Classes:

MafFrame(df)

Class for storing MAF data.

class fuc.api.pymaf.MafFrame(df)[source]

Class for storing MAF data.

Parameters

df (pandas.DataFrame) – DataFrame containing MAF data.

See also

MafFrame.from_file

Construct MafFrame from a MAF file.

Methods:

calculate_concordance(a, b[, c, mode])

Calculate genotype concordance between two (A, B) or three (A, B, C) samples.

compute_clonality(vaf_col[, threshold])

Compute the clonality of variants based on VAF.

copy()

Return a copy of the MafFrame.

filter_annot(af, expr)

Filter the MafFrame using sample annotation data.

filter_indel([opposite, as_index])

Remove rows with an indel.

from_file(fn)

Construct MafFrame from a MAF file.

from_vcf(vcf[, keys, names])

Construct MafFrame from a VCF file or VcfFrame.

get_gene_concordance(gene, a, b)

Test whether two samples have the identical mutation profile for specified gene.

matrix_genes([mode, count])

Compute a matrix of counts with a shape of (genes, variant classifications).

matrix_prevalence()

Compute a matrix of variant counts with a shape of (genes, samples).

matrix_tmb()

Compute a matrix of variant counts with a shape of (samples, variant classifications).

matrix_waterfall([count, keep_empty])

Compute a matrix of variant classifications with a shape of (genes, samples).

matrix_waterfall_matched(af, patient_col, ...)

Compute a matrix of variant classifications with a shape of (gene-group pairs, patients).

plot_clonality(vaf_col[, af, group_col, ...])

Create a bar plot summarizing the clonality of variants in top mutated genes.

plot_comparison(a, b[, c, labels, ax, figsize])

Create a Venn diagram showing genotype concordance between groups.

plot_evolution(samples, vaf_col[, anchor, ...])

Create a line plot visualizing changes in VAF between specified samples.

plot_genepair(x, y, vaf_col[, af, ...])

Create a scatter plot of VAF between Gene X and Gene Y.

plot_genes([mode, count, flip, ax, figsize])

Create a bar plot showing variant distirbution for top mutated genes.

plot_interactions([count, cmap, ax, figsize])

Create a heatmap representing mutually exclusive or co-occurring set of genes.

plot_lollipop(gene[, alpha, ax, figsize, legend])

Create a lollipop or stem plot showing amino acid changes of a gene.

plot_matrixg(gene, af, group_col[, ...])

Create a heatmap of count matrix with a shape of (sample groups, protein changes).

plot_matrixs(gene[, samples, c0, c1, l0, ...])

Create a heatmap of presence/absence matrix with a shape of (samples, protein changes).

plot_mutated([af, group_col, group_order, ...])

Create a bar plot visualizing the mutation prevalence of top mutated genes.

plot_mutated_matched(af, patient_col, ...[, ...])

Create a bar plot visualizing the mutation prevalence of top mutated genes.

plot_oncoplot([count, keep_empty, figsize, ...])

Create an oncoplot.

plot_oncoplot_matched(af, patient_col, ...)

Create an oncoplot for mached samples.

plot_rainfall(sample[, palette, legend, ax, ...])

Create a rainfall plot visualizing inter-variant distance on a linear genomic scale for single sample.

plot_regplot(af, group_col, a, b[, a_size, ...])

Create a scatter plot with a linear regression model fit visualizing correlation between gene mutation frequencies in two sample groups A and B.

plot_snvclsc([af, group_col, group_order, ...])

Create a bar plot summarizing the count distrubtions of the six SNV classes for all samples.

plot_snvclsp([af, group_col, group_order, ...])

Create a box plot summarizing the proportion distrubtions of the six SNV classes for all sample.

plot_snvclss([samples, color, colormap, ...])

Create a bar plot showing the proportions of the six SNV classes for individual samples.

plot_summary([figsize, title_fontsize, ...])

Create a summary figure for MafFrame.

plot_titv([af, group_col, group_order, ...])

Create a box plot showing the Ti/Tv proportions of samples.

plot_tmb([samples, width, ax, figsize])

Create a bar plot showing the TMB distributions of samples.

plot_tmb_matched(af, patient_col, group_col)

Create a grouped bar plot showing TMB distributions for different group levels in each patient.

plot_vaf(vaf_col[, count, af, group_col, ...])

Create a box plot showing the VAF distributions of top mutated genes.

plot_varcls([ax, figsize])

Create a bar plot for the nonsynonymous variant classes.

plot_varsum([flip, ax, figsize])

Create a summary box plot for variant classifications.

plot_vartype([palette, flip, ax, figsize])

Create a bar plot summarizing the count distrubtions of viaration types for all samples.

plot_waterfall([count, keep_empty, samples, ...])

Create a waterfall plot (oncoplot).

plot_waterfall_matched(af, patient_col, ...)

Create a waterfall plot using matched samples from each patient.

subset(samples[, exclude])

Subset MafFrame for specified samples.

to_file(fn)

Write MafFrame to a MAF file.

to_string()

Render MafFrame to a console-friendly tabular output.

to_vcf([fasta, ignore_indels, cols, names])

Write the MafFrame to a sorted VcfFrame.

variants()

List unique variants in MafFrame.

Attributes:

genes

List of the genes.

samples

List of the sample names.

shape

Dimensionality of MafFrame (variants, samples).

calculate_concordance(a, b, c=None, mode='all')[source]

Calculate genotype concordance between two (A, B) or three (A, B, C) samples.

This method will return (Ab, aB, AB, ab) for comparison between two samples and (Abc, aBc, ABc, abC, AbC, aBC, ABC, abc) for three samples. Note that the former is equivalent to (FP, FN, TP, TN) if we assume A is the test sample and B is the truth sample.

Parameters
  • a, b (str or int) – Name or index of Samples A and B.

  • c (str or int, optional) – Name or index of Sample C.

  • mode ({‘all’, ‘snv’, ‘indel’}, default: ‘all’) – Determines which variant types should be analyzed:

    • ‘all’: Include both SNVs and INDELs.

    • ‘snv’: Include SNVs only.

    • ‘indel’: Include INDELs only.

Returns

Four- or eight-element tuple depending on the number of samples.

Return type

tuple

See also

fuc.api.common.sumstat

Return various summary statistics from (FP, FN, TP, TN).

Examples

>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.calculate_concordance('TCGA-AB-2988', 'TCGA-AB-2869')
(15, 12, 0, 2064)
>>> mf.calculate_concordance('TCGA-AB-2988', 'TCGA-AB-2869', 'TCGA-AB-3009')
(15, 12, 0, 42, 0, 0, 0, 2022)
compute_clonality(vaf_col, threshold=0.25)[source]

Compute the clonality of variants based on VAF.

A mutation will be defined as “Subclonal” if the VAF is less than the threshold percentage (e.g. 25%) of the highest VAF in the sample and is defined as “Clonal” if it is equal to or above this threshold.

Parameters
  • vaf_col (str) – MafFrame column containing VAF data.

  • threshold (float) – Minimum VAF to be considered as “Clonal”.

Returns

Clonality for each variant.

Return type

panda.Series

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.df['Clonality'] = mf.compute_clonality('i_TumorVAF_WU')
>>> mf.df['Clonality'][:10]
0    Clonal
1    Clonal
2    Clonal
3    Clonal
4    Clonal
5    Clonal
6    Clonal
7    Clonal
8    Clonal
9    Clonal
Name: Clonality, dtype: object
copy()[source]

Return a copy of the MafFrame.

filter_annot(af, expr)[source]

Filter the MafFrame using sample annotation data.

Samples are selected by querying the columns of an AnnFrame with a boolean expression. Samples not present in the MafFrame will be excluded automatically.

Parameters
  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • expr (str) – Query expression to evaluate.

Returns

Filtered MafFrame.

Return type

MafFrame

Examples

>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> mf = pymaf.MafFrame.from_file('~/fuc-data/tcga-laml/tcga_laml.maf.gz')
>>> af = common.AnnFrame.from_file('~/fuc-data/tcga-laml/tcga_laml_annot.tsv', sample_col=0)
>>> filtered_mf = mf.filter_annot(af, "FAB_classification == 'M4'")
filter_indel(opposite=False, as_index=False)[source]

Remove rows with an indel.

Parameters
  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of MafFrame.

Returns

Filtered MafFrame or boolean index array.

Return type

MafFrame or pandas.Series

Examples

>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.filter_indel().df.Variant_Type.unique()
array(['SNP'], dtype=object)
>>> mf.filter_indel(opposite=True).df.Variant_Type.unique()
array(['DEL', 'INS'], dtype=object)
classmethod from_file(fn)[source]

Construct MafFrame from a MAF file.

Parameters

fn (str) – MAF file (compressed or uncompressed).

Returns

MafFrame object.

Return type

MafFrame

See also

MafFrame

MafFrame object creation using constructor.

Examples

>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
classmethod from_vcf(vcf, keys=None, names=None)[source]

Construct MafFrame from a VCF file or VcfFrame.

It is recommended that the input VCF data be functionally annotated by an annotation tool such as Ensembl VEP, SnpEff, and ANNOVAR; however, the method can handle unannotated VCF data as well.

The preferred tool for functional annotation is Ensembl VEP with “RefSeq transcripts” as the transcript database and the filtering option “Show one selected consequence per variant”.

Parameters
  • vcf (str or VcfFrame) – VCF file or VcfFrame.

  • keys (str or list) – Genotype key (e.g. ‘AD’, ‘AF’) or list of genotype keys to be added to the MafFrame.

  • names (str or list) – Column name or list of column names for keys (must be the same length). By default, the genotype keys will be used as column names.

Examples

Below is a simple example:

>>> from fuc import pyvcf, pymaf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['CSQ=T|missense_variant|MODERATE|MTOR|2475|Transcript|NM_001386500.1|protein_coding|47/58||||6792|6644|2215|S/Y|tCt/tAt|rs587777894&COSV63868278&COSV63868313||-1||EntrezGene||||||||G|G||deleterious(0)|possibly_damaging(0.876)||||||||||||||||||likely_pathogenic&pathogenic|0&1&1|1&1&1|26619011&27159400&24631838&26018084&27830187|||||', 'CSQ=C|splice_donor_variant|HIGH|MTOR|2475|Transcript|NM_001386500.1|protein_coding||46/57||||||||||-1||EntrezGene||||||||A|A|||||||||||||||||||||||||||||'],
...     'FORMAT': ['GT:AD:DP:AF', 'GT:AD:DP:AF'],
...     'A': ['0/1:176,37:213:0.174', '0/1:966,98:1064:0.092']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER                                               INFO       FORMAT                      A
0  chr1  100  .   G   A    .      .  CSQ=T|missense_variant|MODERATE|MTOR|2475|Tran...  GT:AD:DP:AF   0/1:176,37:213:0.174
1  chr2  101  .   T   C    .      .  CSQ=C|splice_donor_variant|HIGH|MTOR|2475|Tran...  GT:AD:DP:AF  0/1:966,98:1064:0.092
>>> mf = pymaf.MafFrame.from_vcf(vf)
>>> mf.df
  Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome  Start_Position  End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Protein_Change Tumor_Sample_Barcode
0        MTOR           2475      .          .       chr1             100           100      -      Missense_Mutation          SNP                G                 A                 A       p.S2215Y                    A
1        MTOR           2475      .          .       chr2             101           101      -            Splice_Site          SNP                T                 C                 C              .                    A

We can add genotype keys such as AD and AF:

>>> mf = pymaf.MafFrame.from_vcf(vf, keys=['AD', 'AF'])
>>> mf.df
  Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome  Start_Position  End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Protein_Change Tumor_Sample_Barcode      AD     AF
0        MTOR           2475      .          .       chr1             100           100      -      Missense_Mutation          SNP                G                 A                 A       p.S2215Y                    A  176,37  0.174
1        MTOR           2475      .          .       chr2             101           101      -            Splice_Site          SNP                T                 C                 C              .                    A  966,98  0.092

The method can accept a VCF file as input instead of VcfFrame:

>>> mf = pymaf.MafFrame.from_vcf('annotated.vcf')

The method can also handle unannotated VCF data:

>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 200, 300],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'C', 'TTC'],
...     'ALT': ['A', 'CAG', 'T'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'A': ['0/1', '0/1', '0/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID  REF  ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .    G    A    .      .    .     GT  0/1
1  chr1  200  .    C  CAG    .      .    .     GT  0/1
2  chr1  300  .  TTC    T    .      .    .     GT  0/1
>>> mf = pymaf.MafFrame.from_vcf(vf)
>>> mf.df
  Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome  Start_Position  End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Protein_Change Tumor_Sample_Barcode
0           .              .      .          .       chr1             100           100      .                      .          SNP                G                 A                 A              .                    A
1           .              .      .          .       chr1             200           201      .                      .          INS                -                AG                AG              .                    A
2           .              .      .          .       chr1             301           302      .                      .          DEL               TC                 -                 -              .                    A
property genes

List of the genes.

Type

list

get_gene_concordance(gene, a, b)[source]

Test whether two samples have the identical mutation profile for specified gene.

Parameters
  • gene (str) – Name of the gene.

  • a, b (str) – Sample name.

Returns

True if the two samples have the same mutation profile.

Return type

bool

matrix_genes(mode='variants', count=10)[source]

Compute a matrix of counts with a shape of (genes, variant classifications).

This method only considers the nine nonsynonymous variant classifications.

Parameters
  • mode ({‘variants’, ‘samples’}, default: ‘variants’) – Determines how to identify top mutated genes:

    • ‘variants’: Count the number of observed variants.

    • ‘samples’: Count the number of affected samples. Using this option will create an additional variant classification called ‘Multi_Hit’.

  • count (int, default: 10) – Number of top mutated genes to include.

Returns

The said matrix.

Return type

pandas.DataFrame

matrix_prevalence()[source]

Compute a matrix of variant counts with a shape of (genes, samples).

Returns

The said matrix.

Return type

pandas.DataFrame

matrix_tmb()[source]

Compute a matrix of variant counts with a shape of (samples, variant classifications).

Returns

The said matrix.

Return type

pandas.DataFrame

matrix_waterfall(count=10, keep_empty=False)[source]

Compute a matrix of variant classifications with a shape of (genes, samples).

If there are multiple variant classifications available for a given cell, they will be replaced as ‘Multi_Hit’.

Parameters
  • count (int, default: 10) – Number of top mutated genes to include.

  • keep_empty (bool, default: False) – If True, keep samples with all NaN’s.

Returns

The said matrix.

Return type

pandas.DataFrame

matrix_waterfall_matched(af, patient_col, group_col, group_order, count=10)[source]

Compute a matrix of variant classifications with a shape of (gene-group pairs, patients).

Parameters
  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • patient_col (str) – AnnFrame column containing patient information.

  • group_col (str) – AnnFrame column containing sample group information.

  • group_order (list) – List of sample group names.

  • count (int, default: 10) – Number of top mutated genes to include.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

plot_clonality(vaf_col, af=None, group_col=None, group_order=None, count=10, threshold=0.25, subclonal=False, ax=None, figsize=None)[source]

Create a bar plot summarizing the clonality of variants in top mutated genes.

Clonality will be calculated based on VAF using MafFrame.compute_clonality().

Parameters
  • vaf_col (str) – MafFrame column containing VAF data.

  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • count (int, defualt: 10) – Number of top mutated genes to display.

  • threshold (float, default: 0.25) – VAF threshold percentage.

  • subclonal (bool, default: False) – If True, display subclonality (1 - clonality).

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.barplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

MafFrame.compute_clonality

Compute the clonality of variants based on VAF.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_clonality('i_TumorVAF_WU')
>>> plt.tight_layout()
_images/api-13.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_clonality('i_TumorVAF_WU',
...                   af=af,
...                   group_col='FAB_classification',
...                   group_order=['M0', 'M1', 'M2'])
>>> plt.tight_layout()
_images/api-14.png
plot_comparison(a, b, c=None, labels=None, ax=None, figsize=None)[source]

Create a Venn diagram showing genotype concordance between groups.

This method supports comparison between two groups (Groups A & B) as well as three groups (Groups A, B, & C).

Parameters
  • a, b (list) – Sample names. The lists must have the same shape.

  • c (list, optional) – Same as above.

  • labels (list, optional) – List of labels to be displayed.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

  • matplotlib.axes.Axes – The matplotlib axes containing the plot.

  • matplotlib_venn._common.VennDiagram – VennDiagram object.

plot_evolution(samples, vaf_col, anchor=None, normalize=True, count=5, ax=None, figsize=None, **kwargs)[source]

Create a line plot visualizing changes in VAF between specified samples.

Parameters
  • samples (list) – List of samples to display.

  • vaf_col (str) – MafFrame column containing VAF data.

  • anchor (str, optional) – Sample to use as the anchor. If absent, use the first sample in the list.

  • normalize (bool, default: True) – If False, do not normalize VAF by the maximum value.

  • count (int, default: 5) – Number of top variants to display.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.lineplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

plot_genepair(x, y, vaf_col, af=None, group_col=None, group_order=None, ax=None, figsize=None, **kwargs)[source]

Create a scatter plot of VAF between Gene X and Gene Y.

Parameters
  • x, y (str) – Gene names.

  • vaf_col (str) – MafFrame column containing VAF data.

  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.scatterplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_genepair('DNMT3A', 'FLT3', 'i_TumorVAF_WU')
>>> plt.tight_layout()
_images/api-15.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_genepair('DNMT3A', 'FLT3', 'i_TumorVAF_WU',
...                  af=af,
...                  group_col='FAB_classification')
>>> plt.tight_layout()
_images/api-16.png
plot_genes(mode='variants', count=10, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a bar plot showing variant distirbution for top mutated genes.

Parameters
  • mode ({‘variants’, ‘samples’}, default: ‘variants’) – Determines how to identify top mutated genes:

    • ‘variants’: Count the number of observed variants.

    • ‘samples’: Count the number of affected samples. Using this option will create an additional variant classification called ‘Multi_Hit’.

  • count (int, default: 10) – Number of top mutated genes to display.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to pandas.DataFrame.plot.bar() or pandas.DataFrame.plot.barh().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

By default (mode='variants'), the method identifies top mutated genes by counting the number of observed variants:

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_genes()
>>> plt.tight_layout()
_images/api-17.png

We can also identify top mutated genes by counting the number of affected samples:

>>> mf.plot_genes(mode='samples')
>>> plt.tight_layout()
_images/api-18.png
plot_interactions(count=10, cmap=None, ax=None, figsize=None, **kwargs)[source]

Create a heatmap representing mutually exclusive or co-occurring set of genes.

This method performs pair-wise Fisher’s Exact test to detect such significant pair of genes.

Parameters
  • count (int, defualt: 10) – Number of top mutated genes to display.

  • cmap (str, optional) – Color map.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.heatmap().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_interactions(count=25, cmap='BrBG')
>>> plt.tight_layout()
_images/api-19.png
plot_lollipop(gene, alpha=0.7, ax=None, figsize=None, legend=True)[source]

Create a lollipop or stem plot showing amino acid changes of a gene.

Parameters
  • gene (str) – Name of the gene.

  • alpha (float, default: 0.7) – Set the color transparency. Must be within the 0-1 range, inclusive.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_lollipop('DNMT3A')
>>> plt.tight_layout()
_images/api-20.png
plot_matrixg(gene, af, group_col, group_order=None, cbar=True, ax=None, figsize=None, **kwargs)[source]

Create a heatmap of count matrix with a shape of (sample groups, protein changes).

Parameters
  • gene (str) – Name of the gene.

  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • group_col (str) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • cbar (bool, default: True) – Whether to draw a colorbar.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.heatmap().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_matrixg('IDH1', af, 'FAB_classification', linewidth=0.5, square=True, annot=True)
>>> plt.tight_layout()
_images/api-21.png
plot_matrixs(gene, samples=None, c0='lightgray', c1='red', l0='0', l1='1', cbar=True, square=False, ax=None, figsize=None, **kwargs)[source]

Create a heatmap of presence/absence matrix with a shape of (samples, protein changes).

Parameters
  • gene (str) – Name of the gene.

  • samples (list, optional) – List of samples to display (in that order too). If samples that are absent in the MafFrame are provided, the method will give a warning but still draw an empty bar for those samples.

  • c0 (str, default: ‘lightgray’) – Color for absence.

  • c1 (str, default: ‘red’) – Color for presence.

  • l0 (str, default: ‘0’) – Label for absence.

  • l1 (str, default: ‘1’) – Label for presence.

  • cbar (bool, default: True) – Whether to draw a colorbar.

  • square (bool, default: False) – If True, set the Axes aspect to “equal” so each cell will be square-shaped.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.heatmap().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_matrixs('KRAS', linewidth=0.5, square=True)
>>> plt.tight_layout()
_images/api-22.png
plot_mutated(af=None, group_col=None, group_order=None, genes=None, count=10, ax=None, figsize=None)[source]

Create a bar plot visualizing the mutation prevalence of top mutated genes.

Parameters
  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • genes (list, optional) – Genes to display. When absent, top mutated genes (count) will be used.

  • count (int, defualt: 10) – Number of top mutated genes to display. Ignored if genes is specified.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.barplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_mutated()
>>> plt.tight_layout()
_images/api-23.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_mutated(af=af,
...                 group_col='FAB_classification',
...                 group_order=['M0', 'M1', 'M2'])
>>> plt.tight_layout()
_images/api-24.png
plot_mutated_matched(af, patient_col, group_col, group_order, ax=None, figsize=None, **kwargs)[source]

Create a bar plot visualizing the mutation prevalence of top mutated genes.

Parameters
  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • patient_col (str) – AnnFrame column containing patient information.

  • group_col (str) – AnnFrame column containing sample group information.

  • group_order (list) – List of sample group names.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.barplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

plot_oncoplot(count=10, keep_empty=False, figsize=(15, 10), label_fontsize=15, ticklabels_fontsize=15, legend_fontsize=15)[source]

Create an oncoplot.

See this tutorial to learn how to create customized oncoplots.

Parameters
  • count (int, default: 10) – Number of top mutated genes to display.

  • keep_empty (bool, default: False) – If True, display samples that do not have any mutations.

  • figsize (tuple, default: (15, 10)) – Width, height in inches. Format: (float, float).

  • label_fontsize (float, default: 15) – Font size of labels.

  • ticklabels_fontsize (float, default: 15) – Font size of tick labels.

  • legend_fontsize (float, default: 15) – Font size of legend texts.

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_oncoplot()
_images/api-25.png
plot_oncoplot_matched(af, patient_col, group_col, group_order, colors='Set2', figsize=(15, 10), label_fontsize=12, ticklabels_fontsize=12, legend_fontsize=12)[source]

Create an oncoplot for mached samples.

Parameters
  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • patient_col (str) – AnnFrame column containing patient information.

  • group_col (str) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • colors (str) – Colormap name for the sample groups.

  • figsize (tuple, default: (15, 10)) – Width, height in inches. Format: (float, float).

  • label_fontsize (float, default: 12) – Font size of labels.

  • ticklabels_fontsize (float, default: 12) – Font size of tick labels.

  • legend_fontsize (float, default: 12) – Font size of legend texts.

plot_rainfall(sample, palette=None, legend='auto', ax=None, figsize=None, **kwargs)[source]

Create a rainfall plot visualizing inter-variant distance on a linear genomic scale for single sample.

Parameters
  • sample (str) – Name of the sample.

  • palette (str, optional) – Name of the seaborn palette. See the Control plot colors tutorial for details.

  • legend ({‘auto’, ‘brief’, ‘full’, False}, default: ‘auto’) – Display setting of the legend according to seaborn.scatterplot().

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.scatterplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pymaf
>>> common.load_dataset('brca')
>>> maf_file = '~/fuc-data/brca/brca.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_rainfall('TCGA-A8-A08B',
...                  figsize=(14, 7),
...                  palette=sns.color_palette('Set2')[:6])
>>> plt.tight_layout()
_images/api-26.png
plot_regplot(af, group_col, a, b, a_size=None, b_size=None, genes=None, count=10, to_csv=None, ax=None, figsize=None, **kwargs)[source]

Create a scatter plot with a linear regression model fit visualizing correlation between gene mutation frequencies in two sample groups A and B.

Each point in the plot represents a gene.

The method will automatically calculate and print summary statistics including R-squared and p-value.

Parameters
  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • group_col (str) – AnnFrame column containing sample group information.

  • a, b (str) – Sample group names.

  • a_size, b_size (int, optional) – Sample group sizes to use as denominator. By default, these are inferred from the MafFrame and AnnFrame objects.

  • genes (list, optional) – Genes to display. When absent, top mutated genes (count) will be used.

  • count (int, defualt: 10) – Number of top mutated genes to display. Ignored if genes is specified.

  • to_csv (str, optional) – Write the plot’s data to a CSV file.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.regplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_regplot(af, 'FAB_classification', 'M1', 'M2')
Results for M2 ~ M1:
R^2 = 0.43
  P = 3.96e-02
>>> plt.tight_layout()
_images/api-27.png
plot_snvclsc(af=None, group_col=None, group_order=None, palette=None, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a bar plot summarizing the count distrubtions of the six SNV classes for all samples.

A grouped bar plot can be created with group_col (requires an AnnFrame).

Parameters
  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • palette (str, optional) – Name of the seaborn palette. See the Control plot colors tutorial for details.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.barplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

MafFrame.plot_snvclsp

Create a box plot summarizing the proportion distrubtions of the six SNV classes for all sample.

MafFrame.plot_snvclss

Create a bar plot showing the proportions of the six SNV classes for individual samples.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_snvclsc(palette=sns.color_palette('Dark2'))
>>> plt.tight_layout()
_images/api-28.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_snvclsc(af=af,
...                 group_col='FAB_classification',
...                 group_order=['M0', 'M1', 'M2'])
>>> plt.tight_layout()
_images/api-29.png
plot_snvclsp(af=None, group_col=None, group_order=None, palette=None, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a box plot summarizing the proportion distrubtions of the six SNV classes for all sample.

Parameters
  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • palette (str, optional) – Name of the seaborn palette. See the Control plot colors tutorial for details.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.boxplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

MafFrame.plot_snvclsc

Create a bar plot summarizing the count distrubtions of the six SNV classes for all samples.

MafFrame.plot_snvclss

Create a bar plot showing the proportions of the six SNV classes for individual samples.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_snvclsp(palette=sns.color_palette('Set2'))
>>> plt.tight_layout()
_images/api-30.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_snvclsp(af=af,
...                 group_col='FAB_classification',
...                 group_order=['M0', 'M1', 'M2'])
>>> plt.tight_layout()
_images/api-31.png
plot_snvclss(samples=None, color=None, colormap=None, width=0.8, legend=True, flip=False, to_csv=None, ax=None, figsize=None, **kwargs)[source]

Create a bar plot showing the proportions of the six SNV classes for individual samples.

Parameters
  • samples (list, optional) – List of samples to display (in that order too). If samples that are absent in the MafFrame are provided, the method will give a warning but still draw an empty bar for those samples.

  • color (list, optional) – List of color tuples. See the Control plot colors tutorial for details.

  • colormap (str or matplotlib colormap object, optional) – Colormap to select colors from. See the Control plot colors tutorial for details.

  • width (float, default: 0.8) – The width of the bars.

  • legend (bool, default: True) – Place legend on axis subplots.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • to_csv (str, optional) – Write the plot’s data to a CSV file.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to pandas.DataFrame.plot.bar() or pandas.DataFrame.plot.barh().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

MafFrame.plot_snvclsc

Create a bar plot summarizing the count distrubtions of the six SNV classes for all samples.

MafFrame.plot_snvclsp

Create a box plot summarizing the proportion distrubtions of the six SNV classes for all sample.

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> ax = mf.plot_snvclss(width=1, color=plt.get_cmap('Set2').colors)
>>> ax.legend(loc='upper right')
>>> plt.tight_layout()
_images/api-32.png
plot_summary(figsize=(15, 10), title_fontsize=16, ticklabels_fontsize=12, legend_fontsize=12)[source]

Create a summary figure for MafFrame.

Parameters
  • figsize (tuple, default: (15, 10)) – Width, height in inches. Format: (float, float).

  • title_fontsize (float, default: 16) – Font size of subplot titles.

  • ticklabels_fontsize (float, default: 12) – Font size of tick labels.

  • legend_fontsize (float, default: 12) – Font size of legend texts.

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_summary()
_images/api-33.png
plot_titv(af=None, group_col=None, group_order=None, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a box plot showing the Ti/Tv proportions of samples.

Parameters
  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.boxplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

fuc.api.pyvcf.VcfFrame.plot_titv

Similar method for the fuc.api.pyvcf.VcfFrame class.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_titv()
>>> plt.tight_layout()
_images/api-34.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_titv(af=af,
...              group_col='FAB_classification',
...              group_order=['M0', 'M1', 'M2'])
>>> plt.tight_layout()
_images/api-35.png
plot_tmb(samples=None, width=0.8, ax=None, figsize=None, **kwargs)[source]

Create a bar plot showing the TMB distributions of samples.

Parameters
  • samples (list, optional) – List of samples to display (in that order too). If samples that are absent in the MafFrame are provided, the method will give a warning but still draw an empty bar for those samples.

  • width (float, default: 0.8) – The width of the bars.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to pandas.DataFrame.plot.bar().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_tmb(width=1)
>>> plt.tight_layout()
_images/api-36.png
plot_tmb_matched(af, patient_col, group_col, group_order=None, patients=None, legend=True, ax=None, figsize=None, **kwargs)[source]

Create a grouped bar plot showing TMB distributions for different group levels in each patient.

Parameters
  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • patient_col (str) – AnnFrame column containing patient information.

  • group_col (str) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • patients (list, optional) – List of patient names.

  • legend (bool, default: True) – Place legend on axis subplots.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to pandas.DataFrame.plot.bar()

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

plot_vaf(vaf_col, count=10, af=None, group_col=None, group_order=None, flip=False, sort=True, ax=None, figsize=None, **kwargs)[source]

Create a box plot showing the VAF distributions of top mutated genes.

A grouped box plot can be created with group_col (requires an AnnFrame).

Parameters
  • vaf_col (str) – MafFrame column containing VAF data.

  • count (int, default: 10) – Number of top mutated genes to display.

  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • sort (bool, default: True) – If False, do not sort the genes by median value.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.boxplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_vaf('i_TumorVAF_WU')
>>> plt.tight_layout()
_images/api-37.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_vaf('i_TumorVAF_WU',
...             af=af,
...             group_col='FAB_classification',
...             group_order=['M1', 'M2', 'M3'],
...             count=5)
>>> plt.tight_layout()
_images/api-38.png
plot_varcls(ax=None, figsize=None, **kwargs)[source]

Create a bar plot for the nonsynonymous variant classes.

Parameters
  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to matplotlib.axes.Axes.bar() and seaborn.barplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_varcls()
>>> plt.tight_layout()
_images/api-39.png
plot_varsum(flip=False, ax=None, figsize=None)[source]

Create a summary box plot for variant classifications.

Parameters
  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_varsum()
>>> plt.tight_layout()
_images/api-40.png
plot_vartype(palette=None, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a bar plot summarizing the count distrubtions of viaration types for all samples.

Parameters
  • palette (str, optional) – Name of the seaborn palette. See the Control plot colors tutorial for details.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.barplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_vartype()
>>> plt.tight_layout()
_images/api-41.png
plot_waterfall(count=10, keep_empty=False, samples=None, ax=None, figsize=None, **kwargs)[source]

Create a waterfall plot (oncoplot).

See this tutorial to learn how to create customized oncoplots.

Parameters
  • count (int, default: 10) – Number of top mutated genes to display.

  • keep_empty (bool, default: False) – If True, display samples that do not have any mutations.

  • samples (list, optional) – List of samples to display (in that order too). If samples that are absent in the MafFrame are provided, the method will give a warning but still draw an empty bar for those samples.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.heatmap().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_waterfall(linewidths=0.5)
>>> plt.tight_layout()
_images/api-42.png
plot_waterfall_matched(af, patient_col, group_col, group_order, count=10, ax=None, figsize=None)[source]

Create a waterfall plot using matched samples from each patient.

Parameters
  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • patient_col (str) – AnnFrame column containing patient information.

  • group_col (str) – AnnFrame column containing sample group information.

  • group_order (list) – List of sample group names.

  • count (int, default: 10) – Number of top mutated genes to include.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

property samples

List of the sample names.

Type

list

property shape

Dimensionality of MafFrame (variants, samples).

Type

tuple

subset(samples, exclude=False)[source]

Subset MafFrame for specified samples.

Parameters
  • samples (str, list, or pandas.Series) – Sample name or list of names (the order does not matters).

  • exclude (bool, default: False) – If True, exclude specified samples.

Returns

Subsetted MafFrame.

Return type

MafFrame

Examples

>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.shape
(2207, 193)
>>> mf.subset(['TCGA-AB-2988', 'TCGA-AB-2869']).shape
(27, 2)
>>> mf.subset(['TCGA-AB-2988', 'TCGA-AB-2869'], exclude=True).shape
(2180, 191)
to_file(fn)[source]

Write MafFrame to a MAF file.

Parameters

fn (str) – MAF file path.

to_string()[source]

Render MafFrame to a console-friendly tabular output.

Returns

String representation of MafFrame.

Return type

str

to_vcf(fasta=None, ignore_indels=False, cols=None, names=None)[source]

Write the MafFrame to a sorted VcfFrame.

Converting from MAF to VCF is pretty straightforward for SNVs, but it can be challenging for INDELs and complex events involving multiple nucleotides (e.g. ‘AAGG’ → ‘CCCG’). This is because, for the latter case we need to identify the “anchor” nucleotide for each event, which is crucial for constructing a properly formatted VCF. For example, a deletion event ‘AGT’ → ‘-’ in MAF would have to be converted to ‘CAGT’ → ‘C’ in the VCF where ‘C’ is our anchor nucleotide. The position should be shifted by one as well.

In order to tackle this issue, the method makes use of a reference assembly (i.e. FASTA file). If SNVs are your only concern, then you do not need a FASTA file and can just set ignore_indels as True. If you are going to provide a FASTA file, please make sure to select the appropriate one (e.g. one that matches the genome assembly). For example, if your MAF is in hg19/GRCh37, use the ‘hs37d5.fa’ file which can be freely downloaded from the 1000 Genomes Project.

Parameters
  • fasta (str, optional) – FASTA file. Required if ignore_indels is False.

  • ignore_indels (bool, default: False) – If True, do not include INDELs in the VcfFrame. Useful when a FASTA file is not available.

  • cols (str or list, optional) – Column(s) in the MafFrame which contain additional genotype data of interest. If provided, these data will be added to individual sample genotypes (e.g. ‘0/1:0.23’).

  • names (str or list, optional) – Name(s) to be displayed in the FORMAT field (e.g. AD, AF, DP). If not provided, the original column name(s) will be displayed.

Returns

VcfFrame object.

Return type

VcfFrame

Examples

>>> from fuc import pymaf
>>> mf = pymaf.MafFrame.from_file('in.maf')
>>> vf = mf.to_vcf(fasta='hs37d5.fa')
>>> vf = mf.to_vcf(ignore_indels=True)
>>> vf = mf.to_vcf(fasta='hs37d5.fa', cols='i_TumorVAF_WU', names='AF')
variants()[source]

List unique variants in MafFrame.

Returns

List of unique variants.

Return type

list

Examples

>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.variants()[:5]
['1:1571791:1571791:G:A', '1:1747228:1747228:T:G', '1:2418350:2418350:C:T', '1:3328523:3328523:G:A', '1:3638739:3638739:C:T']

fuc.pysnpeff

The pysnpeff submodule is designed for parsing VCF annotation data from the SnpEff program. It should be used with pyvcf.VcfFrame.

One VCF record can have several SnpEff annotations if, for example, the record is a multiallelic site or the variant is shared by multiple genes. When more than one annotations are reported, SnpEff will sort them by their importance. For more details, visit the official website.

For each annotation, SnpEff provides the following data:

  1. Allele - ALT allele.

  2. Annotation - Sequence Ontology terms concatenated using ‘&’.

  3. Annotation_Impact - HIGH, MODERATE, LOW, or MODIFIER.

  4. Gene_Name - Common gene name (HGNC).

  5. Gene_ID - Gene ID.

  6. Feature_Type - Which type of feature is in the next field.

  7. Feature_ID - Transcript ID, Motif ID, miRNA, ChipSeq peak, etc.

  8. Transcript_BioType - Coding or noncoding.

  9. Rank - Exon or Intron rank / total number of exons or introns.

  10. HGVS.c - Variant using HGVS notation (DNA level).

  11. HGVS.p - Variant using HGVS notation (Protein level).

  12. cDNA.pos / cDNA.length - Position in cDNA and trancript’s cDNA length.

  13. CDS.pos / CDS.length - Position and number of coding bases.

  14. AA.pos / AA.length - Position and number of AA.

  15. Distance - All items in this field are options.

  16. ERRORS / WARNINGS - Messages that can affect annotation accuracy.

  17. INFO - Additional information.

Functions:

filter_ann(vf, targets[, include])

Filter out rows based on the SnpEff annotations.

parseann(vf, idx[, sep])

Parse SnpEff annotations.

row_firstann(r)

Return the first SnpEff annotation for the row.

fuc.api.pysnpeff.filter_ann(vf, targets, include=True)[source]

Filter out rows based on the SnpEff annotations.

Parameters
  • vf (fuc.api.pyvcf.VcfFrame) – Input VcfFrame.

  • targets (list) – List of annotations (e.g. [‘missense_variant’, ‘stop_gained’]).

  • include (bool, default: False) – If True, include only such rows instead of excluding them.

Returns

vf – Filtered VcfFrame.

Return type

VcfFrame

fuc.api.pysnpeff.parseann(vf, idx, sep=' | ')[source]

Parse SnpEff annotations.

Parameters
  • vf (fuc.api.pyvcf.VcfFrame) – Input VcfFrame.

  • i (list) – List of annotation indicies.

  • sep (str, default: ‘ | ‘) – Separator for joining requested annotations.

Returns

s – Parsed annotations.

Return type

pandas.Series

fuc.api.pysnpeff.row_firstann(r)[source]

Return the first SnpEff annotation for the row.

fuc.pyvcf

The pyvcf submodule is designed for working with VCF files. It implements pyvcf.VcfFrame which stores VCF data as pandas.DataFrame to allow fast computation and easy manipulation. The pyvcf.VcfFrame class also contains many useful plotting methods such as VcfFrame.plot_comparison and VcfFrame.plot_tmb. The submodule strictly adheres to the standard VCF specification.

A typical VCF file contains metadata lines (prefixed with ‘##’), a header line (prefixed with ‘#’), and genotype lines that begin with a chromosome identifier (e.g. ‘chr1’). See the VCF specification above for an example VCF file.

Genotype lines usually consist of nine columns for storing variant information (all fixed and mandatory except for the FORMAT column) plus additional sample-specific columns for expressing individual genotype calls (e.g. ‘0/1’). Missing values are allowed in some cases and can be specified with a dot (‘.’). The first nine columns are:

No.

Column

Description

Required

Missing

Examples

1

CHROM

Chromosome or contig identifier

‘chr2’, ‘2’, ‘chrM’

2

POS

1-based reference position

10041, 23042

3

ID

‘;’-separated variant identifiers

‘.’, ‘rs35’, ‘rs9;rs53’

4

REF

Reference allele

‘A’, ‘GT’

5

ALT

‘,’-separated alternate alleles

‘T’, ‘ACT’, ‘C,T’

6

QUAL

Phred-scaled quality score for ALT

‘.’, 67, 12

7

FILTER

‘;’-separated filters that failed

‘.’, ‘PASS’, ‘q10;s50’

8

INFO

‘;’-separated information fields

‘.’, ‘DP=14;AF=0.5;DB’

9

FORMAT

‘:’-separated genotype fields

‘GT’, ‘GT:AD:DP’

You will sometimes come across VCF files that have only eight columns, and do not contain the FORMAT column or sample-specific information. These are called “sites-only” VCF files, and normally represent genetic variation that has been observed in a large population. Generally, information about the population of origin should be included in the header.

There are several reserved keywords in the INFO and FORMAT columns that are standards across the community. Popular keywords are listed below:

Column

Key

Number

Type

Description

INFO

AC

A

Integer

Allele count in genotypes, for each ALT allele, in the same order as listed

INFO

AN

1

Integer

Total number of alleles in called genotypes

INFO

AF

A

Float

Allele frequency for each ALT allele in the same order as listed (estimated from primary data, not called genotypes)

FORMAT

AD

R

Integer

Total read depth for each allele

FORMAT

AF

1

Float

Allele fraction of the event in the tumor

FORMAT

DP

1

Integer

Read depth

If sample annotation data are available for a given VCF file, use the common.AnnFrame class to import the data.

Classes:

VcfFrame(meta, df)

Class for storing VCF data.

Functions:

call(fasta, bams[, regions, path, min_mq, ...])

Call SNVs and indels from BAM files.

gt_hasvar(g)

For given genotype, return True if it has variation.

gt_het(g)

For given genotype, return True if it is heterozygous.

gt_miss(g)

For given genotype, return True if it has missing value.

gt_ploidy(g)

For given genotype, return its ploidy number.

gt_polyp(g)

For given genotype, return True if it is polyploid.

gt_pseudophase(g)

For given genotype, return its pseudophased form.

gt_unphase(g)

For given genotype, return its unphased form.

has_chr_prefix(file[, size])

Return True if all of the sampled contigs from a VCF file have the (annoying) 'chr' string.

merge(vfs[, how, format, sort, collapse])

Merge VcfFrame objects.

plot_af_correlation(vf1, vf2[, ax, figsize])

Create a scatter plot showing the correlation of allele frequency between two VCF files.

rescue_filtered_variants(vfs[, format])

Rescue filtered variants if they are PASS in at least one of the input VCF files.

row_computeinfo(r, key[, decimals])

For given row, return AC/AN/AF calculation for INFO column.

row_hasindel(r)

For given row, return True if it has indel.

row_missval(r)

For given row, return formatted missing genotype.

row_parseinfo(r, key)

For given row, return requested data from INFO column.

row_phased(r)

For given row, return True if all genotypes are phased.

row_updateinfo(r, key, value[, force, missing])

For given row, return updated data from INFO column.

slice(file, regions[, path])

Slice a VCF file for specified regions.

split(vcf[, clean])

Split VcfFrame by individual.

class fuc.api.pyvcf.VcfFrame(meta, df)[source]

Class for storing VCF data.

Parameters
  • meta (list) – List of metadata lines.

  • df (pandas.DataFrame) – DataFrame containing VCF data.

See also

VcfFrame.from_dict

Construct VcfFrame from a dict of array-like or dicts.

VcfFrame.from_file

Construct VcfFrame from a VCF file.

VcfFrame.from_string

Construct VcfFrame from a string.

Examples

Constructing VcfFrame from pandas DataFrame:

>>> from fuc import pyvcf
>>> import pandas as pd
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.',],
...     'REF': ['G', 'T', 'A'],
...     'ALT': ['A', 'C', 'T'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'Steven': ['0/1', '0/1', '0/1'],
... }
>>> df = pd.DataFrame(data)
>>> vf = pyvcf.VcfFrame(['##fileformat=VCFv4.3'], df)
>>> vf.meta
['##fileformat=VCFv4.3']
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  101  .   T   C    .      .    .     GT    0/1
2  chr1  102  .   A   T    .      .    .     GT    0/1

Methods:

add_af([decimals])

Compute AF from AD and then add it to the FORMAT field.

add_dp()

Compute DP using AD and add it to the FORMAT field.

add_flag(flag[, order, index])

Add the given flag to the INFO field.

calculate_concordance(a, b[, c, mode])

Calculate genotype concordance between two (A, B) or three (A, B, C) samples.

collapse()

Collapse duplicate records in the VcfFrame.

combine(a, b)

Combine genotype data from two samples (A, B).

compare(other)

Compare to another VcfFrame and show the differences in genotype calling.

compute_info(key)

Compute AC/AN/AF for INFO column.

copy()

Return a copy of the VcfFrame.

copy_df()

Return a copy of the dataframe.

copy_meta()

Return a copy of the metadata.

drop_duplicates([subset, keep])

Return VcfFrame with duplicate rows removed.

empty_samples([threshold, opposite, as_list])

Remove samples with high missingness.

expand()

Expand each multiallelic locus to multiple rows.

extract_format(k[, func, as_nan])

Extract data for the specified FORMAT key.

extract_info(k[, func, as_nan])

Extract data for the specified INFO key.

fetch(variant)

Fetch the VCF row that matches specified variant.

filter_bed(bed[, opposite, as_index])

Filter rows intersecting with given BED.

filter_empty([threshold, opposite, as_index])

Filter rows with high missingness.

filter_flagall(flags[, opposite, as_index])

Filter rows with all given INFO flags.

filter_flagany(flags[, opposite, as_index])

Filter rows with any given INFO flags.

filter_indel([opposite, as_index])

Filter rows with indel.

filter_multialt([opposite, as_index])

Filter rows with multiple ALT alleles.

filter_pass([opposite, as_index])

Filter rows with PASS in FILTER column.

filter_phased([opposite, as_index])

Filter rows with phased genotypes.

filter_polyp([opposite, as_index])

Filter rows with polyploid genotypes.

filter_qual(threshold[, opposite, as_index])

Filter rows with low QUAL values.

filter_sampall([samples, opposite, as_index])

Filter rows where all given samples have variant.

filter_sampany([samples, opposite, as_index])

Filter rows where any given samples have variant.

filter_sampnum(threshold[, opposite, as_index])

Filter rows with high variant prevalence.

filter_vcf(vcf[, opposite, as_index])

Filter rows intersecting with given VCF.

from_dict(meta, data)

Construct VcfFrame from a dict of array-like or dicts.

from_file(fn[, compression, meta_only, regions])

Construct VcfFrame from a VCF file.

from_string(s[, meta_only])

Construct VcfFrame from a string.

get_af(sample, variant)

Get allele fraction for a pair of sample and variant.

markmiss(expr[, greedy, opposite, samples, ...])

Mark all genotypes that satisfy the query expression as missing.

merge(other[, how, format, sort, collapse])

Merge with the other VcfFrame.

meta_keys()

Print metadata lines with a key.

miss2ref()

Convert missing genotype (./.) to homozygous REF (0/0).

plot_comparison(a, b[, c, labels, ax, figsize])

Create a Venn diagram showing genotype concordance between groups.

plot_hist_format(k[, af, group_col, ...])

Create a histogram showing the distribution of data for the specified FORMAT key.

plot_hist_info(k[, kde, ax, figsize])

Create a histogram showing the distribution of data for the specified INFO key.

plot_rainfall(sample[, palette, ax, ...])

Create a rainfall plot visualizing inter-variant distance on a linear genomic scale for single sample.

plot_region(sample[, k, color, region, ...])

Create a scatter plot showing read depth profile of a sample for the specified region.

plot_regplot(a, b[, ax, figsize])

Create a scatter plot with a linear regression model fit visualizing correlation between TMB in two sample groups.

plot_snvclsc([af, group_col, group_order, ...])

Create a bar plot summarizing the count distrubtions of the six SNV classes for all samples.

plot_snvclsp([af, group_col, group_order, ...])

Create a box plot summarizing the proportion distrubtions of the six SNV classes for all sample.

plot_snvclss([color, colormap, width, ...])

Create a bar plot showing the proportions of the six SNV classes for individual samples.

plot_titv([af, group_col, group_order, ...])

Create a box plot showing the Ti/Tv proportions of samples.

plot_tmb([af, group_col, group_order, kde, ...])

Create a histogram showing TMB distribution.

pseudophase()

Pseudophase VcfFrame.

rename(names[, indicies])

Rename the samples.

slice(region)

Slice VcfFrame for specified region.

sort()

Sort the VcfFrame by chromosome and position.

strip([format, metadata])

Remove any unnecessary data.

subset(samples[, exclude])

Subset VcfFrame for specified samples.

subtract(a, b)

Subtract genotype data between two samples (A, B).

to_bed()

Convert VcfFrame to BedFrame.

to_file(fn[, compression])

Write VcfFrame to a VCF file.

to_string()

Render the VcfFrame to a console-friendly tabular output.

to_variants()

List unique variants in VcfFrame.

unphase()

Unphase all the sample genotypes.

update_chr_prefix([mode])

Add or remove the (annoying) 'chr' string from the CHROM column.

Attributes:

contigs

List of contig names.

df

DataFrame containing VCF data.

empty

Indicator whether VcfFrame is empty.

has_chr_prefix

Whether the (annoying) 'chr' string is found.

meta

List of metadata lines.

phased

Return True if every genotype in VcfFrame is haplotype phased.

samples

List of sample names.

shape

Dimensionality of VcfFrame (variants, samples).

sites_only

Whether the VCF is sites-only.

add_af(decimals=3)[source]

Compute AF from AD and then add it to the FORMAT field.

This method will compute allele fraction for each ALT allele in the same order as listed.

Parameters

decimals (int, default: 3) – Number of decimals to display.

Returns

Updated VcfFrame object.

Return type

VcfFrame

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'G', 'A', 'C'],
...     'ALT': ['C', 'T', 'G', 'G,A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT:AD', 'GT:AD', 'GT', 'GT:AD'],
...     'A': ['0/1:12,15', '0/0:32,1', '0/1', './.:.'],
...     'B': ['0/1:13,17', '0/1:14,15', './.', '1/2:0,11,17'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT          A            B
0  chr1  100  .   A    C    .      .    .  GT:AD  0/1:12,15    0/1:13,17
1  chr1  101  .   G    T    .      .    .  GT:AD   0/0:32,1    0/1:14,15
2  chr1  102  .   A    G    .      .    .     GT        0/1          ./.
3  chr1  103  .   C  G,A    .      .    .  GT:AD      ./.:.  1/2:0,11,17
>>> vf.add_af().df
  CHROM  POS ID REF  ALT QUAL FILTER INFO    FORMAT                      A                              B
0  chr1  100  .   A    C    .      .    .  GT:AD:AF  0/1:12,15:0.444,0.556          0/1:13,17:0.433,0.567
1  chr1  101  .   G    T    .      .    .  GT:AD:AF   0/0:32,1:0.970,0.030          0/1:14,15:0.483,0.517
2  chr1  102  .   A    G    .      .    .     GT:AF                  0/1:.                          ./.:.
3  chr1  103  .   C  G,A    .      .    .  GT:AD:AF                ./.:.:.  1/2:0,11,17:0.000,0.393,0.607
add_dp()[source]

Compute DP using AD and add it to the FORMAT field.

Returns

Updated VcfFrame.

Return type

VcfFrame

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr2', 'chr2'],
...     'POS': [100, 100, 200, 200],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'A', 'C', 'C'],
...     'ALT': ['C', 'T', 'G', 'G,A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT:AD', 'GT:AD', 'GT:AD', 'GT:AD'],
...     'Steven': ['0/1:12,15', '0/0:32,1', '0/1:16,12', './.:.'],
...     'Sara': ['0/1:13,17', '0/1:14,15', './.:.', '1/2:0,11,17'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT     Steven         Sara
0  chr1  100  .   A    C    .      .    .  GT:AD  0/1:12,15    0/1:13,17
1  chr1  100  .   A    T    .      .    .  GT:AD   0/0:32,1    0/1:14,15
2  chr2  200  .   C    G    .      .    .  GT:AD  0/1:16,12        ./.:.
3  chr2  200  .   C  G,A    .      .    .  GT:AD      ./.:.  1/2:0,11,17

We can add the DP subfield to our genotype data:

>>> vf.add_dp().df
  CHROM  POS ID REF  ALT QUAL FILTER INFO    FORMAT        Steven            Sara
0  chr1  100  .   A    C    .      .    .  GT:AD:DP  0/1:12,15:27    0/1:13,17:30
1  chr1  100  .   A    T    .      .    .  GT:AD:DP   0/0:32,1:33    0/1:14,15:29
2  chr2  200  .   C    G    .      .    .  GT:AD:DP  0/1:16,12:28         ./.:.:.
3  chr2  200  .   C  G,A    .      .    .  GT:AD:DP       ./.:.:.  1/2:0,11,17:28
add_flag(flag, order='last', index=None)[source]

Add the given flag to the INFO field.

The default behavior is to add the flag to all rows in the VcfFrame.

Parameters
  • flag (str) – INFO flag.

  • order ({‘last’, ‘first’, False}, default: ‘last’) – Determines the order in which the flag will be added.

    • last : Add to the end of the list.

    • first : Add to the beginning of the list.

    • False : Overwrite the existing field.

  • index (list or pandas.Series, optional) – Boolean index array indicating which rows should be updated.

Returns

Updated VcfFrame.

Return type

VcfFrame

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'T', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', 'DB', 'DB', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/0', '0/1', '0/1', '1/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/0
1  chr1  101  .   T   C    .      .   DB     GT    0/1
2  chr1  102  .   A   T    .      .   DB     GT    0/1
3  chr1  103  .   C   A    .      .    .     GT    1/1

We can add the SOMATIC flag to the INFO field:

>>> vf.add_flag('SOMATIC').df
  CHROM  POS ID REF ALT QUAL FILTER        INFO FORMAT Steven
0  chr1  100  .   G   A    .      .     SOMATIC     GT    0/0
1  chr1  101  .   T   C    .      .  DB;SOMATIC     GT    0/1
2  chr1  102  .   A   T    .      .  DB;SOMATIC     GT    0/1
3  chr1  103  .   C   A    .      .     SOMATIC     GT    1/1

Setting order='first' will append the flag at the beginning:

>>> vf.add_flag('SOMATIC', order='first').df
  CHROM  POS ID REF ALT QUAL FILTER        INFO FORMAT Steven
0  chr1  100  .   G   A    .      .     SOMATIC     GT    0/0
1  chr1  101  .   T   C    .      .  SOMATIC;DB     GT    0/1
2  chr1  102  .   A   T    .      .  SOMATIC;DB     GT    0/1
3  chr1  103  .   C   A    .      .     SOMATIC     GT    1/1

Setting order=False will overwrite the INFO field:

>>> vf.add_flag('SOMATIC', order=False).df
  CHROM  POS ID REF ALT QUAL FILTER     INFO FORMAT Steven
0  chr1  100  .   G   A    .      .  SOMATIC     GT    0/0
1  chr1  101  .   T   C    .      .  SOMATIC     GT    0/1
2  chr1  102  .   A   T    .      .  SOMATIC     GT    0/1
3  chr1  103  .   C   A    .      .  SOMATIC     GT    1/1

We can also specify which rows should be updated:

>>> vf.add_flag('SOMATIC', index=[True, True, False, False]).df
  CHROM  POS ID REF ALT QUAL FILTER        INFO FORMAT Steven
0  chr1  100  .   G   A    .      .     SOMATIC     GT    0/0
1  chr1  101  .   T   C    .      .  DB;SOMATIC     GT    0/1
2  chr1  102  .   A   T    .      .          DB     GT    0/1
3  chr1  103  .   C   A    .      .           .     GT    1/1
calculate_concordance(a, b, c=None, mode='all')[source]

Calculate genotype concordance between two (A, B) or three (A, B, C) samples.

This method will return (Ab, aB, AB, ab) for comparison between two samples and (Abc, aBc, ABc, abC, AbC, aBC, ABC, abc) for three samples. Note that the former is equivalent to (FP, FN, TP, TN) if we assume A is the test sample and B is the truth sample.

Only biallelic sites will be used for calculation. Additionally, the method will ignore zygosity and only consider presence or absence of variant calls (e.g. 0/1 and 1/1 will be treated the same).

Parameters
  • a, b (str or int) – Name or index of Samples A and B.

  • c (str or int, optional) – Name or index of Sample C.

  • mode ({‘all’, ‘snv’, ‘indel’}, default: ‘all’) – Determines which variant types should be analyzed:

    • ‘all’: Include both SNVs and INDELs.

    • ‘snv’: Include SNVs only.

    • ‘indel’: Include INDELs only.

Returns

Four- or eight-element tuple depending on the number of samples.

Return type

tuple

See also

fuc.api.common.sumstat

Return various summary statistics from (FP, FN, TP, TN).

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['G', 'CT', 'T', 'C', 'A'],
...     'ALT': ['A', 'C', 'A', 'T', 'G,C'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', '0/0', '0/0', '0/1', '0/0'],
...     'B': ['1/1', '0/1', './.', '0/1', '0/0'],
...     'C': ['0/1', '0/1', '1/1', './.', '1/2'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT    A    B    C
0  chr1  100  .   G    A    .      .    .     GT  0/1  1/1  0/1
1  chr1  101  .  CT    C    .      .    .     GT  0/0  0/1  0/1
2  chr1  102  .   T    A    .      .    .     GT  0/0  ./.  1/1
3  chr1  103  .   C    T    .      .    .     GT  0/1  0/1  ./.
4  chr1  104  .   A  G,C    .      .    .     GT  0/0  0/0  1/2

We can first compare the samples A and B:

>>> vf.calculate_concordance('A', 'B', mode='all')
(0, 1, 2, 1)
>>> vf.calculate_concordance('A', 'B', mode='snv')
(0, 0, 2, 1)
>>> vf.calculate_concordance('A', 'B', mode='indel')
(0, 1, 0, 0)

We can also compare all three samples at once:

>>> vf.calculate_concordance('A', 'B', 'C')
(0, 0, 1, 1, 0, 1, 1, 0)
collapse()[source]

Collapse duplicate records in the VcfFrame.

Duplicate records have the identical values for CHROM, POS, and REF. They can result from merging two VCF files.

Note

The method will sort the order of ALT alleles.

Returns

Collapsed VcfFrame.

Return type

VcfFrame

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr2', 'chr2'],
...     'POS': [100, 100, 200, 200],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'A', 'C', 'C'],
...     'ALT': ['C', 'T', 'G', 'G,A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT:AD', 'GT:AD', 'GT:AD', 'GT:AD'],
...     'Steven': ['0/1:12,15', './.:.', '0/1:16,12', './.:.'],
...     'Sara': ['./.:.', '0/1:14,15', './.:.', '1/2:0,11,17'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT     Steven         Sara
0  chr1  100  .   A    C    .      .    .  GT:AD  0/1:12,15        ./.:.
1  chr1  100  .   A    T    .      .    .  GT:AD      ./.:.    0/1:14,15
2  chr2  200  .   C    G    .      .    .  GT:AD  0/1:16,12        ./.:.
3  chr2  200  .   C  G,A    .      .    .  GT:AD      ./.:.  1/2:0,11,17

We collapse the VcfFrame:

>>> vf.collapse().df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT       Steven         Sara
0  chr1  100  .   A  C,T    .      .    .  GT:AD  0/1:12,15,0  0/2:14,0,15
2  chr2  200  .   C  A,G    .      .    .  GT:AD  0/2:16,0,12  1/2:0,17,11
combine(a, b)[source]

Combine genotype data from two samples (A, B).

This method can be especially useful when you want to consolidate genotype data from replicate samples. See examples below for more details.

Parameters

a, b (str or int) – Name or index of Samples A and B.

Returns

Resulting VCF column.

Return type

pandas.Series

See also

VcfFrame.subtract

Subtract genotype data between two samples (A, B).

Examples

Assume we have following data where a cancer patient’s tissue sample has been sequenced twice:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['G', 'T', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'A', 'C', 'G'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT:DP', 'GT:DP', 'GT:DP', 'GT:DP', 'GT:DP'],
...     'Tissue1': ['./.:.', '0/0:7', '0/1:28', '0/1:4', '0/1:32'],
...     'Tissue2': ['0/1:24', '0/1:42', './.:.', './.:.', '0/1:19'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Tissue1 Tissue2
0  chr1  100  .   G   A    .      .    .  GT:DP   ./.:.  0/1:24
1  chr1  101  .   T   C    .      .    .  GT:DP   0/0:7  0/1:42
2  chr1  102  .   T   A    .      .    .  GT:DP  0/1:28   ./.:.
3  chr1  103  .   A   C    .      .    .  GT:DP   0/1:4   ./.:.
4  chr1  104  .   C   G    .      .    .  GT:DP  0/1:32  0/1:19

We can combine genotype data from ‘Tissue1’ and ‘Tissue2’ to get a more comprehensive variant profile:

>>> vf.df['Combined'] = vf.combine('Tissue1', 'Tissue2')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Tissue1 Tissue2 Combined
0  chr1  100  .   G   A    .      .    .  GT:DP   ./.:.  0/1:24   0/1:24
1  chr1  101  .   T   C    .      .    .  GT:DP   0/0:7  0/1:42   0/1:42
2  chr1  102  .   T   A    .      .    .  GT:DP  0/1:28   ./.:.   0/1:28
3  chr1  103  .   A   C    .      .    .  GT:DP   0/1:4   ./.:.    0/1:4
4  chr1  104  .   C   G    .      .    .  GT:DP  0/1:32  0/1:19   0/1:32
compare(other)[source]

Compare to another VcfFrame and show the differences in genotype calling.

Parameters

other (VcfFrame) – VcfFrame to compare with.

Returns

DataFrame comtaining genotype differences.

Return type

pandas.DataFrame

Examples

>>> from fuc import pyvcf
>>> data1 = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['G', 'CT', 'T', 'C', 'A'],
...     'ALT': ['A', 'C', 'A', 'T', 'G,C'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', '0/0', '0/0', '0/1', '0/0'],
...     'B': ['1/1', '0/1', './.', '0/1', '0/0'],
...     'C': ['0/1', '0/1', '1/1', './.', '1/2'],
... }
>>> data2 = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['G', 'CT', 'T', 'C', 'A'],
...     'ALT': ['A', 'C', 'A', 'T', 'G,C'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
...     'A': ['./.', '0/0', '0/0', '0/1', '0/0'],
...     'B': ['1/1', '0/1', './.', '1/1', '0/0'],
...     'C': ['0/1', '0/1', '0/1', './.', '1/2'],
... }
>>> vf1 = pyvcf.VcfFrame.from_dict([], data1)
>>> vf2 = pyvcf.VcfFrame.from_dict([], data2)
>>> vf1.compare(vf2)
          Locus Sample Self Other
0  chr1-100-G-A      A  0/1   ./.
1  chr1-102-T-A      C  1/1   0/1
2  chr1-103-C-T      B  0/1   1/1
compute_info(key)[source]

Compute AC/AN/AF for INFO column.

The method will ignore and overwrite any existing data for selected key.

Returns

  • VcfFrame – Updated VcfFrame.

  • key ({‘AC’, ‘AN’, ‘AF’}) – INFO key.

Example

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chrX'],
...     'POS': [100, 101, 102, 100],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'T,G', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['AC=100', 'MQ=59', '.', '.'],
...     'FORMAT': ['GT:DP', 'GT', 'GT', 'GT'],
...     'A': ['1|0:34', '0|0', '1|0', '0'],
...     'B': ['1/1:23', '0/1', '0/0', '0/0'],
...     'C': ['0/0:28', './.', '1/2', '1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER    INFO FORMAT       A       B       C
0  chr1  100  .   G    A    .      .  AC=100  GT:DP  1|0:34  1/1:23  0/0:28
1  chr1  101  .   T    C    .      .   MQ=59     GT     0|0     0/1     ./.
2  chr1  102  .   A  T,G    .      .       .     GT     1|0     0/0     1/2
3  chrX  100  .   C    A    .      .       .     GT       0     0/0       1
>>> vf = vf.compute_info('AC')
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER        INFO FORMAT       A       B       C
0  chr1  100  .   G    A    .      .        AC=1  GT:DP  1|0:34  1/1:23  0/0:28
1  chr1  101  .   T    C    .      .  MQ=59;AC=1     GT     0|0     0/1     ./.
2  chr1  102  .   A  T,G    .      .      AC=1,1     GT     1|0     0/0     1/2
3  chrX  100  .   C    A    .      .        AC=1     GT       0     0/0       1
>>> vf = vf.compute_info('AN')
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER             INFO FORMAT       A       B       C
0  chr1  100  .   G    A    .      .        AC=1;AN=6  GT:DP  1|0:34  1/1:23  0/0:28
1  chr1  101  .   T    C    .      .  MQ=59;AC=1;AN=4     GT     0|0     0/1     ./.
2  chr1  102  .   A  T,G    .      .      AC=1,1;AN=6     GT     1|0     0/0     1/2
3  chrX  100  .   C    A    .      .        AC=1;AN=4     GT       0     0/0       1
>>> vf = vf.compute_info('AF')
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER                        INFO FORMAT       A       B       C
0  chr1  100  .   G    A    .      .          AC=1;AN=6;AF=0.167  GT:DP  1|0:34  1/1:23  0/0:28
1  chr1  101  .   T    C    .      .    MQ=59;AC=1;AN=4;AF=0.250     GT     0|0     0/1     ./.
2  chr1  102  .   A  T,G    .      .  AC=1,1;AN=6;AF=0.167,0.167     GT     1|0     0/0     1/2
3  chrX  100  .   C    A    .      .          AC=1;AN=4;AF=0.250     GT       0     0/0       1
property contigs

List of contig names.

Type

list

copy()[source]

Return a copy of the VcfFrame.

copy_df()[source]

Return a copy of the dataframe.

copy_meta()[source]

Return a copy of the metadata.

property df

DataFrame containing VCF data.

Type

pandas.DataFrame

drop_duplicates(subset=None, keep='first')[source]

Return VcfFrame with duplicate rows removed.

This method essentially wraps the pandas.DataFrame.drop_duplicates() method.

Considering certain columns is optional.

Parameters
  • subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns.

  • keep ({‘first’, ‘last’, False}, default ‘first’) – Determines which duplicates (if any) to keep.

    • first : Drop duplicates except for the first occurrence.

    • last : Drop duplicates except for the last occurrence.

    • False : Drop all duplicates.

Returns

VcfFrame with duplicates removed.

Return type

VcfFrame

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr2', 'chr2'],
...     'POS': [100, 100, 200, 200],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'A', 'C', 'C'],
...     'ALT': ['C', 'T', 'G', 'G,A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', './.', '0/1', './.'],
...     'B': ['./.', '0/1', './.', '1/2'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   A    C    .      .    .     GT  0/1  ./.
1  chr1  100  .   A    T    .      .    .     GT  ./.  0/1
2  chr2  200  .   C    G    .      .    .     GT  0/1  ./.
3  chr2  200  .   C  G,A    .      .    .     GT  ./.  1/2
>>> vf.drop_duplicates(['CHROM', 'POS', 'REF']).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   A   C    .      .    .     GT  0/1  ./.
1  chr2  200  .   C   G    .      .    .     GT  0/1  ./.
>>> vf.drop_duplicates(['CHROM', 'POS', 'REF'], keep='last').df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   A    T    .      .    .     GT  ./.  0/1
1  chr2  200  .   C  G,A    .      .    .     GT  ./.  1/2
property empty

Indicator whether VcfFrame is empty.

Returns

If VcfFrame is empty, return True, if not return False.

Return type

bool

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['0/1', '1/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .   G   A    .      .    .     GT  0/1
1  chr2  101  .   T   C    .      .    .     GT  1/1
>>> vf.df = vf.df[0:0]
>>> vf.df
Empty DataFrame
Columns: [CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, A]
Index: []
>>> vf.empty
True
empty_samples(threshold=0, opposite=False, as_list=False)[source]

Remove samples with high missingness.

Samples with missingness >= threshold will be removed.

Parameters
  • threshold (int or float, default: 0) – Number or fraction of missing variants. By default (threshold=0), only samples with 100% missingness will be removed.

  • opposite (bool, default: False) – If True, return samples that don’t meet the said criteria.

  • as_list (bool, default: False) – If True, return a list of sample names instead of a VcfFrame.

Returns

Subsetted VcfFrame.

Return type

VcfFrame

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'G', 'T'],
...     'ALT': ['A', 'C', 'C', 'C'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'A': ['0/0', '0/0', '0/0', '0/0'],
...     'B': ['./.', '0/0', '0/0', '0/0'],
...     'C': ['./.', './.', '0/0', '0/0'],
...     'D': ['./.', './.', './.', '0/0'],
...     'E': ['./.', './.', './.', './.'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B    C    D    E
0  chr1  100  .   G   A    .      .    .     GT  0/0  ./.  ./.  ./.  ./.
1  chr1  101  .   T   C    .      .    .     GT  0/0  0/0  ./.  ./.  ./.
2  chr1  102  .   G   C    .      .    .     GT  0/0  0/0  0/0  ./.  ./.
3  chr1  103  .   T   C    .      .    .     GT  0/0  0/0  0/0  0/0  ./.
>>> vf.empty_samples().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B    C    D
0  chr1  100  .   G   A    .      .    .     GT  0/0  ./.  ./.  ./.
1  chr1  101  .   T   C    .      .    .     GT  0/0  0/0  ./.  ./.
2  chr1  102  .   G   C    .      .    .     GT  0/0  0/0  0/0  ./.
3  chr1  103  .   T   C    .      .    .     GT  0/0  0/0  0/0  0/0
>>> vf.empty_samples(threshold=2).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   G   A    .      .    .     GT  0/0  ./.
1  chr1  101  .   T   C    .      .    .     GT  0/0  0/0
2  chr1  102  .   G   C    .      .    .     GT  0/0  0/0
3  chr1  103  .   T   C    .      .    .     GT  0/0  0/0
>>> vf.empty_samples(threshold=0.5).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   G   A    .      .    .     GT  0/0  ./.
1  chr1  101  .   T   C    .      .    .     GT  0/0  0/0
2  chr1  102  .   G   C    .      .    .     GT  0/0  0/0
3  chr1  103  .   T   C    .      .    .     GT  0/0  0/0
>>> vf.empty_samples(threshold=0.5, opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    C    D    E
0  chr1  100  .   G   A    .      .    .     GT  ./.  ./.  ./.
1  chr1  101  .   T   C    .      .    .     GT  ./.  ./.  ./.
2  chr1  102  .   G   C    .      .    .     GT  0/0  ./.  ./.
3  chr1  103  .   T   C    .      .    .     GT  0/0  0/0  ./.
>>> vf.empty_samples(threshold=0.5, opposite=True, as_list=True)
['C', 'D', 'E']
expand()[source]

Expand each multiallelic locus to multiple rows.

Only the GT subfield of FORMAT will be retained.

Returns

Expanded VcfFrame.

Return type

VcfFrame

See also

VcfFrame.collapse

Collapse duplicate records in the VcfFrame.

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'A', 'C', 'C'],
...     'ALT': ['C', 'T,G', 'G', 'A,G,CT'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT:DP', 'GT:DP', 'GT:DP', 'GT:DP'],
...     'Steven': ['0/1:32', './.:.', '0/1:27', '0/2:34'],
...     'Sara': ['0/0:28', '1/2:30', '1/1:29', '1/2:38'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF     ALT QUAL FILTER INFO FORMAT  Steven    Sara
0  chr1  100  .   A       C    .      .    .  GT:DP  0/1:32  0/0:28
1  chr1  101  .   A     T,G    .      .    .  GT:DP   ./.:.  1/2:30
2  chr1  102  .   C       G    .      .    .  GT:DP  0/1:27  1/1:29
3  chr1  103  .   C  A,G,CT    .      .    .  GT:DP  0/2:34  1/2:38

We can expand each of the multiallelic loci:

>>> vf.expand().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara
0  chr1  100  .   A   C    .      .    .     GT    0/1  0/0
1  chr1  101  .   A   T    .      .    .     GT    ./.  0/1
2  chr1  101  .   A   G    .      .    .     GT    ./.  0/1
3  chr1  102  .   C   G    .      .    .     GT    0/1  1/1
4  chr1  103  .   C   A    .      .    .     GT    0/0  0/1
5  chr1  103  .   C   G    .      .    .     GT    0/1  0/1
6  chr1  103  .   C  CT    .      .    .     GT    0/0  0/0
extract_format(k, func=None, as_nan=False)[source]

Extract data for the specified FORMAT key.

By default, this method will return string data. Use func and as_nan to output numbers. Alternatvely, select one of the special keys for k, which have predetermined values of func and as_nan for convenience.

Parameters
  • k (str) – FORMAT key to use when extracting data. In addition to regular FORMAT keys (e.g. ‘DP’, ‘AD’), the method also accepts the special keys listed below:

    • ‘#DP’: Return numeric DP.

    • ‘#AD_REF’: Return numeric AD for REF.

    • ‘#AD_ALT’: Return numeric AD for ALT. If multiple values are available (i.e. multiallelic site), return the sum.

    • ‘#AD_FRAC_REF’: Return allele fraction for REF.

    • ‘#AD_FRAC_ALT’: Return allele fraction for ALT. If multiple values are available (i.e. multiallelic site), return the sum.

  • func (function, optional) – Function to apply to each of the extracted results.

  • as_nan (bool, default: False) – If True, return missing values as NaN.

Returns

DataFrame containing requested data.

Return type

pandas.DataFrame

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.'],
...     'REF': ['A', 'C', 'A'],
...     'ALT': ['G', 'T', 'C,T'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT:AD:DP', 'GT', 'GT:AD:DP'],
...     'A': ['0/1:15,13:28', '0/0', '0/1:9,14,0:23'],
...     'B': ['./.:.:.', '1/1', '1/2:0,11,15:26'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO    FORMAT              A               B
0  chr1  100  .   A    G    .      .    .  GT:AD:DP   0/1:15,13:28         ./.:.:.
1  chr1  101  .   C    T    .      .    .        GT            0/0             1/1
2  chr1  102  .   A  C,T    .      .    .  GT:AD:DP  0/1:9,14,0:23  1/2:0,11,15:26
>>> vf.extract_format('GT')
     A    B
0  0/1  ./.
1  0/0  1/1
2  0/1  1/2
>>> vf.extract_format('GT', as_nan=True)
     A    B
0  0/1  NaN
1  0/0  1/1
2  0/1  1/2
>>> vf.extract_format('AD')
        A        B
0   15,13        .
1     NaN      NaN
2  9,14,0  0,11,15
>>> vf.extract_format('DP', func=lambda x: int(x), as_nan=True)
      A     B
0  28.0   NaN
1   NaN   NaN
2  23.0  26.0
>>> vf.extract_format('#DP') # Same as above
      A     B
0  28.0   NaN
1   NaN   NaN
2  23.0  26.0
>>> vf.extract_format('AD', func=lambda x: float(x.split(',')[0]), as_nan=True)
      A    B
0  15.0  NaN
1   NaN  NaN
2   9.0  0.0
>>> vf.extract_format('#AD_REF') # Same as above
      A    B
0  15.0  NaN
1   NaN  NaN
2   9.0  0.0
extract_info(k, func=None, as_nan=False)[source]

Extract data for the specified INFO key.

By default, this method will return string data. Use func and as_nan to output numbers. Alternatvely, select one of the special keys for k, which have predetermined values of func and as_nan for convenience.

Parameters
  • k (str) – INFO key to use when extracting data. In addition to regular INFO keys (e.g. ‘AC’, ‘AF’), the method also accepts the special keys listed below:

    • ‘#AC’: Return numeric AC. If multiple values are available (i.e. multiallelic site), return the sum.

    • ‘#AF’: Similar to ‘#AC’.

  • func (function, optional) – Function to apply to each of the extracted results.

  • as_nan (bool, default: False) – If True, return missing values as NaN.

Returns

Requested data.

Return type

pandas.Series

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'C', 'A', 'A'],
...     'ALT': ['G', 'T', 'C,T', 'T'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['AC=1;AF=0.167;H2', 'AC=2;AF=0.333', 'AC=1,2;AF=0.167,0.333;H2', 'AC=.;AF=.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', '0/0', '0/1', './.'],
...     'B': ['0/0', '1/1', '0/2', './.'],
...     'C': ['0/0', '0/0', '0/2', './.'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER                      INFO FORMAT    A    B    C
0  chr1  100  .   A    G    .      .          AC=1;AF=0.167;H2     GT  0/1  0/0  0/0
1  chr1  101  .   C    T    .      .             AC=2;AF=0.333     GT  0/0  1/1  0/0
2  chr1  102  .   A  C,T    .      .  AC=1,2;AF=0.167,0.333;H2     GT  0/1  0/2  0/2
3  chr1  103  .   A    T    .      .                 AC=.;AF=.     GT  ./.  ./.  ./.
>>> vf.extract_info('H2')
0     H2
1    NaN
2     H2
3    NaN
dtype: object
>>> vf.extract_info('AC')
0      1
1      2
2    1,2
3      .
dtype: object
>>> vf.extract_info('AC', as_nan=True)
0      1
1      2
2    1,2
3    NaN
dtype: object
>>> vf.extract_info('AC', func=lambda x: sum([int(x) for x in x.split(',')]), as_nan=True)
0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64
>>> vf.extract_info('#AC') # Same as above
0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64
fetch(variant)[source]

Fetch the VCF row that matches specified variant.

Parameters

variant (str) – Target variant.

Returns

VCF row.

Return type

pandas.Series

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['0/1', '1/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.fetch('chr1-100-G-A')
CHROM     chr1
POS        100
ID           .
REF          G
ALT          A
QUAL         .
FILTER       .
INFO         .
FORMAT      GT
A          0/1
Name: 0, dtype: object
filter_bed(bed, opposite=False, as_index=False)[source]

Filter rows intersecting with given BED.

Only variants intersecting with given BED data will remain.

Parameters
  • bed (pybed.BedFrame or str) – BedFrame or path to a BED file.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pybed, pyvcf
>>> data = {
...     'Chromosome': ['chr1', 'chr2', 'chr3'],
...     'Start': [100, 400, 100],
...     'End': [200, 500, 200]
... }
>>> bf = pybed.BedFrame.from_dict([], data)
>>> bf.gr.df
  Chromosome  Start  End
0       chr1    100  200
1       chr2    400  500
2       chr3    100  200
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr2', 'chr3'],
...     'POS': [100, 201, 450, 99],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'CT', 'A', 'C'],
...     'ALT': ['A', 'C', 'AT', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/1', '1/1', '0/1', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  201  .  CT   C    .      .    .     GT    1/1
2  chr2  450  .   A  AT    .      .    .     GT    0/1
3  chr3   99  .   C   A    .      .    .     GT    0/1

We can select rows that overlap with the BED data:

>>> vf.filter_bed(bf).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr2  450  .   A  AT    .      .    .     GT    0/1

We can also remove those rows:

>>> vf.filter_bed(bf, opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  201  .  CT   C    .      .    .     GT    1/1
1  chr3   99  .   C   A    .      .    .     GT    0/1

Finally, we can return boolean index array from the filtering:

>>> vf.filter_bed(bf, as_index=True)
0     True
1    False
2     True
3    False
dtype: bool
>>>
filter_empty(threshold=0, opposite=False, as_index=False)[source]

Filter rows with high missingness.

Variants with missingness >= threshold will be removed.

Parameters
  • threshold (int, default: 0) – Exclude the row if it has a number of missing genotypes that is greater than or equal to this number. When 0 (default), exclude rows where all of the samples have a missing genotype.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C', 'C'],
...     'ALT': ['A', 'C', 'T', 'A', 'T'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', './.', './.', './.', './.'],
...     'B': ['0/0', '0/1', './.', './.', './.'],
...     'C': ['0/0', '0/0', '0/1', './.', './.'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B    C
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/0  0/0
1  chr1  101  .   T   C    .      .    .     GT  ./.  0/1  0/0
2  chr1  102  .   A   T    .      .    .     GT  ./.  ./.  0/1
3  chr1  103  .   C   A    .      .    .     GT  ./.  ./.  ./.
4  chr1  104  .   C   T    .      .    .     GT  ./.  ./.  ./.

We can remove rows that are completely empty:

>>> vf.filter_empty().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B    C
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/0  0/0
1  chr1  101  .   T   C    .      .    .     GT  ./.  0/1  0/0
2  chr1  102  .   A   T    .      .    .     GT  ./.  ./.  0/1

We can remove rows where at least two samples have missing genotype:

>>> vf.filter_empty(threshold=2).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B    C
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/0  0/0
1  chr1  101  .   T   C    .      .    .     GT  ./.  0/1  0/0

We can show rows that are completely empty:

>>> vf.filter_empty(opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B    C
0  chr1  103  .   C   A    .      .    .     GT  ./.  ./.  ./.
1  chr1  104  .   C   T    .      .    .     GT  ./.  ./.  ./.

Finally, we can return boolean index array from the filtering:

>>> vf.filter_empty(as_index=True)
0     True
1     True
2     True
3    False
4    False
dtype: bool
filter_flagall(flags, opposite=False, as_index=False)[source]

Filter rows with all given INFO flags.

Only variants with all given INFO flags will remain.

Parameters
  • flags (list) – List of INFO flags.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

See also

VcfFrame.filter_flagany

Similar method that selects rows if any one of the given INFO flags is present.

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'T', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['DB', 'DB;H2', 'DB;H2', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/0', '0/1', '0/1', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER   INFO FORMAT Steven
0  chr1  100  .   G   A    .      .     DB     GT    0/0
1  chr1  101  .   T   C    .      .  DB;H2     GT    0/1
2  chr1  102  .   A   T    .      .  DB;H2     GT    0/1
3  chr1  103  .   C   A    .      .      .     GT    0/0

We can select rows with both the H2 and DB tags:

>>> vf.filter_flagall(['H2', 'DB']).df
  CHROM  POS ID REF ALT QUAL FILTER   INFO FORMAT Steven
0  chr1  101  .   T   C    .      .  DB;H2     GT    0/1
1  chr1  102  .   A   T    .      .  DB;H2     GT    0/1

We can also remove those rows:

>>> vf.filter_flagall(['H2', 'DB'], opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .   DB     GT    0/0
1  chr1  103  .   C   A    .      .    .     GT    0/0

Finally, we can return boolean index array from the filtering:

>>> vf.filter_flagall(['H2', 'DB'], as_index=True)
0    False
1     True
2     True
3    False
dtype: bool
filter_flagany(flags, opposite=False, as_index=False)[source]

Filter rows with any given INFO flags.

Only variants with any given INFO flags will remain.

Parameters
  • flags (list) – List of INFO flags.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

See also

VcfFrame.filter_flagall

Similar method that selects rows if all of the given INFO flags are present.

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'T', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['DB', 'DB;H2', 'DB;H2', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/0', '0/1', '0/1', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER   INFO FORMAT Steven
0  chr1  100  .   G   A    .      .     DB     GT    0/0
1  chr1  101  .   T   C    .      .  DB;H2     GT    0/1
2  chr1  102  .   A   T    .      .  DB;H2     GT    0/1
3  chr1  103  .   C   A    .      .      .     GT    0/0

We can select rows with the H2 tag:

>>> vf.filter_flagany(['H2']).df
  CHROM  POS ID REF ALT QUAL FILTER   INFO FORMAT Steven
0  chr1  101  .   T   C    .      .  DB;H2     GT    0/1
1  chr1  102  .   A   T    .      .  DB;H2     GT    0/1

We can also remove those rows:

>>> vf.filter_flagany(['H2'], opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .   DB     GT    0/0
1  chr1  103  .   C   A    .      .    .     GT    0/0

Finally, we can return boolean index array from the filtering:

>>> vf.filter_flagany(['H2'], as_index=True)
0    False
1     True
2     True
3    False
dtype: bool
filter_indel(opposite=False, as_index=False)[source]

Filter rows with indel.

Variants with indel will be removed.

Parameters
  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'CT', 'A', 'C'],
...     'ALT': ['A', 'C', 'C,AT', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/1', '0/1', '1/2', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF   ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G     A    .      .    .     GT    0/1
1  chr1  101  .  CT     C    .      .    .     GT    0/1
2  chr1  102  .   A  C,AT    .      .    .     GT    1/2
3  chr1  103  .   C     A    .      .    .     GT    0/1

We can remove rows with an indel:

>>> vf.filter_indel().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  103  .   C   A    .      .    .     GT    0/1

We can also select those rows:

>>> vf.filter_indel(opposite=True).df
  CHROM  POS ID REF   ALT QUAL FILTER INFO FORMAT Steven
0  chr1  101  .  CT     C    .      .    .     GT    0/1
1  chr1  102  .   A  C,AT    .      .    .     GT    1/2

Finally, we can return boolean index array from the filtering:

>>> vf.filter_indel(as_index=True)
0     True
1    False
2    False
3     True
dtype: bool
filter_multialt(opposite=False, as_index=False)[source]

Filter rows with multiple ALT alleles.

Variants with multiple ALT alleles will be removed.

Parameters
  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'A', 'C', 'C'],
...     'ALT': ['C,T', 'T', 'G', 'G,A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'A': ['0/2', '0/0', '0/1', './.'],
...     'B': ['0/1', '0/1', './.', '1/2'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   A  C,T    .      .    .     GT  0/2  0/1
1  chr1  101  .   A    T    .      .    .     GT  0/0  0/1
2  chr1  102  .   C    G    .      .    .     GT  0/1  ./.
3  chr1  103  .   C  G,A    .      .    .     GT  ./.  1/2

We can remove rows with multiple ALT alleles:

>>> vf.filter_multialt().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  101  .   A   T    .      .    .     GT  0/0  0/1
1  chr1  102  .   C   G    .      .    .     GT  0/1  ./.

We can also select those rows:

>>> vf.filter_multialt(opposite=True).df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   A  C,T    .      .    .     GT  0/2  0/1
1  chr1  103  .   C  G,A    .      .    .     GT  ./.  1/2

Finally, we can return boolean index array from the filtering:

>>> vf.filter_multialt(as_index=True)
0    False
1     True
2     True
3    False
dtype: bool
filter_pass(opposite=False, as_index=False)[source]

Filter rows with PASS in FILTER column.

Only variants with PASS in the FILTER column will remain.

Parameters
  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'T', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['PASS', 'FAIL', 'PASS', 'FAIL'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/0', './.', '0/1', './.'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .   PASS    .     GT    0/0
1  chr1  101  .   T   C    .   FAIL    .     GT    ./.
2  chr1  102  .   A   T    .   PASS    .     GT    0/1
3  chr1  103  .   C   A    .   FAIL    .     GT    ./.

We can select rows with PASS:

>>> vf.filter_pass().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .   PASS    .     GT    0/0
1  chr1  102  .   A   T    .   PASS    .     GT    0/1

We can also remove those rows:

>>> vf.filter_pass(opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  101  .   T   C    .   FAIL    .     GT    ./.
1  chr1  103  .   C   A    .   FAIL    .     GT    ./.

Finally, we can return boolean index array from the filtering:

>>> vf.filter_pass(as_index=True)
0     True
1    False
2     True
3    False
dtype: bool
filter_phased(opposite=False, as_index=False)[source]

Filter rows with phased genotypes.

Variants with phased genotypes will be removed.

Parameters
  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'CT', 'A', 'C'],
...     'ALT': ['A', 'C', 'C', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['1|0', '0/1', '0/1', '0|1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    1|0
1  chr1  101  .  CT   C    .      .    .     GT    0/1
2  chr1  102  .   A   C    .      .    .     GT    0/1
3  chr1  103  .   C   A    .      .    .     GT    0|1

We can remove rows with a phased genotype:

>>> vf.filter_phased().df
  CHROM  POS ID REF   ALT QUAL FILTER INFO FORMAT Steven
0  chr1  101  .  CT     C    .      .    .     GT    0/1
1  chr1  102  .   A  C,AT    .      .    .     GT    0/1

We can also select those rows:

>>> vf.filter_phased(opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    1|0
1  chr1  103  .   C   A    .      .    .     GT    0|1

Finally, we can return boolean index array from the filtering:

>>> vf.filter_phased(as_index=True)
0    False
1     True
2     True
3    False
dtype: bool
filter_polyp(opposite=False, as_index=False)[source]

Filter rows with polyploid genotypes.

Variants with polyploid genotypes will be removed.

Parameters
  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr2', 'chr2'],
...     'POS': [100, 100, 200, 200],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'A', 'C', 'C'],
...     'ALT': ['C', 'T', 'G', 'G'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/0/1', '0/0', '1/1/1', './.'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   A   C    .      .    .     GT  0/0/1
1  chr1  100  .   A   T    .      .    .     GT    0/0
2  chr2  200  .   C   G    .      .    .     GT  1/1/1
3  chr2  200  .   C   G    .      .    .     GT    ./.

We can remove rows with a polyploid genotype call:

>>> vf.filter_polyp().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   A   T    .      .    .     GT    0/0
1  chr2  200  .   C   G    .      .    .     GT    ./.

We can also select those rows:

>>> vf.filter_polyp(opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   A   C    .      .    .     GT  0/0/1
1  chr2  200  .   C   G    .      .    .     GT  1/1/1

Finally, we can return boolean index array from the filtering:

>>> vf.filter_polyp(as_index=True)
0    False
1     True
2    False
3     True
dtype: bool
filter_qual(threshold, opposite=False, as_index=False)[source]

Filter rows with low QUAL values.

Only variants with QUAL >= threashold will remain.

Parameters
  • threshold (float) – Minimum QUAL value.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C', 'C'],
...     'ALT': ['A', 'C', 'T', 'A', 'T'],
...     'QUAL': ['.', 30, 19, 41, 29],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/1', '1/1', '0/1', '0/1', '1/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  101  .   T   C   30      .    .     GT    1/1
2  chr1  102  .   A   T   19      .    .     GT    0/1
3  chr1  103  .   C   A   41      .    .     GT    0/1
4  chr1  104  .   C   T   29      .    .     GT    1/1

We can select rows with minimum QUAL value of 30:

>>> vf.filter_qual(30).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  101  .   T   C   30      .    .     GT    1/1
1  chr1  103  .   C   A   41      .    .     GT    0/1

We can also remove those rows:

>>> vf.filter_qual(30, opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  102  .   A   T   19      .    .     GT    0/1
2  chr1  104  .   C   T   29      .    .     GT    1/1

Finally, we can return boolean index array from the filtering:

>>> vf.filter_qual(30, as_index=True)
0    False
1     True
2    False
3     True
4    False
dtype: bool
filter_sampall(samples=None, opposite=False, as_index=False)[source]

Filter rows where all given samples have variant.

Only variants where all given samples have variant. The default behavior is to use all samples in the VcfFrame.

Parameters
  • samples (list, optional) – List of sample names or indicies.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

See also

VcfFrame.filter_sampany

Similar method that selects rows if any one of the given samples has the variant.

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'T', 'T'],
...     'ALT': ['A', 'C', 'A', 'C'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/1', '0/0', '0/1', '0/1'],
...     'Sara': ['0/1', '0/1', '0/0', '0/1'],
...     'James': ['0/1', '0/1', '0/1', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/1  0/1   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/1
2  chr1  102  .   T   A    .      .    .     GT    0/1  0/0   0/1
3  chr1  103  .   T   C    .      .    .     GT    0/1  0/1   0/1

We can select rows where all three samples have the variant:

>>> vf.filter_sampall().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/1  0/1   0/1
1  chr1  103  .   T   C    .      .    .     GT    0/1  0/1   0/1

We can also remove those rows:

>>> vf.filter_sampall(opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/1
1  chr1  102  .   T   A    .      .    .     GT    0/1  0/0   0/1

We can select rows where both Sara and James have the variant:

>>> vf.filter_sampall(samples=['Sara', 'James']).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/1  0/1   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/1
2  chr1  103  .   T   C    .      .    .     GT    0/1  0/1   0/1

Finally, we can return boolean index array from the filtering:

>>> vf.filter_sampall(as_index=True)
0     True
1    False
2    False
3     True
dtype: bool
filter_sampany(samples=None, opposite=False, as_index=False)[source]

Filter rows where any given samples have variant.

Only variants where any given samples have variant will remain. The default behavior is to use all samples in the VcfFrame.

Parameters
  • samples (list, optional) – List of sample names or indicies.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

See also

VcfFrame.filter_sampall

Similar method that selects rows if all of the given samples have the variant.

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'T', 'T'],
...     'ALT': ['A', 'C', 'A', 'C'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/0', '0/0', '0/1', '0/0'],
...     'Sara': ['0/0', '0/1', '0/0', '0/0'],
...     'James': ['0/1', '0/0', '0/0', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/0  0/0   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/0
2  chr1  102  .   T   A    .      .    .     GT    0/1  0/0   0/0
3  chr1  103  .   T   C    .      .    .     GT    0/0  0/0   0/0

We can select rows where at least one sample has the variant:

>>> vf.filter_sampany().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/0  0/0   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/0
2  chr1  102  .   T   A    .      .    .     GT    0/1  0/0   0/0

We can also remove those rows:

>>> vf.filter_sampany(opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  103  .   T   C    .      .    .     GT    0/0  0/0   0/0

We can select rows where either Sara or James has the variant:

>>> vf.filter_sampany(samples=['Sara', 'James']).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/0  0/0   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/0

Finally, we can return boolean index array from the filtering:

>>> vf.filter_sampany(as_index=True)
0     True
1     True
2     True
3    False
dtype: bool
filter_sampnum(threshold, opposite=False, as_index=False)[source]

Filter rows with high variant prevalence.

Only variants with variant prevalence >= threshold will remian.

Parameters
  • threshold (int or float) – Minimum number or fraction of samples with the variant.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'T'],
...     'ALT': ['A', 'C', 'A'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'Steven': ['0/1', '0/1', '0/1'],
...     'Sara': ['0/0', '0/1', '0/0'],
...     'James': ['0/1', '0/1', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/1  0/0   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/1  0/1   0/1
2  chr1  102  .   T   A    .      .    .     GT    0/1  0/0   0/0

We can select rows where at least two samples have the variant:

>>> vf.filter_sampnum(2).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/1  0/0   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/1  0/1   0/1

Similarly, we can select rows where at least 50% of the samples have the variant:

>>> vf.filter_sampnum(0.5).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/1  0/0   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/1  0/1   0/1

We can also remove those rows:

>>> vf.filter_sampnum(0.5, opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  102  .   T   A    .      .    .     GT    0/1  0/0   0/0

Finally, we can return boolean index array from the filtering:

>>> vf.filter_sampnum(2, as_index=True)
0     True
1     True
2    False
dtype: bool
filter_vcf(vcf, opposite=False, as_index=False)[source]

Filter rows intersecting with given VCF.

Only variants intersecting with given VCF data will remain.

Parameters
  • vcf (VcfFrame or str) – VcfFrame or VCF file.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data1 = {
...     'CHROM': ['chr1', 'chr1', 'chr4', 'chr8', 'chr8'],
...     'POS': [100, 203, 192, 52, 788],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['A', 'C', 'T', 'T', 'GA'],
...     'ALT': ['C', 'G', 'A', 'G', 'G'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', '0/1', '0/1', '0/1', '0/1'],
... }
>>> vf1 = pyvcf.VcfFrame.from_dict([], data1)
>>> vf1.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .   A   C    .      .    .     GT  0/1
1  chr1  203  .   C   G    .      .    .     GT  0/1
2  chr4  192  .   T   A    .      .    .     GT  0/1
3  chr8   52  .   T   G    .      .    .     GT  0/1
4  chr8  788  .  GA   G    .      .    .     GT  0/1
>>> data2 = {
...     'CHROM': ['chr1', 'chr8'],
...     'POS': [100, 788],
...     'ID': ['.', '.'],
...     'REF': ['A', 'GA'],
...     'ALT': ['C', 'G'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
... }
>>> vf2 = pyvcf.VcfFrame.from_dict([], data2)
>>> vf2.df
  CHROM  POS ID REF ALT QUAL FILTER INFO
0  chr1  100  .   A   C    .      .    .
1  chr8  788  .  GA   G    .      .    .

We can select rows that overlap with the VCF data:

>>> vf1.filter_vcf(vf2).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .   A   C    .      .    .     GT  0/1
1  chr8  788  .  GA   G    .      .    .     GT  0/1

We can also remove those rows:

>>> vf1.filter_vcf(vf2, opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  203  .   C   G    .      .    .     GT  0/1
1  chr4  192  .   T   A    .      .    .     GT  0/1
2  chr8   52  .   T   G    .      .    .     GT  0/1

Finally, we can return boolean index array from the filtering:

>>> vf1.filter_vcf(vf2, as_index=True)
0     True
1    False
2    False
3    False
4     True
dtype: bool
classmethod from_dict(meta, data)[source]

Construct VcfFrame from a dict of array-like or dicts.

Parameters
  • meta (list) – List of the metadata lines.

  • data (dict) – Of the form {field : array-like} or {field : dict}.

Returns

VcfFrame.

Return type

VcfFrame

See also

VcfFrame

VcfFrame object creation using constructor.

VcfFrame.from_file

Construct VcfFrame from a VCF file.

VcfFrame.from_string

Construct VcfFrame from a string.

Examples

Below is a simple example:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['0/1', '1/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .   G   A    .      .    .     GT  0/1
1  chr2  101  .   T   C    .      .    .     GT  1/1
classmethod from_file(fn, compression=False, meta_only=False, regions=None)[source]

Construct VcfFrame from a VCF file.

The method will automatically use BGZF decompression if the filename ends with ‘.gz’.

If the file is large you can speicfy regions of interest to speed up data processing. Note that this requires the file be BGZF compressed and indexed (.tbi) for random access. Each region to be sliced must have the format chrom:start-end and be a half-open interval with (start, end]. This means, for example, ‘chr1:100-103’ will extract positions 101, 102, and 103. Alternatively, you can provide BED data to specify regions.

Parameters
  • fn (str or file-like object) – VCF file (compressed or uncompressed). By file-like object, we refer to objects with a read() method, such as a file handle.

  • compression (bool, default: False) – If True, use BGZF decompression regardless of the filename.

  • meta_only (bool, default: False) – If True, only read metadata and header lines.

  • regions (str, list, or pybed.BedFrame, optional) – Region or list of regions to be sliced. Also accepts a BED file or a BedFrame.

Returns

VcfFrame object.

Return type

VcfFrame

See also

VcfFrame

VcfFrame object creation using constructor.

VcfFrame.from_dict

Construct VcfFrame from a dict of array-like or dicts.

VcfFrame.from_string

Construct VcfFrame from a string.

Examples

>>> from fuc import pyvcf
>>> vf = pyvcf.VcfFrame.from_file('unzipped.vcf')
>>> vf = pyvcf.VcfFrame.from_file('zipped.vcf.gz')
>>> vf = pyvcf.VcfFrame.from_file('zipped.vcf', compression=True)
classmethod from_string(s, meta_only=False)[source]

Construct VcfFrame from a string.

Parameters

s (str) – String representation of a VCF file.

Returns

VcfFrame object.

Return type

VcfFrame

See also

VcfFrame

VcfFrame object creation using constructor.

VcfFrame.from_file

Construct VcfFrame from a VCF file.

VcfFrame.from_dict

Construct VcfFrame from a dict of array-like or dicts.

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['0/1', '0/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict(['##fileformat=VCFv4.3'], data)
>>> s = vf.to_string()
>>> print(s[:20])
##fileformat=VCFv4.3
>>> vf = pyvcf.VcfFrame.from_string(s)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .   G   A    .      .    .     GT  0/1
1  chr1  101  .   T   C    .      .    .     GT  0/1
get_af(sample, variant)[source]

Get allele fraction for a pair of sample and variant.

The method will return numpy.nan if the value is missing.

Parameters
  • sample (str) – Sample name.

  • variant (str) – Variant name.

Returns

Allele fraction.

Return type

float

Examples

>>> from fuc import pyvcf, common
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'G', 'A', 'C'],
...     'ALT': ['C', 'T', 'G', 'G,A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT:AD:AF', 'GT:AD:AF', 'GT:AF', 'GT:AD:AF'],
...     'A': ['0/1:12,15:0.444,0.556', '0/0:32,1:0.970,0.030', '0/1:.', './.:.:.'],
...     'B': ['0/1:13,17:0.433,0.567', '0/1:14,15:0.483,0.517', './.:.', '1/2:0,11,17:0.000,0.393,0.607'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO    FORMAT                      A                              B
0  chr1  100  .   A    C    .      .    .  GT:AD:AF  0/1:12,15:0.444,0.556          0/1:13,17:0.433,0.567
1  chr1  101  .   G    T    .      .    .  GT:AD:AF   0/0:32,1:0.970,0.030          0/1:14,15:0.483,0.517
2  chr1  102  .   A    G    .      .    .     GT:AF                  0/1:.                          ./.:.
3  chr1  103  .   C  G,A    .      .    .  GT:AD:AF                ./.:.:.  1/2:0,11,17:0.000,0.393,0.607
>>> vf.get_af('A', 'chr1-100-A-C')
0.556
>>> vf.get_af('B', 'chr1-102-A-G')
nan
property has_chr_prefix

Whether the (annoying) ‘chr’ string is found.

Type

bool

markmiss(expr, greedy=False, opposite=False, samples=None, as_nan=False)[source]

Mark all genotypes that satisfy the query expression as missing.

Parameters
  • expr (str) – The expression to evaluate. See the examples below for details.

  • greedy (bool, default: False) – If True, mark even ambiguous genotypes as missing.

  • opposite (bool, default: False) – If True, mark all genotypes that do not satisfy the query expression as missing and leave those that do intact.

  • sampels (list, optional) – If provided, apply the marking only to these samples.

  • as_nan (bool, default: False) – If True, mark genotypes as NaN instead of as missing.

Returns

Updated VcfFrame.

Return type

VcfFrame

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'T'],
...     'ALT': ['A', 'C', 'G'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT:DP:AD', 'GT:DP:AD', 'GT:DP:AD'],
...     'A': ['0/0:26:0,26', '0/1:32:16,16', '0/0:.:.'],
...     'B': ['./.:.:.', '0/0:31:29,2', './.:.:.'],
...     'C': ['0/1:18:12,6', '0/0:24:24,0', '1/1:8:0,8'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A            B            C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD   0/0:26:0,26      ./.:.:.  0/1:18:12,6
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  0/0:31:29,2  0/0:24:24,0
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       0/0:.:.      ./.:.:.    1/1:8:0,8

To mark as missing all genotypes with 0/0:

>>> vf.markmiss('GT == "0/0"').df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A        B            C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD       ./.:.:.  ./.:.:.  0/1:18:12,6
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  ./.:.:.      ./.:.:.
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       ./.:.:.  ./.:.:.    1/1:8:0,8

To mark as missing all genotypes that do not have 0/0:

>>> vf.markmiss('GT != "0/0"').df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT            A            B            C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD  0/0:26:0,26      ./.:.:.      ./.:.:.
1  chr1  101  .   T   C    .      .    .  GT:DP:AD      ./.:.:.  0/0:31:29,2  0/0:24:24,0
2  chr1  102  .   T   G    .      .    .  GT:DP:AD      0/0:.:.      ./.:.:.      ./.:.:.

To mark as missing all genotypes whose DP is below 30:

>>> vf.markmiss('DP < 30').df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A            B        C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD       ./.:.:.      ./.:.:.  ./.:.:.
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  0/0:31:29,2  ./.:.:.
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       0/0:.:.      ./.:.:.  ./.:.:.

Note that the genotype 0/0:.:. was not marked as missing because its DP is missing and therefore it could not be evaluated properly. To mark even ambiguous genotypes like this one as missing, you can set greedy as True:

>>> vf.markmiss('DP < 30', greedy=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A            B        C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD       ./.:.:.      ./.:.:.  ./.:.:.
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  0/0:31:29,2  ./.:.:.
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       ./.:.:.      ./.:.:.  ./.:.:.

To mark as missing all genotypes whose ALT allele has read depth below 10:

>>> vf.markmiss('AD[1] < 10', greedy=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A        B        C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD   0/0:26:0,26  ./.:.:.  ./.:.:.
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  ./.:.:.  ./.:.:.
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       ./.:.:.  ./.:.:.  ./.:.:.

To mark as missing all genotypes whose ALT allele has read depth below 10 and DP is below 30:

>>> vf.markmiss('AD[1] < 10 and DP < 30', greedy=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A            B        C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD   0/0:26:0,26      ./.:.:.  ./.:.:.
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  0/0:31:29,2  ./.:.:.
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       ./.:.:.      ./.:.:.  ./.:.:.

To mark as missing all genotypes whose ALT allele has read depth below 10 or DP is below 30:

>>> vf.markmiss('AD[1] < 10 or DP < 30', greedy=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A        B        C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD       ./.:.:.  ./.:.:.  ./.:.:.
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  ./.:.:.  ./.:.:.
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       ./.:.:.  ./.:.:.  ./.:.:.

To only retain genotypes whose ALT allele has read depth below 10 or DP is below 30:

>>> vf.markmiss('AD[1] < 10 or DP < 30', opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT            A            B            C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD  0/0:26:0,26      ./.:.:.  0/1:18:12,6
1  chr1  101  .   T   C    .      .    .  GT:DP:AD      ./.:.:.  0/0:31:29,2  0/0:24:24,0
2  chr1  102  .   T   G    .      .    .  GT:DP:AD      ./.:.:.      ./.:.:.    1/1:8:0,8

To mark as missing all genotypes whose mean of AD is below 10:

>>> vf.markmiss('np.mean(AD) < 10', greedy=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A            B            C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD   0/0:26:0,26      ./.:.:.      ./.:.:.
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  0/0:31:29,2  0/0:24:24,0
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       ./.:.:.      ./.:.:.      ./.:.:.

To do the same as above, but only for the samples A and B:

>>> vf.markmiss('np.mean(AD) < 10', greedy=True, samples=['A', 'B']).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A            B            C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD   0/0:26:0,26      ./.:.:.  0/1:18:12,6
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  0/0:31:29,2  0/0:24:24,0
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       ./.:.:.      ./.:.:.    1/1:8:0,8

To mark as NaN all genotypes whose sum of AD is below 10:

>>> vf.markmiss('sum(AD) < 10', as_nan=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A            B            C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD   0/0:26:0,26      ./.:.:.  0/1:18:12,6
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  0/0:31:29,2  0/0:24:24,0
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       0/0:.:.      ./.:.:.          NaN

Marking as NaN is useful when, for example, it is necessary to count how many genotypes are marked:

>>> vf.markmiss('sum(AD) < 10', as_nan=True).df.isna().sum().sum()
1
merge(other, how='inner', format='GT', sort=True, collapse=False)[source]

Merge with the other VcfFrame.

Parameters
  • other (VcfFrame) – Other VcfFrame. Note that the ‘chr’ prefix in contig names (e.g. ‘chr1’ vs. ‘1’) will be automatically added or removed as necessary to match the contig names of self.

  • how (str, default: ‘inner’) – Type of merge as defined in pandas.DataFrame.merge().

  • format (str, default: ‘GT’) – FORMAT subfields to be retained (e.g. ‘GT:AD:DP’).

  • sort (bool, default: True) – If True, sort the VcfFrame before returning.

  • collapse (bool, default: False) – If True, collapse duplicate records.

Returns

Merged VcfFrame.

Return type

VcfFrame

See also

merge

Merge multiple VcfFrame objects.

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data1 = {
...     'CHROM': ['chr1', 'chr1'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT:DP', 'GT:DP'],
...     'A': ['0/0:32', '0/1:29'],
...     'B': ['0/1:24', '1/1:30'],
... }
>>> data2 = {
...     'CHROM': ['chr1', 'chr1', 'chr2'],
...     'POS': [100, 101, 200],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'A'],
...     'ALT': ['A', 'C', 'T'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT:DP', 'GT:DP', 'GT:DP'],
...     'C': ['./.:.', '0/0:24', '0/0:26'],
...     'D': ['0/1:24', '0/1:31', '0/1:26'],
... }
>>> vf1 = pyvcf.VcfFrame.from_dict([], data1)
>>> vf2 = pyvcf.VcfFrame.from_dict([], data2)
>>> vf1.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT       A       B
0  chr1  100  .   G   A    .      .    .  GT:DP  0/0:32  0/1:24
1  chr1  101  .   T   C    .      .    .  GT:DP  0/1:29  1/1:30
>>> vf2.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT       C       D
0  chr1  100  .   G   A    .      .    .  GT:DP   ./.:.  0/1:24
1  chr1  101  .   T   C    .      .    .  GT:DP  0/0:24  0/1:31
2  chr2  200  .   A   T    .      .    .  GT:DP  0/0:26  0/1:26

We can merge the two VcfFrames with how='inner' (default):

>>> vf1.merge(vf2).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B    C    D
0  chr1  100  .   G   A    .      .    .     GT  0/0  0/1  ./.  0/1
1  chr1  101  .   T   C    .      .    .     GT  0/1  1/1  0/0  0/1

We can also merge with how='outer':

>>> vf1.merge(vf2, how='outer').df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B    C    D
0  chr1  100  .   G   A    .      .    .     GT  0/0  0/1  ./.  0/1
1  chr1  101  .   T   C    .      .    .     GT  0/1  1/1  0/0  0/1
2  chr2  200  .   A   T    .      .    .     GT  ./.  ./.  0/0  0/1

Since both VcfFrames have the DP subfield, we can use format='GT:DP':

>>> vf1.merge(vf2, how='outer', format='GT:DP').df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT       A       B       C       D
0  chr1  100  .   G   A    .      .    .  GT:DP  0/0:32  0/1:24   ./.:.  0/1:24
1  chr1  101  .   T   C    .      .    .  GT:DP  0/1:29  1/1:30  0/0:24  0/1:31
2  chr2  200  .   A   T    .      .    .  GT:DP   ./.:.   ./.:.  0/0:26  0/1:26
property meta

List of metadata lines.

Type

list

meta_keys()[source]

Print metadata lines with a key.

miss2ref()[source]

Convert missing genotype (./.) to homozygous REF (0/0).

Returns

VcfFrame object.

Return type

VcfFrame

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['./.', '1/1'],
...     'B': ['./.', './.']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   G   A    .      .    .     GT  ./.  ./.
1  chr2  101  .   T   C    .      .    .     GT  1/1  ./.
>>> new_vf = vf.miss2ref()
>>> new_vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   G   A    .      .    .     GT  0/0  0/0
1  chr2  101  .   T   C    .      .    .     GT  1/1  0/0
property phased

Return True if every genotype in VcfFrame is haplotype phased.

Returns

If VcfFrame is fully phased, return True, if not return False. Also return False if VcfFrame is empty.

Return type

bool

Examples

>>> from fuc import pyvcf
>>> data1 = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'A'],
...     'ALT': ['A', 'C', 'T'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'A': ['1|1', '0|0', '1|0'],
...     'B': ['1|0', '0|1', '1|0'],
... }
>>> vf1 = pyvcf.VcfFrame.from_dict([], data1)
>>> vf1.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   G   A    .      .    .     GT  1|1  1|0
1  chr1  101  .   T   C    .      .    .     GT  0|0  0|1
2  chr1  102  .   A   T    .      .    .     GT  1|0  1|0
>>> vf1.phased
True
>>> data2 = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'A'],
...     'ALT': ['A', 'C', 'T'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'C': ['1|1', '0/0', '1|0'],
...     'D': ['1|0', '0/1', '1|0'],
... }
>>> vf2 = pyvcf.VcfFrame.from_dict([], data2)
>>> vf2.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    C    D
0  chr1  100  .   G   A    .      .    .     GT  1|1  1|0
1  chr1  101  .   T   C    .      .    .     GT  0/0  0/1
2  chr1  102  .   A   T    .      .    .     GT  1|0  1|0
>>> vf2.phased
False
plot_comparison(a, b, c=None, labels=None, ax=None, figsize=None)[source]

Create a Venn diagram showing genotype concordance between groups.

This method supports comparison between two groups (Groups A & B) as well as three groups (Groups A, B, & C).

Parameters
  • a, b (list) – Sample names. The lists must have the same shape.

  • c (list, optional) – Same as above.

  • labels (list, optional) – List of labels to be displayed.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

  • matplotlib.axes.Axes – The matplotlib axes containing the plot.

  • matplotlib_venn._common.VennDiagram – VennDiagram object.

Examples

>>> from fuc import pyvcf, common
>>> common.load_dataset('pyvcf')
>>> f = '~/fuc-data/pyvcf/plot_comparison.vcf'
>>> vf = pyvcf.VcfFrame.from_file(f)
>>> a = ['Steven_A', 'John_A', 'Sara_A']
>>> b = ['Steven_B', 'John_B', 'Sara_B']
>>> c = ['Steven_C', 'John_C', 'Sara_C']
>>> vf.plot_comparison(a, b)
_images/api-43.png
>>> vf.plot_comparison(a, b, c)
_images/api-44.png
plot_hist_format(k, af=None, group_col=None, group_order=None, kde=True, ax=None, figsize=None, **kwargs)[source]

Create a histogram showing the distribution of data for the specified FORMAT key.

Parameters
  • k (str) – One of the special FORMAT keys as defined in VcfFrame.extract_format().

  • af (common.AnnFrame) – AnnFrame containing sample annotation data.

  • group_col (list, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • kde (bool, default: True) – Compute a kernel density estimate to smooth the distribution.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.histplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> from fuc import common, pyvcf
>>> common.load_dataset('pyvcf')
>>> vcf_file = '~/fuc-data/pyvcf/normal-tumor.vcf'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> vf.plot_hist_format('#DP')
_images/api-45.png

We can draw multiple histograms with hue mapping:

>>> annot_file = '~/fuc-data/pyvcf/normal-tumor-annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col='Sample')
>>> vf.plot_hist_format('#DP', af=af, group_col='Tissue')
_images/api-46.png

We can show AF instead of DP:

>>> vf.plot_hist_format('#AD_FRAC_REF')
_images/api-47.png
plot_hist_info(k, kde=True, ax=None, figsize=None, **kwargs)[source]

Create a histogram showing the distribution of data for the specified INFO key.

Parameters
  • k (str) – One of the special INFO keys as defined in VcfFrame.extract_info().

  • kde (bool, default: True) – Compute a kernel density estimate to smooth the distribution.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.histplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> from fuc import common, pyvcf
>>> common.load_dataset('pyvcf')
>>> vcf_file = '~/fuc-data/pyvcf/getrm-cyp2d6-vdr.vcf'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> vf.plot_hist_info('#AC')
_images/api-48.png

We can show AF instead of AC:

>>> vf.plot_hist_info('#AF')
_images/api-49.png
plot_rainfall(sample, palette=None, ax=None, figsize=None, legend='auto', **kwargs)[source]

Create a rainfall plot visualizing inter-variant distance on a linear genomic scale for single sample.

Under the hood, this method simply converts the VcfFrame to the fuc.api.pymaf.MafFrame class and then applies the fuc.api.pymaf.MafFrame.plot_rainfall() method.

Parameters
  • sample (str) – Name of the sample.

  • palette (str, optional) – Name of the seaborn palette. See the Control plot colors tutorial for details.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.scatterplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

fuc.api.pymaf.MafFrame.plot_rainfall

Similar method for the fuc.api.pymaf.MafFrame() class.

Examples

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pyvcf
>>> common.load_dataset('brca')
>>> vcf_file = '~/fuc-data/brca/brca.vcf'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> vf.plot_rainfall('TCGA-A8-A08B',
...                  figsize=(14, 7),
...                  palette=sns.color_palette('Set2')[:6])
>>> plt.tight_layout()
_images/api-50.png
plot_region(sample, k='#DP', color=None, region=None, label=None, ax=None, figsize=None, **kwargs)[source]

Create a scatter plot showing read depth profile of a sample for the specified region.

Parameters
  • sample (str or int) – Name or index of target sample.

  • k (str, default: ‘#DP’) – Genotype key to use for extracting data:

    • ‘#DP’: Return read depth.

    • ‘#AD_REF’: Return REF allele depth.

    • ‘#AD_ALT’: Return ALT allele depth.

    • ‘#AD_FRAC_REF’: Return REF allele fraction.

    • ‘#AD_FRAC_ALT’: Return ALT allele fraction.

  • color (str, optional) – Marker color.

  • region (str, optional) – Target region (‘chrom:start-end’).

  • label (str, optional) – Label to use for the data points.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to matplotlib.axes.Axes.scatter().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> from fuc import pyvcf, common
>>> import matplotlib.pyplot as plt
>>> common.load_dataset('pyvcf')
>>> vcf_file = '~/fuc-data/pyvcf/getrm-cyp2d6-vdr.vcf'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> vf.plot_region('NA18973')
>>> plt.tight_layout()
_images/api-51.png

We can display allele fraction of REF and ALT instead of DP:

>>> ax = vf.plot_region('NA18973', k='#AD_FRAC_REF', label='REF')
>>> vf.plot_region('NA18973', k='#AD_FRAC_ALT', label='ALT', ax=ax)
>>> plt.tight_layout()
_images/api-52.png
plot_regplot(a, b, ax=None, figsize=None, **kwargs)[source]

Create a scatter plot with a linear regression model fit visualizing correlation between TMB in two sample groups.

The method will automatically calculate and print summary statistics including R-squared and p-value.

Parameters
  • a, b (array-like) – Lists of sample names. The lists must have the same shape.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.regplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> from fuc import common, pyvcf
>>> common.load_dataset('pyvcf')
>>> vcf_file = '~/fuc-data/pyvcf/normal-tumor.vcf'
>>> annot_file = '~/fuc-data/pyvcf/normal-tumor-annot.tsv'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> af = common.AnnFrame.from_file(annot_file, sample_col='Sample')
>>> normal = af.df[af.df.Tissue == 'Normal'].index
>>> normal.name = 'Normal'
>>> tumor = af.df[af.df.Tissue == 'Tumor'].index
>>> tumor.name = 'Tumor'
>>> vf.plot_regplot(normal, tumor)
Results for B ~ A:
R^2 = 0.01
P = 7.17e-01
>>> plt.tight_layout()
_images/api-53.png
plot_snvclsc(af=None, group_col=None, group_order=None, palette=None, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a bar plot summarizing the count distrubtions of the six SNV classes for all samples.

A grouped bar plot can be created with group_col (requires an AnnFrame).

Under the hood, this method simply converts the VcfFrame to the fuc.api.pymaf.MafFrame class and then applies the fuc.api.pymaf.MafFrame.plot_snvclsc() method.

Parameters
  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • palette (str, optional) – Name of the seaborn palette. See the Control plot colors tutorial for details.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.barplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

fuc.api.pymaf.MafFrame.plot_snvclsc

Similar method for the fuc.api.pymaf.MafFrame() class.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pyvcf
>>> common.load_dataset('tcga-laml')
>>> vcf_file = '~/fuc-data/tcga-laml/tcga_laml.vcf'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> vf.plot_snvclsc(palette=sns.color_palette('Pastel1'))
>>> plt.tight_layout()
_images/api-54.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> vf.plot_snvclsc(af=af,
...                 group_col='FAB_classification',
...                 group_order=['M0', 'M1', 'M2'])
>>> plt.tight_layout()
_images/api-55.png
plot_snvclsp(af=None, group_col=None, group_order=None, palette=None, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a box plot summarizing the proportion distrubtions of the six SNV classes for all sample.

Under the hood, this method simply converts the VcfFrame to the fuc.api.pymaf.MafFrame class and then applies the fuc.api.pymaf.MafFrame.plot_snvclsp() method.

Parameters
  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • palette (str, optional) – Name of the seaborn palette. See the Control plot colors tutorial for details.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.boxplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

fuc.api.pymaf.MafFrame.plot_snvclsp

Similar method for the fuc.api.pymaf.MafFrame() class.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pyvcf
>>> common.load_dataset('tcga-laml')
>>> vcf_file = '~/fuc-data/tcga-laml/tcga_laml.vcf'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> vf.plot_snvclsp(palette=sns.color_palette('Pastel1'))
>>> plt.tight_layout()
_images/api-56.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> vf.plot_snvclsp(af=af,
...                 group_col='FAB_classification',
...                 group_order=['M0', 'M1', 'M2'])
>>> plt.tight_layout()
_images/api-57.png
plot_snvclss(color=None, colormap=None, width=0.8, legend=True, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a bar plot showing the proportions of the six SNV classes for individual samples.

Under the hood, this method simply converts the VcfFrame to the fuc.api.pymaf.MafFrame class and then applies the fuc.api.pymaf.MafFrame.plot_snvclss() method.

Parameters
  • color (list, optional) – List of color tuples. See the Control plot colors tutorial for details.

  • colormap (str or matplotlib colormap object, optional) – Colormap to select colors from. See the Control plot colors tutorial for details.

  • width (float, default: 0.8) – The width of the bars.

  • legend (bool, default: True) – Place legend on axis subplots.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to pandas.DataFrame.plot.bar() or pandas.DataFrame.plot.barh().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

fuc.api.pymaf.MafFrame.plot_snvclss

Similar method for the fuc.api.pymaf.MafFrame() class.

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> ax = mf.plot_snvclss(width=1, color=plt.get_cmap('Pastel1').colors)
>>> ax.legend(loc='upper right')
>>> plt.tight_layout()
_images/api-58.png
plot_titv(af=None, group_col=None, group_order=None, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a box plot showing the Ti/Tv proportions of samples.

Under the hood, this method simply converts the VcfFrame to the pymaf.MafFrame class and then applies the pymaf.MafFrame.plot_titv() method.

Parameters
  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.boxplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

fuc.api.pymaf.MafFrame.plot_titv

Similar method for the fuc.api.pymaf.MafFrame class.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pyvcf
>>> common.load_dataset('tcga-laml')
>>> vcf_file = '~/fuc-data/tcga-laml/tcga_laml.vcf'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> vf.plot_titv()
>>> plt.tight_layout()
_images/api-59.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> vf.plot_titv(af=af