API

Introduction

This section describes application programming interface (API) for the fuc package.

Below is the list of submodules available in the fuc API:

  • common : The common submodule is used by other fuc submodules such as pyvcf and pybed. It also provides many day-to-day actions used in the field of bioinformatics.

  • pybam : The pybam submodule is designed for working with sequence alignment files (SAM/BAM/CRAM). It essentially wraps the pysam package to allow fast computation and easy manipulation. If you are mainly interested in working with depth of coverage data, please check out the pycov submodule which is specifically designed for the task.

  • pybed : The pybed submodule is designed for working with BED files. It implements pybed.BedFrame which stores BED data as pandas.DataFrame via the pyranges package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard BED specification.

  • pycov : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements pycov.CovFrame which stores read depth data as pandas.DataFrame via the pysam package to allow fast computation and easy manipulation. The pycov.CovFrame class also contains many useful plotting methods such as CovFrame.plot_region and CovFrame.plot_uniformity.

  • pyfq : The pyfq submodule is designed for working with FASTQ files. It implements pyfq.FqFrame which stores FASTQ data as pandas.DataFrame to allow fast computation and easy manipulation.

  • pygff : The pygff submodule is designed for working with GFF/GTF files. It implements pygff.GffFrame which stores GFF/GTF data as pandas.DataFrame to allow fast computation and easy manipulation. The submodule strictly adheres to the standard GFF specification.

  • pymaf : The pymaf submodule is designed for working with MAF files. It implements pymaf.MafFrame which stores MAF data as pandas.DataFrame to allow fast computation and easy manipulation. The pymaf.MafFrame class also contains many useful plotting methods such as MafFrame.plot_oncoplot and MafFrame.plot_summary. The submodule strictly adheres to the standard MAF specification.

  • pysnpeff : The pysnpeff submodule is designed for parsing VCF annotation data from the SnpEff program. It should be used with pyvcf.VcfFrame.

  • pyvcf : The pyvcf submodule is designed for working with VCF files. It implements pyvcf.VcfFrame which stores VCF data as pandas.DataFrame to allow fast computation and easy manipulation. The pyvcf.VcfFrame class also contains many useful plotting methods such as VcfFrame.plot_comparison and VcfFrame.plot_tmb. The submodule strictly adheres to the standard VCF specification.

  • pyvep : The pyvep submodule is designed for parsing VCF annotation data from the Ensembl VEP program. It should be used with pyvcf.VcfFrame.

For getting help on a specific submodule (e.g. pyvcf):

from fuc import pyvcf
help(pyvcf)

fuc.common

The common submodule is used by other fuc submodules such as pyvcf and pybed. It also provides many day-to-day actions used in the field of bioinformatics.

Classes:

AnnFrame(df)

Class for storing sample annotation data.

Functions:

color_print(s[, color, bold])

Print colored text.

conda_env()

str : Name of the current conda environment.

convert_file2list(fn)

Convert a text file to a list of filenames.

convert_num2cat(s[, n, decimals])

Convert numeric values to categorical variables.

extract_sequence(fasta, region)

Extract the region's DNA sequence from the FASTA file.

get_most_similar(a, l)

Return the most similar string in a list.

get_similarity(a, b)

Return a value from 0 to 1 representing how similar two strings are.

is_similar(a, b[, threshold])

Return True if the similarity is equal to or greater than threshold.

legend_handles(labels[, colors])

Create custom legend handles.

load_dataset(name[, force])

Load an example dataset from the online repository (requires internet).

parse_region(region)

Parse specified genomic region.

parse_variant(variant)

Parse specified genomic variant.

plot_cytobands(cytoband, bed[, ax, figsize])

Create chromosome ideograms along with BED data.

plot_exons(starts, ends[, name, offset, ...])

Create a gene model where exons are drawn as boxes.

rename(original, names[, indicies])

Rename sample names flexibly.

sort_regions(regions)

Return sorted list of regions.

sort_variants(variants)

Return sorted list of variants.

sumstat(fp, fn, tp, tn)

Return various summary statistics from (FP, FN, TP, TN).

class fuc.api.common.AnnFrame(df)[source]

Class for storing sample annotation data.

This class stores sample annotation data as pandas.DataFrame with sample names as index.

Note that an AnnFrame can have a different set of samples than its accompanying pymaf.MafFrame, pyvcf.VcfFrame, etc.

Parameters

df (pandas.DataFrame) – DataFrame containing sample annotation data. The index must be unique sample names.

See also

AnnFrame.from_dict

Construct AnnFrame from dict of array-like or dicts.

AnnFrame.from_file

Construct AnnFrame from a delimited text file.

Examples

>>> import pandas as pd
>>> from fuc import common
>>> data = {
...     'SampleID': ['A', 'B', 'C', 'D'],
...     'PatientID': ['P1', 'P1', 'P2', 'P2'],
...     'Tissue': ['Normal', 'Tissue', 'Normal', 'Tumor'],
...     'Age': [30, 30, 57, 57]
... }
>>> df = pd.DataFrame(data)
>>> df = df.set_index('SampleID')
>>> af = common.AnnFrame(df)
>>> af.df
         PatientID  Tissue  Age
SampleID
A               P1  Normal   30
B               P1  Tissue   30
C               P2  Normal   57
D               P2   Tumor   57

Attributes:

df

DataFrame containing sample annotation data.

samples

List of the sample names.

shape

Dimensionality of AnnFrame (samples, annotations).

Methods:

from_dict(data, sample_col)

Construct AnnFrame from dict of array-like or dicts.

from_file(fn, sample_col[, sep])

Construct AnnFrame from a delimited text file.

plot_annot(group_col[, group_order, ...])

Create a categorical heatmap for the selected column using unmatched samples.

plot_annot_matched(patient_col, group_col, ...)

Create a categorical heatmap for the selected column using matched samples.

sorted_samples(by[, mf, keep_empty, nonsyn])

Return a sorted list of sample names.

property df

DataFrame containing sample annotation data.

Type

pandas.DataFrame

classmethod from_dict(data, sample_col)[source]

Construct AnnFrame from dict of array-like or dicts.

The dictionary must contain a column that represents sample names.

Parameters
  • data (dict) – Of the form {field : array-like} or {field : dict}.

  • sample_col (str or int) – Column containing unique sample names, either given as string name or column index.

Returns

AnnFrame object.

Return type

AnnFrame

See also

AnnFrame

AnnFrame object creation using constructor.

AnnFrame.from_file

Construct AnnFrame from a delimited text file.

Examples

>>> from fuc import common
>>> data = {
...     'SampleID': ['A', 'B', 'C', 'D'],
...     'PatientID': ['P1', 'P1', 'P2', 'P2'],
...     'Tissue': ['Normal', 'Tissue', 'Normal', 'Tumor'],
...     'Age': [30, 30, 57, 57]
... }
>>> af = common.AnnFrame.from_dict(data, sample_col='SampleID') # or sample_col=0
>>> af.df
         PatientID  Tissue  Age
SampleID
A               P1  Normal   30
B               P1  Tissue   30
C               P2  Normal   57
D               P2   Tumor   57
classmethod from_file(fn, sample_col, sep='\t')[source]

Construct AnnFrame from a delimited text file.

The file must contain a column that represents sample names.

Parameters
  • fn (str) – Text file (zipped or unzipped).

  • sample_col (str or int) – Column containing unique sample names, either given as string name or column index.

  • sep (str, default: ‘\t’) – Delimiter to use.

Returns

AnnFrame object.

Return type

AnnFrame

See also

AnnFrame

AnnFrame object creation using constructor.

AnnFrame.from_dict

Construct AnnFrame from dict of array-like or dicts.

Examples

>>> from fuc import common
>>> af = common.AnnFrame.from_file('sample-annot.tsv', sample_col='SampleID')
>>> af = common.AnnFrame.from_file('sample-annot.csv', sample_col=0, sep=',')
plot_annot(group_col, group_order=None, samples=None, colors='tab10', sequential=False, xticklabels=True, ax=None, figsize=None)[source]

Create a categorical heatmap for the selected column using unmatched samples.

See this tutorial to learn how to create customized oncoplots.

Parameters
  • group_col (str) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • samples (list, optional) – Display only specified samples (in that order too).

  • colors (str or list, default: ‘tab10’) – Colormap name or list of colors.

  • sequential (bool, default: False) – Whether the column is sequential data.

  • xticklabels (bool, default: True) – If True, plot the sample names.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

  • matplotlib.axes.Axes – The matplotlib axes containing the plot.

  • list – Legend handles.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> ax, handles = af.plot_annot('FAB_classification', samples=af.samples[:10])
>>> legend = ax.legend(handles=handles)
>>> ax.add_artist(legend)
>>> plt.tight_layout()
_images/api-1.png

We can display only selected groups:

>>> ax, handles = af.plot_annot('FAB_classification', group_order=['M7', 'M6'])
>>> legend = ax.legend(handles=handles)
>>> ax.add_artist(legend)
>>> plt.tight_layout()
_images/api-2.png

We can also display sequenital data in the following way:

>>> ax, handles = af.plot_annot('FAB_classification',
...                             samples=af.samples[:10],
...                             colors='viridis',
...                             sequential=True)
>>> legend = ax.legend(handles=handles)
>>> ax.add_artist(legend)
>>> plt.tight_layout()
_images/api-3.png
plot_annot_matched(patient_col, group_col, annot_col, patient_order=None, group_order=None, annot_order=None, colors='tab10', sequential=False, xticklabels=True, ax=None, figsize=None)[source]

Create a categorical heatmap for the selected column using matched samples.

See this tutorial to learn how to create customized oncoplots.

Parameters
  • patient_col (str) – AnnFrame column containing patient information.

  • group_col (str) – AnnFrame column containing sample group information.

  • annot_col (str) – Column to plot.

  • patient_order (list, optional) – Plot only specified patients (in that order too).

  • group_order (list, optional) – List of sample group names.

  • annot_order (list, optional) – Plot only specified annotations (in that order too).

  • colors (str or list, default: ‘tab10’) – Colormap name or list of colors.

  • sequential (bool, default: False) – Whether the column is sequential data.

  • xticklabels (bool, default: True) – If True, plot the sample names.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

  • matplotlib.axes.Axes – The matplotlib axes containing the plot.

  • list – Legend handles.

property samples

List of the sample names.

Type

list

property shape

Dimensionality of AnnFrame (samples, annotations).

Type

tuple

sorted_samples(by, mf=None, keep_empty=False, nonsyn=False)[source]

Return a sorted list of sample names.

Parameters

df (str or list) – Column or list of columns to sort by.

fuc.api.common.color_print(s, color='green', bold=False)[source]

Print colored text.

fuc.api.common.conda_env()[source]

str : Name of the current conda environment.

fuc.api.common.convert_file2list(fn)[source]

Convert a text file to a list of filenames.

Parameters

fn (str) – File containing one filename per line.

Returns

List of filenames.

Return type

list

Examples

>>> from fuc import common
>>> common.convert_file2list('bam.list')
['1.bam', '2.bam', '3.bam']
fuc.api.common.convert_num2cat(s, n=5, decimals=0)[source]

Convert numeric values to categorical variables.

Parameters
  • pandas.Series – Series object containing numeric values.

  • n (int, default: 5) – Number of variables to output.

Returns

Series object containing categorical variables.

Return type

pandas.Series

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> s = af.df.days_to_last_followup
>>> s[:10]
Tumor_Sample_Barcode
TCGA-AB-2802     365.0
TCGA-AB-2803     792.0
TCGA-AB-2804    2557.0
TCGA-AB-2805     577.0
TCGA-AB-2806     945.0
TCGA-AB-2807     181.0
TCGA-AB-2808    2861.0
TCGA-AB-2809      62.0
TCGA-AB-2810      31.0
TCGA-AB-2811     243.0
Name: days_to_last_followup, dtype: float64
>>> s = common.convert_num2cat(s)
>>> s.unique()
array([ 572.2, 1144.4, 2861. , 2288.8, 1716.6,    nan])
>>> s[:10]
Tumor_Sample_Barcode
TCGA-AB-2802     572.2
TCGA-AB-2803    1144.4
TCGA-AB-2804    2861.0
TCGA-AB-2805    1144.4
TCGA-AB-2806    1144.4
TCGA-AB-2807     572.2
TCGA-AB-2808    2861.0
TCGA-AB-2809     572.2
TCGA-AB-2810     572.2
TCGA-AB-2811     572.2
Name: days_to_last_followup, dtype: float64
fuc.api.common.extract_sequence(fasta, region)[source]

Extract the region’s DNA sequence from the FASTA file.

Parameters
  • fasta (str) – FASTA file.

  • region (str) – Region (‘chrom:start-end’).

Returns

DNA sequence. Empty string if there is no matching sequence.

Return type

str

fuc.api.common.get_most_similar(a, l)[source]

Return the most similar string in a list.

fuc.api.common.get_similarity(a, b)[source]

Return a value from 0 to 1 representing how similar two strings are.

fuc.api.common.is_similar(a, b, threshold=0.9)[source]

Return True if the similarity is equal to or greater than threshold.

fuc.api.common.legend_handles(labels, colors='tab10')[source]

Create custom legend handles.

Parameters
  • labels (list) – List of labels.

  • colors (str or list, default: ‘tab10’) – Colormap name or list of colors.

Returns

List of legend handles.

Return type

list

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common
>>> fig, ax = plt.subplots()
>>> handles1 = common.legend_handles(['A', 'B'], colors='tab10')
>>> handles2 = common.legend_handles(['C', 'D'], colors=['yellow', 'green'])
>>> legend1 = ax.legend(handles=handles1, loc='center left')
>>> legend2 = ax.legend(handles=handles2)
>>> ax.add_artist(legend1)
>>> ax.add_artist(legend2)
>>> plt.tight_layout()
_images/api-4.png
fuc.api.common.load_dataset(name, force=False)[source]

Load an example dataset from the online repository (requires internet).

Parameters
fuc.api.common.parse_region(region)[source]

Parse specified genomic region.

The method will return parsed region as a tuple with a shape of (chrom, start, end) which has data types of (str, int, int).

Note that only chrom is required when specifing a region. If start and end are omitted, the method will return NaN in their respective positions in the output tuple.

Parameters

region (str) – Region (‘chrom:start-end’).

Returns

Parsed region.

Return type

tuple

Examples

>>> from fuc import common
>>> common.parse_region('chr1:100-150')
('chr1', 100, 150)
>>> common.parse_region('chr1')
('chr1', nan, nan)
>>> common.parse_region('chr1:100')
('chr1', 100, nan)
>>> common.parse_region('chr1:100-')
('chr1', 100, nan)
>>> common.parse_region('chr1:-100')
('chr1', nan, 100)
fuc.api.common.parse_variant(variant)[source]

Parse specified genomic variant.

Generally speaking, the input string should consist of chromosome, position, reference allele, and alternative allele separated by any one or combination of the following delimiters: -, :, > (e.g. ‘22-42127941-G-A’). The method will return parsed variant as a tuple with a shape of (chrom, pos, ref, alt) which has data types of (str, int, str, str).

Note that it’s possible to omit reference allele and alternative allele from the input string to indicate position-only data (e.g. ‘22-42127941’). In this case, the method will return empty string for the alleles – i.e. (str, int, '', '') if both are omitted and (str, int, str, '') if only alternative allele is omitted.

Parameters

variant (str) – Genomic variant.

Returns

Parsed variant.

Return type

tuple

Examples

>>> from fuc import common
>>> common.parse_variant('22-42127941-G-A')
('22', 42127941, 'G', 'A')
>>> common.parse_variant('22:42127941-G>A')
('22', 42127941, 'G', 'A')
>>> common.parse_variant('22-42127941')
('22', 42127941, '', '')
>>> common.parse_variant('22-42127941-G')
('22', 42127941, 'G', '')
fuc.api.common.plot_cytobands(cytoband, bed, ax=None, figsize=None)[source]

Create chromosome ideograms along with BED data.

The method’s source code is derived from a Python script (ideograms.py) written by Ryan Dale. The original script can be found at: https://gist.github.com/daler/c98fc410282d7570efc3#file-ideograms-py

Parameters
  • cytoband (str) – Text file containing cytoband ideogram information.

  • bed (str) – BED file to be displayed.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common
>>> common.load_dataset('cytoband')
>>> cytoband_file = '~/fuc-data/cytoband/cytoBandIdeo.txt.gz'
>>> bed_file = '~/fuc-data/cytoband/ucsc_genes.bed.gz'
>>> common.plot_cytobands(cytoband_file, bed_file, figsize=(10, 8))
_images/api-5.png
fuc.api.common.plot_exons(starts, ends, name=None, offset=1, fontsize=None, color='black', y=0, height=1, ax=None, figsize=None)[source]

Create a gene model where exons are drawn as boxes.

Parameters
  • starts (list) – List of exon start positions.

  • ends (list) – List of exon end positions.

  • name (str, optional) – Gene name.

  • offset (float, default: 1) – How far gene name should be plotted from the gene model.

  • color (str, default: ‘black’) – Box color.

  • y (float, default: 0) – Y position of the backbone.

  • height (float, default: 1) – Height of the gene model.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common
>>> cyp2d6_starts = [42522500, 42522852, 42523448, 42523843, 42524175, 42524785, 42525034, 42525739, 42526613]
>>> cyp2d6_ends = [42522754, 42522994, 42523636, 42523985, 42524352, 42524946, 42525187, 42525911, 42526883]
>>> ax = common.plot_exons(cyp2d6_starts, cyp2d6_ends, name='CYP2D6', fontsize=20)
>>> ax.set_ylim([-2, 2])
>>> plt.tight_layout()
_images/api-6.png
fuc.api.common.rename(original, names, indicies=None)[source]

Rename sample names flexibly.

Parameters
  • original (list) – List of original names.

  • names (dict or list) – Dict of old names to new names or list of new names.

  • indicies (list or tuple, optional) – List of 0-based sample indicies. Alternatively, a tuple (int, int) can be used to specify an index range.

Returns

List of updated names.

Return type

list

Examples

>>> from fuc import common
>>> original = ['A', 'B', 'C', 'D']
>>> common.rename(original, ['1', '2', '3', '4'])
['1', '2', '3', '4']
>>> common.rename(original, {'B': '2', 'C': '3'})
['A', '2', '3', 'D']
>>> common.rename(original, ['2', '4'], indicies=[1, 3])
['A', '2', 'C', '4']
>>> common.rename(original, ['2', '3'], indicies=(1, 3))
['A', '2', '3', 'D']
fuc.api.common.sort_regions(regions)[source]

Return sorted list of regions.

Parameters

regions (list) – List of regions.

Returns

Sorted list.

Return type

list

Examples

>>> from fuc import common
>>> regions = ['22:1000-1500', '16:100-200', '22:200-300']
>>> sorted(regions) # Lexicographic sorting (not what we want)
['16:100-200', '22:1000-1500', '22:200-300']
>>> common.sort_regions(regions)
['16:100-200', '22:200-300', '22:1000-1500']
fuc.api.common.sort_variants(variants)[source]

Return sorted list of variants.

Parameters

variants (list) – List of variants.

Returns

Sorted list.

Return type

list

Examples

>>> from fuc import common
>>> variants = ['5-200-G-T', '5:100:T:C', '1:100:A>C', '10-100-G-C']
>>> sorted(variants) # Lexicographic sorting (not what we want)
['10-100-G-C', '1:100:A>C', '5-200-G-T', '5:100:T:C']
>>> common.sort_variants(variants)
['1:100:A>C', '5:100:T:C', '5-200-G-T', '10-100-G-C']
fuc.api.common.sumstat(fp, fn, tp, tn)[source]

Return various summary statistics from (FP, FN, TP, TN).

This method will return the following statistics:

Terminology

Derivation

sensitivity, recall, hit rate, or true positive rate (TPR)

\(TPR = TP / P = TP / (TP + FN) = 1 - FNR\)

specificity, selectivity or true negative rate (TNR)

\(TNR = TN / N = TN / (TN + FP) = 1 - FPR\)

precision or positive predictive value (PPV)

\(PPV = TP / (TP + FP) = 1 - FDR\)

negative predictive value (NPV)

\(NPV = TN / (TN + FN) = 1 - FOR\)

miss rate or false negative rate (FNR)

\(FNR = FN / P = FN / (FN + TP) = 1 - TPR\)

fall-out or false positive rate (FPR)

\(FPR = FP / N = FP / (FP + TN) = 1 - TNR\)

false discovery rate (FDR)

\(FDR = FP / (FP + TP) = 1 - PPV\)

false omission rate (FOR)

\(FOR = FN / (FN + TN) = 1 - NPV\)

accuracy (ACC)

\(ACC = (TP + TN)/(TP + TN + FP + FN)\)

Parameters

fp, fn, tp, tn (int) – Input statistics.

Returns

Dictionary containing summary statistics.

Return type

dict

Examples

This example is directly taken from the Wiki page Sensitivity and specificity.

>>> from fuc import common
>>> results = common.sumstat(180, 10, 20, 1820)
>>> for k, v in results.items():
...     print(k, f'{v:.3f}')
...
tpr 0.667
tnr 0.910
ppv 0.100
npv 0.995
fnr 0.333
fpr 0.090
fdr 0.900
for 0.005
acc 0.906

fuc.pybam

The pybam submodule is designed for working with sequence alignment files (SAM/BAM/CRAM). It essentially wraps the pysam package to allow fast computation and easy manipulation. If you are mainly interested in working with depth of coverage data, please check out the pycov submodule which is specifically designed for the task.

Functions:

count_allelic_depth(bam, sites)

Count allelic depth for specified sites.

has_chr_prefix(fn)

Return True if contigs have the (annoying) 'chr' string.

tag_sm(fn)

Extract SM tags (sample names) from a BAM file.

tag_sn(fn)

Extract SN tags (contig names) from a BAM file.

fuc.api.pybam.count_allelic_depth(bam, sites)[source]

Count allelic depth for specified sites.

Parameters
  • bam (str) – BAM file.

  • sites (str or list) – Genomic site or list of sites. Each site should consist of chromosome and 1-based position in the format that can be recognized by common.parse_variant() (e.g. ‘22-42127941’).

Returns

DataFrame containing allelic depth.

Return type

pandas.DataFrame

Examples

>>> from fuc import pybam
>>> pybam.count_allelic_depth('in.bam', ['19-41510048', '19-41510053', '19-41510062'])
  Chromosome  Position  Total    A  C    G    T  N  DEL  INS
0         19  41510048    119  106  7    4    0  0    2    0
1         19  41510053    120    1  2    0  116  0    0    1
2         19  41510062    115    0  0  115    0  0    0    0
fuc.api.pybam.has_chr_prefix(fn)[source]

Return True if contigs have the (annoying) ‘chr’ string.

Parameters

fn (str) – BAM file.

Returns

Whether the ‘chr’ string is found.

Return type

bool

fuc.api.pybam.tag_sm(fn)[source]

Extract SM tags (sample names) from a BAM file.

Parameters

fn (str) – BAM file.

Returns

List of SM tags.

Return type

list

Examples

>>> from fuc import pybam
>>> pybam.tag_sm('NA19920.bam')
['NA19920']
fuc.api.pybam.tag_sn(fn)[source]

Extract SN tags (contig names) from a BAM file.

Parameters

fn (str) – BAM file.

Returns

List of SN tags.

Return type

list

Examples

>>> from fuc import pybam
>>> pybam.tag_sn('NA19920.bam')
['chr3', 'chr15', 'chrY', 'chr19', 'chr22', 'chr5', 'chr18', 'chr14', 'chr11', 'chr20', 'chr21', 'chr16', 'chr10', 'chr13', 'chr9', 'chr2', 'chr17', 'chr12', 'chr6', 'chrM', 'chrX', 'chr4', 'chr8', 'chr1', 'chr7']

fuc.pybed

The pybed submodule is designed for working with BED files. It implements pybed.BedFrame which stores BED data as pandas.DataFrame via the pyranges package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard BED specification.

BED lines can have the following fields (the first three are required):

No.

Name

Description

Examples

1

Chromosome

Chromosome

‘chr2’, ‘2’

2

Start

Start position

10041, 23042

3

End

End position

10041, 23042

4

Name

Feature name

‘TP53’

5

Score

Score for color density (0, 1000)

342, 544

6

Strand

‘+’ or ‘-’ (‘.’ for no strand)

‘+’, ‘-’

7

ThickStart

Start position for thick drawing

10041, 23042

8

ThickEnd

End position for thick drawing

10041, 23042

9

ItemRGB

RGB value

‘255,0,0’

10

BlockCount

Number of blocks (e.g. exons)

12, 8

11

BlockSizes

‘,’-separated block sizes

‘224,423’

12

BlockStarts

‘,’-separated block starts

‘2345,5245’

Classes:

BedFrame(meta, gr)

Class for storing BED data.

class fuc.api.pybed.BedFrame(meta, gr)[source]

Class for storing BED data.

Parameters
  • meta (list) – Metadata lines.

  • gr (pyranges.PyRanges) – PyRanges object containing BED data.

See also

BedFrame.from_dict

Construct BedFrame from a dict of array-like or dicts.

BedFrame.from_file

Construct BedFrame from a BED file.

BedFrame.from_frame

Construct BedFrame from a dataframe.

BedFrame.from_region

Construct BedFrame from a list of regions.

Examples

>>> import pandas as pd
>>> import pyranges as pr
>>> from fuc import pybed
>>> data = {
...     'Chromosome': ['chr1', 'chr2', 'chr3'],
...     'Start': [100, 400, 100],
...     'End': [200, 500, 200]
... }
>>> df = pd.DataFrame(data)
>>> gr = pr.PyRanges(df)
>>> bf = pybed.BedFrame([], gr)
>>> bf.gr.df
  Chromosome  Start  End
0       chr1    100  200
1       chr2    400  500
2       chr3    100  200

Attributes:

contigs

List of contig names.

gr

Two-dimensional representation of genomic intervals and their annotations.

has_chr_prefix

Whether the (annoying) 'chr' string is found.

meta

Metadata lines.

shape

Dimensionality of BedFrame (intervals, columns).

Methods:

copy_meta()

Return a copy of the metadata.

from_dict(meta, data)

Construct BedFrame from a dict of array-like or dicts.

from_file(fn)

Construct BedFrame from a BED file.

from_frame(meta, data)

Construct BedFrame from a dataframe.

from_regions(meta, regions)

Construct BedFrame from a list of regions.

intersect(other)

Find intersection between the BedFrames.

merge()

Merge overlapping intervals within BedFrame.

sort()

Sort the BedFrame by chromosome and position.

to_file(fn)

Write the BedFrame to a BED file.

to_regions([merge])

Return a list of regions from BedFrame.

to_string()

Render the BedFrame to a console-friendly tabular output.

update_chr_prefix([mode])

Add or remove the (annoying) 'chr' string from the Chromosome column.

property contigs

List of contig names.

Type

list

copy_meta()[source]

Return a copy of the metadata.

classmethod from_dict(meta, data)[source]

Construct BedFrame from a dict of array-like or dicts.

Parameters
  • meta (list) – Metadata lines.

  • data (dict) – Of the form {field : array-like} or {field : dict}.

Returns

BedFrame object.

Return type

BedFrame

See also

BedFrame

BedFrame object creation using constructor.

BedFrame.from_file

Construct BedFrame from a BED file.

BedFrame.from_frame

Construct BedFrame from a dataframe.

BedFrame.from_region

Construct BedFrame from a list of regions.

Examples

>>> from fuc import pybed
>>> data = {
...     'Chromosome': ['chr1', 'chr2', 'chr3'],
...     'Start': [100, 400, 100],
...     'End': [200, 500, 200]
... }
>>> bf = pybed.BedFrame.from_dict([], data)
>>> bf.gr.df
  Chromosome  Start  End
0       chr1    100  200
1       chr2    400  500
2       chr3    100  200
classmethod from_file(fn)[source]

Construct BedFrame from a BED file.

Parameters

fn (str) – BED file path.

Returns

BedFrame object.

Return type

BedFrame

See also

BedFrame

BedFrame object creation using constructor.

BedFrame.from_dict

Construct BedFrame from a dict of array-like or dicts.

BedFrame.from_frame

Construct BedFrame from a dataframe.

BedFrame.from_region

Construct BedFrame from a list of regions.

Examples

>>> from fuc import pybed
>>> bf = pybed.BedFrame.from_file('example.bed')
classmethod from_frame(meta, data)[source]

Construct BedFrame from a dataframe.

Parameters
  • meta (list) – Metadata lines.

  • data (pandas.DataFrame) – DataFrame containing BED data.

Returns

BedFrame object.

Return type

BedFrame

See also

BedFrame

BedFrame object creation using constructor.

BedFrame.from_dict

Construct BedFrame from a dict of array-like or dicts.

BedFrame.from_file

Construct BedFrame from a BED file.

BedFrame.from_region

Construct BedFrame from a list of regions.

Examples

>>> import pandas as pd
>>> from fuc import pybed
>>> data = {
...     'Chromosome': ['chr1', 'chr2', 'chr3'],
...     'Start': [100, 400, 100],
...     'End': [200, 500, 200]
... }
>>> df = pd.DataFrame(data)
>>> bf = pybed.BedFrame.from_frame([], df)
>>> bf.gr.df
  Chromosome  Start  End
0       chr1    100  200
1       chr2    400  500
2       chr3    100  200
classmethod from_regions(meta, regions)[source]

Construct BedFrame from a list of regions.

Parameters
  • meta (list) – Metadata lines.

  • regions (str or list) – Region or list of regions.

Returns

BedFrame object.

Return type

BedFrame

See also

BedFrame

BedFrame object creation using constructor.

BedFrame.from_dict

Construct BedFrame from a dict of array-like or dicts.

BedFrame.from_file

Construct BedFrame from a BED file.

BedFrame.from_frame

Construct BedFrame from a dataframe.

Examples

>>> from fuc import pybed
>>> data = ['chr1:100-200', 'chr2:100-200', 'chr3:100-200']
>>> bf = pybed.BedFrame.from_regions([], data)
>>> bf.gr.df
  Chromosome  Start  End
0       chr1    100  200
1       chr2    100  200
2       chr3    100  200
property gr

Two-dimensional representation of genomic intervals and their annotations.

Type

pyranges.PyRanges

property has_chr_prefix

Whether the (annoying) ‘chr’ string is found.

Type

bool

intersect(other)[source]

Find intersection between the BedFrames.

merge()[source]

Merge overlapping intervals within BedFrame.

Returns

Merged BedFrame.

Return type

BedFrame

Examples

>>> from fuc import pybed
>>> data = {
...     'Chromosome': ['chr1', 'chr1', 'chr2', 'chr2', 'chr3', 'chr3'],
...     'Start': [10, 30, 15, 25, 50, 61],
...     'End': [40, 50, 25, 35, 60, 80]
... }
>>> bf = pybed.BedFrame.from_dict([], data)
>>> bf.gr.df
  Chromosome  Start  End
0       chr1     10   40
1       chr1     30   50
2       chr2     15   25
3       chr2     25   35
4       chr3     50   60
5       chr3     61   80
>>> bf.merge().gr.df
  Chromosome  Start  End
0       chr1     10   50
1       chr2     15   35
2       chr3     50   60
3       chr3     61   80
property meta

Metadata lines.

Type

list

property shape

Dimensionality of BedFrame (intervals, columns).

Type

tuple

sort()[source]

Sort the BedFrame by chromosome and position.

Returns

Sorted BedFrame.

Return type

BedFrame

Examples

>>> from fuc import pybed
>>> data = {
...     'Chromosome': ['chr1', 'chr3', 'chr1'],
...     'Start': [400, 100, 100],
...     'End': [500, 200, 200]
... }
>>> bf = pybed.BedFrame.from_dict([], data)
>>> bf.gr.df
  Chromosome  Start  End
0       chr1    400  500
1       chr1    100  200
2       chr3    100  200
>>> bf.sort().gr.df
  Chromosome  Start  End
0       chr1    100  200
1       chr1    400  500
2       chr3    100  200
to_file(fn)[source]

Write the BedFrame to a BED file.

to_regions(merge=True)[source]

Return a list of regions from BedFrame.

Parameters

merge (bool, default: True) – Whether to merge overlapping intervals.

Returns

List of regions.

Return type

list

Examples

>>> from fuc import pybed
>>> data = {
...     'Chromosome': ['chr1', 'chr1', 'chr2', 'chr2', 'chr3', 'chr3'],
...     'Start': [10, 30, 15, 25, 50, 61],
...     'End': [40, 50, 25, 35, 60, 80]
... }
>>> bf = pybed.BedFrame.from_dict([], data)
>>> bf.regions()
['chr1:10-50', 'chr2:15-35', 'chr3:50-60', 'chr3:61-80']
>>> bf.regions(merge=False)
['chr1:10-40', 'chr1:30-50', 'chr2:15-25', 'chr2:25-35', 'chr3:50-60', 'chr3:61-80']
to_string()[source]

Render the BedFrame to a console-friendly tabular output.

update_chr_prefix(mode='remove')[source]

Add or remove the (annoying) ‘chr’ string from the Chromosome column.

Parameters

mode ({‘add’, ‘remove’}, default: ‘remove’) – Whether to add or remove the ‘chr’ string.

Returns

Updated BedFrame.

Return type

BedFrame

Examples

>>> from fuc import pybed
>>> data = {
...     'Chromosome': ['1', '1', 'chr2', 'chr2'],
...     'Start': [100, 400, 100, 200],
...     'End': [200, 500, 200, 300]
... }
>>> bf = pybed.BedFrame.from_dict([], data)
>>> bf.gr.df
  Chromosome  Start  End
0          1    100  200
1          1    400  500
2       chr2    100  200
3       chr2    200  300
>>> bf.update_chr_prefix(mode='remove').gr.df
  Chromosome  Start  End
0          1    100  200
1          1    400  500
2          2    100  200
3          2    200  300
>>> bf.update_chr_prefix(mode='add').gr.df
  Chromosome  Start  End
0       chr1    100  200
1       chr1    400  500
2       chr2    100  200
3       chr2    200  300

fuc.pycov

The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements pycov.CovFrame which stores read depth data as pandas.DataFrame via the pysam package to allow fast computation and easy manipulation. The pycov.CovFrame class also contains many useful plotting methods such as CovFrame.plot_region and CovFrame.plot_uniformity.

Classes:

CovFrame(df)

Class for storing read depth data from one or more SAM/BAM/CRAM files.

Functions:

concat(cfs[, axis])

Concatenate CovFrame objects along a particular axis.

simulate([mode, loc, scale, size])

Simulate read depth data for single sample.

class fuc.api.pycov.CovFrame(df)[source]

Class for storing read depth data from one or more SAM/BAM/CRAM files.

Parameters

df (pandas.DataFrame) – DataFrame containing read depth data.

See also

CovFrame.from_bam

Construct CovFrame from one or more SAM/BAM/CRAM files.

CovFrame.from_dict

Construct CovFrame from dict of array-like or dicts.

CovFrame.from_file

Construct CovFrame from a text file containing read depth data.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1'] * 1000,
...     'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
... }
>>> df = pd.DataFrame(data)
>>> cf = pycov.CovFrame(df)
>>> cf.df.head()
  Chromosome  Position   A   B
0       chr1      1000  22  23
1       chr1      1001  34  30
2       chr1      1002  33  27
3       chr1      1003  32  21
4       chr1      1004  32  15

Attributes:

contigs

List of contig names.

df

DataFrame containing read depth data.

has_chr_prefix

Whether the (annoying) 'chr' string is found.

samples

List of the sample names.

shape

Dimensionality of CovFrame (positions, samples).

Methods:

copy()

Return a copy of the CovFrame.

copy_df()

Return a copy of the dataframe.

from_bam([bam, fn, bed, region, zero, ...])

Construct CovFrame from one or more SAM/BAM/CRAM files.

from_dict(data)

Construct CovFrame from dict of array-like or dicts.

from_file(fn[, compression])

Construct CovFrame from a TSV file containing read depth data.

mask_bed(bed[, opposite])

Mask rows that overlap with BED data.

plot_distribution([mode, frac, ax, figsize])

Create a line plot visualizaing the distribution of per-base read depth.

plot_region(sample[, region, samples, ...])

Create read depth profile for specified region.

plot_uniformity([mode, frac, n, m, marker, ...])

Create a line plot visualizing the uniformity in read depth.

rename(names[, indicies])

Rename the samples.

slice(region)

Slice the CovFrame for the region.

subset(samples[, exclude])

Subset CovFrame for specified samples.

to_file(fn[, compression])

Write the CovFrame to a TSV file.

to_string()

Render the CovFrame to a console-friendly tabular output.

update_chr_prefix([mode])

Add or remove the (annoying) 'chr' string from the Chromosome column.

property contigs

List of contig names.

Type

list

copy()[source]

Return a copy of the CovFrame.

copy_df()[source]

Return a copy of the dataframe.

property df

DataFrame containing read depth data.

Type

pandas.DataFrame

classmethod from_bam(bam=None, fn=None, bed=None, region=None, zero=False, map_qual=None, names=None)[source]

Construct CovFrame from one or more SAM/BAM/CRAM files.

Alignment files must be specified with either bam or fn, but it’s an error to use both.

By default, the method will count all reads within the alignment files. You can specify target regions with either bed or region, but not both. When you do this, pay close attention to the ‘chr’ string in contig names (e.g. ‘chr1’ vs. ‘1’). Note also that region requires the input files be indexed.

Under the hood, the method computes read depth using the samtools depth command.

Parameters
  • bam (str or list, optional) – One or more alignment files.

  • fn (str, optional) – File containing one alignment file per line.

  • bed (str, optional) – BED file.

  • region (str, optional) – Target region (‘chrom:start-end’).

  • zero (bool, default: False) – If True, output all positions (including those with zero depth).

  • map_qual (int, optional) – Only count reads with mapping quality greater than orequal to this number.

  • names (list, optional) – Use these as sample names instead of the SM tags.

Returns

CovFrame object.

Return type

CovFrame

See also

CovFrame

CovFrame object creation using constructor.

CovFrame.from_dict

Construct CovFrame from dict of array-like or dicts.

CovFrame.from_file

Construct CovFrame from a text file containing read depth data.

Examples

>>> from fuc import pycov
>>> cf = pycov.CovFrame.from_bam(bam)
>>> cf = pycov.CovFrame.from_bam([bam1, bam2])
>>> cf = pycov.CovFrame.from_bam(bam, region='19:41497204-41524301')
classmethod from_dict(data)[source]

Construct CovFrame from dict of array-like or dicts.

Parameters

data (dict) – Of the form {field : array-like} or {field : dict}.

Returns

CovFrame object.

Return type

CovFrame

See also

CovFrame

CovFrame object creation using constructor.

CovFrame.from_bam

Construct CovFrame from one or more SAM/BAM/CRAM files.

CovFrame.from_file

Construct CovFrame from a text file containing read depth data.

Examples

>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1'] * 1000,
...     'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.df.head()
  Chromosome  Position   A   B
0       chr1      1000  36  22
1       chr1      1001  39  35
2       chr1      1002  33  19
3       chr1      1003  36  20
4       chr1      1004  31  24
classmethod from_file(fn, compression=False)[source]

Construct CovFrame from a TSV file containing read depth data.

Parameters
  • fn (str or file-like object) – TSV file (zipped or unzipped). By file-like object, we refer to objects with a read() method, such as a file handle.

  • compression (bool, default: False) – If True, use GZIP decompression regardless of filename.

Returns

CovFrame object.

Return type

CovFrame

See also

CovFrame

CovFrame object creation using constructor.

CovFrame.from_bam

Construct CovFrame from one or more SAM/BAM/CRAM files.

CovFrame.from_dict

Construct CovFrame from dict of array-like or dicts.

Examples

>>> from fuc import pycov
>>> cf = pycov.CovFrame.from_file('unzipped.tsv')
>>> cf = pycov.CovFrame.from_file('zipped.tsv.gz')
>>> cf = pycov.CovFrame.from_file('zipped.tsv', compression=True)
property has_chr_prefix

Whether the (annoying) ‘chr’ string is found.

Type

bool

mask_bed(bed, opposite=False)[source]

Mask rows that overlap with BED data.

Parameters
  • bed (pybed.BedFrame or str) – BedFrame object or BED file.

  • opposite (bool, default: False) – If True, mask rows that don’t overlap with BED data.

Returns

Masked CovFrame.

Return type

CovFrame

Examples

Assume we have the following data:

>>> import numpy as np
>>> from fuc import pycov, pybed
>>> data = {
...     'Chromosome': ['chr1'] * 1000,
...     'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.df.head()
  Chromosome  Position   A   B
0       chr1      1000  34  31
1       chr1      1001  31  20
2       chr1      1002  41  22
3       chr1      1003  28  41
4       chr1      1004  34  23
>>> data = {
...     'Chromosome': ['chr1', 'chr1'],
...     'Start': [1000, 1003],
...     'End': [1002, 1004]
... }
>>> bf = pybed.BedFrame.from_dict([], data)
>>> bf.gr.df
  Chromosome  Start   End
0       chr1   1000  1002
1       chr1   1003  1004

We can mask rows that overlap with the BED data:

>>> cf.mask_bed(bf).df.head()
  Chromosome  Position     A     B
0       chr1      1000   NaN   NaN
1       chr1      1001   NaN   NaN
2       chr1      1002  41.0  22.0
3       chr1      1003   NaN   NaN
4       chr1      1004  34.0  23.0

We can also do the opposite:

>>> cf.mask_bed(bf, opposite=True).df.head()
  Chromosome  Position     A     B
0       chr1      1000  34.0  31.0
1       chr1      1001  31.0  20.0
2       chr1      1002   NaN   NaN
3       chr1      1003  28.0  41.0
4       chr1      1004   NaN   NaN
plot_distribution(mode='aggregated', frac=0.1, ax=None, figsize=None, **kwargs)[source]

Create a line plot visualizaing the distribution of per-base read depth.

Parameters
  • mode ({‘aggregated’, ‘individual’}, default: ‘aggregated’) – Determines how to display the lines:

    • ‘aggregated’: Aggregate over repeated values to show the mean and 95% confidence interval.

    • ‘individual’: Show data for individual samples.

  • frac (float, default: 0.1) – Fraction of data to be sampled (to speed up the process).

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.lineplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

By default (mode='aggregated'), the method will aggregate over repeated values:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1'] * 1000,
...     'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.plot_distribution(mode='aggregated', frac=0.9)
>>> plt.tight_layout()
_images/api-7.png

We can display data for individual samples:

>>> cf.plot_distribution(mode='individual', frac=0.9)
>>> plt.tight_layout()
_images/api-8.png
plot_region(sample, region=None, samples=None, label=None, ax=None, figsize=None, **kwargs)[source]

Create read depth profile for specified region.

Region can be omitted if there is only one contig in the CovFrame.

Parameters
  • region (str, optional) – Target region (‘chrom:start-end’).

  • label (str, optional) – Label to use for the data points.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.lineplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...    'Chromosome': ['chr1'] * 1000,
...    'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> ax = cf.plot_region('A')
>>> plt.tight_layout()
_images/api-9.png

We can draw multiple profiles in one plot:

>>> ax = cf.plot_region('A', label='A')
>>> cf.plot_region('B', label='B', ax=ax)
>>> ax.legend()
>>> plt.tight_layout()
_images/api-10.png
plot_uniformity(mode='aggregated', frac=0.1, n=20, m=None, marker=None, ax=None, figsize=None, **kwargs)[source]

Create a line plot visualizing the uniformity in read depth.

Parameters
  • mode ({‘aggregated’, ‘individual’}, default: ‘aggregated’) – Determines how to display the lines:

    • ‘aggregated’: Aggregate over repeated values to show the mean and 95% confidence interval.

    • ‘individual’: Show data for individual samples.

  • frac (float, default: 0.1) – Fraction of data to be sampled (to speed up the process).

  • n (int or list, default: 20) – Number of evenly spaced points to generate for the x-axis. Alternatively, positions can be manually specified by providing a list.

  • m (float, optional) – Maximum point in the x-axis. By default, it will be the maximum depth in the entire dataset.

  • marker (str, optional) – Marker style string (e.g. ‘o’).

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.lineplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

By default (mode='aggregated'), the method will aggregate over repeated values:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1'] * 1000,
...     'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.plot_uniformity(mode='aggregated')
>>> plt.tight_layout()
_images/api-11.png

We can display data for individual samples:

>>> cf.plot_uniformity(mode='individual')
>>> plt.tight_layout()
_images/api-12.png
rename(names, indicies=None)[source]

Rename the samples.

Parameters
  • names (dict or list) – Dict of old names to new names or list of new names.

  • indicies (list or tuple, optional) – List of 0-based sample indicies. Alternatively, a tuple (int, int) can be used to specify an index range.

Returns

Updated CovFrame.

Return type

CovFrame

Examples

>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1'] * 2,
...     'Position': np.arange(1, 3),
...     'A': pycov.simulate(loc=35, scale=5, size=2),
...     'B': pycov.simulate(loc=25, scale=7, size=2),
...     'C': pycov.simulate(loc=25, scale=7, size=2),
...     'D': pycov.simulate(loc=25, scale=7, size=2),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.df
  Chromosome  Position   A   B   C   D
0       chr1         1  31  19  28  15
1       chr1         2  35  24  22  17
>>> cf.rename(['1', '2', '3', '4']).df
  Chromosome  Position   1   2   3   4
0       chr1         1  31  19  28  15
1       chr1         2  35  24  22  17
>>> cf.rename({'B': '2', 'C': '3'}).df
  Chromosome  Position   A   2   3   D
0       chr1         1  31  19  28  15
1       chr1         2  35  24  22  17
>>> cf.rename(['2', '4'], indicies=[1, 3]).df
  Chromosome  Position   A   2   C   4
0       chr1         1  31  19  28  15
1       chr1         2  35  24  22  17
>>> cf.rename(['2', '3'], indicies=(1, 3)).df
  Chromosome  Position   A   2   3   D
0       chr1         1  31  19  28  15
1       chr1         2  35  24  22  17
property samples

List of the sample names.

Type

list

property shape

Dimensionality of CovFrame (positions, samples).

Type

tuple

slice(region)[source]

Slice the CovFrame for the region.

Parameters

region (str) – Region (‘chrom:start-end’).

Returns

Sliced CovFrame.

Return type

CovFrame

Examples

>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1']*500 + ['chr2']*500,
...     'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.slice('chr2').df.head()
  Chromosome  Position   A   B
0       chr2      1500  37  34
1       chr2      1501  28  12
2       chr2      1502  35  29
3       chr2      1503  34  34
4       chr2      1504  32  21
>>> cf.slice('chr2:1500-1504').df
  Chromosome  Position   A   B
0       chr2      1500  37  34
1       chr2      1501  28  12
2       chr2      1502  35  29
3       chr2      1503  34  34
4       chr2      1504  32  21
>>> cf.slice('chr2:-1504').df
  Chromosome  Position   A   B
0       chr2      1500  37  34
1       chr2      1501  28  12
2       chr2      1502  35  29
3       chr2      1503  34  34
4       chr2      1504  32  21
subset(samples, exclude=False)[source]

Subset CovFrame for specified samples.

Parameters
  • samples (str or list) – Sample name or list of names (the order matters).

  • exclude (bool, default: False) – If True, exclude specified samples.

Returns

Subsetted CovFrame.

Return type

CovFrame

Examples

Assume we have the following data:

>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1'] * 1000,
...     'Position': np.arange(1000, 2000),
...     'A': pycov.simulate(loc=35, scale=5),
...     'B': pycov.simulate(loc=25, scale=7),
...     'C': pycov.simulate(loc=15, scale=2),
...     'D': pycov.simulate(loc=45, scale=8),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.df.head()
  Chromosome  Position   A   B   C   D
0       chr1      1000  30  30  15  37
1       chr1      1001  25  24  11  43
2       chr1      1002  33  24  16  50
3       chr1      1003  29  22  15  46
4       chr1      1004  34  30  11  32

We can subset the CovFrame for the samples A and B:

>>> cf.subset(['A', 'B']).df.head()
  Chromosome  Position   A   B
0       chr1      1000  30  30
1       chr1      1001  25  24
2       chr1      1002  33  24
3       chr1      1003  29  22
4       chr1      1004  34  30

Alternatively, we can exclude those samples:

>>> cf.subset(['A', 'B'], exclude=True).df.head()
  Chromosome  Position   C   D
0       chr1      1000  15  37
1       chr1      1001  11  43
2       chr1      1002  16  50
3       chr1      1003  15  46
4       chr1      1004  11  32
to_file(fn, compression=False)[source]

Write the CovFrame to a TSV file.

If the file name ends with ‘.gz’, the method will automatically use the GZIP compression when writing the file.

Parameters
  • fn (str) – TSV file (zipped or unzipped).

  • compression (bool, default: False) – If True, use the GZIP compression.

to_string()[source]

Render the CovFrame to a console-friendly tabular output.

Returns

String representation of the CovFrame.

Return type

str

update_chr_prefix(mode='remove')[source]

Add or remove the (annoying) ‘chr’ string from the Chromosome column.

Parameters

mode ({‘add’, ‘remove’}, default: ‘remove’) – Whether to add or remove the ‘chr’ string.

Returns

Updated CovFrame.

Return type

CovFrame

Examples

>>> import numpy as np
>>> from fuc import pycov
>>> data = {
...     'Chromosome': ['chr1'] * 3 + ['2'] * 3,
...     'Position': np.arange(1, 7),
...     'A': pycov.simulate(loc=35, scale=5, size=6),
...     'B': pycov.simulate(loc=25, scale=7, size=6),
... }
>>> cf = pycov.CovFrame.from_dict(data)
>>> cf.df
  Chromosome  Position   A   B
0       chr1         1  35  25
1       chr1         2  23  14
2       chr1         3  32  23
3          2         4  38  25
4          2         5  33   8
5          2         6  21  22
>>> cf.update_chr_prefix(mode='remove').df
  Chromosome  Position   A   B
0          1         1  35  25
1          1         2  23  14
2          1         3  32  23
3          2         4  38  25
4          2         5  33   8
5          2         6  21  22
>>> cf.update_chr_prefix(mode='add').df
  Chromosome  Position   A   B
0       chr1         1  35  25
1       chr1         2  23  14
2       chr1         3  32  23
3       chr2         4  38  25
4       chr2         5  33   8
5       chr2         6  21  22
fuc.api.pycov.concat(cfs, axis=0)[source]

Concatenate CovFrame objects along a particular axis.

Parameters
  • cfs (list) – List of CovFrame objects.

  • axis ({0/’index’, 1/’columns’}, default: 0) – The axis to concatenate along.

Returns

Concatenated CovFrame.

Return type

CovFrame

fuc.api.pycov.simulate(mode='wgs', loc=30, scale=5, size=1000)[source]

Simulate read depth data for single sample.

Generated read depth will be integer and non-negative.

Parameters
  • mode ({‘wgs’}, default: ‘wgs’) – Additional modes will be made available in future releases.

  • loc (float, default: 30) – Mean (“centre”) of the distribution.

  • scale (float, default: 5) – Standard deviation (spread or “width”) of the distribution. Must be non-negative.

  • size (int, default: 1000) – Number of base pairs to return.

Returns

Numpy array object.

Return type

numpy.ndarray

Examples

>>> from fuc import pycov
>>> pycov.simulate(size=10)
array([25, 32, 30, 31, 26, 25, 33, 29, 28, 35])

fuc.pyfq

The pyfq submodule is designed for working with FASTQ files. It implements pyfq.FqFrame which stores FASTQ data as pandas.DataFrame to allow fast computation and easy manipulation.

Classes:

FqFrame(df)

Class for storing FASTQ data.

class fuc.api.pyfq.FqFrame(df)[source]

Class for storing FASTQ data.

Methods:

from_file(fn)

Construct FqFrame from a FASTQ file.

readlen()

Return a dictionary of read lengths and their counts.

to_file(file_path)

Write the FqFrame to a FASTQ file.

Attributes:

shape

Number of sequence reads in the FqFrame.

classmethod from_file(fn)[source]

Construct FqFrame from a FASTQ file.

Parameters

fn (str) – FASTQ file path (zipped or unzipped).

Returns

FqFrame.

Return type

FqFrame

See also

FqFrame

FqFrame object creation using constructor.

readlen()[source]

Return a dictionary of read lengths and their counts.

property shape

Number of sequence reads in the FqFrame.

Type

int

to_file(file_path)[source]

Write the FqFrame to a FASTQ file.

fuc.pygff

The pygff submodule is designed for working with GFF/GTF files. It implements pygff.GffFrame which stores GFF/GTF data as pandas.DataFrame to allow fast computation and easy manipulation. The submodule strictly adheres to the standard GFF specification.

A GFF/GTF file contains nine columns as follows:

No.

Name

Description

Examples

1

Seqid

Landmark ID

‘NC_000001.10’, ‘NC_012920.1’

2

Source

Feature source

‘RefSeq’, ‘BestRefSeq’, ‘Genescan’, ‘Genebank’

3

Type

Feature type

‘transcript’, ‘exon’, ‘gene’

4

Start

Start coordinate

11874, 14409

5

End

End coordinate

11874, 14409

6

Score

Feature score

‘.’, ‘1730.55’, ‘1070’

7

Strand

Feature strand

‘.’, ‘-’, ‘+’, ‘?’

8

Phase

CDS phase

‘.’, ‘0’, ‘1’, ‘2’

9

Attributes

‘;’-separated attributes

‘ID=NC_000001.10:1..249250621;Dbxref=taxon:9606;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA’

Classes:

GffFrame(meta, df, fasta)

Class for storing GFF/GTF data.

class fuc.api.pygff.GffFrame(meta, df, fasta)[source]

Class for storing GFF/GTF data.

Parameters
  • meta (list) – List of metadata lines.

  • df (pandas.DataFrame) – DataFrame containing GFF/GTF data.

  • fasta (str) – FASTA sequence lines.

Attributes:

df

DataFrame containing GFF/GTF data.

fasta

FASTA sequence lines.

meta

List of metadata lines.

Methods:

from_file(fn)

Construct GffFrame from a GFF/GTF file.

protein_length(gene[, name])

Return the protein length of a gene.

property df

DataFrame containing GFF/GTF data.

Type

pandas.DataFrame

property fasta

FASTA sequence lines.

Type

dict

classmethod from_file(fn)[source]

Construct GffFrame from a GFF/GTF file.

Parameters

fn (str) – GFF/GTF file (zipped or unzipped).

Returns

GffFrame object.

Return type

GffFrame

property meta

List of metadata lines.

Type

list

protein_length(gene, name=None)[source]

Return the protein length of a gene.

Parameters
  • gene (str) – Name of the gene.

  • name (str, optional) – Protein sequence ID (e.g. ‘NP_005219.2’). Required when the gene has multiple protein sequences available.

Returns

Protein length.

Return type

int

fuc.pymaf

The pymaf submodule is designed for working with MAF files. It implements pymaf.MafFrame which stores MAF data as pandas.DataFrame to allow fast computation and easy manipulation. The pymaf.MafFrame class also contains many useful plotting methods such as MafFrame.plot_oncoplot and MafFrame.plot_summary. The submodule strictly adheres to the standard MAF specification.

A typical MAF file contains many columns ranging from gene symbol to protein change. However, most of the analysis in pymaf uses the following columns:

No.

Name

Description

Examples

1

Hugo_Symbol

HUGO gene symbol

‘TP53’, ‘Unknown’

2

Chromosome

Chromosome name

‘chr1’, ‘1’, ‘X’

3

Start_Position

Start coordinate

119031351

4

End_Position

End coordinate

44079555

5

Variant_Classification

Translational effect

‘Missense_Mutation’, ‘Silent’

6

Variant_Type

Mutation type

‘SNP’, ‘INS’, ‘DEL’

7

Reference_Allele

Reference allele

‘T’, ‘-’, ‘ACAA’

8

Tumor_Seq_Allele1

First tumor allele

‘A’, ‘-’, ‘TCA’

9

Tumor_Seq_Allele2

Second tumor allele

‘A’, ‘-’, ‘TCA’

10

Tumor_Sample_Barcode

Sample ID

‘TCGA-AB-3002’

11

Protein_Change

Protein change

‘p.L558Q’

It is also recommended to include additional custom columns such as variant allele frequecy (VAF) and transcript name.

If sample annotation data are available for a given MAF file, use the common.AnnFrame class to import the data.

There are nine nonsynonymous variant classifcations that pymaf primarily uses: Missense_Mutation, Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Nonsense_Mutation, Nonstop_Mutation, Splice_Site, and Translation_Start_Site.

Classes:

MafFrame(df)

Class for storing MAF data.

class fuc.api.pymaf.MafFrame(df)[source]

Class for storing MAF data.

Parameters

df (pandas.DataFrame) – DataFrame containing MAF data.

See also

MafFrame.from_file

Construct MafFrame from a MAF file.

Methods:

calculate_concordance(a, b[, c, mode])

Calculate genotype concordance between two (A, B) or three (A, B, C) samples.

compute_clonality(vaf_col[, threshold])

Compute the clonality of variants based on VAF.

copy()

Return a copy of the MafFrame.

filter_annot(af, expr)

Filter the MafFrame using sample annotation data.

filter_indel([opposite, as_index])

Remove rows with an indel.

from_file(fn)

Construct MafFrame from a MAF file.

from_vcf(vcf[, keys, names])

Construct MafFrame from a VCF file or VcfFrame.

matrix_genes([mode, count])

Compute a matrix of counts with a shape of (genes, variant classifications).

matrix_prevalence()

Compute a matrix of variant counts with a shape of (genes, samples).

matrix_tmb()

Compute a matrix of variant counts with a shape of (samples, variant classifications).

matrix_waterfall([count, keep_empty])

Compute a matrix of variant classifications with a shape of (genes, samples).

matrix_waterfall_matched(af, patient_col, ...)

Compute a matrix of variant classifications with a shape of (gene-group pairs, patients).

plot_clonality(vaf_col[, af, group_col, ...])

Create a bar plot summarizing the clonality of variants in top mutated genes.

plot_comparison(a, b[, c, labels, ax, figsize])

Create a Venn diagram showing genotype concordance between groups.

plot_evolution(samples, vaf_col[, anchor, ...])

Create a line plot visualizing changes in VAF between specified samples.

plot_genepair(x, y, vaf_col[, af, ...])

Create a scatter plot of VAF between Gene X and Gene Y.

plot_genes([mode, count, flip, ax, figsize])

Create a bar plot showing variant distirbution for top mutated genes.

plot_interactions([count, cmap, ax, figsize])

Create a heatmap representing mutually exclusive or co-occurring set of genes.

plot_lollipop(gene[, alpha, ax, figsize, legend])

Create a lollipop or stem plot showing amino acid changes of a gene.

plot_matrixg(gene, af, group_col[, ...])

Create a heatmap of count matrix with a shape of (sample groups, protein changes).

plot_matrixs(gene[, samples, c0, c1, l0, ...])

Create a heatmap of presence/absence matrix with a shape of (samples, protein changes).

plot_mutated([af, group_col, group_order, ...])

Create a bar plot visualizing the mutation prevalence of top mutated genes.

plot_mutated_matched(af, patient_col, ...[, ...])

Create a bar plot visualizing the mutation prevalence of top mutated genes.

plot_oncoplot([count, keep_empty, figsize, ...])

Create an oncoplot.

plot_oncoplot_matched(af, patient_col, ...)

Create an oncoplot for mached samples.

plot_rainfall(sample[, palette, legend, ax, ...])

Create a rainfall plot visualizing inter-variant distance on a linear genomic scale for single sample.

plot_regplot(af, group_col, a, b[, a_size, ...])

Create a scatter plot with a linear regression model fit visualizing correlation between gene mutation frequencies in two sample groups A and B.

plot_snvclsc([af, group_col, group_order, ...])

Create a bar plot summarizing the count distrubtions of the six SNV classes for all samples.

plot_snvclsp([af, group_col, group_order, ...])

Create a box plot summarizing the proportion distrubtions of the six SNV classes for all sample.

plot_snvclss([samples, color, colormap, ...])

Create a bar plot showing the proportions of the six SNV classes for individual samples.

plot_summary([figsize, title_fontsize, ...])

Create a summary figure for MafFrame.

plot_titv([af, group_col, group_order, ...])

Create a box plot showing the Ti/Tv proportions of samples.

plot_tmb([samples, width, ax, figsize])

Create a bar plot showing the TMB distributions of samples.

plot_tmb_matched(af, patient_col, group_col)

Create a grouped bar plot showing TMB distributions for different group levels in each patient.

plot_vaf(vaf_col[, count, af, group_col, ...])

Create a box plot showing the VAF distributions of top mutated genes.

plot_varcls([ax, figsize])

Create a bar plot for the nonsynonymous variant classes.

plot_varsum([flip, ax, figsize])

Create a summary box plot for variant classifications.

plot_vartype([palette, flip, ax, figsize])

Create a bar plot summarizing the count distrubtions of viaration types for all samples.

plot_waterfall([count, keep_empty, samples, ...])

Create a waterfall plot (oncoplot).

plot_waterfall_matched(af, patient_col, ...)

Create a waterfall plot using matched samples from each patient.

subset(samples[, exclude])

Subset MafFrame for specified samples.

to_file(fn)

Write MafFrame to a MAF file.

to_string()

Render MafFrame to a console-friendly tabular output.

to_vcf([fasta, ignore_indels, cols, names])

Write the MafFrame to a sorted VcfFrame.

variants()

List unique variants in MafFrame.

Attributes:

genes

List of the genes.

samples

List of the sample names.

shape

Dimensionality of MafFrame (variants, samples).

calculate_concordance(a, b, c=None, mode='all')[source]

Calculate genotype concordance between two (A, B) or three (A, B, C) samples.

This method will return (Ab, aB, AB, ab) for comparison between two samples and (Abc, aBc, ABc, abC, AbC, aBC, ABC, abc) for three samples. Note that the former is equivalent to (FP, FN, TP, TN) if we assume A is the test sample and B is the truth sample.

Parameters
  • a, b (str or int) – Name or index of Samples A and B.

  • c (str or int, optional) – Name or index of Sample C.

  • mode ({‘all’, ‘snv’, ‘indel’}, default: ‘all’) – Determines which variant types should be analyzed:

    • ‘all’: Include both SNVs and INDELs.

    • ‘snv’: Include SNVs only.

    • ‘indel’: Include INDELs only.

Returns

Four- or eight-element tuple depending on the number of samples.

Return type

tuple

See also

fuc.api.common.sumstat

Return various summary statistics from (FP, FN, TP, TN).

Examples

>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.calculate_concordance('TCGA-AB-2988', 'TCGA-AB-2869')
(15, 12, 0, 2064)
>>> mf.calculate_concordance('TCGA-AB-2988', 'TCGA-AB-2869', 'TCGA-AB-3009')
(15, 12, 0, 42, 0, 0, 0, 2022)
compute_clonality(vaf_col, threshold=0.25)[source]

Compute the clonality of variants based on VAF.

A mutation will be defined as “Subclonal” if the VAF is less than the threshold percentage (e.g. 25%) of the highest VAF in the sample and is defined as “Clonal” if it is equal to or above this threshold.

Parameters
  • vaf_col (str) – MafFrame column containing VAF data.

  • threshold (float) – Minimum VAF to be considered as “Clonal”.

Returns

Clonality for each variant.

Return type

panda.Series

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.df['Clonality'] = mf.compute_clonality('i_TumorVAF_WU')
>>> mf.df['Clonality'][:10]
0    Clonal
1    Clonal
2    Clonal
3    Clonal
4    Clonal
5    Clonal
6    Clonal
7    Clonal
8    Clonal
9    Clonal
Name: Clonality, dtype: object
copy()[source]

Return a copy of the MafFrame.

filter_annot(af, expr)[source]

Filter the MafFrame using sample annotation data.

Samples are selected by querying the columns of an AnnFrame with a boolean expression. Samples not present in the MafFrame will be excluded automatically.

Parameters
  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • expr (str) – Query expression to evaluate.

Returns

Filtered MafFrame.

Return type

MafFrame

Examples

>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> mf = pymaf.MafFrame.from_file('~/fuc-data/tcga-laml/tcga_laml.maf.gz')
>>> af = common.AnnFrame.from_file('~/fuc-data/tcga-laml/tcga_laml_annot.tsv', sample_col=0)
>>> filtered_mf = mf.filter_annot(af, "FAB_classification == 'M4'")
filter_indel(opposite=False, as_index=False)[source]

Remove rows with an indel.

Parameters
  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of MafFrame.

Returns

Filtered MafFrame or boolean index array.

Return type

MafFrame or pandas.Series

Examples

>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.filter_indel().df.Variant_Type.unique()
array(['SNP'], dtype=object)
>>> mf.filter_indel(opposite=True).df.Variant_Type.unique()
array(['DEL', 'INS'], dtype=object)
classmethod from_file(fn)[source]

Construct MafFrame from a MAF file.

Parameters

fn (str) – MAF file (zipped or unzipped).

Returns

MafFrame object.

Return type

MafFrame

See also

MafFrame

MafFrame object creation using constructor.

Examples

>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
classmethod from_vcf(vcf, keys=None, names=None)[source]

Construct MafFrame from a VCF file or VcfFrame.

It is recommended that the input VCF data be functionally annotated by an annotation tool such as Ensembl VEP, SnpEff, and ANNOVAR; however, the method can handle unannotated VCF data as well.

The preferred tool for functional annotation is Ensembl VEP with “RefSeq transcripts” as the transcript database and the filtering option “Show one selected consequence per variant”.

Parameters
  • vcf (str or VcfFrame) – VCF file or VcfFrame.

  • keys (str or list) – Genotype key (e.g. ‘AD’, ‘AF’) or list of genotype keys to be added to the MafFrame.

  • names (str or list) – Column name or list of column names for keys (must be the same length). By default, the genotype keys will be used as column names.

Examples

Below is a simple example:

>>> from fuc import pyvcf, pymaf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['CSQ=T|missense_variant|MODERATE|MTOR|2475|Transcript|NM_001386500.1|protein_coding|47/58||||6792|6644|2215|S/Y|tCt/tAt|rs587777894&COSV63868278&COSV63868313||-1||EntrezGene||||||||G|G||deleterious(0)|possibly_damaging(0.876)||||||||||||||||||likely_pathogenic&pathogenic|0&1&1|1&1&1|26619011&27159400&24631838&26018084&27830187|||||', 'CSQ=C|splice_donor_variant|HIGH|MTOR|2475|Transcript|NM_001386500.1|protein_coding||46/57||||||||||-1||EntrezGene||||||||A|A|||||||||||||||||||||||||||||'],
...     'FORMAT': ['GT:AD:DP:AF', 'GT:AD:DP:AF'],
...     'A': ['0/1:176,37:213:0.174', '0/1:966,98:1064:0.092']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER                                               INFO       FORMAT                      A
0  chr1  100  .   G   A    .      .  CSQ=T|missense_variant|MODERATE|MTOR|2475|Tran...  GT:AD:DP:AF   0/1:176,37:213:0.174
1  chr2  101  .   T   C    .      .  CSQ=C|splice_donor_variant|HIGH|MTOR|2475|Tran...  GT:AD:DP:AF  0/1:966,98:1064:0.092
>>> mf = pymaf.MafFrame.from_vcf(vf)
>>> mf.df
  Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome  Start_Position  End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Protein_Change Tumor_Sample_Barcode
0        MTOR           2475      .          .       chr1             100           100      -      Missense_Mutation          SNP                G                 A                 A       p.S2215Y                    A
1        MTOR           2475      .          .       chr2             101           101      -            Splice_Site          SNP                T                 C                 C              .                    A

We can add genotype keys such as AD and AF:

>>> mf = pymaf.MafFrame.from_vcf(vf, keys=['AD', 'AF'])
>>> mf.df
  Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome  Start_Position  End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Protein_Change Tumor_Sample_Barcode      AD     AF
0        MTOR           2475      .          .       chr1             100           100      -      Missense_Mutation          SNP                G                 A                 A       p.S2215Y                    A  176,37  0.174
1        MTOR           2475      .          .       chr2             101           101      -            Splice_Site          SNP                T                 C                 C              .                    A  966,98  0.092

The method can accept a VCF file as input instead of VcfFrame:

>>> mf = pymaf.MafFrame.from_vcf('annotated.vcf')

The method can also handle unannotated VCF data:

>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 200, 300],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'C', 'TTC'],
...     'ALT': ['A', 'CAG', 'T'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'A': ['0/1', '0/1', '0/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID  REF  ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .    G    A    .      .    .     GT  0/1
1  chr1  200  .    C  CAG    .      .    .     GT  0/1
2  chr1  300  .  TTC    T    .      .    .     GT  0/1
>>> mf = pymaf.MafFrame.from_vcf(vf)
>>> mf.df
  Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome  Start_Position  End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Protein_Change Tumor_Sample_Barcode
0           .              .      .          .       chr1             100           100      .                      .          SNP                G                 A                 A              .                    A
1           .              .      .          .       chr1             200           201      .                      .          INS                -                AG                AG              .                    A
2           .              .      .          .       chr1             301           302      .                      .          DEL               TC                 -                 -              .                    A
property genes

List of the genes.

Type

list

matrix_genes(mode='variants', count=10)[source]

Compute a matrix of counts with a shape of (genes, variant classifications).

This method only considers the nine nonsynonymous variant classifications.

Parameters
  • mode ({‘variants’, ‘samples’}, default: ‘variants’) – Determines how to identify top mutated genes:

    • ‘variants’: Count the number of observed variants.

    • ‘samples’: Count the number of affected samples. Using this option will create an additional variant classification called ‘Multi_Hit’.

  • count (int, default: 10) – Number of top mutated genes to include.

Returns

The said matrix.

Return type

pandas.DataFrame

matrix_prevalence()[source]

Compute a matrix of variant counts with a shape of (genes, samples).

Returns

The said matrix.

Return type

pandas.DataFrame

matrix_tmb()[source]

Compute a matrix of variant counts with a shape of (samples, variant classifications).

Returns

The said matrix.

Return type

pandas.DataFrame

matrix_waterfall(count=10, keep_empty=False)[source]

Compute a matrix of variant classifications with a shape of (genes, samples).

If there are multiple variant classifications available for a given cell, they will be replaced as ‘Multi_Hit’.

Parameters
  • count (int, default: 10) – Number of top mutated genes to include.

  • keep_empty (bool, default: False) – If True, keep samples with all NaN’s.

Returns

The said matrix.

Return type

pandas.DataFrame

matrix_waterfall_matched(af, patient_col, group_col, group_order, count=10)[source]

Compute a matrix of variant classifications with a shape of (gene-group pairs, patients).

Parameters
  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • patient_col (str) – AnnFrame column containing patient information.

  • group_col (str) – AnnFrame column containing sample group information.

  • group_order (list) – List of sample group names.

  • count (int, default: 10) – Number of top mutated genes to include.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

plot_clonality(vaf_col, af=None, group_col=None, group_order=None, count=10, threshold=0.25, subclonal=False, ax=None, figsize=None)[source]

Create a bar plot summarizing the clonality of variants in top mutated genes.

Clonality will be calculated based on VAF using MafFrame.compute_clonality().

Parameters
  • vaf_col (str) – MafFrame column containing VAF data.

  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • count (int, defualt: 10) – Number of top mutated genes to display.

  • threshold (float, default: 0.25) – VAF threshold percentage.

  • subclonal (bool, default: False) – If True, display subclonality (1 - clonality).

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.barplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

MafFrame.compute_clonality

Compute the clonality of variants based on VAF.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_clonality('i_TumorVAF_WU')
>>> plt.tight_layout()
_images/api-13.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_clonality('i_TumorVAF_WU',
...                   af=af,
...                   group_col='FAB_classification',
...                   group_order=['M0', 'M1', 'M2'])
>>> plt.tight_layout()
_images/api-14.png
plot_comparison(a, b, c=None, labels=None, ax=None, figsize=None)[source]

Create a Venn diagram showing genotype concordance between groups.

This method supports comparison between two groups (Groups A & B) as well as three groups (Groups A, B, & C).

Parameters
  • a, b (list) – Sample names. The lists must have the same shape.

  • c (list, optional) – Same as above.

  • labels (list, optional) – List of labels to be displayed.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

  • matplotlib.axes.Axes – The matplotlib axes containing the plot.

  • matplotlib_venn._common.VennDiagram – VennDiagram object.

plot_evolution(samples, vaf_col, anchor=None, normalize=True, count=5, ax=None, figsize=None, **kwargs)[source]

Create a line plot visualizing changes in VAF between specified samples.

Parameters
  • samples (list) – List of samples to display.

  • vaf_col (str) – MafFrame column containing VAF data.

  • anchor (str, optional) – Sample to use as the anchor. If absent, use the first sample in the list.

  • normalize (bool, default: True) – If False, do not normalize VAF by the maximum value.

  • count (int, default: 5) – Number of top variants to display.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.lineplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

plot_genepair(x, y, vaf_col, af=None, group_col=None, group_order=None, ax=None, figsize=None, **kwargs)[source]

Create a scatter plot of VAF between Gene X and Gene Y.

Parameters
  • x, y (str) – Gene names.

  • vaf_col (str) – MafFrame column containing VAF data.

  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.scatterplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_genepair('DNMT3A', 'FLT3', 'i_TumorVAF_WU')
>>> plt.tight_layout()
_images/api-15.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_genepair('DNMT3A', 'FLT3', 'i_TumorVAF_WU',
...                  af=af,
...                  group_col='FAB_classification')
>>> plt.tight_layout()
_images/api-16.png
plot_genes(mode='variants', count=10, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a bar plot showing variant distirbution for top mutated genes.

Parameters
  • mode ({‘variants’, ‘samples’}, default: ‘variants’) – Determines how to identify top mutated genes:

    • ‘variants’: Count the number of observed variants.

    • ‘samples’: Count the number of affected samples. Using this option will create an additional variant classification called ‘Multi_Hit’.

  • count (int, default: 10) – Number of top mutated genes to display.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to pandas.DataFrame.plot.bar() or pandas.DataFrame.plot.barh().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

By default (mode='variants'), the method identifies top mutated genes by counting the number of observed variants:

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_genes()
>>> plt.tight_layout()
_images/api-17.png

We can also identify top mutated genes by counting the number of affected samples:

>>> mf.plot_genes(mode='samples')
>>> plt.tight_layout()
_images/api-18.png
plot_interactions(count=10, cmap=None, ax=None, figsize=None, **kwargs)[source]

Create a heatmap representing mutually exclusive or co-occurring set of genes.

This method performs pair-wise Fisher’s Exact test to detect such significant pair of genes.

Parameters
  • count (int, defualt: 10) – Number of top mutated genes to display.

  • cmap (str, optional) – Color map.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.heatmap().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_interactions(count=25, cmap='BrBG')
>>> plt.tight_layout()
_images/api-19.png
plot_lollipop(gene, alpha=0.7, ax=None, figsize=None, legend=True)[source]

Create a lollipop or stem plot showing amino acid changes of a gene.

Parameters
  • gene (str) – Name of the gene.

  • alpha (float, default: 0.7) – Set the color transparency. Must be within the 0-1 range, inclusive.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_lollipop('DNMT3A')
>>> plt.tight_layout()
_images/api-20.png
plot_matrixg(gene, af, group_col, group_order=None, cbar=True, ax=None, figsize=None, **kwargs)[source]

Create a heatmap of count matrix with a shape of (sample groups, protein changes).

Parameters
  • gene (str) – Name of the gene.

  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • group_col (str) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • cbar (bool, default: True) – Whether to draw a colorbar.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.heatmap().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_matrixg('IDH1', af, 'FAB_classification', linewidth=0.5, square=True, annot=True)
>>> plt.tight_layout()
_images/api-21.png
plot_matrixs(gene, samples=None, c0='lightgray', c1='red', l0='0', l1='1', cbar=True, square=False, ax=None, figsize=None, **kwargs)[source]

Create a heatmap of presence/absence matrix with a shape of (samples, protein changes).

Parameters
  • gene (str) – Name of the gene.

  • samples (list, optional) – List of samples to display (in that order too). If samples that are absent in the MafFrame are provided, the method will give a warning but still draw an empty bar for those samples.

  • c0 (str, default: ‘lightgray’) – Color for absence.

  • c1 (str, default: ‘red’) – Color for presence.

  • l0 (str, default: ‘0’) – Label for absence.

  • l1 (str, default: ‘1’) – Label for presence.

  • cbar (bool, default: True) – Whether to draw a colorbar.

  • square (bool, default: False) – If True, set the Axes aspect to “equal” so each cell will be square-shaped.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.heatmap().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_matrixs('KRAS', linewidth=0.5, square=True)
>>> plt.tight_layout()
_images/api-22.png
plot_mutated(af=None, group_col=None, group_order=None, genes=None, count=10, ax=None, figsize=None)[source]

Create a bar plot visualizing the mutation prevalence of top mutated genes.

Parameters
  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • genes (list, optional) – Genes to display. When absent, top mutated genes (count) will be used.

  • count (int, defualt: 10) – Number of top mutated genes to display. Ignored if genes is specified.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.barplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_mutated()
>>> plt.tight_layout()
_images/api-23.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_mutated(af=af,
...                 group_col='FAB_classification',
...                 group_order=['M0', 'M1', 'M2'])
>>> plt.tight_layout()
_images/api-24.png
plot_mutated_matched(af, patient_col, group_col, group_order, ax=None, figsize=None, **kwargs)[source]

Create a bar plot visualizing the mutation prevalence of top mutated genes.

Parameters
  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • patient_col (str) – AnnFrame column containing patient information.

  • group_col (str) – AnnFrame column containing sample group information.

  • group_order (list) – List of sample group names.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.barplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

plot_oncoplot(count=10, keep_empty=False, figsize=(15, 10), label_fontsize=15, ticklabels_fontsize=15, legend_fontsize=15)[source]

Create an oncoplot.

See this tutorial to learn how to create customized oncoplots.

Parameters
  • count (int, default: 10) – Number of top mutated genes to display.

  • keep_empty (bool, default: False) – If True, display samples that do not have any mutations.

  • figsize (tuple, default: (15, 10)) – Width, height in inches. Format: (float, float).

  • label_fontsize (float, default: 15) – Font size of labels.

  • ticklabels_fontsize (float, default: 15) – Font size of tick labels.

  • legend_fontsize (float, default: 15) – Font size of legend texts.

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_oncoplot()
_images/api-25.png
plot_oncoplot_matched(af, patient_col, group_col, group_order, colors='Set2', figsize=(15, 10), label_fontsize=12, ticklabels_fontsize=12, legend_fontsize=12)[source]

Create an oncoplot for mached samples.

Parameters
  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • patient_col (str) – AnnFrame column containing patient information.

  • group_col (str) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • colors (str) – Colormap name for the sample groups.

  • figsize (tuple, default: (15, 10)) – Width, height in inches. Format: (float, float).

  • label_fontsize (float, default: 12) – Font size of labels.

  • ticklabels_fontsize (float, default: 12) – Font size of tick labels.

  • legend_fontsize (float, default: 12) – Font size of legend texts.

plot_rainfall(sample, palette=None, legend='auto', ax=None, figsize=None, **kwargs)[source]

Create a rainfall plot visualizing inter-variant distance on a linear genomic scale for single sample.

Parameters
  • sample (str) – Name of the sample.

  • palette (str, optional) – Name of the seaborn palette. See the Control plot colors tutorial for details.

  • legend ({‘auto’, ‘brief’, ‘full’, False}, default: ‘auto’) – Display setting of the legend according to seaborn.scatterplot().

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.scatterplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pymaf
>>> common.load_dataset('brca')
>>> maf_file = '~/fuc-data/brca/brca.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_rainfall('TCGA-A8-A08B',
...                  figsize=(14, 7),
...                  palette=sns.color_palette('Set2')[:6])
>>> plt.tight_layout()
_images/api-26.png
plot_regplot(af, group_col, a, b, a_size=None, b_size=None, genes=None, count=10, to_csv=None, ax=None, figsize=None, **kwargs)[source]

Create a scatter plot with a linear regression model fit visualizing correlation between gene mutation frequencies in two sample groups A and B.

Each point in the plot represents a gene.

The method will automatically calculate and print summary statistics including R-squared and p-value.

Parameters
  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • group_col (str) – AnnFrame column containing sample group information.

  • a, b (str) – Sample group names.

  • a_size, b_size (int, optional) – Sample group sizes to use as denominator. By default, these are inferred from the MafFrame and AnnFrame objects.

  • genes (list, optional) – Genes to display. When absent, top mutated genes (count) will be used.

  • count (int, defualt: 10) – Number of top mutated genes to display. Ignored if genes is specified.

  • to_csv (str, optional) – Write the plot’s data to a CSV file.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.regplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_regplot(af, 'FAB_classification', 'M1', 'M2')
Results for M2 ~ M1:
R^2 = 0.43
  P = 3.96e-02
>>> plt.tight_layout()
_images/api-27.png
plot_snvclsc(af=None, group_col=None, group_order=None, palette=None, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a bar plot summarizing the count distrubtions of the six SNV classes for all samples.

A grouped bar plot can be created with group_col (requires an AnnFrame).

Parameters
  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • palette (str, optional) – Name of the seaborn palette. See the Control plot colors tutorial for details.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.barplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

MafFrame.plot_snvclsp

Create a box plot summarizing the proportion distrubtions of the six SNV classes for all sample.

MafFrame.plot_snvclss

Create a bar plot showing the proportions of the six SNV classes for individual samples.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_snvclsc(palette=sns.color_palette('Dark2'))
>>> plt.tight_layout()
_images/api-28.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_snvclsc(af=af,
...                 group_col='FAB_classification',
...                 group_order=['M0', 'M1', 'M2'])
>>> plt.tight_layout()
_images/api-29.png
plot_snvclsp(af=None, group_col=None, group_order=None, palette=None, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a box plot summarizing the proportion distrubtions of the six SNV classes for all sample.

Parameters
  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • palette (str, optional) – Name of the seaborn palette. See the Control plot colors tutorial for details.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.boxplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

MafFrame.plot_snvclsc

Create a bar plot summarizing the count distrubtions of the six SNV classes for all samples.

MafFrame.plot_snvclss

Create a bar plot showing the proportions of the six SNV classes for individual samples.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_snvclsp(palette=sns.color_palette('Set2'))
>>> plt.tight_layout()
_images/api-30.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_snvclsp(af=af,
...                 group_col='FAB_classification',
...                 group_order=['M0', 'M1', 'M2'])
>>> plt.tight_layout()
_images/api-31.png
plot_snvclss(samples=None, color=None, colormap=None, width=0.8, legend=True, flip=False, to_csv=None, ax=None, figsize=None, **kwargs)[source]

Create a bar plot showing the proportions of the six SNV classes for individual samples.

Parameters
  • samples (list, optional) – List of samples to display (in that order too). If samples that are absent in the MafFrame are provided, the method will give a warning but still draw an empty bar for those samples.

  • color (list, optional) – List of color tuples. See the Control plot colors tutorial for details.

  • colormap (str or matplotlib colormap object, optional) – Colormap to select colors from. See the Control plot colors tutorial for details.

  • width (float, default: 0.8) – The width of the bars.

  • legend (bool, default: True) – Place legend on axis subplots.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • to_csv (str, optional) – Write the plot’s data to a CSV file.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to pandas.DataFrame.plot.bar() or pandas.DataFrame.plot.barh().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

MafFrame.plot_snvclsc

Create a bar plot summarizing the count distrubtions of the six SNV classes for all samples.

MafFrame.plot_snvclsp

Create a box plot summarizing the proportion distrubtions of the six SNV classes for all sample.

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> ax = mf.plot_snvclss(width=1, color=plt.get_cmap('Set2').colors)
>>> ax.legend(loc='upper right')
>>> plt.tight_layout()
_images/api-32.png
plot_summary(figsize=(15, 10), title_fontsize=16, ticklabels_fontsize=12, legend_fontsize=12)[source]

Create a summary figure for MafFrame.

Parameters
  • figsize (tuple, default: (15, 10)) – Width, height in inches. Format: (float, float).

  • title_fontsize (float, default: 16) – Font size of subplot titles.

  • ticklabels_fontsize (float, default: 12) – Font size of tick labels.

  • legend_fontsize (float, default: 12) – Font size of legend texts.

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_summary()
_images/api-33.png
plot_titv(af=None, group_col=None, group_order=None, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a box plot showing the Ti/Tv proportions of samples.

Parameters
  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.boxplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

fuc.api.pyvcf.VcfFrame.plot_titv

Similar method for the fuc.api.pyvcf.VcfFrame class.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_titv()
>>> plt.tight_layout()
_images/api-34.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_titv(af=af,
...              group_col='FAB_classification',
...              group_order=['M0', 'M1', 'M2'])
>>> plt.tight_layout()
_images/api-35.png
plot_tmb(samples=None, width=0.8, ax=None, figsize=None, **kwargs)[source]

Create a bar plot showing the TMB distributions of samples.

Parameters
  • samples (list, optional) – List of samples to display (in that order too). If samples that are absent in the MafFrame are provided, the method will give a warning but still draw an empty bar for those samples.

  • width (float, default: 0.8) – The width of the bars.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to pandas.DataFrame.plot.bar().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_tmb(width=1)
>>> plt.tight_layout()
_images/api-36.png
plot_tmb_matched(af, patient_col, group_col, group_order=None, patients=None, legend=True, ax=None, figsize=None, **kwargs)[source]

Create a grouped bar plot showing TMB distributions for different group levels in each patient.

Parameters
  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • patient_col (str) – AnnFrame column containing patient information.

  • group_col (str) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • patients (list, optional) – List of patient names.

  • legend (bool, default: True) – Place legend on axis subplots.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to pandas.DataFrame.plot.bar()

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

plot_vaf(vaf_col, count=10, af=None, group_col=None, group_order=None, flip=False, sort=True, ax=None, figsize=None, **kwargs)[source]

Create a box plot showing the VAF distributions of top mutated genes.

A grouped box plot can be created with group_col (requires an AnnFrame).

Parameters
  • vaf_col (str) – MafFrame column containing VAF data.

  • count (int, default: 10) – Number of top mutated genes to display.

  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • sort (bool, default: True) – If False, do not sort the genes by median value.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.boxplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_vaf('i_TumorVAF_WU')
>>> plt.tight_layout()
_images/api-37.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> mf.plot_vaf('i_TumorVAF_WU',
...             af=af,
...             group_col='FAB_classification',
...             group_order=['M1', 'M2', 'M3'],
...             count=5)
>>> plt.tight_layout()
_images/api-38.png
plot_varcls(ax=None, figsize=None, **kwargs)[source]

Create a bar plot for the nonsynonymous variant classes.

Parameters
  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to matplotlib.axes.Axes.bar() and seaborn.barplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_varcls()
>>> plt.tight_layout()
_images/api-39.png
plot_varsum(flip=False, ax=None, figsize=None)[source]

Create a summary box plot for variant classifications.

Parameters
  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_varsum()
>>> plt.tight_layout()
_images/api-40.png
plot_vartype(palette=None, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a bar plot summarizing the count distrubtions of viaration types for all samples.

Parameters
  • palette (str, optional) – Name of the seaborn palette. See the Control plot colors tutorial for details.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.barplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_vartype()
>>> plt.tight_layout()
_images/api-41.png
plot_waterfall(count=10, keep_empty=False, samples=None, ax=None, figsize=None, **kwargs)[source]

Create a waterfall plot (oncoplot).

See this tutorial to learn how to create customized oncoplots.

Parameters
  • count (int, default: 10) – Number of top mutated genes to display.

  • keep_empty (bool, default: False) – If True, display samples that do not have any mutations.

  • samples (list, optional) – List of samples to display (in that order too). If samples that are absent in the MafFrame are provided, the method will give a warning but still draw an empty bar for those samples.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.heatmap().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.plot_waterfall(linewidths=0.5)
>>> plt.tight_layout()
_images/api-42.png
plot_waterfall_matched(af, patient_col, group_col, group_order, count=10, ax=None, figsize=None)[source]

Create a waterfall plot using matched samples from each patient.

Parameters
  • af (AnnFrame) – AnnFrame containing sample annotation data.

  • patient_col (str) – AnnFrame column containing patient information.

  • group_col (str) – AnnFrame column containing sample group information.

  • group_order (list) – List of sample group names.

  • count (int, default: 10) – Number of top mutated genes to include.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

property samples

List of the sample names.

Type

list

property shape

Dimensionality of MafFrame (variants, samples).

Type

tuple

subset(samples, exclude=False)[source]

Subset MafFrame for specified samples.

Parameters
  • samples (str, list, or pandas.Series) – Sample name or list of names (the order does not matters).

  • exclude (bool, default: False) – If True, exclude specified samples.

Returns

Subsetted MafFrame.

Return type

MafFrame

Examples

>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.shape
(2207, 193)
>>> mf.subset(['TCGA-AB-2988', 'TCGA-AB-2869']).shape
(27, 2)
>>> mf.subset(['TCGA-AB-2988', 'TCGA-AB-2869'], exclude=True).shape
(2180, 191)
to_file(fn)[source]

Write MafFrame to a MAF file.

Parameters

fn (str) – MAF file path.

to_string()[source]

Render MafFrame to a console-friendly tabular output.

Returns

String representation of MafFrame.

Return type

str

to_vcf(fasta=None, ignore_indels=False, cols=None, names=None)[source]

Write the MafFrame to a sorted VcfFrame.

Converting from MAF to VCF is pretty straightforward for SNVs, but it can be challenging for INDELs and complex events involving multiple nucleotides (e.g. ‘AAGG’ → ‘CCCG’). This is because, for the latter case we need to identify the “anchor” nucleotide for each event, which is crucial for constructing a properly formatted VCF. For example, a deletion event ‘AGT’ → ‘-’ in MAF would have to be converted to ‘CAGT’ → ‘C’ in the VCF where ‘C’ is our anchor nucleotide. The position should be shifted by one as well.

In order to tackle this issue, the method makes use of a reference assembly (i.e. FASTA file). If SNVs are your only concern, then you do not need a FASTA file and can just set ignore_indels as True. If you are going to provide a FASTA file, please make sure to select the appropriate one (e.g. one that matches the genome assembly). For example, if your MAF is in hg19/GRCh37, use the ‘hs37d5.fa’ file which can be freely downloaded from the 1000 Genomes Project.

Parameters
  • fasta (str, optional) – FASTA file. Required if ignore_indels is False.

  • ignore_indels (bool, default: False) – If True, do not include INDELs in the VcfFrame. Useful when a FASTA file is not available.

  • cols (str or list, optional) – Column(s) in the MafFrame which contain additional genotype data of interest. If provided, these data will be added to individual sample genotypes (e.g. ‘0/1:0.23’).

  • names (str or list, optional) – Name(s) to be displayed in the FORMAT field (e.g. AD, AF, DP). If not provided, the original column name(s) will be displayed.

Returns

The VcfFrame object.

Return type

VcfFrame

Examples

>>> from fuc import pymaf
>>> mf = pymaf.MafFrame.from_file('in.maf')
>>> vf = mf.to_vcf(fasta='hs37d5.fa')
>>> vf = mf.to_vcf(ignore_indels=True)
>>> vf = mf.to_vcf(fasta='hs37d5.fa', cols='i_TumorVAF_WU', names='AF')
variants()[source]

List unique variants in MafFrame.

Returns

List of unique variants.

Return type

list

Examples

>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> mf.variants()[:5]
['1:1571791:1571791:G:A', '1:1747228:1747228:T:G', '1:2418350:2418350:C:T', '1:3328523:3328523:G:A', '1:3638739:3638739:C:T']

fuc.pysnpeff

The pysnpeff submodule is designed for parsing VCF annotation data from the SnpEff program. It should be used with pyvcf.VcfFrame.

One VCF record can have several SnpEff annotations if, for example, the record is a multiallelic site or the variant is shared by multiple genes. When more than one annotations are reported, SnpEff will sort them by their importance. For more details, visit the official website.

For each annotation, SnpEff provides the following data:

  1. Allele - ALT allele.

  2. Annotation - Sequence Ontology terms concatenated using ‘&’.

  3. Annotation_Impact - HIGH, MODERATE, LOW, or MODIFIER.

  4. Gene_Name - Common gene name (HGNC).

  5. Gene_ID - Gene ID.

  6. Feature_Type - Which type of feature is in the next field.

  7. Feature_ID - Transcript ID, Motif ID, miRNA, ChipSeq peak, etc.

  8. Transcript_BioType - Coding or noncoding.

  9. Rank - Exon or Intron rank / total number of exons or introns.

  10. HGVS.c - Variant using HGVS notation (DNA level).

  11. HGVS.p - Variant using HGVS notation (Protein level).

  12. cDNA.pos / cDNA.length - Position in cDNA and trancript’s cDNA length.

  13. CDS.pos / CDS.length - Position and number of coding bases.

  14. AA.pos / AA.length - Position and number of AA.

  15. Distance - All items in this field are options.

  16. ERRORS / WARNINGS - Messages that can affect annotation accuracy.

  17. INFO - Additional information.

Functions:

filter_ann(vf, targets[, include])

Filter out rows based on the SnpEff annotations.

parseann(vf, idx[, sep])

Parse SnpEff annotations.

row_firstann(r)

Return the first SnpEff annotation for the row.

fuc.api.pysnpeff.filter_ann(vf, targets, include=True)[source]

Filter out rows based on the SnpEff annotations.

Parameters
  • vf (fuc.api.pyvcf.VcfFrame) – Input VcfFrame.

  • targets (list) – List of annotations (e.g. [‘missense_variant’, ‘stop_gained’]).

  • include (bool, default: False) – If True, include only such rows instead of excluding them.

Returns

vf – Filtered VcfFrame.

Return type

VcfFrame

fuc.api.pysnpeff.parseann(vf, idx, sep=' | ')[source]

Parse SnpEff annotations.

Parameters
  • vf (fuc.api.pyvcf.VcfFrame) – Input VcfFrame.

  • i (list) – List of annotation indicies.

  • sep (str, default: ‘ | ‘) – Separator for joining requested annotations.

Returns

s – Parsed annotations.

Return type

pandas.Series

fuc.api.pysnpeff.row_firstann(r)[source]

Return the first SnpEff annotation for the row.

fuc.pyvcf

The pyvcf submodule is designed for working with VCF files. It implements pyvcf.VcfFrame which stores VCF data as pandas.DataFrame to allow fast computation and easy manipulation. The pyvcf.VcfFrame class also contains many useful plotting methods such as VcfFrame.plot_comparison and VcfFrame.plot_tmb. The submodule strictly adheres to the standard VCF specification.

A typical VCF file contains metadata lines (prefixed with ‘##’), a header line (prefixed with ‘#’), and genotype lines that begin with a chromosome identifier (e.g. ‘chr1’). See the VCF specification above for an example VCF file.

Genotype lines have nine required fields for storing variant information and variable-length fields for storing sample genotype data. For some fields, missing values are tolerated and can be specified with a dot (‘.’). The nine required fields are:

No.

Name

Description

Examples

1

CHROM

Chromosome or contig identifier

‘chr2’, ‘2’, ‘chrM’

2

POS

1-based reference position

10041, 23042

3

ID

‘;’-separated variant identifiers

‘.’, ‘rs35’, ‘rs9;rs53’

4

REF

Reference allele

‘A’, ‘GT’

5

ALT

‘,’-separated alternate alleles

‘T’, ‘ACT’, ‘C,T’

6

QUAL

Phred-scaled quality score for ALT

‘.’, 67, 12

7

FILTER

‘;’-separated filters that failed

‘.’, ‘PASS’, ‘q10;s50’

8

INFO

‘;’-separated information fields

‘.’, ‘DP=14;AF=0.5;DB’

9

FORMAT

‘:’-separated genotype fields

‘GT’, ‘GT:AD:DP’

You will sometimes come across VCFs that have only eight columns, and contain no FORMAT or sample-specific information. These are called “sites-only” VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.

There are several common, reserved genotype keywords that are standards across the community. Currently, the pyvcf submodule is aware of the following:

  • AD - Total read depth for each allele (R, Integer)

  • AF - Allele fraction of the event in the tumor (1, Float)

  • DP - Read depth (1, Integer)

If sample annotation data are available for a given VCF file, use the common.AnnFrame class to import the data.

Classes:

VcfFrame(meta, df)

Class for storing VCF data.

Functions:

gt_hasvar(g)

Return True if sample genotype has at least one variant call.

gt_het(g)

Return True if genotype call is heterozygous.

gt_miss(g)

Return True if sample genotype is missing.

gt_polyp(g)

Return True if sample genotype has a polyploid call.

gt_pseudophase(g)

Return pseudophased genotype call.

gt_unphase(g)

Return unphased sample genotype.

merge(vfs[, how, format, sort, collapse])

Merge VcfFrame objects.

rescue_filtered_variants(vfs[, format])

Rescue filtered variants if they are PASS in at least one of the input VCF files.

row_hasindel(r)

Return True if the row has an indel.

row_missval(r)

Return the correctly formatted missing value for the row.

row_parseinfo(r, key)

Return INFO data in the row that match the given key.

row_updateinfo(r, key, value)

Update INFO data in the row that match the given key.

slice(file, regions[, path])

Slice VCF file for specified regions.

class fuc.api.pyvcf.VcfFrame(meta, df)[source]

Class for storing VCF data.

Parameters
  • meta (list) – List of metadata lines.

  • df (pandas.DataFrame) – DataFrame containing VCF data.

See also

VcfFrame.from_dict

Construct VcfFrame from a dict of array-like or dicts.

VcfFrame.from_file

Construct VcfFrame from a VCF file.

VcfFrame.from_string

Construct VcfFrame from a string.

Examples

Constructing VcfFrame from pandas DataFrame:

>>> from fuc import pyvcf
>>> import pandas as pd
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.',],
...     'REF': ['G', 'T', 'A'],
...     'ALT': ['A', 'C', 'T'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'Steven': ['0/1', '0/1', '0/1'],
... }
>>> df = pd.DataFrame(data)
>>> vf = pyvcf.VcfFrame(['##fileformat=VCFv4.3'], df)
>>> vf.meta
['##fileformat=VCFv4.3']
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  101  .   T   C    .      .    .     GT    0/1
2  chr1  102  .   A   T    .      .    .     GT    0/1

Methods:

add_af([decimals])

Compute AF from AD and then add it to the FORMAT field.

add_dp()

Compute DP using AD and add it to the FORMAT field.

add_flag(flag[, order, index])

Add the given flag to the INFO field.

calculate_concordance(a, b[, c, mode])

Calculate genotype concordance between two (A, B) or three (A, B, C) samples.

cfilter_empty([opposite, as_list])

Remove samples whose genotype calls are all missing.

collapse()

Collapse duplicate records in the VcfFrame.

combine(a, b)

Combine genotype data from two samples (A, B).

compare(other)

Compare to another VcfFrame and show the differences in genotype calling.

copy()

Return a copy of the VcfFrame.

copy_df()

Return a copy of the dataframe.

copy_meta()

Return a copy of the metadata.

drop_duplicates([subset, keep])

Return VcfFrame with duplicate rows removed.

expand()

Expand each multiallelic locus to multiple rows.

extract_format(k[, func, as_nan])

Extract data for the specified FORMAT key.

extract_info(k[, func, as_nan])

Extract data for the specified INFO key.

fetch(variant)

Fetch the VCF row that matches specified variant.

filter_bed(bed[, opposite, as_index])

Select rows that overlap with the given BED data.

filter_empty([opposite, as_index])

Remove rows with no genotype calls at all.

filter_flagall(flags[, opposite, as_index])

Select rows if all of the given INFO flags are present.

filter_flagany(flags[, opposite, as_index])

Select rows if any one of the given INFO flags is present.

filter_indel([opposite, as_index])

Remove rows with an indel.

filter_multialt([opposite, as_index])

Remove rows with multiple ALT alleles.

filter_pass([opposite, as_index])

Select rows with PASS in the FILTER field.

filter_phased([opposite, as_index])

Remove rows with phased genotypes.

filter_polyp([opposite, as_index])

Remove rows with a polyploid genotype call.

filter_qual(threshold[, opposite, as_index])

Select rows with minimum QUAL value.

filter_sampall([samples, opposite, as_index])

Select rows if all of the given samples have the variant.

filter_sampany([samples, opposite, as_index])

Select rows if any one of the given samples has the variant.

filter_sampnum(threshold[, opposite, as_index])

Select rows if the variant is prevalent enough.

filter_vcf(vcf[, opposite, as_index])

Select rows that overlap with the other VCF.

from_dict(meta, data)

Construct VcfFrame from a dict of array-like or dicts.

from_file(fn[, compression, meta_only, regions])

Construct VcfFrame from a VCF file.

from_string(s[, meta_only])

Construct VcfFrame from a string.

get_af(sample, variant)

Get allele fraction for a pair of sample and variant.

markmiss(expr[, greedy, opposite, samples, ...])

Mark all genotypes that satisfy the query expression as missing.

merge(other[, how, format, sort, collapse])

Merge with the other VcfFrame.

meta_keys()

Print metadata lines with a key.

miss2ref()

Convert missing genotype (./.) to homozygous REF (0/0).

plot_comparison(a, b[, c, labels, ax, figsize])

Create a Venn diagram showing genotype concordance between groups.

plot_hist_format(k[, af, group_col, ...])

Create a histogram showing the distribution of data for the specified FORMAT key.

plot_hist_info(k[, kde, ax, figsize])

Create a histogram showing the distribution of data for the specified INFO key.

plot_rainfall(sample[, palette, ax, ...])

Create a rainfall plot visualizing inter-variant distance on a linear genomic scale for single sample.

plot_region(sample[, k, color, region, ...])

Create a scatter plot showing read depth profile of a sample for the specified region.

plot_regplot(a, b[, ax, figsize])

Create a scatter plot with a linear regression model fit visualizing correlation between TMB in two sample groups.

plot_snvclsc([af, group_col, group_order, ...])

Create a bar plot summarizing the count distrubtions of the six SNV classes for all samples.

plot_snvclsp([af, group_col, group_order, ...])

Create a box plot summarizing the proportion distrubtions of the six SNV classes for all sample.

plot_snvclss([color, colormap, width, ...])

Create a bar plot showing the proportions of the six SNV classes for individual samples.

plot_titv([af, group_col, group_order, ...])

Create a box plot showing the Ti/Tv proportions of samples.

plot_tmb([af, group_col, group_order, kde, ...])

Create a histogram showing TMB distribution.

pseudophase()

Pseudophase VcfFrame.

rename(names[, indicies])

Rename the samples.

slice(region)

Slice the VcfFrame for the region.

sort()

Sort the VcfFrame by chromosome and position.

strip([format, metadata])

Remove any unnecessary data.

subset(samples[, exclude])

Subset VcfFrame for specified samples.

subtract(a, b)

Subtract genotype data between two samples (A, B).

to_bed()

Write BedFrame from the VcfFrame.

to_file(fn[, compression])

Write VcfFrame to a VCF file.

to_string()

Render the VcfFrame to a console-friendly tabular output.

unphase()

Unphase all the sample genotypes.

update_chr_prefix([mode])

Add or remove the (annoying) 'chr' string from the CHROM column.

variants()

List unique variants in VcfFrame.

Attributes:

contigs

List of contig names.

df

DataFrame containing VCF data.

empty

Indicator whether VcfFrame is empty.

has_chr_prefix

Whether the (annoying) 'chr' string is found.

meta

List of metadata lines.

samples

List of sample names.

shape

Dimensionality of VcfFrame (variants, samples).

sites_only

Whether the VCF is sites-only.

add_af(decimals=3)[source]

Compute AF from AD and then add it to the FORMAT field.

This method will compute allele fraction for each ALT allele in the same order as listed.

Parameters

decimals (int, default: 3) – Number of decimals to display.

Returns

Updated VcfFrame object.

Return type

VcfFrame

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'G', 'A', 'C'],
...     'ALT': ['C', 'T', 'G', 'G,A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT:AD', 'GT:AD', 'GT', 'GT:AD'],
...     'A': ['0/1:12,15', '0/0:32,1', '0/1', './.:.'],
...     'B': ['0/1:13,17', '0/1:14,15', './.', '1/2:0,11,17'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT          A            B
0  chr1  100  .   A    C    .      .    .  GT:AD  0/1:12,15    0/1:13,17
1  chr1  101  .   G    T    .      .    .  GT:AD   0/0:32,1    0/1:14,15
2  chr1  102  .   A    G    .      .    .     GT        0/1          ./.
3  chr1  103  .   C  G,A    .      .    .  GT:AD      ./.:.  1/2:0,11,17
>>> vf.add_af().df
  CHROM  POS ID REF  ALT QUAL FILTER INFO    FORMAT                      A                              B
0  chr1  100  .   A    C    .      .    .  GT:AD:AF  0/1:12,15:0.444,0.556          0/1:13,17:0.433,0.567
1  chr1  101  .   G    T    .      .    .  GT:AD:AF   0/0:32,1:0.970,0.030          0/1:14,15:0.483,0.517
2  chr1  102  .   A    G    .      .    .     GT:AF                  0/1:.                          ./.:.
3  chr1  103  .   C  G,A    .      .    .  GT:AD:AF                ./.:.:.  1/2:0,11,17:0.000,0.393,0.607
add_dp()[source]

Compute DP using AD and add it to the FORMAT field.

Returns

Updated VcfFrame.

Return type

VcfFrame

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr2', 'chr2'],
...     'POS': [100, 100, 200, 200],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'A', 'C', 'C'],
...     'ALT': ['C', 'T', 'G', 'G,A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT:AD', 'GT:AD', 'GT:AD', 'GT:AD'],
...     'Steven': ['0/1:12,15', '0/0:32,1', '0/1:16,12', './.:.'],
...     'Sara': ['0/1:13,17', '0/1:14,15', './.:.', '1/2:0,11,17'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT     Steven         Sara
0  chr1  100  .   A    C    .      .    .  GT:AD  0/1:12,15    0/1:13,17
1  chr1  100  .   A    T    .      .    .  GT:AD   0/0:32,1    0/1:14,15
2  chr2  200  .   C    G    .      .    .  GT:AD  0/1:16,12        ./.:.
3  chr2  200  .   C  G,A    .      .    .  GT:AD      ./.:.  1/2:0,11,17

We can add the DP subfield to our genotype data:

>>> vf.add_dp().df
  CHROM  POS ID REF  ALT QUAL FILTER INFO    FORMAT        Steven            Sara
0  chr1  100  .   A    C    .      .    .  GT:AD:DP  0/1:12,15:27    0/1:13,17:30
1  chr1  100  .   A    T    .      .    .  GT:AD:DP   0/0:32,1:33    0/1:14,15:29
2  chr2  200  .   C    G    .      .    .  GT:AD:DP  0/1:16,12:28         ./.:.:.
3  chr2  200  .   C  G,A    .      .    .  GT:AD:DP       ./.:.:.  1/2:0,11,17:28
add_flag(flag, order='last', index=None)[source]

Add the given flag to the INFO field.

The default behavior is to add the flag to all rows in the VcfFrame.

Parameters
  • flag (str) – INFO flag.

  • order ({‘last’, ‘first’, False}, default: ‘last’) – Determines the order in which the flag will be added.

    • last : Add to the end of the list.

    • first : Add to the beginning of the list.

    • False : Overwrite the existing field.

  • index (list or pandas.Series, optional) – Boolean index array indicating which rows should be updated.

Returns

Updated VcfFrame.

Return type

VcfFrame

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'T', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', 'DB', 'DB', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/0', '0/1', '0/1', '1/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/0
1  chr1  101  .   T   C    .      .   DB     GT    0/1
2  chr1  102  .   A   T    .      .   DB     GT    0/1
3  chr1  103  .   C   A    .      .    .     GT    1/1

We can add the SOMATIC flag to the INFO field:

>>> vf.add_flag('SOMATIC').df
  CHROM  POS ID REF ALT QUAL FILTER        INFO FORMAT Steven
0  chr1  100  .   G   A    .      .     SOMATIC     GT    0/0
1  chr1  101  .   T   C    .      .  DB;SOMATIC     GT    0/1
2  chr1  102  .   A   T    .      .  DB;SOMATIC     GT    0/1
3  chr1  103  .   C   A    .      .     SOMATIC     GT    1/1

Setting order='first' will append the flag at the beginning:

>>> vf.add_flag('SOMATIC', order='first').df
  CHROM  POS ID REF ALT QUAL FILTER        INFO FORMAT Steven
0  chr1  100  .   G   A    .      .     SOMATIC     GT    0/0
1  chr1  101  .   T   C    .      .  SOMATIC;DB     GT    0/1
2  chr1  102  .   A   T    .      .  SOMATIC;DB     GT    0/1
3  chr1  103  .   C   A    .      .     SOMATIC     GT    1/1

Setting order=False will overwrite the INFO field:

>>> vf.add_flag('SOMATIC', order=False).df
  CHROM  POS ID REF ALT QUAL FILTER     INFO FORMAT Steven
0  chr1  100  .   G   A    .      .  SOMATIC     GT    0/0
1  chr1  101  .   T   C    .      .  SOMATIC     GT    0/1
2  chr1  102  .   A   T    .      .  SOMATIC     GT    0/1
3  chr1  103  .   C   A    .      .  SOMATIC     GT    1/1

We can also specify which rows should be updated:

>>> vf.add_flag('SOMATIC', index=[True, True, False, False]).df
  CHROM  POS ID REF ALT QUAL FILTER        INFO FORMAT Steven
0  chr1  100  .   G   A    .      .     SOMATIC     GT    0/0
1  chr1  101  .   T   C    .      .  DB;SOMATIC     GT    0/1
2  chr1  102  .   A   T    .      .          DB     GT    0/1
3  chr1  103  .   C   A    .      .           .     GT    1/1
calculate_concordance(a, b, c=None, mode='all')[source]

Calculate genotype concordance between two (A, B) or three (A, B, C) samples.

This method will return (Ab, aB, AB, ab) for comparison between two samples and (Abc, aBc, ABc, abC, AbC, aBC, ABC, abc) for three samples. Note that the former is equivalent to (FP, FN, TP, TN) if we assume A is the test sample and B is the truth sample.

Only biallelic sites will be used for calculation. Additionally, the method will ignore zygosity and only consider presence or absence of variant calls (e.g. 0/1 and 1/1 will be treated the same).

Parameters
  • a, b (str or int) – Name or index of Samples A and B.

  • c (str or int, optional) – Name or index of Sample C.

  • mode ({‘all’, ‘snv’, ‘indel’}, default: ‘all’) – Determines which variant types should be analyzed:

    • ‘all’: Include both SNVs and INDELs.

    • ‘snv’: Include SNVs only.

    • ‘indel’: Include INDELs only.

Returns

Four- or eight-element tuple depending on the number of samples.

Return type

tuple

See also

fuc.api.common.sumstat

Return various summary statistics from (FP, FN, TP, TN).

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['G', 'CT', 'T', 'C', 'A'],
...     'ALT': ['A', 'C', 'A', 'T', 'G,C'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', '0/0', '0/0', '0/1', '0/0'],
...     'B': ['1/1', '0/1', './.', '0/1', '0/0'],
...     'C': ['0/1', '0/1', '1/1', './.', '1/2'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT    A    B    C
0  chr1  100  .   G    A    .      .    .     GT  0/1  1/1  0/1
1  chr1  101  .  CT    C    .      .    .     GT  0/0  0/1  0/1
2  chr1  102  .   T    A    .      .    .     GT  0/0  ./.  1/1
3  chr1  103  .   C    T    .      .    .     GT  0/1  0/1  ./.
4  chr1  104  .   A  G,C    .      .    .     GT  0/0  0/0  1/2

We can first compare the samples A and B:

>>> vf.calculate_concordance('A', 'B', mode='all')
(0, 1, 2, 1)
>>> vf.calculate_concordance('A', 'B', mode='snv')
(0, 0, 2, 1)
>>> vf.calculate_concordance('A', 'B', mode='indel')
(0, 1, 0, 0)

We can also compare all three samples at once:

>>> vf.calculate_concordance('A', 'B', 'C')
(0, 0, 1, 1, 0, 1, 1, 0)
cfilter_empty(opposite=False, as_list=False)[source]

Remove samples whose genotype calls are all missing.

Parameters
  • opposite (bool, default: False) – If True, return samples that don’t meet the said criteria.

  • as_list (bool, default: False) – If True, return a list of sample names instead of a VcfFrame.

Returns

Filtered VcfFrame.

Return type

VcfFrame

Examples

Assume we have the following data:

>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'G', 'T'],
...     'ALT': ['A', 'C', 'C', 'C'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/1', '1/1', '1/1', '1/1'],
...     'Rachel': ['./.', './.', './.', './.'],
...     'John': ['0/0', './.', '0/0', '0/0'],
...     'Sara': ['./.', './.', './.', './.'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Rachel John Sara
0  chr1  100  .   G   A    .      .    .     GT    0/1    ./.  0/0  ./.
1  chr1  101  .   T   C    .      .    .     GT    1/1    ./.  ./.  ./.
2  chr1  102  .   G   C    .      .    .     GT    1/1    ./.  0/0  ./.
3  chr1  103  .   T   C    .      .    .     GT    1/1    ./.  0/0  ./.

We can remove samples whose genotypes are all missing:

>>> vf.cfilter_empty().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven John
0  chr1  100  .   G   A    .      .    .     GT    0/1  0/0
1  chr1  101  .   T   C    .      .    .     GT    1/1  ./.
2  chr1  102  .   G   C    .      .    .     GT    1/1  0/0
3  chr1  103  .   T   C    .      .    .     GT    1/1  0/0

We can also select those samples:

>>> vf.cfilter_empty(opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Rachel Sara
0  chr1  100  .   G   A    .      .    .     GT    ./.  ./.
1  chr1  101  .   T   C    .      .    .     GT    ./.  ./.
2  chr1  102  .   G   C    .      .    .     GT    ./.  ./.
3  chr1  103  .   T   C    .      .    .     GT    ./.  ./.

Finally, we can return a list of sample names from the filtering:

>>> vf.cfilter_empty(as_list=True)
['Steven', 'John']
collapse()[source]

Collapse duplicate records in the VcfFrame.

Duplicate records have the identical values for CHROM, POS, and REF. They can result from merging two VCF files.

Note

The method will sort the order of ALT alleles.

Returns

Collapsed VcfFrame.

Return type

VcfFrame

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr2', 'chr2'],
...     'POS': [100, 100, 200, 200],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'A', 'C', 'C'],
...     'ALT': ['C', 'T', 'G', 'G,A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT:AD', 'GT:AD', 'GT:AD', 'GT:AD'],
...     'Steven': ['0/1:12,15', './.:.', '0/1:16,12', './.:.'],
...     'Sara': ['./.:.', '0/1:14,15', './.:.', '1/2:0,11,17'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT     Steven         Sara
0  chr1  100  .   A    C    .      .    .  GT:AD  0/1:12,15        ./.:.
1  chr1  100  .   A    T    .      .    .  GT:AD      ./.:.    0/1:14,15
2  chr2  200  .   C    G    .      .    .  GT:AD  0/1:16,12        ./.:.
3  chr2  200  .   C  G,A    .      .    .  GT:AD      ./.:.  1/2:0,11,17

We collapse the VcfFrame:

>>> vf.collapse().df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT       Steven         Sara
0  chr1  100  .   A  C,T    .      .    .  GT:AD  0/1:12,15,0  0/2:14,0,15
2  chr2  200  .   C  A,G    .      .    .  GT:AD  0/2:16,0,12  1/2:0,17,11
combine(a, b)[source]

Combine genotype data from two samples (A, B).

This method can be especially useful when you want to consolidate genotype data from replicate samples. See examples below for more details.

Parameters

a, b (str or int) – Name or index of Samples A and B.

Returns

Resulting VCF column.

Return type

pandas.Series

See also

VcfFrame.subtract

Subtract genotype data between two samples (A, B).

Examples

Assume we have following data where a cancer patient’s tissue sample has been sequenced twice:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['G', 'T', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'A', 'C', 'G'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT:DP', 'GT:DP', 'GT:DP', 'GT:DP', 'GT:DP'],
...     'Tissue1': ['./.:.', '0/0:7', '0/1:28', '0/1:4', '0/1:32'],
...     'Tissue2': ['0/1:24', '0/1:42', './.:.', './.:.', '0/1:19'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Tissue1 Tissue2
0  chr1  100  .   G   A    .      .    .  GT:DP   ./.:.  0/1:24
1  chr1  101  .   T   C    .      .    .  GT:DP   0/0:7  0/1:42
2  chr1  102  .   T   A    .      .    .  GT:DP  0/1:28   ./.:.
3  chr1  103  .   A   C    .      .    .  GT:DP   0/1:4   ./.:.
4  chr1  104  .   C   G    .      .    .  GT:DP  0/1:32  0/1:19

We can combine genotype data from ‘Tissue1’ and ‘Tissue2’ to get a more comprehensive variant profile:

>>> vf.df['Combined'] = vf.combine('Tissue1', 'Tissue2')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Tissue1 Tissue2 Combined
0  chr1  100  .   G   A    .      .    .  GT:DP   ./.:.  0/1:24   0/1:24
1  chr1  101  .   T   C    .      .    .  GT:DP   0/0:7  0/1:42   0/1:42
2  chr1  102  .   T   A    .      .    .  GT:DP  0/1:28   ./.:.   0/1:28
3  chr1  103  .   A   C    .      .    .  GT:DP   0/1:4   ./.:.    0/1:4
4  chr1  104  .   C   G    .      .    .  GT:DP  0/1:32  0/1:19   0/1:32
compare(other)[source]

Compare to another VcfFrame and show the differences in genotype calling.

Parameters

other (VcfFrame) – VcfFrame to compare with.

Returns

DataFrame comtaining genotype differences.

Return type

pandas.DataFrame

Examples

>>> from fuc import pyvcf
>>> data1 = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['G', 'CT', 'T', 'C', 'A'],
...     'ALT': ['A', 'C', 'A', 'T', 'G,C'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', '0/0', '0/0', '0/1', '0/0'],
...     'B': ['1/1', '0/1', './.', '0/1', '0/0'],
...     'C': ['0/1', '0/1', '1/1', './.', '1/2'],
... }
>>> data2 = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['G', 'CT', 'T', 'C', 'A'],
...     'ALT': ['A', 'C', 'A', 'T', 'G,C'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
...     'A': ['./.', '0/0', '0/0', '0/1', '0/0'],
...     'B': ['1/1', '0/1', './.', '1/1', '0/0'],
...     'C': ['0/1', '0/1', '0/1', './.', '1/2'],
... }
>>> vf1 = pyvcf.VcfFrame.from_dict([], data1)
>>> vf2 = pyvcf.VcfFrame.from_dict([], data2)
>>> vf1.compare(vf2)
          Locus Sample Self Other
0  chr1-100-G-A      A  0/1   ./.
1  chr1-102-T-A      C  1/1   0/1
2  chr1-103-C-T      B  0/1   1/1
property contigs

List of contig names.

Type

list

copy()[source]

Return a copy of the VcfFrame.

copy_df()[source]

Return a copy of the dataframe.

copy_meta()[source]

Return a copy of the metadata.

property df

DataFrame containing VCF data.

Type

pandas.DataFrame

drop_duplicates(subset=None, keep='first')[source]

Return VcfFrame with duplicate rows removed.

This method essentially wraps the pandas.DataFrame.drop_duplicates() method.

Considering certain columns is optional.

Parameters
  • subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns.

  • keep ({‘first’, ‘last’, False}, default ‘first’) – Determines which duplicates (if any) to keep.

    • first : Drop duplicates except for the first occurrence.

    • last : Drop duplicates except for the last occurrence.

    • False : Drop all duplicates.

Returns

VcfFrame with duplicates removed.

Return type

VcfFrame

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr2', 'chr2'],
...     'POS': [100, 100, 200, 200],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'A', 'C', 'C'],
...     'ALT': ['C', 'T', 'G', 'G,A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', './.', '0/1', './.'],
...     'B': ['./.', '0/1', './.', '1/2'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   A    C    .      .    .     GT  0/1  ./.
1  chr1  100  .   A    T    .      .    .     GT  ./.  0/1
2  chr2  200  .   C    G    .      .    .     GT  0/1  ./.
3  chr2  200  .   C  G,A    .      .    .     GT  ./.  1/2
>>> vf.drop_duplicates(['CHROM', 'POS', 'REF']).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   A   C    .      .    .     GT  0/1  ./.
1  chr2  200  .   C   G    .      .    .     GT  0/1  ./.
>>> vf.drop_duplicates(['CHROM', 'POS', 'REF'], keep='last').df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   A    T    .      .    .     GT  ./.  0/1
1  chr2  200  .   C  G,A    .      .    .     GT  ./.  1/2
property empty

Indicator whether VcfFrame is empty.

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['0/1', '1/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .   G   A    .      .    .     GT  0/1
1  chr2  101  .   T   C    .      .    .     GT  1/1
>>> vf.df = vf.df[0:0]
>>> vf.df
Empty DataFrame
Columns: [CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, A]
Index: []
>>> vf.empty
True
expand()[source]

Expand each multiallelic locus to multiple rows.

Only the GT subfield of FORMAT will be retained.

Returns

Expanded VcfFrame.

Return type

VcfFrame

See also

VcfFrame.collapse

Collapse duplicate records in the VcfFrame.

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'A', 'C', 'C'],
...     'ALT': ['C', 'T,G', 'G', 'A,G,CT'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT:DP', 'GT:DP', 'GT:DP', 'GT:DP'],
...     'Steven': ['0/1:32', './.:.', '0/1:27', '0/2:34'],
...     'Sara': ['0/0:28', '1/2:30', '1/1:29', '1/2:38'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF     ALT QUAL FILTER INFO FORMAT  Steven    Sara
0  chr1  100  .   A       C    .      .    .  GT:DP  0/1:32  0/0:28
1  chr1  101  .   A     T,G    .      .    .  GT:DP   ./.:.  1/2:30
2  chr1  102  .   C       G    .      .    .  GT:DP  0/1:27  1/1:29
3  chr1  103  .   C  A,G,CT    .      .    .  GT:DP  0/2:34  1/2:38

We can expand each of the multiallelic loci:

>>> vf.expand().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara
0  chr1  100  .   A   C    .      .    .     GT    0/1  0/0
1  chr1  101  .   A   T    .      .    .     GT    ./.  0/1
2  chr1  101  .   A   G    .      .    .     GT    ./.  0/1
3  chr1  102  .   C   G    .      .    .     GT    0/1  1/1
4  chr1  103  .   C   A    .      .    .     GT    0/0  0/1
5  chr1  103  .   C   G    .      .    .     GT    0/1  0/1
6  chr1  103  .   C  CT    .      .    .     GT    0/0  0/0
extract_format(k, func=None, as_nan=False)[source]

Extract data for the specified FORMAT key.

By default, this method will return string data. Use func and as_nan to output numbers. Alternatvely, select one of the special keys for k, which have predetermined values of func and as_nan for convenience.

Parameters
  • k (str) – FORMAT key to use when extracting data. In addition to regular FORMAT keys (e.g. ‘DP’, ‘AD’), the method also accepts the special keys listed below:

    • ‘#DP’: Return numeric DP.

    • ‘#AD_REF’: Return numeric AD for REF.

    • ‘#AD_ALT’: Return numeric AD for ALT. If multiple values are available (i.e. multiallelic site), return the sum.

    • ‘#AD_FRAC_REF’: Return allele fraction for REF.

    • ‘#AD_FRAC_ALT’: Return allele fraction for ALT. If multiple values are available (i.e. multiallelic site), return the sum.

  • func (function, optional) – Function to apply to each of the extracted results.

  • as_nan (bool, default: False) – If True, return missing values as NaN.

Returns

DataFrame containing requested data.

Return type

pandas.DataFrame

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.'],
...     'REF': ['A', 'C', 'A'],
...     'ALT': ['G', 'T', 'C,T'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT:AD:DP', 'GT', 'GT:AD:DP'],
...     'A': ['0/1:15,13:28', '0/0', '0/1:9,14,0:23'],
...     'B': ['./.:.:.', '1/1', '1/2:0,11,15:26'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO    FORMAT              A               B
0  chr1  100  .   A    G    .      .    .  GT:AD:DP   0/1:15,13:28         ./.:.:.
1  chr1  101  .   C    T    .      .    .        GT            0/0             1/1
2  chr1  102  .   A  C,T    .      .    .  GT:AD:DP  0/1:9,14,0:23  1/2:0,11,15:26
>>> vf.extract_format('GT')
     A    B
0  0/1  ./.
1  0/0  1/1
2  0/1  1/2
>>> vf.extract_format('GT', as_nan=True)
     A    B
0  0/1  NaN
1  0/0  1/1
2  0/1  1/2
>>> vf.extract_format('AD')
        A        B
0   15,13        .
1     NaN      NaN
2  9,14,0  0,11,15
>>> vf.extract_format('DP', func=lambda x: int(x), as_nan=True)
      A     B
0  28.0   NaN
1   NaN   NaN
2  23.0  26.0
>>> vf.extract_format('#DP') # Same as above
      A     B
0  28.0   NaN
1   NaN   NaN
2  23.0  26.0
>>> vf.extract_format('AD', func=lambda x: float(x.split(',')[0]), as_nan=True)
      A    B
0  15.0  NaN
1   NaN  NaN
2   9.0  0.0
>>> vf.extract_format('#AD_REF') # Same as above
      A    B
0  15.0  NaN
1   NaN  NaN
2   9.0  0.0
extract_info(k, func=None, as_nan=False)[source]

Extract data for the specified INFO key.

By default, this method will return string data. Use func and as_nan to output numbers. Alternatvely, select one of the special keys for k, which have predetermined values of func and as_nan for convenience.

Parameters
  • k (str) – INFO key to use when extracting data. In addition to regular INFO keys (e.g. ‘AC’, ‘AF’), the method also accepts the special keys listed below:

    • ‘#AC’: Return numeric AC. If multiple values are available (i.e. multiallelic site), return the sum.

    • ‘#AF’: Similar to ‘#AC’.

  • func (function, optional) – Function to apply to each of the extracted results.

  • as_nan (bool, default: False) – If True, return missing values as NaN.

Returns

Requested data.

Return type

pandas.Series

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'C', 'A', 'A'],
...     'ALT': ['G', 'T', 'C,T', 'T'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['AC=1;AF=0.167;H2', 'AC=2;AF=0.333', 'AC=1,2;AF=0.167,0.333;H2', 'AC=.;AF=.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', '0/0', '0/1', './.'],
...     'B': ['0/0', '1/1', '0/2', './.'],
...     'C': ['0/0', '0/0', '0/2', './.'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER                      INFO FORMAT    A    B    C
0  chr1  100  .   A    G    .      .          AC=1;AF=0.167;H2     GT  0/1  0/0  0/0
1  chr1  101  .   C    T    .      .             AC=2;AF=0.333     GT  0/0  1/1  0/0
2  chr1  102  .   A  C,T    .      .  AC=1,2;AF=0.167,0.333;H2     GT  0/1  0/2  0/2
3  chr1  103  .   A    T    .      .                 AC=.;AF=.     GT  ./.  ./.  ./.
>>> vf.extract_info('H2')
0     H2
1    NaN
2     H2
3    NaN
dtype: object
>>> vf.extract_info('AC')
0      1
1      2
2    1,2
3      .
dtype: object
>>> vf.extract_info('AC', as_nan=True)
0      1
1      2
2    1,2
3    NaN
dtype: object
>>> vf.extract_info('AC', func=lambda x: sum([int(x) for x in x.split(',')]), as_nan=True)
0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64
>>> vf.extract_info('#AC') # Same as above
0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64
fetch(variant)[source]

Fetch the VCF row that matches specified variant.

Parameters

variant (str) – Target variant.

Returns

VCF row.

Return type

pandas.Series

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['0/1', '1/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.fetch('chr1-100-G-A')
CHROM     chr1
POS        100
ID           .
REF          G
ALT          A
QUAL         .
FILTER       .
INFO         .
FORMAT      GT
A          0/1
Name: 0, dtype: object
filter_bed(bed, opposite=False, as_index=False)[source]

Select rows that overlap with the given BED data.

Parameters
  • bed (pybed.BedFrame or str) – BedFrame or path to a BED file.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pybed, pyvcf
>>> data = {
...     'Chromosome': ['chr1', 'chr2', 'chr3'],
...     'Start': [100, 400, 100],
...     'End': [200, 500, 200]
... }
>>> bf = pybed.BedFrame.from_dict([], data)
>>> bf.gr.df
  Chromosome  Start  End
0       chr1    100  200
1       chr2    400  500
2       chr3    100  200
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr2', 'chr3'],
...     'POS': [100, 201, 450, 99],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'CT', 'A', 'C'],
...     'ALT': ['A', 'C', 'AT', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/1', '1/1', '0/1', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  201  .  CT   C    .      .    .     GT    1/1
2  chr2  450  .   A  AT    .      .    .     GT    0/1
3  chr3   99  .   C   A    .      .    .     GT    0/1

We can select rows that overlap with the BED data:

>>> vf.filter_bed(bf).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr2  450  .   A  AT    .      .    .     GT    0/1

We can also remove those rows:

>>> vf.filter_bed(bf, opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  201  .  CT   C    .      .    .     GT    1/1
1  chr3   99  .   C   A    .      .    .     GT    0/1

Finally, we can return boolean index array from the filtering:

>>> vf.filter_bed(bf, as_index=True)
0     True
1    False
2     True
3    False
dtype: bool
>>>
filter_empty(opposite=False, as_index=False)[source]

Remove rows with no genotype calls at all.

Parameters
  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'T', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/0', './.', '0/1', './.'],
...     'Sara': ['0/0', './.', './.', './.'],
...     'James': ['0/0', './.', '0/1', './.'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.filter_indel().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/0  0/0   0/0
1  chr1  101  .   T   C    .      .    .     GT    ./.  ./.   ./.
2  chr1  102  .   A   T    .      .    .     GT    0/1  ./.   0/1
3  chr1  103  .   C   A    .      .    .     GT    ./.  ./.   ./.

We can remove empty rows:

>>> vf.filter_empty().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/0  0/0   0/0
1  chr1  102  .   A   T    .      .    .     GT    0/1  ./.   0/1

We can also select those rows:

>>> vf.filter_empty(opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  101  .   T   C    .      .    .     GT    ./.  ./.   ./.
1  chr1  103  .   C   A    .      .    .     GT    ./.  ./.   ./.

Finally, we can return boolean index array from the filtering:

>>> vf.filter_empty(as_index=True)
0     True
1    False
2     True
3    False
dtype: bool
filter_flagall(flags, opposite=False, as_index=False)[source]

Select rows if all of the given INFO flags are present.

Parameters
  • flags (list) – List of INFO flags.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

See also

VcfFrame.filter_flagany

Similar method that selects rows if any one of the given INFO flags is present.

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'T', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['DB', 'DB;H2', 'DB;H2', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/0', '0/1', '0/1', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER   INFO FORMAT Steven
0  chr1  100  .   G   A    .      .     DB     GT    0/0
1  chr1  101  .   T   C    .      .  DB;H2     GT    0/1
2  chr1  102  .   A   T    .      .  DB;H2     GT    0/1
3  chr1  103  .   C   A    .      .      .     GT    0/0

We can select rows with both the H2 and DB tags:

>>> vf.filter_flagall(['H2', 'DB']).df
  CHROM  POS ID REF ALT QUAL FILTER   INFO FORMAT Steven
0  chr1  101  .   T   C    .      .  DB;H2     GT    0/1
1  chr1  102  .   A   T    .      .  DB;H2     GT    0/1

We can also remove those rows:

>>> vf.filter_flagall(['H2', 'DB'], opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .   DB     GT    0/0
1  chr1  103  .   C   A    .      .    .     GT    0/0

Finally, we can return boolean index array from the filtering:

>>> vf.filter_flagall(['H2', 'DB'], as_index=True)
0    False
1     True
2     True
3    False
dtype: bool
filter_flagany(flags, opposite=False, as_index=False)[source]

Select rows if any one of the given INFO flags is present.

Parameters
  • flags (list) – List of INFO flags.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

See also

VcfFrame.filter_flagall

Similar method that selects rows if all of the given INFO flags are present.

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'T', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['DB', 'DB;H2', 'DB;H2', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/0', '0/1', '0/1', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER   INFO FORMAT Steven
0  chr1  100  .   G   A    .      .     DB     GT    0/0
1  chr1  101  .   T   C    .      .  DB;H2     GT    0/1
2  chr1  102  .   A   T    .      .  DB;H2     GT    0/1
3  chr1  103  .   C   A    .      .      .     GT    0/0

We can select rows with the H2 tag:

>>> vf.filter_flagany(['H2']).df
  CHROM  POS ID REF ALT QUAL FILTER   INFO FORMAT Steven
0  chr1  101  .   T   C    .      .  DB;H2     GT    0/1
1  chr1  102  .   A   T    .      .  DB;H2     GT    0/1

We can also remove those rows:

>>> vf.filter_flagany(['H2'], opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .   DB     GT    0/0
1  chr1  103  .   C   A    .      .    .     GT    0/0

Finally, we can return boolean index array from the filtering:

>>> vf.filter_flagany(['H2'], as_index=True)
0    False
1     True
2     True
3    False
dtype: bool
filter_indel(opposite=False, as_index=False)[source]

Remove rows with an indel.

Parameters
  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'CT', 'A', 'C'],
...     'ALT': ['A', 'C', 'C,AT', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/1', '0/1', '1/2', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF   ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G     A    .      .    .     GT    0/1
1  chr1  101  .  CT     C    .      .    .     GT    0/1
2  chr1  102  .   A  C,AT    .      .    .     GT    1/2
3  chr1  103  .   C     A    .      .    .     GT    0/1

We can remove rows with an indel:

>>> vf.filter_indel().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  103  .   C   A    .      .    .     GT    0/1

We can also select those rows:

>>> vf.filter_indel(opposite=True).df
  CHROM  POS ID REF   ALT QUAL FILTER INFO FORMAT Steven
0  chr1  101  .  CT     C    .      .    .     GT    0/1
1  chr1  102  .   A  C,AT    .      .    .     GT    1/2

Finally, we can return boolean index array from the filtering:

>>> vf.filter_indel(as_index=True)
0     True
1    False
2    False
3     True
dtype: bool
filter_multialt(opposite=False, as_index=False)[source]

Remove rows with multiple ALT alleles.

Parameters
  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'A', 'C', 'C'],
...     'ALT': ['C,T', 'T', 'G', 'G,A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'A': ['0/2', '0/0', '0/1', './.'],
...     'B': ['0/1', '0/1', './.', '1/2'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   A  C,T    .      .    .     GT  0/2  0/1
1  chr1  101  .   A    T    .      .    .     GT  0/0  0/1
2  chr1  102  .   C    G    .      .    .     GT  0/1  ./.
3  chr1  103  .   C  G,A    .      .    .     GT  ./.  1/2

We can remove rows with multiple ALT alleles:

>>> vf.filter_multialt().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  101  .   A   T    .      .    .     GT  0/0  0/1
1  chr1  102  .   C   G    .      .    .     GT  0/1  ./.

We can also select those rows:

>>> vf.filter_multialt(opposite=True).df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   A  C,T    .      .    .     GT  0/2  0/1
1  chr1  103  .   C  G,A    .      .    .     GT  ./.  1/2

Finally, we can return boolean index array from the filtering:

>>> vf.filter_multialt(as_index=True)
0    False
1     True
2     True
3    False
dtype: bool
filter_pass(opposite=False, as_index=False)[source]

Select rows with PASS in the FILTER field.

Parameters
  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'T', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['PASS', 'FAIL', 'PASS', 'FAIL'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/0', './.', '0/1', './.'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .   PASS    .     GT    0/0
1  chr1  101  .   T   C    .   FAIL    .     GT    ./.
2  chr1  102  .   A   T    .   PASS    .     GT    0/1
3  chr1  103  .   C   A    .   FAIL    .     GT    ./.

We can select rows with PASS:

>>> vf.filter_pass().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .   PASS    .     GT    0/0
1  chr1  102  .   A   T    .   PASS    .     GT    0/1

We can also remove those rows:

>>> vf.filter_pass(opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  101  .   T   C    .   FAIL    .     GT    ./.
1  chr1  103  .   C   A    .   FAIL    .     GT    ./.

Finally, we can return boolean index array from the filtering:

>>> vf.filter_pass(as_index=True)
0     True
1    False
2     True
3    False
dtype: bool
filter_phased(opposite=False, as_index=False)[source]

Remove rows with phased genotypes.

Parameters
  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'CT', 'A', 'C'],
...     'ALT': ['A', 'C', 'C', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['1|0', '0/1', '0/1', '0|1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    1|0
1  chr1  101  .  CT   C    .      .    .     GT    0/1
2  chr1  102  .   A   C    .      .    .     GT    0/1
3  chr1  103  .   C   A    .      .    .     GT    0|1

We can remove rows with a phased genotype:

>>> vf.filter_phased().df
  CHROM  POS ID REF   ALT QUAL FILTER INFO FORMAT Steven
0  chr1  101  .  CT     C    .      .    .     GT    0/1
1  chr1  102  .   A  C,AT    .      .    .     GT    0/1

We can also select those rows:

>>> vf.filter_phased(opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    1|0
1  chr1  103  .   C   A    .      .    .     GT    0|1

Finally, we can return boolean index array from the filtering:

>>> vf.filter_phased(as_index=True)
0    False
1     True
2     True
3    False
dtype: bool
filter_polyp(opposite=False, as_index=False)[source]

Remove rows with a polyploid genotype call.

Parameters
  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr2', 'chr2'],
...     'POS': [100, 100, 200, 200],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'A', 'C', 'C'],
...     'ALT': ['C', 'T', 'G', 'G'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/0/1', '0/0', '1/1/1', './.'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   A   C    .      .    .     GT  0/0/1
1  chr1  100  .   A   T    .      .    .     GT    0/0
2  chr2  200  .   C   G    .      .    .     GT  1/1/1
3  chr2  200  .   C   G    .      .    .     GT    ./.

We can remove rows with a polyploid genotype call:

>>> vf.filter_polyp().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   A   T    .      .    .     GT    0/0
1  chr2  200  .   C   G    .      .    .     GT    ./.

We can also select those rows:

>>> vf.filter_polyp(opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   A   C    .      .    .     GT  0/0/1
1  chr2  200  .   C   G    .      .    .     GT  1/1/1

Finally, we can return boolean index array from the filtering:

>>> vf.filter_polyp(as_index=True)
0    False
1     True
2    False
3     True
dtype: bool
filter_qual(threshold, opposite=False, as_index=False)[source]

Select rows with minimum QUAL value.

Parameters
  • threshold (float) – Minimum QUAL value.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C', 'C'],
...     'ALT': ['A', 'C', 'T', 'A', 'T'],
...     'QUAL': ['.', 30, 19, 41, 29],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/1', '1/1', '0/1', '0/1', '1/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  101  .   T   C   30      .    .     GT    1/1
2  chr1  102  .   A   T   19      .    .     GT    0/1
3  chr1  103  .   C   A   41      .    .     GT    0/1
4  chr1  104  .   C   T   29      .    .     GT    1/1

We can select rows with minimum QUAL value of 30:

>>> vf.filter_qual(30).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  101  .   T   C   30      .    .     GT    1/1
1  chr1  103  .   C   A   41      .    .     GT    0/1

We can also remove those rows:

>>> vf.filter_qual(30, opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  102  .   A   T   19      .    .     GT    0/1
2  chr1  104  .   C   T   29      .    .     GT    1/1

Finally, we can return boolean index array from the filtering:

>>> vf.filter_qual(30, as_index=True)
0    False
1     True
2    False
3     True
4    False
dtype: bool
filter_sampall(samples=None, opposite=False, as_index=False)[source]

Select rows if all of the given samples have the variant.

The default behavior is to use all samples in the VcfFrame.

Parameters
  • samples (list, optional) – List of sample names or indicies.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

See also

VcfFrame.filter_sampany

Similar method that selects rows if any one of the given samples has the variant.

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'T', 'T'],
...     'ALT': ['A', 'C', 'A', 'C'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/1', '0/0', '0/1', '0/1'],
...     'Sara': ['0/1', '0/1', '0/0', '0/1'],
...     'James': ['0/1', '0/1', '0/1', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/1  0/1   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/1
2  chr1  102  .   T   A    .      .    .     GT    0/1  0/0   0/1
3  chr1  103  .   T   C    .      .    .     GT    0/1  0/1   0/1

We can select rows where all three samples have the variant:

>>> vf.filter_sampall().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/1  0/1   0/1
1  chr1  103  .   T   C    .      .    .     GT    0/1  0/1   0/1

We can also remove those rows:

>>> vf.filter_sampall(opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/1
1  chr1  102  .   T   A    .      .    .     GT    0/1  0/0   0/1

We can select rows where both Sara and James have the variant:

>>> vf.filter_sampall(samples=['Sara', 'James']).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/1  0/1   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/1
2  chr1  103  .   T   C    .      .    .     GT    0/1  0/1   0/1

Finally, we can return boolean index array from the filtering:

>>> vf.filter_sampall(as_index=True)
0     True
1    False
2    False
3     True
dtype: bool
filter_sampany(samples=None, opposite=False, as_index=False)[source]

Select rows if any one of the given samples has the variant.

The default behavior is to use all samples in the VcfFrame.

Parameters
  • samples (list, optional) – List of sample names or indicies.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

See also

VcfFrame.filter_sampall

Similar method that selects rows if all of the given samples have the variant.

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'T', 'T'],
...     'ALT': ['A', 'C', 'A', 'C'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/0', '0/0', '0/1', '0/0'],
...     'Sara': ['0/0', '0/1', '0/0', '0/0'],
...     'James': ['0/1', '0/0', '0/0', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/0  0/0   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/0
2  chr1  102  .   T   A    .      .    .     GT    0/1  0/0   0/0
3  chr1  103  .   T   C    .      .    .     GT    0/0  0/0   0/0

We can select rows where at least one sample has the variant:

>>> vf.filter_sampany().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/0  0/0   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/0
2  chr1  102  .   T   A    .      .    .     GT    0/1  0/0   0/0

We can also remove those rows:

>>> vf.filter_sampany(opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  103  .   T   C    .      .    .     GT    0/0  0/0   0/0

We can select rows where either Sara or James has the variant:

>>> vf.filter_sampany(samples=['Sara', 'James']).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/0  0/0   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/0

Finally, we can return boolean index array from the filtering:

>>> vf.filter_sampany(as_index=True)
0     True
1     True
2     True
3    False
dtype: bool
filter_sampnum(threshold, opposite=False, as_index=False)[source]

Select rows if the variant is prevalent enough.

Parameters
  • threshold (int or float) – Minimum number or fraction of samples with the variant.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'T'],
...     'ALT': ['A', 'C', 'A'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'Steven': ['0/1', '0/1', '0/1'],
...     'Sara': ['0/0', '0/1', '0/0'],
...     'James': ['0/1', '0/1', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/1  0/0   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/1  0/1   0/1
2  chr1  102  .   T   A    .      .    .     GT    0/1  0/0   0/0

We can select rows where at least two samples have the variant:

>>> vf.filter_sampnum(2).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/1  0/0   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/1  0/1   0/1

Similarly, we can select rows where at least 50% of the samples have the variant:

>>> vf.filter_sampnum(0.5).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/1  0/0   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/1  0/1   0/1

We can also remove those rows:

>>> vf.filter_sampnum(0.5, opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  102  .   T   A    .      .    .     GT    0/1  0/0   0/0

Finally, we can return boolean index array from the filtering:

>>> vf.filter_sampnum(2, as_index=True)
0     True
1     True
2    False
dtype: bool
filter_vcf(vcf, opposite=False, as_index=False)[source]

Select rows that overlap with the other VCF.

Parameters
  • vcf (VcfFrame or str) – VcfFrame or VCF file.

  • opposite (bool, default: False) – If True, return rows that don’t meet the said criteria.

  • as_index (bool, default: False) – If True, return boolean index array instead of VcfFrame.

Returns

Filtered VcfFrame or boolean index array.

Return type

VcfFrame or pandas.Series

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data1 = {
...     'CHROM': ['chr1', 'chr1', 'chr4', 'chr8', 'chr8'],
...     'POS': [100, 203, 192, 52, 788],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['A', 'C', 'T', 'T', 'GA'],
...     'ALT': ['C', 'G', 'A', 'G', 'G'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', '0/1', '0/1', '0/1', '0/1'],
... }
>>> vf1 = pyvcf.VcfFrame.from_dict([], data1)
>>> vf1.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .   A   C    .      .    .     GT  0/1
1  chr1  203  .   C   G    .      .    .     GT  0/1
2  chr4  192  .   T   A    .      .    .     GT  0/1
3  chr8   52  .   T   G    .      .    .     GT  0/1
4  chr8  788  .  GA   G    .      .    .     GT  0/1
>>> data2 = {
...     'CHROM': ['chr1', 'chr8'],
...     'POS': [100, 788],
...     'ID': ['.', '.'],
...     'REF': ['A', 'GA'],
...     'ALT': ['C', 'G'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
... }
>>> vf2 = pyvcf.VcfFrame.from_dict([], data2)
>>> vf2.df
  CHROM  POS ID REF ALT QUAL FILTER INFO
0  chr1  100  .   A   C    .      .    .
1  chr8  788  .  GA   G    .      .    .

We can select rows that overlap with the VCF data:

>>> vf1.filter_vcf(vf2).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .   A   C    .      .    .     GT  0/1
1  chr8  788  .  GA   G    .      .    .     GT  0/1

We can also remove those rows:

>>> vf1.filter_vcf(vf2, opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  203  .   C   G    .      .    .     GT  0/1
1  chr4  192  .   T   A    .      .    .     GT  0/1
2  chr8   52  .   T   G    .      .    .     GT  0/1

Finally, we can return boolean index array from the filtering:

>>> vf1.filter_vcf(vf2, as_index=True)
0     True
1    False
2    False
3    False
4     True
dtype: bool
classmethod from_dict(meta, data)[source]

Construct VcfFrame from a dict of array-like or dicts.

Parameters
  • meta (list) – List of the metadata lines.

  • data (dict) – Of the form {field : array-like} or {field : dict}.

Returns

VcfFrame.

Return type

VcfFrame

See also

VcfFrame

VcfFrame object creation using constructor.

VcfFrame.from_file

Construct VcfFrame from a VCF file.

VcfFrame.from_string

Construct VcfFrame from a string.

Examples

Below is a simple example:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['0/1', '1/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .   G   A    .      .    .     GT  0/1
1  chr2  101  .   T   C    .      .    .     GT  1/1
classmethod from_file(fn, compression=False, meta_only=False, regions=None)[source]

Construct VcfFrame from a VCF file.

The method will automatically use BGZF decompression if the filename ends with ‘.gz’.

If the file is large you can speicfy regions of interest to speed up data processing. Note that this requires the file be BGZF compressed and indexed (.tbi) for random access. Each region to be sliced must have the format chrom:start-end and be a half-open interval with (start, end]. This means, for example, ‘chr1:100-103’ will extract positions 101, 102, and 103. Alternatively, you can provide BED data to specify regions.

Parameters
  • fn (str or file-like object) – VCF file (zipped or unzipped). By file-like object, we refer to objects with a read() method, such as a file handle.

  • compression (bool, default: False) – If True, use BGZF decompression regardless of the filename.

  • meta_only (bool, default: False) – If True, only read metadata and header lines.

  • regions (str, list, or pybed.BedFrame, optional) – Region or list of regions to be sliced. Also accepts a BED file or a BedFrame.

Returns

VcfFrame object.

Return type

VcfFrame

See also

VcfFrame

VcfFrame object creation using constructor.

VcfFrame.from_dict

Construct VcfFrame from a dict of array-like or dicts.

VcfFrame.from_string

Construct VcfFrame from a string.

Examples

>>> from fuc import pyvcf
>>> vf = pyvcf.VcfFrame.from_file('unzipped.vcf')
>>> vf = pyvcf.VcfFrame.from_file('zipped.vcf.gz')
>>> vf = pyvcf.VcfFrame.from_file('zipped.vcf', compression=True)
classmethod from_string(s, meta_only=False)[source]

Construct VcfFrame from a string.

Parameters

s (str) – String representation of a VCF file.

Returns

VcfFrame object.

Return type

VcfFrame

See also

VcfFrame

VcfFrame object creation using constructor.

VcfFrame.from_file

Construct VcfFrame from a VCF file.

VcfFrame.from_dict

Construct VcfFrame from a dict of array-like or dicts.

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['0/1', '0/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict(['##fileformat=VCFv4.3'], data)
>>> s = vf.to_string()
>>> print(s[:20])
##fileformat=VCFv4.3
>>> vf = pyvcf.VcfFrame.from_string(s)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .   G   A    .      .    .     GT  0/1
1  chr1  101  .   T   C    .      .    .     GT  0/1
get_af(sample, variant)[source]

Get allele fraction for a pair of sample and variant.

The method will return numpy.nan if the value is missing.

Parameters
  • sample (str) – Sample name.

  • variant (str) – Variant name.

Returns

Allele fraction.

Return type

float

Examples

>>> from fuc import pyvcf, common
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'G', 'A', 'C'],
...     'ALT': ['C', 'T', 'G', 'G,A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT:AD:AF', 'GT:AD:AF', 'GT:AF', 'GT:AD:AF'],
...     'A': ['0/1:12,15:0.444,0.556', '0/0:32,1:0.970,0.030', '0/1:.', './.:.:.'],
...     'B': ['0/1:13,17:0.433,0.567', '0/1:14,15:0.483,0.517', './.:.', '1/2:0,11,17:0.000,0.393,0.607'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO    FORMAT                      A                              B
0  chr1  100  .   A    C    .      .    .  GT:AD:AF  0/1:12,15:0.444,0.556          0/1:13,17:0.433,0.567
1  chr1  101  .   G    T    .      .    .  GT:AD:AF   0/0:32,1:0.970,0.030          0/1:14,15:0.483,0.517
2  chr1  102  .   A    G    .      .    .     GT:AF                  0/1:.                          ./.:.
3  chr1  103  .   C  G,A    .      .    .  GT:AD:AF                ./.:.:.  1/2:0,11,17:0.000,0.393,0.607
>>> vf.get_af('A', 'chr1-100-A-C')
0.556
>>> vf.get_af('B', 'chr1-102-A-G')
nan
property has_chr_prefix

Whether the (annoying) ‘chr’ string is found.

Type

bool

markmiss(expr, greedy=False, opposite=False, samples=None, as_nan=False)[source]

Mark all genotypes that satisfy the query expression as missing.

Parameters
  • expr (str) – The expression to evaluate. See the examples below for details.

  • greedy (bool, default: False) – If True, mark even ambiguous genotypes as missing.

  • opposite (bool, default: False) – If True, mark all genotypes that do not satisfy the query expression as missing and leave those that do intact.

  • sampels (list, optional) – If provided, apply the marking only to these samples.

  • as_nan (bool, default: False) – If True, mark genotypes as NaN instead of as missing.

Returns

Updated VcfFrame.

Return type

VcfFrame

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'T'],
...     'ALT': ['A', 'C', 'G'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT:DP:AD', 'GT:DP:AD', 'GT:DP:AD'],
...     'A': ['0/0:26:0,26', '0/1:32:16,16', '0/0:.:.'],
...     'B': ['./.:.:.', '0/0:31:29,2', './.:.:.'],
...     'C': ['0/1:18:12,6', '0/0:24:24,0', '1/1:8:0,8'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A            B            C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD   0/0:26:0,26      ./.:.:.  0/1:18:12,6
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  0/0:31:29,2  0/0:24:24,0
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       0/0:.:.      ./.:.:.    1/1:8:0,8

To mark as missing all genotypes with 0/0:

>>> vf.markmiss('GT == "0/0"').df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A        B            C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD       ./.:.:.  ./.:.:.  0/1:18:12,6
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  ./.:.:.      ./.:.:.
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       ./.:.:.  ./.:.:.    1/1:8:0,8

To mark as missing all genotypes that do not have 0/0:

>>> vf.markmiss('GT != "0/0"').df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT            A            B            C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD  0/0:26:0,26      ./.:.:.      ./.:.:.
1  chr1  101  .   T   C    .      .    .  GT:DP:AD      ./.:.:.  0/0:31:29,2  0/0:24:24,0
2  chr1  102  .   T   G    .      .    .  GT:DP:AD      0/0:.:.      ./.:.:.      ./.:.:.

To mark as missing all genotypes whose DP is below 30:

>>> vf.markmiss('DP < 30').df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A            B        C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD       ./.:.:.      ./.:.:.  ./.:.:.
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  0/0:31:29,2  ./.:.:.
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       0/0:.:.      ./.:.:.  ./.:.:.

Note that the genotype 0/0:.:. was not marked as missing because its DP is missing and therefore it could not be evaluated properly. To mark even ambiguous genotypes like this one as missing, you can set greedy as True:

>>> vf.markmiss('DP < 30', greedy=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A            B        C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD       ./.:.:.      ./.:.:.  ./.:.:.
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  0/0:31:29,2  ./.:.:.
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       ./.:.:.      ./.:.:.  ./.:.:.

To mark as missing all genotypes whose ALT allele has read depth below 10:

>>> vf.markmiss('AD[1] < 10', greedy=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A        B        C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD   0/0:26:0,26  ./.:.:.  ./.:.:.
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  ./.:.:.  ./.:.:.
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       ./.:.:.  ./.:.:.  ./.:.:.

To mark as missing all genotypes whose ALT allele has read depth below 10 and DP is below 30:

>>> vf.markmiss('AD[1] < 10 and DP < 30', greedy=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A            B        C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD   0/0:26:0,26      ./.:.:.  ./.:.:.
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  0/0:31:29,2  ./.:.:.
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       ./.:.:.      ./.:.:.  ./.:.:.

To mark as missing all genotypes whose ALT allele has read depth below 10 or DP is below 30:

>>> vf.markmiss('AD[1] < 10 or DP < 30', greedy=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A        B        C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD       ./.:.:.  ./.:.:.  ./.:.:.
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  ./.:.:.  ./.:.:.
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       ./.:.:.  ./.:.:.  ./.:.:.

To only retain genotypes whose ALT allele has read depth below 10 or DP is below 30:

>>> vf.markmiss('AD[1] < 10 or DP < 30', opposite=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT            A            B            C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD  0/0:26:0,26      ./.:.:.  0/1:18:12,6
1  chr1  101  .   T   C    .      .    .  GT:DP:AD      ./.:.:.  0/0:31:29,2  0/0:24:24,0
2  chr1  102  .   T   G    .      .    .  GT:DP:AD      ./.:.:.      ./.:.:.    1/1:8:0,8

To mark as missing all genotypes whose mean of AD is below 10:

>>> vf.markmiss('np.mean(AD) < 10', greedy=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A            B            C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD   0/0:26:0,26      ./.:.:.      ./.:.:.
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  0/0:31:29,2  0/0:24:24,0
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       ./.:.:.      ./.:.:.      ./.:.:.

To do the same as above, but only for the samples A and B:

>>> vf.markmiss('np.mean(AD) < 10', greedy=True, samples=['A', 'B']).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A            B            C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD   0/0:26:0,26      ./.:.:.  0/1:18:12,6
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  0/0:31:29,2  0/0:24:24,0
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       ./.:.:.      ./.:.:.    1/1:8:0,8

To mark as NaN all genotypes whose sum of AD is below 10:

>>> vf.markmiss('sum(AD) < 10', as_nan=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A            B            C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD   0/0:26:0,26      ./.:.:.  0/1:18:12,6
1  chr1  101  .   T   C    .      .    .  GT:DP:AD  0/1:32:16,16  0/0:31:29,2  0/0:24:24,0
2  chr1  102  .   T   G    .      .    .  GT:DP:AD       0/0:.:.      ./.:.:.          NaN

Marking as NaN is useful when, for example, it is necessary to count how many genotypes are marked:

>>> vf.markmiss('sum(AD) < 10', as_nan=True).df.isna().sum().sum()
1
merge(other, how='inner', format='GT', sort=True, collapse=False)[source]

Merge with the other VcfFrame.

Parameters
  • other (VcfFrame) – Other VcfFrame.

  • how (str, default: ‘inner’) – Type of merge as defined in pandas.DataFrame.merge.

  • format (str, default: ‘GT’) – FORMAT subfields to be retained (e.g. ‘GT:AD:DP’).

  • sort (bool, default: True) – If True, sort the VcfFrame before returning.

  • collapse (bool, default: False) – If True, collapse duplicate records.

Returns

Merged VcfFrame.

Return type

VcfFrame

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data1 = {
...     'CHROM': ['chr1', 'chr1'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT:DP', 'GT:DP'],
...     'A': ['0/0:32', '0/1:29'],
...     'B': ['0/1:24', '1/1:30'],
... }
>>> data2 = {
...     'CHROM': ['chr1', 'chr1', 'chr2'],
...     'POS': [100, 101, 200],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'A'],
...     'ALT': ['A', 'C', 'T'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT:DP', 'GT:DP', 'GT:DP'],
...     'C': ['./.:.', '0/0:24', '0/0:26'],
...     'D': ['0/1:24', '0/1:31', '0/1:26'],
... }
>>> vf1 = pyvcf.VcfFrame.from_dict([], data1)
>>> vf2 = pyvcf.VcfFrame.from_dict([], data2)
>>> vf1.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT       A       B
0  chr1  100  .   G   A    .      .    .  GT:DP  0/0:32  0/1:24
1  chr1  101  .   T   C    .      .    .  GT:DP  0/1:29  1/1:30
>>> vf2.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT       C       D
0  chr1  100  .   G   A    .      .    .  GT:DP   ./.:.  0/1:24
1  chr1  101  .   T   C    .      .    .  GT:DP  0/0:24  0/1:31
2  chr2  200  .   A   T    .      .    .  GT:DP  0/0:26  0/1:26

We can merge the two VcfFrames with how='inner' (default):

>>> vf1.merge(vf2).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B    C    D
0  chr1  100  .   G   A    .      .    .     GT  0/0  0/1  ./.  0/1
1  chr1  101  .   T   C    .      .    .     GT  0/1  1/1  0/0  0/1

We can also merge with how='outer':

>>> vf1.merge(vf2, how='outer').df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B    C    D
0  chr1  100  .   G   A    .      .    .     GT  0/0  0/1  ./.  0/1
1  chr1  101  .   T   C    .      .    .     GT  0/1  1/1  0/0  0/1
2  chr2  200  .   A   T    .      .    .     GT  ./.  ./.  0/0  0/1

Since both VcfFrames have the DP subfield, we can use format='GT:DP':

>>> vf1.merge(vf2, how='outer', format='GT:DP').df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT       A       B       C       D
0  chr1  100  .   G   A    .      .    .  GT:DP  0/0:32  0/1:24   ./.:.  0/1:24
1  chr1  101  .   T   C    .      .    .  GT:DP  0/1:29  1/1:30  0/0:24  0/1:31
2  chr2  200  .   A   T    .      .    .  GT:DP   ./.:.   ./.:.  0/0:26  0/1:26
property meta

List of metadata lines.

Type

list

meta_keys()[source]

Print metadata lines with a key.

miss2ref()[source]

Convert missing genotype (./.) to homozygous REF (0/0).

Returns

VcfFrame object.

Return type

VcfFrame

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['./.', '1/1'],
...     'B': ['./.', './.']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   G   A    .      .    .     GT  ./.  ./.
1  chr2  101  .   T   C    .      .    .     GT  1/1  ./.
>>> new_vf = vf.miss2ref()
>>> new_vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   G   A    .      .    .     GT  0/0  0/0
1  chr2  101  .   T   C    .      .    .     GT  1/1  0/0
plot_comparison(a, b, c=None, labels=None, ax=None, figsize=None)[source]

Create a Venn diagram showing genotype concordance between groups.

This method supports comparison between two groups (Groups A & B) as well as three groups (Groups A, B, & C).

Parameters
  • a, b (list) – Sample names. The lists must have the same shape.

  • c (list, optional) – Same as above.

  • labels (list, optional) – List of labels to be displayed.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

Returns

  • matplotlib.axes.Axes – The matplotlib axes containing the plot.

  • matplotlib_venn._common.VennDiagram – VennDiagram object.

Examples

>>> from fuc import pyvcf, common
>>> common.load_dataset('pyvcf')
>>> f = '~/fuc-data/pyvcf/plot_comparison.vcf'
>>> vf = pyvcf.VcfFrame.from_file(f)
>>> a = ['Steven_A', 'John_A', 'Sara_A']
>>> b = ['Steven_B', 'John_B', 'Sara_B']
>>> c = ['Steven_C', 'John_C', 'Sara_C']
>>> vf.plot_comparison(a, b)
_images/api-43.png
>>> vf.plot_comparison(a, b, c)
_images/api-44.png
plot_hist_format(k, af=None, group_col=None, group_order=None, kde=True, ax=None, figsize=None, **kwargs)[source]

Create a histogram showing the distribution of data for the specified FORMAT key.

Parameters
  • k (str) – One of the special FORMAT keys as defined in VcfFrame.extract_format().

  • af (common.AnnFrame) – AnnFrame containing sample annotation data.

  • group_col (list, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • kde (bool, default: True) – Compute a kernel density estimate to smooth the distribution.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.histplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> from fuc import common, pyvcf
>>> common.load_dataset('pyvcf')
>>> vcf_file = '~/fuc-data/pyvcf/normal-tumor.vcf'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> vf.plot_hist_format('#DP')
_images/api-45.png

We can draw multiple histograms with hue mapping:

>>> annot_file = '~/fuc-data/pyvcf/normal-tumor-annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col='Sample')
>>> vf.plot_hist_format('#DP', af=af, group_col='Tissue')
_images/api-46.png

We can show AF instead of DP:

>>> vf.plot_hist_format('#AD_FRAC_REF')
_images/api-47.png
plot_hist_info(k, kde=True, ax=None, figsize=None, **kwargs)[source]

Create a histogram showing the distribution of data for the specified INFO key.

Parameters
  • k (str) – One of the special INFO keys as defined in VcfFrame.extract_info().

  • kde (bool, default: True) – Compute a kernel density estimate to smooth the distribution.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.histplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> from fuc import common, pyvcf
>>> common.load_dataset('pyvcf')
>>> vcf_file = '~/fuc-data/pyvcf/getrm-cyp2d6-vdr.vcf'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> vf.plot_hist_info('#AC')
_images/api-48.png

We can show AF instead of AC:

>>> vf.plot_hist_info('#AF')
_images/api-49.png
plot_rainfall(sample, palette=None, ax=None, figsize=None, legend='auto', **kwargs)[source]

Create a rainfall plot visualizing inter-variant distance on a linear genomic scale for single sample.

Under the hood, this method simply converts the VcfFrame to the fuc.api.pymaf.MafFrame class and then applies the fuc.api.pymaf.MafFrame.plot_rainfall() method.

Parameters
  • sample (str) – Name of the sample.

  • palette (str, optional) – Name of the seaborn palette. See the Control plot colors tutorial for details.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.scatterplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

fuc.api.pymaf.MafFrame.plot_rainfall

Similar method for the fuc.api.pymaf.MafFrame() class.

Examples

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pyvcf
>>> common.load_dataset('brca')
>>> vcf_file = '~/fuc-data/brca/brca.vcf'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> vf.plot_rainfall('TCGA-A8-A08B',
...                  figsize=(14, 7),
...                  palette=sns.color_palette('Set2')[:6])
>>> plt.tight_layout()
_images/api-50.png
plot_region(sample, k='#DP', color=None, region=None, label=None, ax=None, figsize=None, **kwargs)[source]

Create a scatter plot showing read depth profile of a sample for the specified region.

Parameters
  • sample (str or int) – Name or index of target sample.

  • k (str, default: ‘#DP’) – Genotype key to use for extracting data:

    • ‘#DP’: Return read depth.

    • ‘#AD_REF’: Return REF allele depth.

    • ‘#AD_ALT’: Return ALT allele depth.

    • ‘#AD_FRAC_REF’: Return REF allele fraction.

    • ‘#AD_FRAC_ALT’: Return ALT allele fraction.

  • color (str, optional) – Marker color.

  • region (str, optional) – Target region (‘chrom:start-end’).

  • label (str, optional) – Label to use for the data points.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to matplotlib.axes.Axes.scatter().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> from fuc import pyvcf, common
>>> import matplotlib.pyplot as plt
>>> common.load_dataset('pyvcf')
>>> vcf_file = '~/fuc-data/pyvcf/getrm-cyp2d6-vdr.vcf'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> vf.plot_region('NA18973')
>>> plt.tight_layout()
_images/api-51.png

We can display allele fraction of REF and ALT instead of DP:

>>> ax = vf.plot_region('NA18973', k='#AD_FRAC_REF', label='REF')
>>> vf.plot_region('NA18973', k='#AD_FRAC_ALT', label='ALT', ax=ax)
>>> plt.tight_layout()
_images/api-52.png
plot_regplot(a, b, ax=None, figsize=None, **kwargs)[source]

Create a scatter plot with a linear regression model fit visualizing correlation between TMB in two sample groups.

The method will automatically calculate and print summary statistics including R-squared and p-value.

Parameters
  • a, b (array-like) – Lists of sample names. The lists must have the same shape.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.regplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

>>> from fuc import common, pyvcf
>>> common.load_dataset('pyvcf')
>>> vcf_file = '~/fuc-data/pyvcf/normal-tumor.vcf'
>>> annot_file = '~/fuc-data/pyvcf/normal-tumor-annot.tsv'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> af = common.AnnFrame.from_file(annot_file, sample_col='Sample')
>>> normal = af.df[af.df.Tissue == 'Normal'].index
>>> normal.name = 'Normal'
>>> tumor = af.df[af.df.Tissue == 'Tumor'].index
>>> tumor.name = 'Tumor'
>>> vf.plot_regplot(normal, tumor)
Results for B ~ A:
R^2 = 0.01
P = 7.17e-01
>>> plt.tight_layout()
_images/api-53.png
plot_snvclsc(af=None, group_col=None, group_order=None, palette=None, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a bar plot summarizing the count distrubtions of the six SNV classes for all samples.

A grouped bar plot can be created with group_col (requires an AnnFrame).

Under the hood, this method simply converts the VcfFrame to the fuc.api.pymaf.MafFrame class and then applies the fuc.api.pymaf.MafFrame.plot_snvclsc() method.

Parameters
  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • palette (str, optional) – Name of the seaborn palette. See the Control plot colors tutorial for details.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.barplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

fuc.api.pymaf.MafFrame.plot_snvclsc

Similar method for the fuc.api.pymaf.MafFrame() class.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pyvcf
>>> common.load_dataset('tcga-laml')
>>> vcf_file = '~/fuc-data/tcga-laml/tcga_laml.vcf'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> vf.plot_snvclsc(palette=sns.color_palette('Pastel1'))
>>> plt.tight_layout()
_images/api-54.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> vf.plot_snvclsc(af=af,
...                 group_col='FAB_classification',
...                 group_order=['M0', 'M1', 'M2'])
>>> plt.tight_layout()
_images/api-55.png
plot_snvclsp(af=None, group_col=None, group_order=None, palette=None, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a box plot summarizing the proportion distrubtions of the six SNV classes for all sample.

Under the hood, this method simply converts the VcfFrame to the fuc.api.pymaf.MafFrame class and then applies the fuc.api.pymaf.MafFrame.plot_snvclsp() method.

Parameters
  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • palette (str, optional) – Name of the seaborn palette. See the Control plot colors tutorial for details.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.boxplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

fuc.api.pymaf.MafFrame.plot_snvclsp

Similar method for the fuc.api.pymaf.MafFrame() class.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from fuc import common, pyvcf
>>> common.load_dataset('tcga-laml')
>>> vcf_file = '~/fuc-data/tcga-laml/tcga_laml.vcf'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> vf.plot_snvclsp(palette=sns.color_palette('Pastel1'))
>>> plt.tight_layout()
_images/api-56.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> vf.plot_snvclsp(af=af,
...                 group_col='FAB_classification',
...                 group_order=['M0', 'M1', 'M2'])
>>> plt.tight_layout()
_images/api-57.png
plot_snvclss(color=None, colormap=None, width=0.8, legend=True, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a bar plot showing the proportions of the six SNV classes for individual samples.

Under the hood, this method simply converts the VcfFrame to the fuc.api.pymaf.MafFrame class and then applies the fuc.api.pymaf.MafFrame.plot_snvclss() method.

Parameters
  • color (list, optional) – List of color tuples. See the Control plot colors tutorial for details.

  • colormap (str or matplotlib colormap object, optional) – Colormap to select colors from. See the Control plot colors tutorial for details.

  • width (float, default: 0.8) – The width of the bars.

  • legend (bool, default: True) – Place legend on axis subplots.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to pandas.DataFrame.plot.bar() or pandas.DataFrame.plot.barh().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

fuc.api.pymaf.MafFrame.plot_snvclss

Similar method for the fuc.api.pymaf.MafFrame() class.

Examples

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(maf_file)
>>> ax = mf.plot_snvclss(width=1, color=plt.get_cmap('Pastel1').colors)
>>> ax.legend(loc='upper right')
>>> plt.tight_layout()
_images/api-58.png
plot_titv(af=None, group_col=None, group_order=None, flip=False, ax=None, figsize=None, **kwargs)[source]

Create a box plot showing the Ti/Tv proportions of samples.

Under the hood, this method simply converts the VcfFrame to the pymaf.MafFrame class and then applies the pymaf.MafFrame.plot_titv() method.

Parameters
  • af (AnnFrame, optional) – AnnFrame containing sample annotation data.

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • flip (bool, default: False) – If True, flip the x and y axes.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.boxplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

See also

fuc.api.pymaf.MafFrame.plot_titv

Similar method for the fuc.api.pymaf.MafFrame class.

Examples

Below is a simple example:

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pyvcf
>>> common.load_dataset('tcga-laml')
>>> vcf_file = '~/fuc-data/tcga-laml/tcga_laml.vcf'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> vf.plot_titv()
>>> plt.tight_layout()
_images/api-59.png

We can create a grouped bar plot based on FAB classification:

>>> annot_file = '~/fuc-data/tcga-laml/tcga_laml_annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col=0)
>>> vf.plot_titv(af=af,
...              group_col='FAB_classification',
...              group_order=['M0', 'M1', 'M2'])
>>> plt.tight_layout()
_images/api-60.png
plot_tmb(af=None, group_col=None, group_order=None, kde=True, ax=None, figsize=None, **kwargs)[source]

Create a histogram showing TMB distribution.

Parameters
  • af (common.AnnFrame) – AnnFrame containing sample annotation data (requires hue).

  • group_col (str, optional) – AnnFrame column containing sample group information.

  • group_order (list, optional) – List of sample group names.

  • kde (bool, default: True) – Compute a kernel density estimate to smooth the distribution.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. Otherwise, crete a new one.

  • figsize (tuple, optional) – Width, height in inches. Format: (float, float).

  • kwargs – Other keyword arguments will be passed down to seaborn.histplot().

Returns

The matplotlib axes containing the plot.

Return type

matplotlib.axes.Axes

Examples

Below is a simple example:

>>> from fuc import common, pyvcf
>>> common.load_dataset('pyvcf')
>>> vcf_file = '~/fuc-data/pyvcf/normal-tumor.vcf'
>>> vf = pyvcf.VcfFrame.from_file(vcf_file)
>>> vf.plot_tmb()
_images/api-61.png

We can draw multiple histograms with hue mapping:

>>> annot_file = '~/fuc-data/pyvcf/normal-tumor-annot.tsv'
>>> af = common.AnnFrame.from_file(annot_file, sample_col='Sample')
>>> vf.plot_tmb(af=af, group_col='Tissue')
_images/api-62.png
pseudophase()[source]

Pseudophase VcfFrame.

Returns

Pseudophased VcfFrame.

Return type

VcfFrame

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['0/1', '1/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .   G   A    .      .    .     GT  0/1
1  chr2  101  .   T   C    .      .    .     GT  1/1
>>> vf.pseudophase().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .   G   A    .      .    .     GT  0|1
1  chr2  101  .   T   C    .      .    .     GT  1|1
rename(names, indicies=None)[source]

Rename the samples.

Parameters
  • names (dict or list) – Dict of old names to new names or list of new names.

  • indicies (list or tuple, optional) – List of 0-based sample indicies. Alternatively, a tuple (int, int) can be used to specify an index range.

Returns

Updated VcfFrame.

Return type

VcfFrame

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['0/1', '0/1'],
...     'B': ['0/1', '0/1'],
...     'C': ['0/1', '0/1'],
...     'D': ['0/1', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B    C    D
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/1  0/1  0/1
1  chr2  101  .   T   C    .      .    .     GT  0/1  0/1  0/1  0/1
>>> vf.rename(['1', '2', '3', '4']).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    1    2    3    4
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/1  0/1  0/1
1  chr2  101  .   T   C    .      .    .     GT  0/1  0/1  0/1  0/1
>>> vf.rename({'B': '2', 'C': '3'}).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    2    3    D
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/1  0/1  0/1
1  chr2  101  .   T   C    .      .    .     GT  0/1  0/1  0/1  0/1
>>> vf.rename(['2', '4'], indicies=[1, 3]).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    2    C    4
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/1  0/1  0/1
1  chr2  101  .   T   C    .      .    .     GT  0/1  0/1  0/1  0/1
>>> vf.rename(['2', '3'], indicies=(1, 3)).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    2    3    D
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/1  0/1  0/1
1  chr2  101  .   T   C    .      .    .     GT  0/1  0/1  0/1  0/1
property samples

List of sample names.

Type

list

property shape

Dimensionality of VcfFrame (variants, samples).

Type

tuple

property sites_only

Whether the VCF is sites-only.

Type

bool

slice(region)[source]

Slice the VcfFrame for the region.

Parameters

region (str) – Region (‘chrom:start-end’).

Returns

Sliced VcfFrame.

Return type

VcfFrame

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr2'],
...     'POS': [100, 205, 297, 101],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'T', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/1', '1/1', '0/1', '0/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  205  .   T   C    .      .    .     GT    1/1
2  chr1  297  .   A   T    .      .    .     GT    0/1
3  chr2  101  .   C   A    .      .    .     GT    0/1
>>> vf.slice('chr1:101-300').df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  205  .   T   C    .      .    .     GT    1/1
1  chr1  297  .   A   T    .      .    .     GT    0/1
>>> vf.slice('chr1').df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  205  .   T   C    .      .    .     GT    1/1
2  chr1  297  .   A   T    .      .    .     GT    0/1
>>> vf.slice('chr1:-296').df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  205  .   T   C    .      .    .     GT    1/1
>>> vf.slice('chr1:101').df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  205  .   T   C    .      .    .     GT    1/1
1  chr1  297  .   A   T    .      .    .     GT    0/1
>>> vf.slice('chr1:101-').df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  205  .   T   C    .      .    .     GT    1/1
1  chr1  297  .   A   T    .      .    .     GT    0/1
sort()[source]

Sort the VcfFrame by chromosome and position.

Returns

Sorted VcfFrame.

Return type

VcfFrame

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr10', 'chr2', 'chr1', 'chr2'],
...     'POS': [100, 101, 102, 90],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'T', 'A'],
...     'ALT': ['A', 'C', 'A', 'T'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT:DP', 'GT:DP', 'GT:DP', 'GT:DP'],
...     'Steven': ['./.:.', '0/0:29', '0/0:28', '0/1:17']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
   CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT  Steven
0  chr10  100  .   G   A    .      .    .  GT:DP   ./.:.
1   chr2  101  .   T   C    .      .    .  GT:DP  0/0:29
2   chr1  102  .   T   A    .      .    .  GT:DP  0/0:28
3   chr2   90  .   A   T    .      .    .  GT:DP  0/1:17

We can sort the VcfFrame by:

>>> vf.sort().df
   CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT  Steven
0   chr1  102  .   T   A    .      .    .  GT:DP  0/0:28
1   chr2   90  .   A   T    .      .    .  GT:DP  0/1:17
2   chr2  101  .   T   C    .      .    .  GT:DP  0/0:29
3  chr10  100  .   G   A    .      .    .  GT:DP   ./.:.
strip(format='GT', metadata=False)[source]

Remove any unnecessary data.

Parameters
  • format (str, default: ‘GT’) – FORMAT keys to retain (e.g. ‘GT:AD:DP’).

  • metadata (bool, default: False) – If True, keep the metadata.

Returns

Stripped VcfFrame.

Return type

VcfFrame

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'A'],
...     'ALT': ['A', 'C', 'T'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT:DP:AD', 'GT:DP:AD', 'GT'],
...     'A': ['0/1:30:15,15', '1/1:28:0,28', '0/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO    FORMAT             A
0  chr1  100  .   G   A    .      .    .  GT:DP:AD  0/1:30:15,15
1  chr1  101  .   T   C    .      .    .  GT:DP:AD   1/1:28:0,28
2  chr1  102  .   A   T    .      .    .        GT           0/1
>>> vf.strip('GT:DP').df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT       A
0  chr1  100  .   G   A    .      .    .  GT:DP  0/1:30
1  chr1  101  .   T   C    .      .    .  GT:DP  1/1:28
2  chr1  102  .   A   T    .      .    .  GT:DP   0/1:.
subset(samples, exclude=False)[source]

Subset VcfFrame for specified samples.

Parameters
  • samples (str or list) – Sample name or list of names (the order matters).

  • exclude (bool, default: False) – If True, exclude specified samples.

Returns

Subsetted VcfFrame.

Return type

VcfFrame

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['0/1', '0/1'],
...     'B': ['0/0', '0/1'],
...     'C': ['0/0', '0/0'],
...     'D': ['0/1', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B    C    D
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/0  0/0  0/1
1  chr1  101  .   T   C    .      .    .     GT  0/1  0/1  0/0  0/0

We can subset the VcfFrame for the samples A and B:

>>> vf.subset(['A', 'B']).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/0
1  chr1  101  .   T   C    .      .    .     GT  0/1  0/1

Alternatively, we can exclude those samples:

>>> vf.subset(['A', 'B'], exclude=True).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    C    D
0  chr1  100  .   G   A    .      .    .     GT  0/0  0/1
1  chr1  101  .   T   C    .      .    .     GT  0/0  0/0
subtract(a, b)[source]

Subtract genotype data between two samples (A, B).

This method can be especially useful when you want to distinguish between somatic and germline variants for an individual. See examples below for more details.

Parameters

a, b (str or int) – Name or index of Samples A and B.

Returns

Resulting VCF column.

Return type

pandas.Series

See also

VcfFrame.combine

Combine genotype data from two samples (A, B).

Examples

Assume we have following data for a cancer patient:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['rs1', 'rs2', 'rs3', 'rs4', 'rs5'],
...     'REF': ['G', 'T', 'C', 'A', 'G'],
...     'ALT': ['A', 'G', 'T', 'C', 'C'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
...     'Tissue': ['./.', '0/1', '0/1', '0/0', '0/1'],
...     'Blood': ['0/1', '0/1', './.', '0/1', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS   ID REF ALT QUAL FILTER INFO FORMAT Tissue Blood
0  chr1  100  rs1   G   A    .      .    .     GT    ./.   0/1
1  chr1  101  rs2   T   G    .      .    .     GT    0/1   0/1
2  chr1  102  rs3   C   T    .      .    .     GT    0/1   ./.
3  chr1  103  rs4   A   C    .      .    .     GT    0/0   0/1
4  chr1  104  rs5   G   C    .      .    .     GT    0/1   0/0

We can compare genotype data between ‘Tissue’ and ‘Blood’ to identify somatic variants (i.e. rs3 and rs5; rs2 is most likely germline):

>>> vf.df['Somatic'] = vf.subtract('Tissue', 'Blood')
>>> vf.df
  CHROM  POS   ID REF ALT QUAL FILTER INFO FORMAT Tissue Blood Somatic
0  chr1  100  rs1   G   A    .      .    .     GT    ./.   0/1     ./.
1  chr1  101  rs2   T   G    .      .    .     GT    0/1   0/1     ./.
2  chr1  102  rs3   C   T    .      .    .     GT    0/1   ./.     0/1
3  chr1  103  rs4   A   C    .      .    .     GT    0/0   0/1     0/0
4  chr1  104  rs5   G   C    .      .    .     GT    0/1   0/0     0/1
to_bed()[source]

Write BedFrame from the VcfFrame.

Returns

BedFrame.

Return type

BedFrame

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['A', 'A', 'C', 'C', 'ACGT'],
...     'ALT': ['C', 'T,G', 'G', 'A,G,CT', 'A'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT:DP', 'GT:DP', 'GT:DP', 'GT:DP', 'GT:DP'],
...     'Steven': ['0/1:32', './.:.', '0/1:27', '0/2:34', '0/0:31'],
...     'Sara': ['0/0:28', '1/2:30', '1/1:29', '1/2:38', '0/1:27'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID   REF     ALT QUAL FILTER INFO FORMAT  Steven    Sara
0  chr1  100  .     A       C    .      .    .  GT:DP  0/1:32  0/0:28
1  chr1  101  .     A     T,G    .      .    .  GT:DP   ./.:.  1/2:30
2  chr1  102  .     C       G    .      .    .  GT:DP  0/1:27  1/1:29
3  chr1  103  .     C  A,G,CT    .      .    .  GT:DP  0/2:34  1/2:38
4  chr1  104  .  ACGT       A    .      .    .  GT:DP  0/0:31  0/1:27

We can construct BedFrame from the VcfFrame:

>>> bf = vf.to_bed()
>>> bf.gr.df
  Chromosome  Start  End
0       chr1    100  100
1       chr1    101  101
2       chr1    102  102
3       chr1    103  103
4       chr1    103  104
5       chr1    105  107
to_file(fn, compression=False)[source]

Write VcfFrame to a VCF file.

If the filename ends with ‘.gz’, the method will automatically use the BGZF compression when writing the file.

Parameters
  • fn (str) – VCF file path.

  • compression (bool, default: False) – If True, use the BGZF compression.

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'Steven': ['0/1', '1/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict(['##fileformat=VCFv4.3'], data)
>>> vf.to_file('unzipped.vcf')
>>> vf.to_file('zipped.vcf.gz')
>>> vf.to_file('zipped.vcf.gz', compression=True)
to_string()[source]

Render the VcfFrame to a console-friendly tabular output.

Returns

String representation of the VcfFrame.

Return type

str

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['0/1', '0/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict(['##fileformat=VCFv4.3'], data)
>>> print(vf.to_string())
##fileformat=VCFv4.3
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  A
chr1    100     .       G       A       .       .       .       GT      0/1
chr1    101     .       T       C       .       .       .       GT      0/1
unphase()[source]

Unphase all the sample genotypes.

Returns

Unphased VcfFrame.

Return type

VcfFrame

Examples

Assume we have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'T', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['1|0', './.', '0|1', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    1|0
1  chr1  101  .   T   C    .      .    .     GT    ./.
2  chr1  102  .   A   T    .      .    .     GT    0|1
3  chr1  103  .   C   A    .      .    .     GT    0/1

We can unphase the samples genotypes:

>>> vf.unphase().df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  101  .   T   C    .      .    .     GT    ./.
2  chr1  102  .   A   T    .      .    .     GT    0/1
3  chr1  103  .   C   A    .      .    .     GT    0/1
update_chr_prefix(mode='remove')[source]

Add or remove the (annoying) ‘chr’ string from the CHROM column.

Parameters

mode ({‘add’, ‘remove’}, default: ‘remove’) – Whether to add or remove the ‘chr’ string.

Returns

Updated VcfFrame.

Return type

VcfFrame

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', '2', '2'],
...     'POS': [100, 101, 100, 101],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'T', 'C'],
...     'ALT': ['A', 'C', 'C', 'G'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', '0/1', '0/1', '0/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .   G   A    .      .    .     GT  0/1
1  chr1  101  .   T   C    .      .    .     GT  0/1
2     2  100  .   T   C    .      .    .     GT  0/1
3     2  101  .   C   G    .      .    .     GT  0/1
>>> vf.update_chr_prefix(mode='remove').df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0     1  100  .   G   A    .      .    .     GT  0/1
1     1  101  .   T   C    .      .    .     GT  0/1
2     2  100  .   T   C    .      .    .     GT  0/1
3     2  101  .   C   G    .      .    .     GT  0/1
>>> vf.update_chr_prefix(mode='add').df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  chr1  100  .   G   A    .      .    .     GT  0/1
1  chr1  101  .   T   C    .      .    .     GT  0/1
2  chr2  100  .   T   C    .      .    .     GT  0/1
3  chr2  101  .   C   G    .      .    .     GT  0/1
variants()[source]

List unique variants in VcfFrame.

Returns

List of unique variants.

Return type

list

Examples

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'A,C'],
...     &