CLI
Introduction
This section describes command line interface (CLI) for the fuc package.
For getting help on the fuc CLI:
$ fuc -h
usage: fuc [-h] [-v] COMMAND ...
positional arguments:
COMMAND
bam-aldepth Compute allelic depth from a BAM file.
bam-depth Compute per-base read depth from BAM files.
bam-head Print the header of a BAM file.
bam-index Index a BAM file.
bam-rename Rename the sample in a BAM file.
bam-slice Slice a BAM file.
bed-intxn Find the intersection of BED files.
bed-sum Summarize a BED file.
cov-concat Concatenate depth of coverage files.
cov-rename Rename the samples in a depth of coverage file.
fa-filter Filter sequence records in a FASTA file.
fq-count Count sequence reads in FASTQ files.
fq-sum Summarize a FASTQ file.
fuc-bgzip Write a BGZF compressed file.
fuc-compf Compare the contents of two files.
fuc-demux Parse the Reports directory from bcl2fastq.
fuc-exist Check whether certain files exist.
fuc-find Retrieve absolute paths of files whose name matches a
specified pattern, optionally recursively.
fuc-undetm Compute top unknown barcodes using undertermined FASTQ from
bcl2fastq.
maf-maf2vcf Convert a MAF file to a VCF file.
maf-oncoplt Create an oncoplot with a MAF file.
maf-sumplt Create a summary plot with a MAF file.
maf-vcf2maf Convert a VCF file to a MAF file.
ngs-bam2fq Pipeline for converting BAM files to FASTQ files.
ngs-fq2bam Pipeline for converting FASTQ files to analysis-ready BAM
files.
ngs-hc Pipeline for germline short variant discovery.
ngs-m2 Pipeline for somatic short variant discovery.
ngs-pon Pipeline for constructing a panel of normals (PoN).
ngs-quant Pipeline for running RNAseq quantification from FASTQ files
with Kallisto.
ngs-trim Pipeline for trimming adapters from FASTQ files.
tabix-index Index a GFF/BED/SAM/VCF file with Tabix.
tabix-slice Slice a GFF/BED/SAM/VCF file with Tabix.
tbl-merge Merge two table files.
tbl-sum Summarize a table file.
vcf-call Call SNVs and indels from BAM files.
vcf-filter Filter a VCF file.
vcf-index Index a VCF file.
vcf-merge Merge two or more VCF files.
vcf-rename Rename the samples in a VCF file.
vcf-slice Slice a VCF file for specified regions.
vcf-split Split a VCF file by individual.
vcf-vcf2bed Convert a VCF file to a BED file.
vcf-vep Filter a VCF file by annotations from Ensembl VEP.
optional arguments:
-h, --help Show this help message and exit.
-v, --version Show the version number and exit.
For getting help on a specific command (e.g. vcf-merge):
$ fuc vcf-merge -h
bam-aldepth
$ fuc bam-aldepth -h
usage: fuc bam-aldepth [-h] bam sites
Count allelic depth from a BAM file.
Positional arguments:
bam Input alignment file.
sites TSV file containing two columns, chromosome and position. This
can also be a BED or VCF file (compressed or uncompressed).
Input type will be detected automatically.
Optional arguments:
-h, --help Show this help message and exit.
[Example] Provide sites with a TSV file:
$ fuc bam-aldepth in.bam sites.tsv > out.tsv
[Example] Provide sites with a VCF file:
$ fuc bam-aldepth in.bam sites.vcf > out.tsv
bam-depth
$ fuc bam-depth -h
usage: fuc bam-depth [-h] [-r TEXT [TEXT ...]] [--zero] bams [bams ...]
Compute per-base read depth from BAM files.
Under the hood, the command computes read depth using the 'samtools depth'
command.
Positional arguments:
bams One or more input BAM files. Alternatively, you can
provide a text file (.txt, .tsv, .csv, or .list)
containing one BAM file per line.
Optional arguments:
-h, --help Show this help message and exit.
-r TEXT [TEXT ...], --regions TEXT [TEXT ...]
By default, the command counts all reads in BAM
files, which can be excruciatingly slow for large
files (e.g. whole genome sequencing). Therefore, use
this argument to only output positions in given
regions. Each region must have the format
chrom:start-end and be a half-open interval with
(start, end]. This means, for example, chr1:100-103
will extract positions 101, 102, and 103.
Alternatively, you can provide a BED file (compressed
or uncompressed) to specify regions. Note that the
'chr' prefix in contig names (e.g. 'chr1' vs. '1')
will be automatically added or removed as necessary
to match the input BAM's contig names.
--zero Output all positions including those with zero depth.
[Example] Specify regions manually:
$ fuc bam-depth 1.bam 2.bam \
-r chr1:100-200 chr2:400-500 > out.tsv
[Example] Specify regions with a BED file:
$ fuc bam-depth bam.list \
-r in.bed > out.tsv
bam-head
$ fuc bam-head -h
usage: fuc bam-head [-h] bam
Print the header of a BAM file.
Positional arguments:
bam Input alignment file.
Optional arguments:
-h, --help Show this help message and exit.
[Example] Print the header of a BAM file:
$ fuc bam-head in.bam
bam-index
$ fuc bam-index -h
usage: fuc bam-index [-h] bam
Index a BAM file.
Positional arguments:
bam Input alignment file.
Optional arguments:
-h, --help Show this help message and exit.
[Example] Index a BAM file:
$ fuc bam-index in.bam
bam-rename
$ fuc bam-rename -h
usage: fuc bam-rename [-h] bam name
Rename the sample in a BAM file.
Positional arguments:
bam Input alignment file.
name New sample name.
Optional arguments:
-h, --help Show this help message and exit.
[Example] Write a new BAM file after renaming:
$ fuc bam-rename in.bam NA12878 > out.bam
bam-slice
$ fuc bam-slice -h
usage: fuc bam-slice [-h] [--format TEXT] [--fasta PATH]
bam regions [regions ...]
Slice a BAM file.
Positional arguments:
bam Input BAM file. It must be already indexed to allow random
access. You can index a BAM file with the bam-index command.
regions One or more regions to be sliced. Each region must have the
format chrom:start-end and be a half-open interval with
(start, end]. This means, for example, chr1:100-103 will
extract positions 101, 102, and 103. Alternatively, you can
provide a BED file (compressed or uncompressed) to specify
regions. Note that the 'chr' prefix in contig names (e.g.
'chr1' vs. '1') will be automatically added or removed as
necessary to match the input BED's contig names.
Optional arguments:
-h, --help Show this help message and exit.
--format TEXT Output format (default: 'BAM') (choices: 'SAM', 'BAM',
'CRAM').
--fasta PATH FASTA file. Required when --format is 'CRAM'.
[Example] Specify regions manually:
$ fuc bam-slice in.bam 1:100-300 2:400-700 > out.bam
[Example] Speicfy regions with a BED file:
$ fuc bam-slice in.bam regions.bed > out.bam
[Example] Slice a CRAM file:
$ fuc bam-slice in.bam regions.bed --format CRAM --fasta ref.fa > out.cram
bed-intxn
$ fuc bed-intxn -h
usage: fuc bed-intxn [-h] bed [bed ...]
Find the intersection of BED files.
Positional arguments:
bed Input BED files.
Optional arguments:
-h, --help Show this help message and exit.
[Example] Find the intersection of three BED files:
$ fuc bed-intxn in1.bed in2.bed in3.bed > out.bed
bed-sum
$ fuc bed-sum -h
usage: fuc bed-sum [-h] [--bases INT] [--decimals INT] bed
Summarize a BED file.
This command will compute various summary statistics for a BED file. The
returned statistics include the total numbers of probes and covered base
pairs for each chromosome.
By default, covered base pairs are displayed in bp, but if you prefer you
can, for example, use '--bases 1000' to display in kb.
Positional arguments:
bed Input BED file.
Optional arguments:
-h, --help Show this help message and exit.
--bases INT Number to divide covered base pairs (default: 1).
--decimals INT Number of decimals (default: 0).
cov-concat
$ fuc cov-concat -h
usage: fuc cov-concat [-h] [--axis INT] tsv [tsv ...]
Concatenate depth of coverage files.
Positional arguments:
tsv Input TSV files.
Optional arguments:
-h, --help Show this help message and exit.
--axis INT The axis to concatenate along (default: 0) (choices:
0, 1 where 0 is index and 1 is columns).
[Example] Concatenate vertically:
$ fuc cov-concat in1.tsv in2.tsv > out.tsv
[Example] Concatenate horizontally:
$ fuc cov-concat in1.tsv in2.tsv --axis 1 > out.tsv
cov-rename
$ fuc cov-rename -h
usage: fuc cov-rename [-h] [--mode TEXT] [--range INT INT] [--sep TEXT]
tsv names
Rename the samples in a depth of coverage file.
There are three different renaming modes using the names file:
- 'MAP': Default mode. Requires two columns, old names in the first
and new names in the second.
- 'INDEX': Requires two columns, new names in the first and 0-based
indicies in the second.
- 'RANGE': Requires only one column of new names but --range must
be specified.
Positional arguments:
tsv TSV file (compressed or uncompressed).
names Text file containing information for renaming the samples.
Optional arguments:
-h, --help Show this help message and exit.
--mode TEXT Renaming mode (default: 'MAP') (choices: 'MAP',
'INDEX', 'RANGE').
--range INT INT Index range to use when renaming the samples.
Applicable only with the 'RANGE' mode.
--sep TEXT Delimiter to use when reading the names file
(default: '\t').
[Example] Using the default 'MAP' mode:
$ fuc cov-rename in.tsv old_new.tsv > out.tsv
[Example] Using the 'INDEX' mode:
$ fuc cov-rename in.tsv new_idx.tsv --mode INDEX > out.tsv
[Example] Using the 'RANGE' mode:
$ fuc cov-rename in.tsv new_only.tsv --mode RANGE --range 2 5 > out.tsv
fa-filter
$ fuc fa-filter -h
usage: fuc fa-filter [-h] [--contigs TEXT [TEXT ...]] [--exclude] fasta
Filter sequence records in a FASTA file.
Positional arguments:
fasta Input FASTA file (compressed or uncompressed).
Optional arguments:
-h, --help Show this help message and exit.
--contigs TEXT [TEXT ...]
One or more contigs to be selected. Alternatively, you can
provide a file containing one contig per line.
--exclude Exclude specified contigs.
[Example] Select certain contigs:
$ fuc fa-filter in.fasta --contigs chr1 chr2 > out.fasta
[Example] Select certain contigs:
$ fuc fa-filter in.fasta --contigs contigs.list --exclude > out.fasta
fq-count
$ fuc fq-count -h
usage: fuc fq-count [-h] [fastq ...]
Count sequence reads in FASTQ files.
Positional arguments:
fastq Input FASTQ files (compressed or uncompressed) (default: stdin).
Optional arguments:
-h, --help Show this help message and exit.
[Example] When the input is a FASTQ file:
$ fuc fq-count in1.fastq in2.fastq
[Example] When the input is stdin:
$ cat fastq.list | fuc fq-count
fq-sum
$ fuc fq-sum -h
usage: fuc fq-sum [-h] fastq
Summarize a FASTQ file.
This command will output a summary of the input FASTQ file. The summary
includes the total number of sequence reads, the distribution of read
lengths, and the numbers of unique and duplicate sequences.
Positional arguments:
fastq Input FASTQ file (compressed or uncompressed).
Optional arguments:
-h, --help Show this help message and exit.
[Example] Summarize a FASTQ file:
$ fuc fq-sum in.fastq
fuc-bgzip
$ fuc fuc-bgzip -h
usage: fuc fuc-bgzip [-h] [file ...]
Write a BGZF compressed file.
BGZF (Blocked GNU Zip Format) is a modified form of gzip compression which
can be applied to any file format to provide compression with efficient
random access. In addition to being required for random access to and writing
of BAM files, the BGZF format can also be used for most of the sequence data
formats (e.g. FASTA, FASTQ, GenBank, VCF, MAF).
Positional arguments:
file Input file to be compressed (default: stdin).
Optional arguments:
-h, --help Show this help message and exit.
[Example] When the input is a VCF file:
$ fuc fuc-bgzip in.vcf > out.vcf.gz
[Example] When the input is stdin:
$ cat in.vcf | fuc fuc-bgzip > out.vcf.gz
fuc-compf
$ fuc fuc-compf -h
usage: fuc fuc-compf [-h] left right
Compare the contents of two files.
This command will compare the contents of two files, returning 'True' if they
are identical and 'False' otherwise.
Positional arguments:
left Input left file.
right Input right file.
Optional arguments:
-h, --help Show this help message and exit.
[Example] Compare two files:
$ fuc fuc-compf left.txt right.txt
fuc-demux
$ fuc fuc-demux -h
usage: fuc fuc-demux [-h] [--sheet PATH] reports output
Parse the Reports directory from bcl2fastq.
This command will parse, and extract various statistics from, HTML files in
the Reports directory created by the bcl2fastq or bcl2fastq2 prograrm. After
creating an output directory, the command will write the following files:
- flowcell-summary.csv
- lane-summary.csv
- top-unknown-barcodes.csv
- reports.pdf
Use --sheet to sort samples in the lane-summary.csv file in the same order
as your SampleSheet.csv file. You can also provide a modified version of your
SampleSheet.csv file to subset samples for the lane-summary.csv and
reports.pdf files.
Positional arguments:
reports Input Reports directory.
output Output directory (will be created).
Optional arguments:
-h, --help Show this help message and exit.
--sheet PATH SampleSheet.csv file. Used for sorting and/or subsetting
samples.
fuc-exist
$ fuc fuc-exist -h
usage: fuc fuc-exist [-h] [files ...]
Check whether certain files exist.
This command will check whether or not specified files including directories
exist, returning 'True' if they exist and 'False' otherwise.
Positional arguments:
files Files and directories to be tested (default: stdin).
Optional arguments:
-h, --help Show this help message and exit.
[Example] Test a file:
$ fuc fuc-exist in.txt
[Example] Test a directory:
$ fuc fuc-exist dir
[Example] When the input is stdin:
$ cat test.list | fuc fuc-exist
fuc-find
$ fuc fuc-find -h
usage: fuc fuc-find [-h] [-r] [-d PATH] pattern
Retrieve absolute paths of files whose name matches a specified pattern,
optionally recursively.
Positional arguments:
pattern Filename pattern.
Optional arguments:
-h, --help Show this help message and exit.
-r, --recursive Turn on recursive retrieving.
-d PATH, --directory PATH
Directory to search in (default: current directory).
[Example] Retrieve VCF files in the current directory only:
$ fuc fuc-find "*.vcf"
[Example] Retrieve VCF files recursively:
$ fuc fuc-find "*.vcf" -r
[Example] Retrieve VCF files in a specific directory:
$ fuc fuc-find "*.vcf" -d /path/to/dir
fuc-undetm
$ fuc fuc-undetm -h
usage: fuc fuc-undetm [-h] [--count INT] fastq
Compute top unknown barcodes using undertermined FASTQ from bcl2fastq.
This command will compute top unknown barcodes using undertermined FASTQ from
the bcl2fastq or bcl2fastq2 prograrm.
Positional arguments:
fastq Undertermined FASTQ (compressed or uncompressed).
Optional arguments:
-h, --help Show this help message and exit.
--count INT Number of top unknown barcodes to return (default: 30).
[Example] Compute top unknown barcodes:
$ fuc fuc-undetm Undetermined_S0_R1_001.fastq.gz
maf-maf2vcf
$ fuc maf-maf2vcf -h
usage: fuc maf-maf2vcf [-h] [--fasta PATH] [--ignore_indels]
[--cols TEXT [TEXT ...]] [--names TEXT [TEXT ...]]
maf
Convert a MAF file to a VCF file.
In order to handle INDELs the command makes use of a reference assembly (i.e.
FASTA file). If SNVs are your only concern, then you do not need a FASTA file
and can just use --ignore_indels.
If you are going to provide a FASTA file, please make sure to select the
appropriate one (e.g. one that matches the genome assembly).
In addition to basic genotype calls (e.g. '0/1'), you can extract more
information from the MAF file by specifying the column(s) that contain
additional genotype data of interest with the '--cols' argument. If provided,
this argument will append the requested data to individual sample genotypes
(e.g. '0/1:0.23').
You can also control how these additional genotype information appear in the
FORMAT field (e.g. AF) with the '--names' argument. If this argument is not
provided, the original column name(s) will be displayed.
Positional arguments:
maf MAF file (compressed or uncompressed).
Optional arguments:
-h, --help Show this help message and exit.
--fasta PATH FASTA file (required to include INDELs in the output).
--ignore_indels Use this flag to exclude INDELs from the output.
--cols TEXT [TEXT ...]
Column(s) in the MAF file.
--names TEXT [TEXT ...]
Name(s) to be displayed in the FORMAT field.
[Example] Convert both SNVs and indels:
$ fuc maf-maf2vcf in.maf --fasta hs37d5.fa > out.vcf
[Example] Convert SNVs only:
$ fuc maf-maf2vcf in.maf --ignore_indels > out.vcf
[Example] Extract AF field:
$ fuc maf-maf2vcf \
in.maf \
--fasta hs37d5.fa \
--cols i_TumorVAF_WU \
--names AF > out.vcf
maf-oncoplt
$ fuc maf-oncoplt -h
usage: fuc maf-oncoplt [-h] [--count INT] [--figsize FLOAT FLOAT]
[--label_fontsize FLOAT] [--ticklabels_fontsize FLOAT]
[--legend_fontsize FLOAT]
maf out
Create an oncoplot with a MAF file.
The format of output image (PDF/PNG/JPEG/SVG) will be automatically
determined by the output file's extension.
Positional arguments:
maf Input MAF file.
out Output image file.
Optional arguments:
-h, --help Show this help message and exit.
--count INT Number of top mutated genes to display (default: 10).
--figsize FLOAT FLOAT
Width, height in inches (default: [15, 10]).
--label_fontsize FLOAT
Font size of labels (default: 15).
--ticklabels_fontsize FLOAT
Font size of tick labels (default: 15).
--legend_fontsize FLOAT
Font size of legend texts (default: 15).
[Example] Output a PNG file:
$ fuc maf-oncoplt in.maf out.png
[Example] Output a PDF file:
$ fuc maf-oncoplt in.maf out.pdf
maf-sumplt
$ fuc maf-sumplt -h
usage: fuc maf-sumplt [-h] [--figsize FLOAT FLOAT] [--title_fontsize FLOAT]
[--ticklabels_fontsize FLOAT] [--legend_fontsize FLOAT]
maf out
Create a summary plot with a MAF file.
The format of output image (PDF/PNG/JPEG/SVG) will be automatically
determined by the output file's extension.
Positional arguments:
maf Input MAF file.
out Output image file.
Optional arguments:
-h, --help Show this help message and exit.
--figsize FLOAT FLOAT
Width, height in inches (default: [15, 10]).
--title_fontsize FLOAT
Font size of subplot titles (default: 16).
--ticklabels_fontsize FLOAT
Font size of tick labels (default: 12).
--legend_fontsize FLOAT
Font size of legend texts (default: 12).
[Example] Output a PNG file:
$ fuc maf-sumplt in.maf out.png
[Example] Output a PNG file:
$ fuc maf-sumplt in.maf out.pdf
maf-vcf2maf
$ fuc maf-vcf2maf -h
usage: fuc maf-vcf2maf [-h] vcf
Convert a VCF file to a MAF file.
Positional arguments:
vcf Annotated VCF file.
Optional arguments:
-h, --help Show this help message and exit.
[Example] Convert VCF to MAF:
$ fuc maf-vcf2maf in.vcf > out.maf
ngs-bam2fq
$ fuc ngs-bam2fq -h
usage: fuc ngs-bam2fq [-h] [--thread INT] [--qsub TEXT] [--force]
manifest output
Pipeline for converting BAM files to FASTQ files.
This pipeline assumes that input BAM files consist of paired-end reads, and
will output two zipped FASTQ files for each sample (forward and reverse
reads). That is, SAMPLE.bam will produce SAMPLE_R1.fastq.gz and
SAMPLE_R2.fastq.gz.
By default, the pipeline will be run in a local environment. Use --qsub to
leverage a parallel environment, in which case SGE is required.
External dependencies:
- [Required] SAMtools: Required for BAM to FASTQ conversion.
- [Optional] SGE: Required for job submission (i.e. qsub).
Manifest columns:
- [Required] BAM: Input BAM file.
Positional arguments:
manifest Sample manifest CSV file.
output Output directory.
Optional arguments:
-h, --help Show this help message and exit.
--thread INT Number of threads to use (default: 1).
--qsub TEXT SGE resoruce to request with qsub for BAM to FASTQ
conversion. Since this oppoeration supports multithreading,
it is recommended to speicfy a parallel environment (PE)
to speed up the process (also see --thread).
--force Overwrite the output directory if it already exists.
[Example] Run in local environment:
$ fuc ngs-bam2fq \
manifest.csv \
output_dir \
--thread 10
$ sh output_dir/shell/runme.sh
[Example] Run in parallel environment with specific queue:
$ fuc ngs-bam2fq \
manifest.csv \
output_dir \
--qsub "-q queue_name -pe pe_name 10" \
--thread 10
$ sh output_dir/shell/runme.sh
[Example] Run in parallel environment with specific nodes:
$ fuc ngs-bam2fq \
manifest.csv \
output_dir \
--qsub "-l h='node_A|node_B' -pe pe_name 10" \
--thread 10
$ sh output_dir/shell/runme.sh
ngs-fq2bam
$ fuc ngs-fq2bam -h
usage: fuc ngs-fq2bam [-h] [--qsub TEXT] [--bed PATH] [--thread INT]
[--platform TEXT] [--job TEXT] [--force] [--keep]
manifest fasta output java vcf [vcf ...]
Pipeline for converting FASTQ files to analysis-ready BAM files.
Here, "analysis-ready" means that the final BAM files are: 1) aligned to a
reference genome, 2) sorted by genomic coordinate, 3) marked for duplicate
reads, 4) recalibrated by BQSR model, and 5) ready for downstream analyses
such as variant calling.
By default, the pipeline will be run in a local environment. Use --qsub to
leverage a parallel environment, in which case SGE is required.
External dependencies:
- [Required] BWA: Required for read alignment (i.e. BWA-MEM).
- [Required] SAMtools: Required for sorting and indexing BAM files.
- [Required] GATK: Required for marking duplicate reads and recalibrating BAM files.
- [Optional] SGE: Required for job submission (i.e. qsub).
Manifest columns:
- Name: Sample name.
- Read1: Path to forward FASTA file.
- Read2: Path to reverse FASTA file.
Positional arguments:
manifest Sample manifest CSV file.
fasta Reference FASTA file.
output Output directory.
java Java resoruce to request for GATK.
vcf One or more reference VCF files containing known variant
sites (e.g. 1000 Genomes Project).
Optional arguments:
-h, --help Show this help message and exit.
--qsub TEXT SGE resoruce to request for qsub.
--bed PATH BED file.
--thread INT Number of threads to use (default: 1).
--platform TEXT Sequencing platform (default: 'Illumina').
--job TEXT Job submission ID for SGE.
--force Overwrite the output directory if it already exists.
--keep Keep temporary files.
[Example] Run in local environment:
$ fuc ngs-fq2bam \
manifest.csv \
ref.fa \
output_dir \
"-Xmx15g -Xms15g" \
1.vcf 2.vcf 3.vcf \
--thread 10
$ sh output_dir/shell/runme.sh
[Example] Run in parallel environment with specific queue:
$ fuc ngs-fq2bam \
manifest.csv \
ref.fa \
output_dir \
"-Xmx15g -Xms15g" \
1.vcf 2.vcf 3.vcf \
--qsub "-q queue_name -pe pe_name 10" \
--thread 10
$ sh output_dir/shell/runme.sh
[Example] Run in parallel environment with specific nodes:
$ fuc ngs-fq2bam \
manifest.csv \
ref.fa \
output_dir \
"-Xmx15g -Xms15g" \
1.vcf 2.vcf 3.vcf \
--qsub "-l h='node_A|node_B' -pe pe_name 10" \
--thread 10
$ sh output_dir/shell/runme.sh
ngs-hc
$ fuc ngs-hc -h
usage: fuc ngs-hc [-h] [--bed PATH] [--dbsnp PATH] [--thread INT]
[--batch INT] [--job TEXT] [--force] [--keep] [--posix]
manifest fasta output qsub java1 java2
Pipeline for germline short variant discovery.
External dependencies:
- SGE: Required for job submission (i.e. qsub).
- GATK: Required for variant calling (i.e. HaplotypeCaller) and filtration.
Manifest columns:
- BAM: Recalibrated BAM file.
Positional arguments:
manifest Sample manifest CSV file.
fasta Reference FASTA file.
output Output directory.
qsub SGE resoruce to request for qsub.
java1 Java resoruce to request for single-sample variant calling.
java2 Java resoruce to request for joint variant calling.
Optional arguments:
-h, --help Show this help message and exit.
--bed PATH BED file.
--dbsnp PATH VCF file from dbSNP.
--thread INT Number of threads to use (default: 1).
--batch INT Batch size used for GenomicsDBImport (default: 0). This
controls the number of samples for which readers are
open at once and therefore provides a way to minimize
memory consumption. The size of 0 means no batching (i.e.
readers for all samples will be opened at once).
--job TEXT Job submission ID for SGE.
--force Overwrite the output directory if it already exists.
--keep Keep temporary files.
--posix Set GenomicsDBImport to allow for optimizations to improve
the usability and performance for shared Posix Filesystems
(e.g. NFS, Lustre). If set, file level locking is disabled
and file system writes are minimized by keeping a higher
number of file descriptors open for longer periods of time.
Use with --batch if keeping a large number of file
descriptors open is an issue.
[Example] Specify queue:
$ fuc ngs-hc \
manifest.csv \
ref.fa \
output_dir \
"-q queue_name" \
"-Xmx15g -Xms15g" \
"-Xmx30g -Xms30g" \
--dbsnp dbSNP.vcf
[Example] Specify nodes:
$ fuc ngs-hc \
manifest.csv \
ref.fa \
output_dir \
"-l h='node_A|node_B'" \
"-Xmx15g -Xms15g" \
"-Xmx30g -Xms30g" \
--bed in.bed
ngs-m2
$ fuc ngs-m2 -h
usage: fuc ngs-m2 [-h] [--bed PATH] [--force] [--keep]
manifest fasta output pon germline qsub java
Pipeline for somatic short variant discovery.
External dependencies:
- SGE: Required for job submission (i.e. qsub).
- GATK: Required for variant calling (i.e. Mutect2) and filtration.
Manifest columns:
- Tumor: Recalibrated BAM file for tumor.
- Normal: Recalibrated BAM file for matched normal.
Positional arguments:
manifest Sample manifest CSV file.
fasta Reference FASTA file.
output Output directory.
pon PoN VCF file.
germline Germline VCF file.
qsub SGE resoruce to request for qsub.
java Java resoruce to request for GATK.
Optional arguments:
-h, --help Show this help message and exit.
--bed PATH BED file.
--force Overwrite the output directory if it already exists.
--keep Keep temporary files.
ngs-pon
$ fuc ngs-pon -h
usage: fuc ngs-pon [-h] [--bed PATH] [--force] [--keep]
manifest fasta output qsub java
Pipeline for constructing a panel of normals (PoN).
Dependencies:
- GATK: Required for constructing PoN.
Manifest columns:
- BAM: Path to recalibrated BAM file.
Positional arguments:
manifest Sample manifest CSV file.
fasta Reference FASTA file.
output Output directory.
qsub SGE resoruce to request for qsub.
java Java resoruce to request for GATK.
Optional arguments:
-h, --help Show this help message and exit.
--bed PATH BED file.
--force Overwrite the output directory if it already exists.
--keep Keep temporary files.
[Example] Specify queue:
$ fuc ngs-pon \
manifest.csv \
ref.fa \
output_dir \
"-q queue_name" \
"-Xmx15g -Xms15g"
[Example] Specify nodes:
$ fuc ngs-pon \
manifest.csv \
ref.fa \
output_dir \
"-l h='node_A|node_B'" \
"-Xmx15g -Xms15g"
ngs-quant
$ fuc ngs-quant -h
usage: fuc ngs-quant [-h] [--thread INT] [--bootstrap INT] [--job TEXT]
[--force] [--posix] [--stranded TEXT]
manifest index output qsub
Pipeline for running RNAseq quantification from FASTQ files with Kallisto.
External dependencies:
- SGE: Required for job submission (i.e. qsub).
- kallisto: Required for RNAseq quantification.
Manifest columns:
- Name: Sample name.
- Read1: Path to forward FASTA file.
- Read2: Path to reverse FASTA file.
Positional arguments:
manifest Sample manifest CSV file.
index Kallisto index file.
output Output directory.
qsub SGE resoruce to request for qsub.
Optional arguments:
-h, --help Show this help message and exit.
--thread INT Number of threads to use (default: 1).
--bootstrap INT Number of bootstrap samples (default: 50).
--job TEXT Job submission ID for SGE.
--force Overwrite the output directory if it already exists.
--posix Set the environment variable HDF5_USE_FILE_LOCKING=FALSE
before running Kallisto. This is required for shared Posix
Filesystems (e.g. NFS, Lustre).
--stranded TEXT Strand specific reads (default: 'none') (choices:
'none', 'forward', 'reverse').
[Example] Specify queue:
$ fuc ngs-quant \
manifest.csv \
transcripts.idx \
output_dir \
"-q queue_name -pe pe_name 10" \
--thread 10
ngs-trim
$ fuc ngs-trim -h
usage: fuc ngs-trim [-h] [--thread INT] [--job TEXT] [--force]
manifest output qsub
Pipeline for trimming adapters from FASTQ files.
External dependencies:
- SGE: Required for job submission (i.e. qsub).
- cutadapt: Required for trimming adapters.
Manifest columns:
- Name: Sample name.
- Read1: Path to forward FASTA file.
- Read2: Path to reverse FASTA file.
Positional arguments:
manifest Sample manifest CSV file.
output Output directory.
qsub SGE resoruce to request for qsub.
Optional arguments:
-h, --help Show this help message and exit.
--thread INT Number of threads to use (default: 1).
--job TEXT Job submission ID for SGE.
--force Overwrite the output directory if it already exists.
[Example] Specify queue:
$ fuc ngs-trim \
manifest.csv \
output_dir \
"-q queue_name -pe pe_name 10" \
--thread 10
tabix-index
$ fuc tabix-index -h
usage: fuc tabix-index [-h] [--force] file
Index a GFF/BED/SAM/VCF file with Tabix.
The Tabix program is used to index a TAB-delimited genome position file
(GFF/BED/SAM/VCF) and create an index file (.tbi). The input data file must
be position sorted and compressed by bgzip.
Positional arguments:
file File to be indexed.
Optional arguments:
-h, --help Show this help message and exit.
--force Force to overwrite the index file if it is present.
[Example] Index a GFF file:
$ fuc tabix-index in.gff.gz
[Example] Index a BED file:
$ fuc tabix-index in.bed.gz
[Example] Index a SAM file:
$ fuc tabix-index in.sam.gz
[Example] Index a VCF file:
$ fuc tabix-index in.vcf.gz
tabix-slice
$ fuc tabix-slice -h
usage: fuc tabix-slice [-h] file regions [regions ...]
Slice a GFF/BED/SAM/VCF file with Tabix.
After creating an index file (.tbi), the Tabix program is able to quickly
retrieve data lines overlapping regions specified in the format
'chr:start-end'. Coordinates specified in this region format are 1-based and
inclusive.
Positional arguments:
file File to be sliced.
regions One or more regions.
Optional arguments:
-h, --help Show this help message and exit.
[Example] Slice a VCF file:
$ fuc tabix-slice in.vcf.gz chr1:100-200 > out.vcf
tbl-merge
$ fuc tbl-merge -h
usage: fuc tbl-merge [-h] [--how TEXT] [--on TEXT [TEXT ...]] [--lsep TEXT]
[--rsep TEXT] [--osep TEXT]
left right
Merge two table files.
This command will merge two table files using one or more shared columns.
The command essentially wraps the 'pandas.DataFrame.merge' method from the
pandas package. For details on the merging algorithms, please visit the
method's documentation page.
Positional arguments:
left Input left file.
right Input right file.
Optional arguments:
-h, --help Show this help message and exit.
--how TEXT Type of merge to be performed (default: 'inner')
(choices: 'left', 'right', 'outer', 'inner', 'cross').
--on TEXT [TEXT ...] Column names to join on.
--lsep TEXT Delimiter to use for the left file (default: '\t').
--rsep TEXT Delimiter to use for the right file (default: '\t').
--osep TEXT Delimiter to use for the output file (default: '\t').
[Example] Merge two tables:
$ fuc tbl-merge left.tsv right.tsv > merged.tsv
[Example] When the left table is a CSV:
$ fuc tbl-merge left.csv right.tsv --lsep , > merged.tsv
[Example] Merge with the outer algorithm:
$ fuc tbl-merge left.tsv right.tsv --how outer > merged.tsv
tbl-sum
$ fuc tbl-sum -h
usage: fuc tbl-sum [-h] [--sep TEXT] [--skiprows TEXT]
[--na_values TEXT [TEXT ...]] [--keep_default_na]
[--expr TEXT] [--columns TEXT [TEXT ...]] [--dtypes PATH]
table_file
Summarize a table file.
Positional arguments:
table_file Table file.
Optional arguments:
-h, --help Show this help message and exit.
--sep TEXT Delimiter to use (default: '\t').
--skiprows TEXT Comma-separated line numbers to skip (0-indexed) or
number of lines to skip at the start of the file
(e.g. `--skiprows 1,` will skip the second line,
`--skiprows 2,4` will skip the third and fifth lines,
and `--skiprows 10` will skip the first 10 lines).
--na_values TEXT [TEXT ...]
Additional strings to recognize as NA/NaN (by
default, the following values are interpreted
as NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND',
'-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN',
'<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan',
'null').
--keep_default_na Whether or not to include the default NaN values when
parsing the data (see 'pandas.read_table' for details).
--expr TEXT Query the columns of a pandas.DataFrame with a
boolean expression (e.g. `--query "A == 'yes'"`).
--columns TEXT [TEXT ...]
Columns to be summarized (by default, all columns
will be included).
--dtypes PATH File of column names and their data types (either
'categorical' or 'numeric'); one tab-delimited pair of
column name and data type per line.
[Example] Summarize a table:
$ fuc tbl-sum table.tsv
vcf-call
$ fuc vcf-call -h
usage: fuc vcf-call [-h] [-r TEXT [TEXT ...]] [--min-mq INT] [--max-depth INT]
[--dir-path PATH] [--gap_frac FLOAT]
[--group-samples PATH]
fasta bams [bams ...]
Call SNVs and indels from BAM files.
Under the hood, the command utilizes the bcftool program to call variants.
Positional arguments:
fasta Reference FASTA file.
bams One or more input BAM files. Alternatively, you can
provide a text file (.txt, .tsv, .csv, or .list)
containing one BAM file per line.
Optional arguments:
-h, --help Show this help message and exit.
-r TEXT [TEXT ...], --regions TEXT [TEXT ...]
By default, the command looks at each genomic
position with coverage in BAM files, which can be
excruciatingly slow for large files (e.g. whole
genome sequencing). Therefore, use this argument to
only call variants in given regions. Each region must
have the format chrom:start-end and be a half-open
interval with (start, end]. This means, for example,
chr1:100-103 will extract positions 101, 102, and
103. Alternatively, you can provide a BED file
(compressed or uncompressed) to specify regions. Note
that the 'chr' prefix in contig names (e.g. 'chr1'
vs. '1') will be automatically added or removed as
necessary to match the input BAM's contig names.
--min-mq INT Minimum mapping quality for an alignment to be used
(default: 1).
--max-depth INT At a position, read maximally this number of reads
per input file (default: 250).
--dir-path PATH By default, intermediate files (likelihoods.bcf,
calls.bcf, and calls.normalized.bcf) will be stored
in a temporary directory, which is automatically
deleted after creating final VCF. If you provide a
directory path, intermediate files will be stored
there.
--gap_frac FLOAT Minimum fraction of gapped reads for calling indels
(default: 0.002).
--group-samples PATH By default, all samples are assumed to come from a
single population. This option allows to group
samples into populations and apply the HWE assumption
within but not across the populations. To use this
option, provide a tab-delimited text file with sample
names in the first column and group names in the
second column. If '--group-samples -' is given
instead, no HWE assumption is made at all and
single-sample calling is performed. Note that in low
coverage data this inflates the rate of false
positives. Therefore, make sure you know what you are
doing.
[Example] Specify regions manually:
$ fuc vcf-call ref.fa 1.bam 2.bam \
-r chr1:100-200 chr2:400-500 > out.vcf
[Example] Specify regions with a BED file:
$ fuc vcf-call ref.fa bam.list \
-r in.bed > out.vcf
vcf-filter
$ fuc vcf-filter -h
usage: fuc vcf-filter [-h] [--expr TEXT] [--samples PATH]
[--drop_duplicates [TEXT ...]] [--greedy] [--opposite]
[--filter_empty]
vcf
Filter a VCF file.
Positional arguments:
vcf VCF file (compressed or uncompressed).
Optional arguments:
-h, --help Show this help message and exit.
--expr TEXT Expression to evaluate.
--samples PATH File of sample names to apply the marking (one
sample per line).
--drop_duplicates [TEXT ...]
Only consider certain columns for identifying
duplicates, by default use all of the columns.
--greedy Use this flag to mark even ambiguous genotypes
as missing.
--opposite Use this flag to mark all genotypes that do not
satisfy the query expression as missing and leave
those that do intact.
--filter_empty Use this flag to remove rows with no genotype
calls at all.
[Example] Mark genotypes with 0/0 as missing:
$ fuc vcf-filter in.vcf --expr 'GT == "0/0"' > out.vcf
[Example] Mark genotypes that are not 0/0 as missing:
$ fuc vcf-filter in.vcf --expr 'GT != "0/0"' > out.vcf
[Example] Mark genotypes whose DP is less than 30 as missing:
$ fuc vcf-filter in.vcf --expr 'DP < 30' > out.vcf
[Example] Same as above, but also mark ambiguous genotypes as missing:
$ fuc vcf-filter in.vcf --expr 'DP < 30' --greedy > out.vcf
[Example] Build a complex query to select genotypes to be marked missing:
$ fuc vcf-filter in.vcf --expr 'AD[1] < 10 or DP < 30' --opposite > out.vcf
[Example] Compute summary statistics and subset samples:
$ fuc vcf-filter in.vcf \
--expr 'np.mean(AD) < 10' --greedy --samples sample.list > out.vcf
[Example] Drop duplicate rows:
$ fuc vcf-filter in.vcf --drop_duplicates CHROM POS REF ALT > out.vcf
[Example] Filter out rows without genotypes:
$ fuc vcf-filter in.vcf --filter_empty > out.vcf
vcf-index
$ fuc vcf-index -h
usage: fuc vcf-index [-h] [--force] vcf
Index a VCF file.
This command will create an index file (.tbi) for the input VCF.
Positional arguments:
vcf Input VCF file to be indexed. When an uncompressed file is
given, the command will automatically create a BGZF
compressed copy of the file (.gz) before indexing.
Optional arguments:
-h, --help Show this help message and exit.
--force Force to overwrite the index file if it is already present.
[Example] Index a compressed VCF file:
$ fuc vcf-index in.vcf.gz
[Example] Index an uncompressed VCF file (will create a compressed VCF first):
$ fuc vcf-index in.vcf
vcf-merge
$ fuc vcf-merge -h
usage: fuc vcf-merge [-h] [--how TEXT] [--format TEXT] [--sort] [--collapse]
vcf_files [vcf_files ...]
Merge two or more VCF files.
Positional arguments:
vcf_files VCF files (compressed or uncompressed). Note that the 'chr'
prefix in contig names (e.g. 'chr1' vs. '1') will be
automatically added or removed as necessary to match the
contig names of the first VCF.
Optional arguments:
-h, --help Show this help message and exit.
--how TEXT Type of merge as defined in pandas.DataFrame.merge
(default: 'inner').
--format TEXT FORMAT subfields to be retained (e.g. 'GT:AD:DP')
(default: 'GT').
--sort Use this flag to turn off sorting of records
(default: True).
--collapse Use this flag to collapse duplicate records
(default: False).
[Example] Merge multiple VCF files:
$ fuc vcf-merge 1.vcf 2.vcf 3.vcf > merged.vcf
[Example] Keep the GT, AD, DP fields:
$ fuc vcf-merge 1.vcf 2.vcf --format GT:AD:DP > merged.vcf
vcf-rename
$ fuc vcf-rename -h
usage: fuc vcf-rename [-h] [--mode TEXT] [--range INT INT] [--sep TEXT]
vcf names
Rename the samples in a VCF file.
There are three different renaming modes using the 'names' file:
- 'MAP': Default mode. Requires two columns, old names in the first
and new names in the second.
- 'INDEX': Requires two columns, new names in the first and 0-based
indicies in the second.
- 'RANGE': Requires only one column of new names but '--range' must
be specified.
Positional arguments:
vcf VCF file (compressed or uncompressed).
names Text file containing information for renaming the samples.
Optional arguments:
-h, --help Show this help message and exit.
--mode TEXT Renaming mode (default: 'MAP') (choices: 'MAP',
'INDEX', 'RANGE').
--range INT INT Index range to use when renaming the samples.
Applicable only with the 'RANGE' mode.
--sep TEXT Delimiter to use for reading the 'names' file
(default: '\t').
[Example] Using the default 'MAP' mode:
$ fuc vcf-rename in.vcf old_new.tsv > out.vcf
[Example] Using the 'INDEX' mode:
$ fuc vcf-rename in.vcf new_idx.tsv --mode INDEX > out.vcf
[Example] Using the 'RANGE' mode:
$ fuc vcf-rename in.vcf new_only.tsv --mode RANGE --range 2 5 > out.vcf
vcf-slice
$ fuc vcf-slice -h
usage: fuc vcf-slice [-h] vcf regions [regions ...]
Slice a VCF file for specified regions.
Positional arguments:
vcf Input VCF file must be already BGZF compressed (.gz) and
indexed (.tbi) to allow random access. A VCF file can be
compressed with the fuc-bgzip command and indexed with the
vcf-index command.
regions One or more regions to be sliced. Each region must have the
format chrom:start-end and be a half-open interval with
(start, end]. This means, for example, chr1:100-103 will
extract positions 101, 102, and 103. Alternatively, you can
provide a BED file (compressed or uncompressed) to specify
regions. Note that the 'chr' prefix in contig names (e.g.
'chr1' vs. '1') will be automatically added or removed as
necessary to match the input VCF's contig names.
Optional arguments:
-h, --help Show this help message and exit.
[Example] Specify regions manually:
$ fuc vcf-slice in.vcf.gz 1:100-300 2:400-700 > out.vcf
[Example] Speicfy regions with a BED file:
$ fuc vcf-slice in.vcf.gz regions.bed > out.vcf
[Example] Output a compressed file:
$ fuc vcf-slice in.vcf.gz regions.bed | fuc fuc-bgzip > out.vcf.gz
vcf-split
$ fuc vcf-split -h
usage: fuc vcf-split [-h] [--clean] [--force] vcf output
Split a VCF file by individual.
Positional arguments:
vcf VCF file to be split.
output Output directory.
Optional arguments:
-h, --help Show this help message and exit.
--clean By default, the command will only return variants present in
each individual. Use the tag to stop this behavior and make
sure that all individuals have the same number of variants.
--force Overwrite the output directory if it already exists.
[Example] Split a VCF file by individual:
$ fuc vcf-split in.vcf output_dir
vcf-vcf2bed
$ fuc vcf-vcf2bed -h
usage: fuc vcf-vcf2bed [-h] vcf
Convert a VCF file to a BED file.
Positional arguments:
vcf VCF file (compressed or uncompressed).
Optional arguments:
-h, --help Show this help message and exit.
[Example] Convert VCF to BED:
$ fuc vcf-vcf2bed in.vcf > out.bed
vcf-vep
$ fuc vcf-vep -h
usage: fuc vcf-vep [-h] [--opposite] [--as_zero] vcf expr
Filter a VCF file by annotations from Ensembl VEP.
Positional arguments:
vcf VCF file annotated by Ensembl VEP (compressed or uncompressed).
expr Query expression to evaluate.
Optional arguments:
-h, --help Show this help message and exit.
--opposite Use this flag to return only records that don't
meet the said criteria.
--as_zero Use this flag to treat missing values as zero instead of NaN.
[Example] Select variants in the TP53 gene:
$ fuc vcf-vep in.vcf "SYMBOL == 'TP53'" > out.vcf
[Example] Exclude variants from the TP53 gene:
$ fuc vcf-vep in.vcf "SYMBOL != 'TP53'" > out.vcf
[Example] Same as above:
$ fuc vcf-vep in.vcf "SYMBOL == 'TP53'" --opposite > out.vcf
[Example] Select splice donor or stop-gain variants:
$ fuc vcf-vep in.vcf \
"Consequence in ['splice_donor_variant', 'stop_gained']" > out.vcf
[Example] Build a complex query to select specific variants:
$ fuc vcf-vep in.vcf \
"(SYMBOL == 'TP53') and (Consequence.str.contains('stop_gained'))" > out.vcf
[Example] Select variants whose gnomAD AF is less than 0.001:
$ fuc vcf-vep in.vcf "gnomAD_AF < 0.001" > out.vcf
[Example] Variants without AF data will be treated as having AF of 0:
$ fuc vcf-vep in.vcf "gnomAD_AF < 0.001" --as_zero > out.vcf