View on GitHub

Wham

Structural variant detection and association testing

Download this project as a .zip file Download this project as a tar.gz file

What is WHAM?

WHole-genome Alignment Metrics (WHAM) is a structural variant (SV) caller that integrates several sources of mapping information to identify SVs. WHAM classifies SVs using a flexible and extendable machine-learning algorithm (random forest). WHAM is not only accurate at identifying SVs, but its association test can identify shared SVs enriched in a cohort of diseased individuals compared to a background of healthy individuals.

WHAM can be easily run as a stand alone tool or as part of gkno (http://gkno.me) or bcbio-nextgen (http://bcbio-nextgen.readthedocs.org) pipelines.

If you would like to use and cite this tool, please see our paper in PLOS computational biology (http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004572).

WHAM-GRAPHENING is under development, but functional for deletions. It can be used to visualize the putative breakpoints of complex SVs.

Installing WHAM

WHAM builds in two simple steps assuming that you have the dependencies. Most linux environments have cmake and OpenMP the two requirements for WHAM. If an install fails you may have to install these libraries.

Fast and dirty install

git clone --recursive https://github.com/jewmanchue/wham.git
cd wham
make

Dependencies

For classification of SV variant types (OPTIONAL), a python distribution with the scikit-learn and argparse packages is required. These come standard with the Anaconda release of python, which can be obtained here:

http://continuum.io/downloads

If you want to continue with your current python distribution use the following commands (NOTE: requires Python (>= 2.6; Python 2.7 is recommended), NumPy (>= 1.6.1), SciPy (>= 0.9):

pip install -U numpy scipy scikit-learn argparse

After dependency installation, clone the git repository with the following commands:

git clone --recursive git@github.com:jewmanchue/wham.git
cd wham
make

If you get an error while downloading wham try https:

git clone --recursive https://github.com/zeeev/wham.git
cd wham
make

WHAM input

WHAM uses soft clipping information and supplementary alignments provided by mapping software. We recommend that you use BWA mem for accurate structural variant calls. The input BAM files must be sorted, duplicates marked or removed and the BAM files need to be indexed. The sorted BAM files need to have the HD:SO tag, this is provided by samtools 0.19 or later.

Running WHAM

usage statement:

usage  : WHAM-BAM -m <INT> -q <INT> -p <INT> -x <INT> -r <STRING> -e <STRING> -t <STRING> -b <STRING>

example: WHAM-BAM -m 2 -q 15 -p 10 -x 20 -r chr1:0-10000 -e genes.bed -t a.bam,b.bam -b c.bam,d.bam

required   : t <STRING> -- comma separated list of target bam files
required   : f <STRING> -- reference sequence reads were aligned to
option     : b <STRING> -- comma separated list of background bam files
option     : r <STRING> -- a genomic region in the format "seqid:start-end"
option     : x <INT>    -- set the number of threads, otherwise max [all]
option     : e <STRING> -- a bedfile that defines regions to score  [none]
option     : m <INT>    -- minimum number of soft-clips supporting
                           START [3]
option     : q <INT>    -- exclude soft-clipped sequences with average base
                           quality below phred scaled value (0-41) [20]
option     : p <INT>    -- exclude soft-clipped reads with mapping quality
                           below value [15]
option     : i          -- base quality is Illumina 1.3+ Phred+64

Version: v1.7.0-57-g76af
Contact: zev.kronenberg [at] gmail.com
Notes  : -If you find a bug, please open a report on github!
         -The options -m,-q, and -p, control the sensitivity and specificity
         -If you have exome data use the -e option for best performance

Example of running Wham

Here we run Wham on the whole NA12878 genome:

./bin/WHAM-BAM -f hg19.fa -t NA12878.sort.bam > NA12878.wham.raw.vcf 2>  NA12878.wham.raw.err 
python utils/classify_WHAM_vcf.py NA12878.wham.raw.vcf wham/data/WHAM_training_data.txt > NA12878.wham.raw.class.vcf 2> NA12878.wham.raw.class.err

The first command runs WHAM on the input bam file and outputs an unsorted VCF file given the reference fasta and the bam file. The second command runs the WHAM classifier, which uses information from the SV calls and from a training dataset to classify each SV call (insertions, deletions, etc.). After classification the output VCF file will have two new keys in the info field: "WC" and "WP".

By default Wham uses all CPUs the machine has. If you only want to use fewer CPUs use the "-x[n]" flag where 'n' is the number of desired CPUs:

./bin/WHAM-BAM -f hg19.fa -x 10 -t NA12878.sort.bam > NA12878.wham.raw.vcf 2>  NA12878.wham.raw.err 

Sensitivity and specificity can be controlled by adjusting -q, -p and -m flags. WHAM interrogates a potential structural variant if more than two soft-clipped reads share the same start or end. By requiring higher base quality and mapping quality of soft-clipped reads the specificity can be increased. Soft-clipping can occur because of low quality bases at the five-prime or three-prime end of a read. In the command below a minimum mapping quality of 30, a minimum base quality of 35 and five soft-clips are required for a SV call:

./bin/WHAM-BAM -f hg19.fa -p 30 -q 35 -m 5 -x 10 -t NA12878.sort.bam > NA12878.wham.raw.vcf 2>  NA12878.wham.raw.err 

To run WHAM for association testing you need to specify a set of target and background files. Here is hypothetical example where three target files (A.bam,B.bam,C.bam) and three background files (D.bam,E.bam,F.bam) are used.

./bin/WHAM-BAM -f hg19.fa -t A.bam,B.bam,C.bam -b D.bam,E.bam,F.bam > wham.raw.vcf 2>  wham.raw.err 

The output of this command will have non-zero LRT values. The higher the LRT value the more likely the allele frequency (of the structural variant) is different between the target group and background group.

There are two ways to run WHAM in regions. The first is to set a region using the -r option. In the example below we score chr1 0-100,000 with WHAM:

./bin/WHAM-BAM -f hg19.fa -r chr1:0-100000 -t NA12878.sort.bam > NA12878.wham.raw.vcf 2> 

Alternatively, WHAM can take a bedfile, defining the regions to run. This is highly advised if you have exome data - or - sparse coverage:

./bin/WHAM-BAM -f hg19.fa -e regions.bed -t NA12878.sort.bam > NA12878.wham.raw.vcf 2> 

Understanding WHAM output

WHAM outputs an unsorted VCFv4.1 file. Below is an example from NA12878.


##fileformat=VCFv4.1
##INFO=<ID=LRT,Number=1,Type=Float,Description="Likelihood Ratio Test Statistic">
##INFO=<ID=WAF,Number=3,Type=Float,Description="Allele frequency of: background,target,combined">
##INFO=<ID=GC,Number=2,Type=Integer,Description="Number of called genotypes in: background,target">
##INFO=<ID=AT,Number=15,Type=Float,Description="Attributes for classification">
##INFO=<ID=KM,Number=3,Type=Float,Description="Kmer filters. The number of 17bp kmers that were assayed, the number of hits to the masking DB, the fraction of hits">
##INFO=<ID=PU,Number=1,Type=Integer,Description="Number of reads supporting position">
##INFO=<ID=SU,Number=1,Type=Integer,Description="Number of supplemental reads supporting position">
##INFO=<ID=CU,Number=1,Type=Integer,Description="Number of neighboring all soft clip clusters across all individuals at pileup position">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Number of reads at pileup position across individuals passing filters">
##INFO=<ID=SP,Number=3,Type=Integer,Description="Number of reads supporting endpoint: insertsize,split-read,alternative-mapping">
##INFO=<ID=CHR2,Number=3,Type=String,Description="Other seqid">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="PE confidence interval around END">
##INFO=<ID=CISTART,Number=2,Type=Integer,Description="PE confidence interval around POS">
##INFO=<ID=DI,Number=1,Type=Character,Description="Consensus is from front or back of pileup : f,b">
##INFO=<ID=NC,Number=1,Type=String,Description="Number of soft clipped sequences collapsed into consensus">
##INFO=<ID=MQ,Number=1,Type=Float,Description="Average mapping quality">
##INFO=<ID=MQF,Number=1,Type=Float,Description="Fraction of reads with MQ less than 50">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Difference in length between POS and end">
##INFO=<ID=WC,Number=1,Type=String,Description="WHAM classifier variant type">
##INFO=<ID=WP,Number=4,Type=Float,Description="WHAM probability estimate for each structural variant classification from RandomForest model">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GL,Number=A,Type=Float,Description="Genotype likelihood">
##FORMAT=<ID=NR,Number=1,Type=Integer,Description="Number of reads that do not support a SV">
##FORMAT=<ID=NA,Number=1,Type=Integer,Description="Number of reads supporting a SV">
##FORMAT=<ID=NS,Number=1,Type=Integer,Description="Number of reads with a softclip at POS for individual">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Number of reads passing filters">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878.sorted.bam

Here is a deletion call:

chr1
69943725
.
N
ATATTTAAAATGCATCACTTGTGACAACAAATAGTTGGATCTT
.
.
LRT=0;WAF=.,1,1;GC=0,1;AT=0.785714,0.285714,0,0,0,0.857143,0,0.535714,0.857143,0,0,0.0357143,0.392857,0,0.2601;ISTART=69943723,69943725;CIEND=69943948,69944094;PU=15;SU=24;CU=17;RD=28;NC=14;MQ=60;MQF=0;SP=4,24,0;CHR2=chr1;DI=b;END=69944022;SVLEN=298;WC=DEL;WP=0.984,0.0,0.016,0.0
GT:GL:NR:NA:NS:RD
1/1:-255,-255,-1.6e-05:0:16:15:16

The first several columns should be familiar to most users as chrom, pos, ID, ref, alt, qual, filter. The details for the info, format, and genotype fields can be found below. The genotypes for NA12878 is 0/1.

The info field is comprised of 21 key value pairs.

LRT:

This is the likelihood ratio test statistic quantifying the difference in allele frequencies between the cases and controls. If you are only using WHAM for its structural breakpoint detection this field will contain a -nan. Both cases (target) and controls (background) bams must be specified using the -t & -b flags to get an LRT score.

WAF:

A comma separated list of allele frequencies of the background, target, combined. If control (background) bams were not specified the allele frequency for the background will be listed as -nan. WHAM treats each breakpoint as a bi-allelic variant and estimates the frequency based on the genotype counts.

GC:

Wham does not call a genotype unless there are two reliable reads covering a position. Genotypes of './.' are no-calls. CG reports the number of genotypes that could be called reliably.

AT:

This field used by wham to determine a classification. There are 15 read depth normalized counts. The average user should not worry about this field.

CF:

Fraction of reads with more than three cigar operations

CISTART:

Confidence interval for POS, based on the other positions of the soft-clipped sequences in the pileup

CIEND:

Confidence interval for END, based on the other positions of mate-pair, split-read and alternative alignments.

PU:

The number of primary mappings that support the exact breakpoint reported in the POS field.

SU:

The number of supplementary / split read support for the exact breakpoint reported in the POS field.

CU:

WHAM skips between positions that have soft-clipping. There are some number of reads that cover a given soft-clipping position. These reads can have soft-clipping at other locations. The number of other positions where there is soft-clipping is reported in CU.

RD:

Combined read depth across indviduals after filtering

NC:

The number of soft-clipped segments there were collapsed into the consensus sequence.

MQ:

The average mapping quality.

MQF:

The fraction of reads with a mapping quality less than 50.

SP:

The support counts for the END pos (mate-pair, split-read, and alternative alignments).

CHR2:

The other seqid. For translocations the field CHROM does not equal the CHR2 field.

DI:

This field denotes is the breakpoint is supported on the 5' of the pileup or 3' end of pileup. The value 'f' denotes the '5 and 'b' denotes the 3'.

END:

The END position of the structural variant.

SVLEN:

The distance between the two breakpoints wham identified. The length with be zero if the event is inter-chromosomal.

WC:

The class of structural variant classified by the random forest.

WP:

The probabilities for each class label generated by the random forest classifier.

The format field is comprised of six colon-delimited fields.

GT:

A genotype call. The nature of the variant is unknown. WHAM determines the zygosity.

GL:

log10 genotype likelihood under a bi-allelic model.

NR:

The number of reads that do not support the structural variant listed in ALT.

NA:

The number of reads that support the structural variant listed in ALT.

NS:

The number of primary reads supporting with a soft clip at POS

RD:

The number of reads covering the individual at POS after filtering

Filtering Wham

For the Wham manuscript we ran two filters to improve the performance of the tools.

Convert Wham VCF to BEDPE:
perl /whamToBedPe.pl -c --file NA12878.wham.class.vcf > NA12878.wham.s.bedpe
Remove Wham calls in low complexity regions (http://arxiv.org/pdf/1404.0929.pdf)
pairToBed -type neither -b ~/tools/wham/data/LCR-hs37d5.bed -a NA12878.wham.s.bedpe
Remove Wham calls in high depth regions (http://www.genomebiology.com/2014/15/6/R84):
pairToBed -type neither -b ceph18.b37.lumpy.exclude.2014-01-15.nochr.bedpe   -a NA12878.wham.nolcr.bedpe   >  NA12878.wham.nolcr.nh.bedpe