Calling variants with freebayes

Erik Garrison

erik.garrison@bc.edu / @erikgarrison

Summary

This tutorial is a brief walkthrough demonstrating how to use FreeBayes to detect short sequence variants in read alignments generated from a resequencing process.

Getting started

You've sequenced a sample. You've aligned your reads to a reference genome. You've sorted your alignments. You've marked PCR duplicates. You've indexed your alignments.

Now, you want to determine the sequence variants present in your sample. FreeBayes lets you do this very easily!

You first need to download and install freebayes:

git clone --recursive git://github.com/ekg/freebayes.git
cd freebayes
make

Alternatively, you can use an existing tarball:

wget http://clavius.bc.edu/~erik/freebayes/freebayes-5d5b8ac0.tar.gz
tar xzvf freebayes-5d5b8ac0.tar.gz
cd freebayes
make

(These tarballs aren't built regularly, so your best bet is to set up and use git to download the code. It's really useful!)

Now that freebayes is built, we can run it:

bin/freebayes        # default summary
bin/freebayes -h     # verbose help text, summary of options

If all is well we should see a description of the method. You can install freebayes by moving it into your path, or adding freebayes/bin to your path.

Working with test data

To call variants we need an alignment file (in BAM format). We also need a reference genome in (uncompressed) FASTA format.

An easy way to get started is to run tests on publicly-available data sets. For this we can use the ever-popular CEU hapmap sample NA12878. This sample is a good target because it provides us a putative truth set, currently in development by the NIST Genome in a Bottle Project.

The 1000 Genomes Project has incorporated a down-sampled alignment set into the project's low-coverage (~5-6x) analyses. As such, we have readily-available alignments processed using the same pipeline used for the alignment of all the 1000 Genomes samples.

We first need to download the reference used for the alignment. Our analysis will be limited to chr20, so we'll just get a copy of that chromosome.

wget http://bioinformatics.bc.edu/marthlab/download/gkno-cshl-2013/chr20.fa

Now, we should grab a piece of the low-coverage NA12878 alignments using samtools:

wget ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
wget ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam.bai

These alignments are just for human chromosome 20, which amounts to about 1/50th of the genome. This will be plenty for testing.

To have a truth set to compare to, we should download the NIST Genome in a Bottle variants. Note that we are obtaining both the VCF (variants) and a target list in which the project determined it is possible to confidently make genotype calls:

wget ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/NIST/NISTIntegratedCalls_13datasets_130719_allcall_UGHapMerge_HetHomVarPASS_VQSRv2.17_all_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs.vcf.gz
wget ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/NIST/union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.17.bed.gz
gunzip *.vcf.gz
gunzip *.bed.gz

And, we can also grab a curated set of these variants shared by the Broad Genome Sequencing and Analysis Group:

wget http://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/working/20130806_broad_na12878_truth_set/NA12878.wgs.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz
wget http://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/working/20130806_broad_na12878_truth_set/NA12878.wgs.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz.tbi

Now we have a reference, alignment data, and a set of truth sets available for NA12878.

Variant calling

To call variants, we can just use freebayes in its default configuration:

freebayes -f chr20.fa \
    NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam >NA12878.chr20.freebayes.vcf

(This should take 5-10 minutes on a typical system.)

You could target a particular region as well (for instance, to speed up testing) by providing freebayes the --region parameter:

freebayes -f chr20.fa --region 20:1000000-2000000 \
    NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam >NA12878.chr20.1mb_2mb.freebayes.vcf

Understanding our results

Take a look at the output.

less -S NA12878.chr20.freebayes.vcf

There is a header:

##fileformat=VCFv4.1
##fileDate=20131121
##source=freeBayes v0.9.9.2-23-g8a98f11-dirty
##reference=chr20.fa
##phasing=none
##commandline="freebayes -f chr20.fa NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam"
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
....

And then there are variants:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
20      61795   .       G       T       127.882 .       AB=0.454545;ABP=3.20771;AC=1;AF=0.5
20      63244   .       A       C       90.7756 .       AB=0;ABP=0;AC=2;AF=1
....

We can quickly examine our aggregate results using tools from vcflib, such as vcfstats:

vcfstats NA12878.chr20.freebayes.vcf

This will provide us a lot of information about small variants, such as the counts of SNPs, indels, and other variant types, the transition/transversion ratio, and a summary of the indel length-frequency spectrum. An interesting thing to note is that the ts/tv ratio is different for high and low-quality variants:

vcffilter -f "QUAL > 20" NA12878.chr20.freebayes.vcf | vcfstats | grep "ts\|bial"
# compare to
vcffilter -f "QUAL < 20" NA12878.chr20.freebayes.vcf | vcfstats | grep "ts\|bial"

Comparing our results to the Broad NA12878 Knowledge-Base

First, we should either use tabix to extract just chr20 from the NA12878-K.B.:

tabix -h NA12878.chr20.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz 20

Or, you can download a pre-made set:

wget http://clavius.bc.edu/~erik/NA12878/NA12878.chr20.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz
wget http://clavius.bc.edu/~erik/NA12878/NA12878.chr20.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz.tbi

And, for this comparison we want to both exclude regions which weren't considered confidently-callable by the GiaB set, and keep only variants which are considered TRUE_POSITIVE in the Broad/GiaB NA12878 truth set:

zcat NA12878.chr20.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz \
    | vcfintersect -b union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.17.bed \
    | grep "^#\|TRUE_POSITIVE" >NA12878.chr20.broad_truth_set.TP.callable_regions.vcf

We'll do the same for our calls, restricting them to confidently-callable regions:

cat NA12878.chr20.freebayes.vcf \
    | vcfintersect -b \
        union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.17.bed \
        >NA12878.chr20.freebayes.callable_regions.vcf

Using this result, we can generate a kind of receiver-operator characteristic using a tool in vcflib, vcfroc:

vcfroc -r chr20.fa \
    -t NA12878.chr20.broad_truth_set.callable_regions.vcf \
    NA12878.chr20.freebayes.vcf >broad_truth_set.roc.tsv

We can generate a plot of this using some tools in vcflib:

( vcfroc -r chr20.fa -t \
    NA12878.chr20.broad_truth_set.TP.callable_regions.vcf \
    NA12878.chr20.broad_truth_set.TP.callable_regions.vcf \
    | prependcol set answers; \
    cat broad_truth_set.roc.tsv \
    | prependcol set freebayes | grep -v set ) >roc.tsv
plot_roc.r freebayesNA12878 answers roc.tsv 0 1 0 1

Here, prependcol is a script that prepends a column named "set" with each row set to "answers" or "freebayes."

The resulting image freebayesNA12878.both.png, shows that we can achieve < 5% FDR in this sample (assuming perfect sensitivity in the truth set--- which is unlikely), and that our overall sensitivity is rather low. This matches our expectations given the low coverage of the sample.