Link Search Menu Expand Document

Preparing VCF

The requirements for the VCF input file for PharmCAT are described in the VCF requirement doc. This document explores some reasoning behind those requirements and some specific examples of ways to fulfill them.

Must be aligned to GRCh38 Assembly

PharmCAT requires VCF files aligned to GRCh38.

We recommend aligning to GRCh38 from the very beginning - for example, see our pharmacogenomics NGS pipeline.

While you can translate positions between assemblies, this can easily introduce errors into your data. For example, LiftOver can be problematic due to ambiguity in positions. In addition, many tools won’t update the genotype.

Known issue with remapping

If you use CrossMap it will update the position and reference nucleotides. However, if the reference is now the same as the alternate, it will remove the variant from the file.

Likewise, NCBI Remap will not update genotypes, so even if the updated position means the variant should now be called as 0/0 it will not be updated from 1/1. Presently there are only a few positions where this matters, but this will need to be very carefully checked.

Another issue with NCBI remap is that some of the alternates can be lost following remap. For instance:

 Pre LiftOver: 2	234668879	rs57191451	CAT	C,CATAT,CATATAT	0/3
Post LiftOver: 2	233760233	rs57191451	CAT	CATAT,CATATAT	0/3

We currently do not have a fix for this (NCBI has been contacted), so again you will have to check these sites carefully by hand.

Example: LiftOver using the GATK

This is a GRCh37-toGRCh38 LiftOver example which uses the GATK LiftoverVcf tool on the UK Biobank genotype data in the GRCh37 coordinates. The GATK LiftoverVcf tool can update genomic coordinates. If the reference allele and alternate allele are swapped in the new genome build for a single nucleotide polymorphism (SNP) locus, the GATK LiftoverVcf can reverse-complement the SNP, update the relevant genotypes (GT), and correct AF-like INFO fields. Due to the length of the example, we host the example with detailed explanations on Google Drive.

Note: this example is not meant to be a comprehensive documentation of solutions to all LiftOver issues. LiftOver may require additional data cleaning or preparation steps that are specific to your genomic data.

Must have all allele-defining positions

All positions that are used to define alleles in PharmCAT must be present in the VCF file you want to run through PharmCAT, even if they are 0/0 (reference) or ./. (missing). This is different from a typical VCF file which usually only contains variant sites.

PharmCAT needs this level specificity because reference and missing alleles are interpreted differently by the allele matcher. When positions are missing from the input PharmCAT does make assumptions about whether this is because they are reference or missing.

Missing positions can be added in the following way using GATK to ‘EMIT_ALL_ACTIVE_SITES’:

gatk --java-options "-Xmx4g" HaplotypeCaller \
     -R grc38.reference.fasta -I input.bam -O output.vcf \
     -L pharmcat_positions.vcf -ip 20 --output-mode EMIT_ALL_ACTIVE_SITES

The pharmcat_positions.vcf file is included in the PharmCAT release package on GitHub.

Please refer to the HaplotypeCaller documentation for details and tuning options.

Must normalize variant representation

Variant representation is an on-going problem in NGS (related article). For example, the following vcf lines all specify the same variant with different formats:

chr7    117548628    .    GTTTTTTTA    GTTTTTA    .    PASS    CFTR:5T    GT    0/1
chr7    117548628    .    GTT     G    .    PASS    CFTR:5T    GT    0/1
chr7    117548628    .    G    .    .    PASS    CFTR:5T    GT    0/1
chr7    117548628    .    G(T)7A    G(T)5A    .    PASS    CFTR:5T    GT    0/1

Different NGS pipelines, the way files are created (for instance if a multisample file is split), and post-processing software tools all lead to these differences. As PharmCAT is directly matching these strings to what is in the definition files this can cause problems. For example, PharmCAT expects to find deletions as the ref=”ATCT” alt=”A”, rather than ref=”TCT” alt=”.”. Therefore you will need to replace all the deletions within in the file. If the ref is a single letter it means no variant was found, so it’s safe to replace it with the appropriate nucleotide string.

To avoid ambiguity in variant representation, PharmCAT is using a parsimonious, left-aligned variant representation format (as discussed in Unified Representation of Genetic Variants by Tan, Abecasis, and Kang).

We recommend performing this normalization with bcftools:

bcftools norm -m+ -c ws -Oz -o output.vcf -f grc38.reference.fasta input.vcf
  • -m+ joins biallelic sites into multiallelic records (+)
  • -f <ref_seq_fasta> is the reference sequence. This flag is required for normalization
  • -c ws when incorrect or missing REF allele is encountered, warn (w) and set/fix(s) bad sites

Please consult the bcftools documentation for details.

Alternatively, you can PharmCAT’s VCF preprocessing tool, which also relies on bcftools for this.

It is highly recommended that you always check the output files from these tools manually to make sure the correct format normalizations have been made.

The CHROM field must be in the format chr##

PharmCAT expects the CHROM field to have entries that begin with “chr” (e.g. chr1 instead of just 1).

The PharmCAT VCF preprocessing tool takes care of this issue by automatically detecting and updating the CHROM field format.

Alternatively,CHROM field format can be updated using the following command:

perl -pe '/^((?!^chr).)*$/ && s/^([^#])/chr$1/gsi' merged_output.vcf > merged_output.chrfixed.vcf

The QUAL and FILTER columns are not interpreted

PharmCAT considers all variants in your file, even if they fail filters.

If you use a tool like VSQR you can use the following Perl one-liner to change the genotype to 0/0 if necessary:

perl -pe '/^((?!PASS).)*$/ && /^((?!0\/0).)*$/ && /^((?!0\|0).)*$/ && s/[0-9][\/|][0-9]/0|0/' merged_output.vcf > merged_output_pass.vcf

PharmCAT is managed at Stanford University & University of Pennsylvania (NHGRI U24HG010862)