Examples of VCF Variant Normalization
This page contains examples of different variant representations in VCF and how they should be normalized for PharmCAT using a parsimonious, left-aligned variant representation format.
To verify that the normalization is works as expected, you should be able to concatenate the different examples of VCF records into a single VCF file and run it against the VCF Preprocessor.
VCF Header
This is the header you will need for the VCF file.
##fileformat=VCFv4.3
##source=PharmCAT allele definitions
##reference=hg38
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=FT,Number=.,Type=String,Description="Genotype filters.">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype Likelihoods">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block.">
##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions.">
##contig=<ID=chr1,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr2,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr3,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr4,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr5,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr6,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr7,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr8,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr9,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr10,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr11,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr12,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr13,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr14,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr15,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr16,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr17,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr18,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr19,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr20,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr21,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr22,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chrX,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chrY,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chrM,assembly=GRCh38.p13,species="Homo sapiens">
##INFO=<ID=PX,Number=.,Type=String,Description="Gene">
##INFO=<ID=POI,Number=0,Type=Flag,Description="Position of Interest but not part of an allele definition">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=PCATxREF,Description="Reference allele does not match PharmCAT reference alleles">
##FILTER=<ID=PCATxALT,Description="Alternate alleles do not match PharmCAT alternate alleles">
##FILTER=<ID=PCATxINDEL,Description="Unexpected format for INDELs">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_1
Chromosome names
Alternate format:
1 97078987 rs114096998 G T . PASS . GT 0/0
Normalized:
chr1 97078987 rs114096998 G T . PASS . GT 0/0
Homozygous reference SNPs with unspecified ALT
Alternate format:
chr1 97078993 rs148799944 C . . PASS . GT 0/0
chr1 97079005 rs140114515 C <*> . PASS . GT 0/0
Normalized:
chr1 97078993 rs148799944 C G . PASS . GT 0/0
chr1 97079005 rs140114515 C T . PASS . GT 0/0
Left-alignment for a tandem repeat INDELs
Note that the ALT allele is updated after left-alignment.
Alternate format:
chr1 97740414 rs72549309 AATGA A . PASS . GT 1/0
Normalized:
chr1 97740410 rs72549309 GATGA G . PASS . GT 1/0
Multi-allelic INDELs
For a genomic position that harbors multiple pharmacogenetic INDELs, all pharmacogenetic INDEL variants should be combined into a multiallelic record.
Alternate format:
chr2 233760233 rs3064744 C CAT . PASS . GT 1/0
chr2 233760233 rs3064744 CAT C . PASS . GT 0/0
chr2 233760233 rs3064744 C CATAT . PASS . GT 0/1
Normalized:
chr2 233760233 rs3064744 CAT CATAT,CATATAT,C . PASS . GT 1/2
Prioritizing pharmacogenetic variant(s) over non-pharmacogenetic ones at a multiallelic locus
Note that the pharmacogenetic SNP is reordered before the INDEL that is located at the same genomic position.
Alternate format:
chr7 117509035 . GA G . PASS . GT 0/0
chr7 117509035 . G A . PASS . GT 0/1
Normalized:
chr7 117509035 rs397508256 G A . PASS . GT 0/1
chr7 117509035 . GA G . PCATxREF . GT 0/0
Left-alignment for long INDELs
Note that the ALT allele is updated after left-alignment.
Alternate format:
chr10 94942212 rs1304490498 AAGAAATGGAA A . PASS . GT 1/0
Normalized:
chr10 94942205 rs1304490498 CAATGGAAAGA C . PASS . GT 1/0
Warning for unexpected INDEL format
In this instance, chr10:94949281 is homozygous reference. There is no sufficient information to infer genotypes for a pharmacogenetic INDEL chr10:94949281:GA:G
at this position. A warning should be listed. And the pharmacogenetic variant should be reported as missing.
Alternate format:
chr10 94949281 . G . . PASS . GT 0/0
Normalized:
chr10 94949281 . G . . PCATxINDEL . GT 0/0
Prioritizing pharmacogenetic variants at positions that have both SNP(s) and INDEL(s)
!! need to double check
Note that pharmacogenetic INDELs should be listed ahead of the non-pharmacogenetic SNPs. Information of other alternative alleles need to be completed so that the variant representation format matches the format that PharmCAT expects.
Alternate format:
chr13 48037782 . A C,<*> . PASS . GT 0/0
chr13 48037782 rs746071566 AGGAGTC A,<*> . PASS . GT 0/0
Normalized:
chr13 48037782 rs746071566 AGGAGTC AGGAGTCGGAGTC,A . PASS . GT 0/0
chr13 48037782 rs746071566 A C . PASS . GT 0/0
chr13 48037782 rs746071566 AGGAGTC <*> . PCATxINDEL . GT 0/0
Filling up multiallelic SNPs
If a SNP position is present in the VCF while the exact pharmacogenetic allele is not present, there is sufficient information to infer that the specific pharmacogenetic alleles are not present in the VCF. It should be safe to add back the pharmacogenetic alleles back to the VCF. Bcftools should probably adjust the genotypes after the alleles are added back.
Alternate format:
chr19 38448712 rs121918592 G A,<*> . PASS . GT 1/0
chr19 40991381 rs33973337 A T . PASS . GT 1/0
Normalized:
chr19 38448712 rs121918592 G A,C . PASS . GT 1/0
chr19 40991381 rs33973337 A C,T . PASS . GT 2/0
Prioritizing pharmacogenetic variants over non-pharmacogenetic variants
This is similar to instance 4 but observed in actual UK Biobank dataset
Alternate format:
chrX 154532608 . C CG,<*> 0 RefCall . 0/0
chrX 154532608 . C T,<*> 0 PASS . 0/0
chrX 154532608 . CG C,<*> 0 RefCall . 0/0
Normalized:
chrX 154532608 . C T 0 RefCall . 0/0
chrX 154532608 . C CG 0 RefCall . 0/0
chrX 154532608 . CG <*> 0 PCATxREF . 0/0