This page presents a variety of instances of VCF variant representation formats that should be normalized and standardized for PharmCAT.
You should be able to concatenate different instances of VCF records into a single VCF file to test the robustness of a VCF preprocessor tool.
VCF Header
This is the header you will need for every VCF file. ```VCF header ##fileformat=VCFv4.3 ##source=PharmCAT allele definitions ##reference=hg38 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=FT,Number=.,Type=String,Description="Genotype filters."> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype Likelihoods"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block."> ##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions."> ##contig=<ID=chr1,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr2,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr3,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr4,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr5,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr6,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr7,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr8,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr9,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr10,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr11,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr12,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr13,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr14,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr15,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr16,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr17,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr18,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr19,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr20,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr21,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chr22,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chrX,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chrY,assembly=GRCh38.p13,species="Homo sapiens"> ##contig=<ID=chrM,assembly=GRCh38.p13,species="Homo sapiens"> ##INFO=<ID=PX,Number=.,Type=String,Description="Gene"> ##INFO=<ID=POI,Number=0,Type=Flag,Description="Position of Interest but not part of an allele definition"> ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> ##FILTER=<ID=PASS,Description="All filters passed"> ##FILTER=<ID=PCATxREF,Description="Reference allele does not match PharmCAT reference alleles"> ##FILTER=<ID=PCATxALT,Description="Alternate alleles do not match PharmCAT alternate alleles"> ##FILTER=<ID=PCATxINDEL,Description="Unexpected format for INDELs"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_1
# instance 1: chromosome names
```vcf deficient format
1 97078987 rs114096998 G T . PASS . GT 0/0
```vcf accurate format chr1 97078987 rs114096998 G T . PASS . GT 0/0
# instance 2: homozygous reference SNPs with unspecified ALT
```vcf deficient format
chr1 97078993 rs148799944 C . . PASS . GT 0/0
chr1 97079005 rs140114515 C <*> . PASS . GT 0/0
```vcf accurate format chr1 97078993 rs148799944 C G . PASS . GT 0/0 chr1 97079005 rs140114515 C T . PASS . GT 0/0
# instance 3: left-alignment for a tandem repeat INDELs
Note that the ALT allele is updated after left-alignment.
```vcf deficient format
chr1 97740414 rs72549309 AATGA A . PASS . GT 1/0
```vcf accurate format chr1 97740410 rs72549309 GATGA G . PASS . GT 1/0
# instance 3: multiallelic INDELs
For a genomic position that harbors multiple pharmacogenetic INDELs, all pharmacogenetic INDEL variants should be combined into a multiallelic record.
```vcf deficient format
chr2 233760233 rs3064744 C CAT . PASS . GT 1/0
chr2 233760233 rs3064744 CAT C . PASS . GT 0/0
chr2 233760233 rs3064744 C CATAT . PASS . GT 0/1
```vcf accurate format chr2 233760233 rs3064744 CAT CATAT,CATATAT,C . PASS . GT 1/2
# instance 4: prioritizing pharmacogenetic variant(s) over non-pharmacogenetic ones at a multiallelic locus
Note that the pharmacogenetic SNP is reordered before the INDEL that is located at the same genomic position.
```vcf deficient format
chr7 117509035 . GA G . PASS . GT 0/0
chr7 117509035 . G A . PASS . GT 0/1
```vcf accurate format chr7 117509035 rs397508256 G A . PASS . GT 0/1 chr7 117509035 . GA G . PCATxREF . GT 0/0
# instance 5: left-alignment for long INDELs
Note that the ALT allele is updated after left-alignment.
```vcf deficient format
chr10 94942212 rs1304490498 AAGAAATGGAA A . PASS . GT 1/0
```vcf accurate format chr10 94942205 rs1304490498 CAATGGAAAGA C . PASS . GT 1/0
# instance 6: warning for unexpected INDEL format
In this instance, chr10:94949281 is homozygous reference. There is no sufficient information to infer genotypes for a pharmacogenetic INDEL `chr10:94949281:GA:G` at this position. A warning should be listed. And the pharmacogenetic variant should be reported as missing.
```vcf deficient format
chr10 94949281 . G . . PASS . GT 0/0
```vcf accurate format chr10 94949281 . G . . PCATxINDEL . GT 0/0
# instance 7: prioritizing pharmacogenetic variants at positions that have both SNP(s) and INDEL(s)
!! need to double check
Note that pharmacogenetic INDELs should be listed ahead of the non-pharmacogenetic SNPs. Information of other alternative alleles need to be completed so that the variant representation format matches the format that PharmCAT expects.
```vcf deficient format
chr13 48037782 . A C,<*> . PASS . GT 0/0
chr13 48037782 rs746071566 AGGAGTC A,<*> . PASS . GT 0/0
```vcf accurate format chr13 48037782 rs746071566 AGGAGTC AGGAGTCGGAGTC,A . PASS . GT 0/0 chr13 48037782 rs746071566 A C . PASS . GT 0/0 chr13 48037782 rs746071566 AGGAGTC <*> . PCATxINDEL . GT 0/0
# instance 8: filling up multiallelic SNPs
If a SNP position is present in the VCF while the exact pharmacogenetic allele is not present, there is sufficient information to infer that the specific pharmacogenetic alleles are not present in the VCF. It should be safe to add back the pharmacogenetic alleles back to the VCF. Bcftools should probably adjust the genotypes after the alleles are added back.
```vcf deficient format
chr19 38448712 rs121918592 G A,<*> . PASS . GT 1/0
chr19 40991381 rs33973337 A T . PASS . GT 1/0
```vcf accurate format chr19 38448712 rs121918592 G A,C . PASS . GT 1/0 chr19 40991381 rs33973337 A C,T . PASS . GT 2/0
# instance 9: prioritizing pharmacogenetic variants over non-pharmacogenetic variants
This is similar to instance 4 but observed in actual UK Biobank dataset
```vcf deficient format
chrX 154532608 . C CG,<*> 0 RefCall . 0/0
chrX 154532608 . C T,<*> 0 PASS . 0/0
chrX 154532608 . CG C,<*> 0 RefCall . 0/0
vcf accurate format chrX 154532608 . C T 0 RefCall . 0/0 chrX 154532608 . C CG 0 RefCall . 0/0 chrX 154532608 . CG <*> 0 PCATxREF . 0/0