PharmCAT VCF Preprocessor
The PharmCAT VCF Preprocessor is a script that can preprocess VCF files for PharmCAT (see PharmCAT's VCF Requirements).
This tool will:
- Strip out PGx positions that PharmCAT does not care about.
- Automatically download the necessary Human Reference Genome Sequence FASTA and index files from the NIH FTP site if they are not provided.
- Perform VCF normalization - a standardization process that turns VCF into a parsimonious, left-aligned variant representation format (as discussed in Unified Representation of Genetic Variants by Tan, Abecasis, and Kang).
- Normalize the multiallelic variant representation to PharmCAT's expectation.
- Optionally filter out data for a subset of samples if requested.
By default, the PharmCAT VCF preprocessing produces two types of output:
- A PharmCAT-ready VCF file
- A report of missing pharmacogenomics core allele defining positions in user's input
Prerequisites
Install the Software
You can skip this if are running PharmCAT in Docker.
- You will need python 3.10.14 or higher
- You will need the following bioinformatic tools:
- Download the preprocessor tar file from our releases page.
- Untar the file
- You will need the following python dependencies:
- colorama >= 0.4.6
- pandas >= 1.5.1
- packaging ~= 21.3
To install the necessary python packages, run the following code:
$ pip3 install -r requirements.txt
You can find requirements.txt
in the preprocessor tar file you downloaded.
How to run the PharmCAT VCF Preprocessor
Below you will find a detailed introduction to the PharmCAT VCF Preprocessor. We have put together interactive materials on the PharmCAT tutorial GitHub repo.
Command line
To normalize and prepare a VCF file, run the following code substituted with proper arguments/inputs:
$ python3 pharmcat_vcf_preprocessor.py -vcf path/to/file.vcf(.bgz)
Mandatory argument: -vcf
.
- -vcf
- Path to a single VCF file or a file containing the list of VCF file paths (one per line), sorted by chromosome position. All VCF files must have the same set of samples. Use this when data for a sample has been split among multiple files (e.g. VCF files from large cohorts, such as UK Biobank). Input VCF files must at least comply with Variant Call Format (VCF) Version >= 4.2.
VCF files can have more than 1 sample and should be bgzip compressed. If not bgzip compressed, they will be automatically bgzipped.
Example valid list file:
chr1_set1.vcf
chr1_set2.vcf
chr2_set1.vcf
chr2_set2.vcf
...
Example invalid list file:
chr3_set2.vcf
chr2_set2.vcf
chr1_set1.vcf
chr1_set2.vcf
...
Optional Arguments
- -S
<txt_file>
or --sample-file<txt_file>
- The list of samples to be processed and prepared for PharmCAT. The file should contain one sample per line.
- -s
<samples>
or --samples<samples>
- A comma-separated list of sample IDs.
- -o
<dir>
or --output-dir<dir>
- Directory to save preprocessed VCF to. Default is the parent directory of the input VCF.
- -bf
<name>
or --base-filename<name>
- Prefix of the output VCF files. Default is the input base file name plus sample IDs.
- -k
or --keep-intermediate-files - This option will help you save useful intermediate files, for example, a normalized, multiallelic VCF named
<base_input_file_name>.pgx_regions.normalized.multiallelic.vcf.bgz
, which will include all PGx regions from the first position to the last one in each chromosome as listed in the reference PGx VCF. - -ss
or --single-sample - Generate 1 VCF file per sample.
- -0
or --missing-to-ref - This option will add missing PGx positions to the output. Missing PGx positions are those absent in the input VCF or whose genotypes are unspecified as "./." across all samples. This option is equivalent to the combination of
--absent-to-ref
and--unspecified-to-ref
.- This option will not convert "./." to "0/0" if any other sample has a specified genotype (
0/0
,0/1
, etc.) as the unspecified genotypes are likely determined so for good reasons. - This SHOULD ONLY BE USED if you are sure your data is reference at the absent positions instead of being unreadable/uncallable. Running PharmCAT with positions as absent vs reference can lead to different results.
- This option will not convert "./." to "0/0" if any other sample has a specified genotype (
- --absent-to-ref
- This option will add absent PGx positions into the output as homozygous reference.
- This SHOULD ONLY BE USED if you are sure your data is reference at the absent positions instead of unreadable/uncallable.
- Running PharmCAT with positions as absent vs reference can lead to different results.
- --unspecified-to-ref
- This option will convert unspecified PGx position to homozygous reference. Unspecified PGx positions are those whose genotypes are unspecified "./." in every single sample.
- This option will not convert "./." to "0/0" when there is a specified genotype at a PGx position as these
./.
calls are likely left unspecified for good reasons. - Running PharmCAT with positions as unspecified vs reference can lead to different results.
- This option will not convert "./." to "0/0" when there is a specified genotype at a PGx position as these
- -c
or --concurrent-mode - Enable concurrent mode. This defaults to using one less than the number of CPU cores available. Note that this is only useful if processing many files/samples. With only a few files/samples, the overhead of using concurrent mode is more than the benefit it may provide.
- -cp
<num processes>
or --max-concurrent-processes<num processes>
- The maximum number of processes to use if concurrent mode is enabled.
- -v
or --verbose - Print verbose messages.
- -V
or --version - Display PharmCAT version.
Advanced Arguments
These options allow you to override default locations if the preprocessor cannot find its dependencies.
- -refVcf
<vcf_file>
or --reference-pgx-vcf<vcf_file>
- A sorted, compressed VCF of PGx core allele defining positions used by PharmCAT. By default, the preprocessor will look for
pharmcat_positions.vcf.bgz
under the current working directory. You can find this VCF in thepharmcat_preprocessor-<release_version>.tar.gz
available from the PharmCAT GitHub releases page. - -refFna
<fna_file>
or --reference-genome<fna_file>
- The GRCh38.p13 FASTA file. The FASTA file can be either decompressed or compressed but has to be indexed (.fai and, in addition, .gzi for the compressed file). We recommended the compressed reference genome FASTA file for the sake of storage. These mandatory files will be automatically downloaded (~0.9 GB) to the same directory as the reference PGx VCF file (
-refVcf
) if not provided by user (see Notes for details). - -R
<bed_file>
or --retain-specific-regions<bed_file>
- A sorted .bed file indicating regions to retain in VCF. For research use only. Additional variants are not used by PharmCAT and will slow PharmCAT down.
- -bcftools
</path/to/bcftools>
or --path-to-bcftools</path/to/bcftools>
- bcftools must be installed. This argument is optional if bcftools is available in your PATH. If not, you can download and compile bcftools and provide the path to the bcftools program. Alternatively, set using the
BCFTOOLS_PATH
environment variable. - -bgzip
</path/to/bgzip>
or --path-to-bgzip</path/to/bgzip>
- bgzip must be installed. This argument is optional if bgzip is available in your PATH. If not, bgzip is a part of the htslib. You can download and compile it and provide the path to the bgzip program. Alternatively, set using the
BGZIP_PATH
environment variable. - -G
or --no-gvcf-check - Bypass the check of the gVCF file format.
- This SHOULD ONLY BE USED if you are certain your data is not a gVCF.
Output
By default, the preprocessor will produce a (multi-sample) VCF named <base_filename>.preprocessed.vcf.bgz
, which is ready to be used by PharmCAT.
All preprocessor output files will use the base filename of the input file unless otherwise specified using the -bf
/--base-filename
argument. For example, if the input file is "study.vcf", then the base filename is "study". If the input file is "biobank_files.txt" then the base filename is "biobank_files".
If there are multiple samples, and the -ss
flag is provided, the preprocessor will produce one PharmCAT-ready VCF file per sample. The output files are named <base_filename>.<sample_id>.preprocessed.vcf
The preprocessor will produce a report named <base_filename>.missing_pgx_var.vcf
when there are absent PGx positions or alleles. This file only reports positions that are absent or unspecified in all samples. The report is based on the input VCF and is not affected by --unspecified-to-ref
or --absent-to-ref
.
Tutorial
Case 1 - single-sample VCF
Imagine we have a VCF named "test_1.vcf.bgz" to be used in PharmCAT.
$ gunzip -c test_1.vcf.bgz
$ cat test_1.vcf
<...header truncated...>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_1
2 233760233 rs3064744 C CAT . PASS . GT 1/0
2 233760233 rs3064744 CAT C . PASS . GT 0/0
2 233760233 rs3064744 C CATAT . PASS . GT 0/1
7 117548628 . GTTTTTTTA GTTTTTA . PASS . GT 0/1
Command to run the PharmCAT VCF Preprocessor:
$ python3 pharmcat_vcf_preprocessor.py -vcf test_1.vcf.bgz
The VCF Preprocessor will return two files in this test case.
- one named "test_1.preprocessed.vcf", which is a PharmCAT-ready VCF
- the other named "test_1.missing_pgx_var.vcf" as a report of missing PGx positions.
Note that the chr7 variant is not used in PharmCAT and was removed by the PharmCAT VCF Preprocessor.
$ cat test_1.preprocessed.vcf
<...header truncated...>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_1
chr2 233760233 rs3064744 CAT C,CATATAT,CATAT . PASS PX=UGT1A1 3/2
$ cat test_1.missing_pgx_var.vcf
<...header truncated...>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PharmCAT
chr1 97078987 rs114096998 G T . PASS PX=DPYD GT 0/0
chr1 97078993 rs148799944 C G . PASS PX=DPYD GT 0/0
chr1 97079005 rs140114515 C T . PASS PX=DPYD GT 0/0
<...truncated...>
Case 2 - multi-sample VCF
Imagine we have a VCF named "test_2.vcf.bgz" that has two samples.
$ gunzip -c test_2.vcf.bgz
<...header truncated...>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_1 Sample_2
1 97740414 rs72549309 AATGA A . PASS . GT 1/0 0/1
2 233760233 rs3064744 C CAT . PASS . GT 1/0 0/0
2 233760233 rs3064744 CAT C . PASS . GT 0/0 0/1
2 233760233 rs3064744 C CATAT . PASS . GT 0/1 1/0
7 117548628 . GTTTTTTTA GTTTTTA . PASS . GT 0/1 1/0
10 94942212 rs1304490498 AAGAAATGGAA A . PASS . GT 1/0 0/1
13 48037826 rs777311140 G GCGGG . PASS . GT 1/0 0/1
19 38499645 rs121918596 GGAG G . PASS . GT 1/0 0/1
22 42130727 . AG A . PASS . GT 1/0 0/1
M 1555 . G A PASS . GT 1/0 0/1
Command to run the PharmCAT VCF Preprocessor:
$ python3 pharmcat_vcf_preprocessor.py -vcf test_2.vcf.bgz
The VCF Preprocessor will return three (3) files in this test case:
- "test_2.preprocessed.vcf"
- "test_2.missing_pgx_var.vcf"
Note that the PharmCAT-ready VCFs will use the sample names from the input VCF.
$ cat test_2.preprocessed.vcf
<...header truncated...>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_1 Sample_2
chr1 97740410 rs72549309 GATGA G . PASS PX=DPYD GT 1/0 0/1
chr2 233760233 rs3064744 CAT C,CATATAT,CATAT . PASS PX=UGT1A1 GT 3/2 2/1
chr10 94942205 rs1304490498 CAATGGAAAGA C . PASS PX=CYP2C9 GT 1/0 0/1
chr13 48037825 rs777311140 C CGCGG . PASS PX=NUDT15 GT 1/0 0/1
chr19 38499644 rs121918596 TGGA T . PASS PX=RYR1 GT 1/0 0/1
$ cat test_2.missing_pgx_var.vcf
<...header truncated...>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PharmCAT
chr1 97078987 rs114096998 G T . PASS PX=DPYD GT 0/0
chr1 97078993 rs148799944 C G . PASS PX=DPYD GT 0/0
chr1 97079005 rs140114515 C T . PASS PX=DPYD GT 0/0
<...truncated...>
Case 3 - multi-sample VCF input and single-sample VCF ouputs
Given the same "test_2.vcf.bgz" as in the case 2, to obtain single-sample VCF files for each sample, run the following command:
$ python3 pharmcat_vcf_preprocessor.py -vcf test_2.vcf.bgz -ss
The VCF Preprocessor will return three (3) files in this test case:
- "test_2.Sample_1.preprocessed.vcf"
- "test_2.Sample_2.preprocessed.vcf"
- "test_2.missing_pgx_var.vcf"
Note that the PharmCAT-ready VCFs will use the sample names from the input VCF.
$ cat test_2.Sample_1.preprocessed.vcf
<...header truncated...>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_1
chr1 97740410 rs72549309 GATGA G . PASS PX=DPYD GT 1/0
chr2 233760233 rs3064744 CAT C,CATATAT,CATAT . PASS PX=UGT1A1 GT 3/2
chr10 94942205 rs1304490498 CAATGGAAAGA C . PASS PX=CYP2C9 GT 1/0
chr13 48037825 rs777311140 C CGCGG . PASS PX=NUDT15 GT 1/0
chr19 38499644 rs121918596 TGGA T . PASS PX=RYR1 GT 1/0
$ cat test_2.Sample_2.preprocessed.vcf
<...header truncated...>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_2
chr1 97740410 rs72549309 GATGA G . PASS PX=DPYD GT 0/1
chr2 233760233 rs3064744 CAT C,CATATAT,CATAT . PASS PX=UGT1A1 GT 2/1
chr10 94942205 rs1304490498 CAATGGAAAGA C . PASS PX=CYP2C9 GT 0/1
chr13 48037825 rs777311140 C CGCGG . PASS PX=NUDT15 GT 0/1
chr19 38499644 rs121918596 TGGA T . PASS PX=RYR1 GT 0/1
Explanation of INFO
The PharmCAT VCF Preprocessor updates the INFO on genetic variants that warrant further inspection. Please check positions with these INFO flags:
PCATxREF
- The reference allele at this position does not match the PharmCAT reference allele at this PGx allele defining positions, which is based on the RefSeq reference human genome sequence on GRCh38. This cannot be fixed by normalizing and flipping the REF and ALT alleles in the PharmCAT VCF Preprocessor.
PCATxALT
- The alternate allele at this position does not match the PharmCAT alternate alleles at this PGx allele defining positions.
PCATxINDEL
- This position has an unexpected format for INDELs, especially an INDEL with
<*>
(an unspecified ALT allele ) or.
(absence of alternative alleles). INDELs needs to be represented with a meaningful alternative allele.
- This position has an unexpected format for INDELs, especially an INDEL with
Notes
PharmCAT uses GRCh38.p13. It is available through the NCBI RefSeq FTP site.
PharmCAT takes this file and prepares it for use with the following commands:
# curl -#fSL https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GRCh38_major_release_seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna.gz -o genomic.fna.gz
# gunzip genomic.fna.gz
# awk '{ if ((NR>1)&&($0~/^>/)) { printf("\n%s", $0); } else if (NR==1) { printf("%s", $0); } else { printf("\t%s", $0); } }' genomic.fna | grep -v "^>chr\S*_" - | tr "\t" "\n" > genomic.short.fna
# bgzip -c genomic.short.fna > reference.fna.bgz
# samtools faidx reference.fna.bgz
# tar -czvf GRCh38_reference_fasta.tar reference.fna.bgz reference.fna.bgz.fai reference.fna.bgz.gzi
PharmCAT makes this indexed FASTA files available on Zenodo.