The basic process:
- Read in all named allele definitions from the gene definition table. Each gene has a reference allele defined by the first definition row in the table (eg. *1). By default, any non-reference named allele that does not contain a base call for a given position (eg. blank spots in the definition table) will default to the reference row’s base call.
- Read in sample data (VCF file), ignoring positions that are not used in the gene definition tables.
- For each gene:
- If data is unphased, generate all possible combinations of genotypes for the positions of interest.
- Attempt to match each combination to a named allele.1
- If there are matches, and data is unphased, try to build diplotypes by making sure that the genotype combinations are possible.
- If diplotypes can be found, they are scored and only the top-scoring diplotype(s) is returned.2
Each named allele is given a score based on the number of variant positions used to define the allele (non-blank cells in that row). This means that the reference allele will always have the maximum score because all positions are defined for that allele.
Take a look at this sample gene definition table:
Since the gene definition table contains 5 positions, the reference allele, *1, gets a score of 5 while *2 only has 3 positions defined and gets a score of 3. 3
A diplotype’s score is the combined score of its component named alleles. A *1/*2 from the example above would have a score of 8.
If the sample data has missing positions that are required by a named allele definition, the position will be dropped from consideration.
This is the only reason the score for a diplotype might be different between two samples.
Using the following gene definition table:
And the following (unphased) sample data:
The potential permutations of those genotypes will match *1, *2, *3 and *5.
From that, plausible diplotypes are:
|1/2||5 + 2 = 7|
|3/5||1 + 1 = 2|
Which results in *1/*2 being returned.
Note that *1/*3 is not a plausible diplotype because one chromosome must have a
C and the other must have a
T at position rs1. They can’t both be
C. Similarly, *2/*3 is not a plausible diplotype either because it cannot be homozygous at rs4.
If we use the same gene definition table as the example above and the following (unphased) sample data, with no data available for rs5:
The results would be the same, except the scores would be different:
|1/2||4 + 2 = 6|
|3/5||1 + 1 = 2|
If we use the same gene definition table as the example above and the following (unphased) sample data, with no data available for rs1:
Then the results would be different:
|1/2||4 + 1 = 5|
|1/3||4 + 1 = 5|
As such, *1/*2 and *1/*3 would be returned.
Note that *5 could never be matched in this scenario because it’s definining allele is missing.
1: If sample data is not phased and we do not assume the reference for missing positions in the definition, it is possible to have multiple matches for a single named allele.
2: This behavior can be modified to return all potential diplotype matches.
3: This score is the same irregardless of whether we assume the reference for missing positions.