This week’s post is going to be a fairly short one. One of the things that I believed when I first started getting into investigative genetic genealogy (IGG), and especially on the bioinformatics side of things, was that a file with more SNPs was going to provide better results than one with less SNPs. My reasoning at the time was that more SNPs would provide better overlap with kits from the various direct-to-consumer (DTC) companies, and therefore produce more accurate matching on the segment level. However, this is simply not the case. The truth is, a file with say 500k SNPs included, can provide much more accurate matching at GEDmatch, than one with 1.2M SNPs.
This really comes down to the fact that not all SNPs are created equal. The main differentiating factor for our purposes, is minor allele frequency (MAF). This is the rate at which an alternate allele (vs the reference genome) is seen within a population. In essence, some SNPs can be fairly rare (generally <1% MAF), while others can be fairly common. In some instances, it may be that there is a 50/50 chance whether someone has the reference or alternate allele at a particular location. Intuitively, it might seem that rare SNPs would be beneficial in genealogical matching, however the opposite is actually true. Let’s consider a SNP with a MAF of 1%. This means that among the population being measured, there is essentially a 1% chance that someone will have one alternate allele at that SNP (heterozygous). This means that there is only a 0.01% chance of someone having an alternate allele at that SNP on both chromosomes (homozygous). When you consider that GEDmatch and others use opposite homozygotes to distinguish segment boundaries, you can see that 99.99% of people will then match each other at that SNP. In other words, it is really not very useful for genealogical matching purposes. SNPs that are more common actually provide stronger distinguishing power for matching.
The situation in which this most often becomes a concern, is when imputation has been used to generate the SNP file being uploaded. This is because imputation can often result in skewed data, where many of the higher confidence calls are also those where the vast majority of people will share the same alleles. This can result in a file, that when uploaded to GEDmatch, can result in a “matchy” kit that is either not usable, or at least has many false positive matches/segments. This can often be avoided by only using a subset of the imputed SNPs – those with the greatest matching power. Imputation is an extremely valuable tool, and many cases I work on would never get solved without it. However, SNP selection is also extremely important when utilizing it in challenging samples.