Why WGS is superior to SNP arrays for IGG

One of the first hurdles an investigative genetic genealogy (IGG) professional faces when taking on a new case, is determining how to get a file that can be uploaded to GEDmatch / FTDNA, from the evidence available. This evidence could be anything from swabs from a sexual assault kit, to blood stains, or even human remains themselves. For the purposes of this post however, I am going to assume that the DNA has already been extracted, and the process will be starting with DNA extract. I will cover DNA extraction more in-depth in a later post.

There are generally two main methods used today to generate a SNP data file that is able to be uploaded to the genealogy databases used for this work. These are a SNP array (often referred to as a “chip”) and whole genome sequencing (WGS). I’m sure it’s clear by the title which of these methods I prefer, however both do have their pros and cons, which I will detail in this post. Armed with this information, hopefully you’ll be able to determine which route makes sense for your particular sample.

SNP arrays

SNP arrays have been widely used in academic research, genetic testing, and the consumer space for many years now. Due to their relatively inexpensive nature, this is also the technology used by nearly every direct-to-consumer company offering DNA tests – including AncestryDNA, 23andme, and others. The arrays used in both the DTC and forensic space are generally all made by Illumina and utilize their BeadChip technology. On every array there are hundreds of thousands, or millions, of silica microbeads. These microbeads are coated with probes that targets a particular complementary location in the genome. Once DNA from the sample has bound to a probe, lasers, an imaging system, and software are then used to determine the alleles at each SNP. These arrays target a subset of known SNPs, generally around 500k-850k SNPs for most DTC and forensic purposes. The most common arrays used for IGG are the CytoSNP 850k and the Global Screening Array (GSA).

SNP array pros/cons

Pros:
  • Less expensive – forensic offerings generally start around $700 per sample
  • Separate bioinformatics usually not required – most labs will simply provide a file that can be uploaded
  • Can work with relatively small amounts of DNA, down to ~1 ng or less
  • Lower required start-up capital means it’s easier for a lab to begin offering this service
Cons:
  • Does not handle degraded DNA well – average fragment size of <=150 bp have been found to be unsuccessful (Source)
  • Anecdotally, does not handle bacterial contamination well
  • Generally when array analysis fails, it leaves you with no data at all – it’s all or nothing
  • Bioinformatics techniques such as imputation are of limited usefulness with the limited data provided
  • Depending on the array, mtDNA and X/Y chromosomes may not be tested
  • Less future-proof if different SNPs or other types of genetic data are desired for future applications
NovaSeq 6000

Magnus Manske, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

Whole genome sequencing (WGS) on the other hand is a relatively newer technology – at least when it comes to being financially feasible for consumers, or even forensic applications. Producing the first “finished” human genome sequence (Human Genome Project, 2003) cost an estimated $500 million to $1 billion in sequencing costs. However, there are now companies such as Nebula, Dante, and others offering whole genome sequencing directly to consumers at costs as low as $200-300 (though this is likely somewhat subsidized). As opposed to the SNP arrays which only target and provide data on hundreds of thousands of SNPs, whole genome sequencing provides data on all ~3 billion base pairs in the human genome, in effect, thousands of times more data. During WGS, a sequencing library is first prepared based on the DNA extract. The sequencer then reads every individual fragment of DNA, including mitochondrial DNA, the autosomal chromosomes, and both sex chromosomes. This is accomplished by loading the sequencing libraries info a flowcell, and then a combination of lasers and imaging equipment reads each individual base in a fragment. This is of course a simplified explanation, and I hope to cover the process deeper in the future.

WGS pros/cons

Pros:
  • Data for the entire genome can be obtained – including mtDNA and X/Y chromosomes
  • Can utilize very small amounts of DNA – down to the tens to hundreds of picograms
  • Very degraded DNA can be utilized – fragments as small as ~30-40bp can be reliably mapped to the human genome
  • Even with heavy bacterial contamination, data from the human DNA contained in the sample is still obtained
  • Advanced bioinformatics techniques such as low-coverage imputation can often mean usable data can still be recovered from extremely poor quality samples
  • Since essentially data for the entire genome is captured, it is much more “future-proof”
  • STR markers can sometimes be determined from WGS data
Cons:
  • More expensive than SNP arrays – forensic WGS generally starts around $1400
  • Additional bioinformatics are generally required to analyze the raw sequencing data and produce a SNP file that can be utilized for genealogy
  • Fewer options available when it comes to labs providing this service, due to the high capital costs of a sequencer

As you can see, the main downside to whole genome sequencing is cost. It generally starts at approximately double the cost of a SNP array, though I do believe the cost will continue to decrease in the future. However, in almost every other aspect it is superior to SNP arrays. This is especially true for degraded DNA and/or samples that contain bacterial contamination – both of these are very often the case when it comes to forensic samples. In the end, if you have large amounts of relatively high quality DNA available, it doesn’t hurt to try a SNP array, especially if budget is an issue. However, if budget allows, I would still generally recommend WGS, as it provides many more options to use that data in the future if needed. For any precious sample that is limited in quantity, degraded and/or contains bacterial contamination (especially the case with human remains), I would only recommend pursuing WGS.

If you are trying to decide the best route to take with a particular sample, I’m always happy to take a look and make a recommendation! Please feel free to reach out via the contact page.

1 thought on “Why WGS is superior to SNP arrays for IGG”

Comments are closed.