A veteran in the genetic genealogy industry looks back at the history of DNA testing for family history–and to the future: the reconstruction of the entire human family tree.
A quick history of DNA testing for family history
In the late 1990s and early 2000s, Diahan Southard, myself and other Your DNA Guide staff worked in a university-based research group that made early strides in the infant field of genetic genealogy. We were among the first to collect DNA samples with correlated pedigrees and construct a database for general query by the public. We collected what at the time was a TON of data: full-genome surveys of strategic points along every chromosome for tens of thousands of individuals.
In those early days, we used prototype versions of algorithms that are bread and butter tools in genetic genealogy now. For the day, we had powerful computers with huge storage capacity, and top-of-the-line specs that allowed us to crunch all that data. But it was still slow, and we had to run sample data in smaller sub-batches. Attempting to run data files that exceeded a certain size would not only take days, but sometimes the run would crash a couple of days into its process and we’d have to start over again and pare the experimental design down to avoid exceeding the capacity of the computers and algorithms.
And then a new tool arrived. A generous philanthropist, Ira A. Fulton, donated funds to build a supercomputer for the university, which he named after his sweetheart wife, Mary Lou (a philanthropist and educator in her own right). This supercomputer, still hard at work at our alma mater, far exceeded any storage and computing capacity available in even the most sophisticated networks to which we previously had access.
Now we could get some real work done! Our research group didn’t have exclusive access to Mary Lou, so we had to coordinate with other departments and sign up for time slots where we could process our experiments. Typically, we could access Mary Lou once or twice a week.
It’s been over 20 years since that time, and things in the world of genetic genealogy have gotten bigger and better. Databases now host genetic and pedigree data for millions of participants representing the entire world, rather than thousands. The number of DNA sites in genomic surveys are numbered in the hundreds of thousands, and the industry is preparing for the shift when whole genome sequencing (WGS) becomes the standard. Clients are poised to benefit greatly from more detailed and accurate predictions as databases become more dense with representation from all world populations and the ultimate genetic detail given by WGS.
Wow, wonder, shock, awe. What an age we live in. Not only is it possible to conceivably observe and collect genetic data that gives a detailed representation of the entire planet, but the methods to analyze and make sense of it all are rising to the challenge of the day.
What’s on the horizon: Future directions for DNA and family history
Two important and complementary studies, published last year, take a gigantic step forward in being able to “do the genealogy” of everyone. Two separate research groups at Oxford University worked in parallel to address similar questions, and they came up with different but complementary approaches. The big picture is to be able to trace the historic genealogy of each individual DNA segment across the whole genome. Whoa, right?
This isn’t a new concept. For portions of the human genome, this has been done since the 1980s. Segments of mitochondrial DNA have been traced back to deep ancestral lineages in Africa, which gave an unprecedented picture of human origins and migrations. Y-chromosome DNA has also been used to make similarly deep and also more recent connections within the great human pedigree.
However, both of these chromosomes represent just a tiny fraction of the human genome: mitochondrial (0.0006%) and Y-chromosome DNA (2%). Further, both of these chromosomes are passed down mostly intact to each new generation, which makes tracing their historic trees relatively straight forward.
The rest of the 98% of the genome is made of chromosomes that undergo recombination at each generation, where maternal and paternal genetic material is shuffled at sites everywhere to make a new unique human. Deconvoluting all that shuffling to describe the genealogy of each of those mixed-up chromosomal segments has been seen as a daunting undertaking. Some have even considered the task by its nature to be unsolvable with explosive complexity and amounts of data that exceed the capacity of modern computing.
The new methods published by Oxford researchers break down this complexity by examining one site at a time over all chromosomes for a group of people. The algorithm starts at one end of the genome and produces a tree that describes how each person is related through the mutations seen in that particular segment of DNA. It moves to the next adjacent segment of DNA and makes a hybrid tree of the two sites which works well because adjacent trees tend to be similar anyway, showing similar relationships between the people in the group.
The algorithm continues across the genome, refining a combined picture of the relationships among people shown by the mutations seen. This relatively simple approach ends up being quite powerful as it captures shared structure among genome-wide trees and also allows for unprecedented efficiency in genetic computing.
And efficient genetic computing is important! Previous state-of-the-art can handle thousands of DNA sites, but only on dozens of samples–and analysis is time intensive. Older tools prove impractical in the age of datasets with millions of individuals, catalogued at 750,000+ sites. New tree methods show similar accuracy to previous state-of-the-art, and have so far been scalable for many thousands of samples, with plans to upgrade from there. That’s a big breakthrough in the right direction.
Another incredible advantage of analyzing big genetic data through the lens of tree structure is that the data becomes massively compressible. For a database with 10 billion people (which is greater than the entire human population) at 100,000s of DNA sites, traditional data storage strategies would require 25,000 TB hard disks just to hold it. That’s a lot. Super, super, super computer size. Tree encoding of the data enables the exact same set to be compressed to just 1 TB. That’s amazing. This means that any conceivable human genetic dataset could be stored and processed on a single well-appointed laptop you could buy at Target. Are you kidding me?
Integrating Future Tools into Consumer DNA Testing
It will take some time for consumer genetics to decide how to incorporate these new advances into their offerings to the public. When it does, clients can expect to see products that link them to relations near and far in the form of trees based on the inheritance of segments of autosomal DNA throughout human history.
Currently, clients see segment data in terms of cumulative shared cMs, but tree-driven reporting will actually show how individual segments have been passed down in larger family groups to our present day. This is extractable from some current data provided by the companies, but tree-reporting will put traceable segment data front and center. That’s a pretty cool concept: to get to see how you are placed in the human family based on the record in your chromosomes.
Some of this genetic tree information may complement traditional written genealogical sources, while some researchers may find the two at odds in places. This will continue the great work that is already underway of using these two sources of information–both genetic and written records–to perhaps challenge former conceptions and make a fuller picture of where we all came from.
These stories of where we’ve been and where we’re going can stoke our sense of wonder at how this work is accelerating over such a short time. A little healthy astonishment can help us appreciate the new ideas and tools that are coming into fruition that can further pull back the curtain on the stories of our history that our chromosomes can tell us. There is a genealogy apparent in our genes, but there is also a genealogy of ideas where one generation’s designs become the foundation for the next generation’s innovations, all of which propel us forward to a new level of understanding.
Mary Lou, you supercomputer, you were so fun to work with, and I’m sad we don’t need you anymore. Because now my laptop from Target can store and process all the genetic data for all of the humans on the entire planet. It’s almost absurd. But we do need you, actually, because all the work you and everyone else did to chip away at the challenges you faced helped spark the advances taking center stage today. Even when we shut you down for good, like every new generation, we still stand on your shoulders.
A New Kind of DNA Learning
If you’re ready to get started with your genetic genealogy journey, swing over and download our free guide on your next steps after DNA testing.
Get Free Guide: Finding Ancestors with DNA
Very interesting! Do you have a reference or a link to the Oxford researchers and their algorithm?
Hey Louis, here is what I got from Jayne:
https://www.nature.com/articles/s41588-019-0483-y
https://www.nature.com/articles/s41588-019-0484-x
I would also be interested to know the details of the two groups of researchers at Oxford who are working on this work. I’m aware of the work from Simon Myers group and the recent paper by Speidel et al:
https://www.nature.com/articles/s41588-019-0484-x
Leo Speidel did his Ph on this subject:
https://ora.ox.ac.uk/objects/uuid:61e3f8d0-6911-461d-92ea-ee91559cf353
What is the other group at Oxford working on this subject?
Hi Debbie, yes this refers to the Speidel paper in Nature 2019, and also Kelleher Nature 2019, https://www.nature.com/articles/s41588-019-0483-y.
Thanks Jayne. Coincidentally I just read the Speidel paper the other day. I’ve just found the Kelleher in my folder of papers I have been intending to read so I will look at that with renewed interest. These are very exciting possibilities!
You are describing ancestral recombination graphs (ARG). Some earlier work on this was by Rassmussen: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004342.
Also called evolutionary trees in an editorial introducing the Spiedel article mentioned earlier in this thread: https://www.nature.com/articles/s41588-019
Kelleher introduced the method he calls tsinfer which, as discussed, compressed DNA datasets substantially without loss of genealogically relevant meaning. The issue, which Kelleher explores, is scalability. He could iterate over 15800 trees in 11 seconds.
More recent work in this sphere includes
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1008895
We might adopt the terminolgy these authors have introduced: ARG and tsinfer?
The more generic description of this work is the use of graph methods. These will surplant the matrix methods used in shared clustering. Shared clustering uses sparse matices in which there are lots of blank cells consuming computing resources. Personally, I’m now working on a toolkit using Neo4j, a native graph database, which has many uses in genealogy including infering family lines from DNA. These methods create edges (relationships) between traditional family trees and the DNA segment data. I hope to have the prototype available for testing in the next month or so and would welcome some beta-testers. Please emai l me if interested.
David A Stumpf, MD, PhD
Professor Emeritus, Northwestern University
Woodstock, IL
E: genealogy@stumpf.org