DNA ethnicity estimates are so complex—like the weather—that they vary for the same person even at the same testing company. Here, let us explain.
Is the title of this article unsettling? Don’t worry: I think you’ll actually be relieved at the explanation by the time you finish reading this article.
Some autosomal DNA testers have multiple tests in hand for the same person from different companies—but they get differing ethnicity results. These differences can cause confidence in them to plummet. But the reality is that if the same person’s test was run through EVEN THE SAME COMPANY’S ethnicity algorithms repeatedly, it is not expected that they would receive the exact same result every time. There’s good reason for this, and it’s fascinating.
Different Kinds of Complex Predictions
Many systems in our world are so complex that predicting their future behavior with certainty is essentially impossible. The weather, economies, and ecosystems all have numerous inputs and interdependencies which contribute to sudden states that weren’t always predictable. The term nondeterministic describes this situation well. A nondeterministic system is one where the contributing variables are so complex that reliably predicting its outcome is not feasible. Contrast this with a deterministic system where the contributing elements are known, they behave and interact reliably, and produce the same outcome every time. Deterministic systems can still be sophisticated, but given the same input they always produce the exact same results.
Years ago weather predictions generally gave a single most likely scenario (“looks like rain on Tuesday”) and then we all laughed and complained when there wasn’t a cloud in the sky. This is an example of trying to fit a nondeterministic system (the complex dynamics of weather) into a deterministic method of prediction (yep, we predict it’s going to rain). Weather scientists did their best to tell us the temperature, precipitation and wind speed they expected for the day, but this ended up not being entirely useful to the public because it often did not pan out that way.
Today we see forecasts communicated as a spread of estimated likelihoods:
This reflects an evolution in the science of predicting weather. The models and algorithms that crunch weather input data (barometric pressure, temperature, humidity, global oceanic trends, satellite data…) now handle this complex system with an eye for predicting a likely outcome (48% chance of rain at 4:00 am), while acknowledging that in this highly complex climate system there are inevitable uncertainties that may steer the weather into an entirely different direction: no rain at all is a possibility. This is a truly nondeterministic system, and the models and algorithms utilized to crunch the data that result in these more useful predictions are called nondeterministic algorithms.
One of the features of a nondeterministic algorithm is that the output from run to run is hardly ever the same. If you run the algorithm with the exact same input 40 times in a row, you’ll get 40 different answers. Does that seem like a problem?
Reporting predicted percentages is often the preferred method for handling such complex systems because it provides a better global picture of the uncertainties involved (40% chance of rain with decreasing likelihood starting at 4:00 am) rather than just giving a single concrete answer that may turn out not to be meaningful, such as predicting rain on Tuesday without acknowledging uncertainty, and then not having any rain at all.
How DNA Ethnicity is Like the Weather
The ethnicity percentages you see on your report were generated using nondeterministic algorithms. Like weather and economies, reconstructing historical genetics is a complex system full of uncertainty. Consider the inputs: 700,000 points of genetic data per individual, undefined relevance of the genetic markers to the diverse ethnicities being tested, the use of contemporary people to represent historical communities (reference panels) which may or may not be accurate, the definitions of geographic regions with uncertain extent of relevance to genetic relatedness, uncertain time depth of represented by the types of genetic markers used… A deterministic, one stop, cut-and-dry answer would definitely be washing over the complexity here!
So for those who are looking at a handful of ethnicity tests that all have different results, let’s dive into the black box of the computational algorithms and statistical models used to produce the ethnicity estimates. This will give some insight into why they can vary, even for the same person tested within the same company multiple times.
The algorithms utilized to estimate ethnicity are nondeterministic. This means that even with the same input data, the algorithm can exhibit different behavior on different sessions of running the data through. It will produce differing results from run to run. When a DNA profile is obtained for a customer, it is inserted into the ethnicity estimation process and run through the algorithm steps until it determines ethnicity percentages for the individual. And the results may vary each time. The estimations are mostly expected to be in the same ballpark, but they will vary.
The nondeterministic nature of the estimation algorithms may give pause to consumers considering what level of confidence they can have in ethnicity estimations that would be expected to vary for an identical DNA profile.
Let’s look more closely to see how the algorithm itself crunches the genetic data.
Think of an algorithm as a set of steps executed in order. You perform an algorithm when you leave your house to drive to work.
It is a nondeterministic algorithm, because you do not always take the exact same route or end up in the exact same parking place. When you travel on a multi-lane road you may change lanes several times to bypass obstacles that impact your efficiency. At some point, you might decide to turn right at a stoplight instead of going straight to avoid a traffic jam. There are probably many alternate routes that will still get you to work. When you go to park your car you may favor a particular parking space, but if it is occupied you can just use a different one nearby. And some days the parking lot may be completely full so you have to use an overflow lot down the road. The algorithm you perform to drive to work has many points at which you may select a different path, but ultimately it will usually closely approximate the location of your work.
The ethnicity estimation algorithm developed by AncestryDNA works similarly. You can think of it as finding an efficient path through the data, while it attempts to match segments of DNA from the customer to the DNA of the individuals in the reference panel.
The chromosomes are divided into small adjacent fragments that are analyzed in order along the chromosome. If a segment is determined to be likely to come from France, for example, this assignment will influence the likelihood that the next adjacent segment also came from France because it is likely that adjacent pieces of DNA were inherited from the same ancestor and therefore geography. The algorithm proceeds segment by segment through the entire genome, finding the most likely path, assigning each segment along the way to a most likely geography, which is ultimately described as the proportion of ethnicities which you see on your report.
On different runs of the algorithm for the same individual, the path through the data can diverge at any point when compared to the original run, due to factors that may cause any of the adjacent segments to receive a different assignment to a different geography (perhaps because that segment is found in France, but also in Italy). This contributes to the nondeterministic nature of the algorithm, ultimately producing different results for different runs.
So just like when you drive your car to work, the ethnicity estimation algorithm can take different paths through the data as it is searching for the ethnicity assignments that are most likely. Even though it will produce varying results from run to run for the same person, most of the likely paths generally end up in the same neighborhood of the highest probability answer.
This single highest probability answer is still just an estimate, not the literal truth. Data from 1,000 other lower probability ethnicity estimates is used to calculate a confidence range around the highest probability estimate, which we already know can be of great help in allowing us to interpret our ethnicity results.
When you have conflicting ethnicity results, know that the algorithm is expected to produce differing results from run to run, even on the same person. The highest likelihood estimations are expected to be in the same ballpark, but not necessarily exactly the same. This nondeterministic approach, while perhaps initially baffling to interpret, is actually a well-suited approach of a powerful method to sort through a highly complex genetic system with inevitable uncertainties.
Just like you’d evaluate the varying likelihood of rain throughout the day with that percentage spread shown above, use the reported spread of likely proportions for each region as an informative estimate that helps paint the picture of your complex and nuanced genetic history.
What’s next in this DNA ethnicity series: Why you should trust your Finnish DNA ethnicity more than your German (and other examples)….
For More DNA Discoveries…
Guess what? We’ve updated our DNA Resources pages. See at-a-glance many of our most important and popular articles and topics.