Saturday, August 24, 2013

GEDmatch and you

On my last post, trying to explain admixtures, I referred Wikipedia's statement that "While there are significant differences among the genomes of human individuals (on the order of 0.1%), these are considerably smaller than the differences between humans and their closest living relatives, the chimpanzees (approximately 1%) and bonobos" leading to me assert that humans and chimpanzees shared approximately 99% of base-pairs in their genomes.

As I already hinted (by pointing out that according to dbSNP variation among individual humans is at least 2%, although I grant this is likely to include severe structural deficiencies as well) back then, this doesn't seem to be quite as clear as the page makes it sound: On another page, Wikipedia states "According to preliminary sequences, 99.7% of the base pairs of the modern human and Neanderthal genomes are identical, compared to humans sharing around 98.8% of base pairs with the chimpanzee. (Other studies concerning the commonality between chimps and humans have modified the commonality of 98% to a commonality of only 94%, showing that the genetic gap between humans and chimps is far larger than originally thought.)"

Now, while it's obvious Wikipedia's authors can't count by rounding 98.8% and calling it 98%, I guess we should just conclude we don't really know. One reason seems to be that the field is moving much faster than glacial institutions can keep up with. Around middle of last decade or so it was estimated Homo Sapiens would have around 10 million Single Nucleotide Polymorphisms, ie. single base differences in genome. This number was surpassed in 2006 or so, but it's still not uncommon to see sites referring to the 10 million number; as of this writing the number is past 60 million and still growing, though slower. Most of those must be very rare, however.

Why does this matter, unless you're planning a career in genetics? Well, one reason is GEDmatch, the leading free (as long as you have your genotypes determined by 23andMe, FTDNA or the like first) SNP genealogical analysis site has just added a test to show how close your genome is to Neanderthal and Denisova genome. This, in turn, adds greatly to the confusion of genome similarity percentages.

For me, I get:
Neanderthal:
Total SNPs in common: 924069
Match on at least one allele: 716608 (77.5%)
Match on both alleles: 432012 (46.8%)

Denisova:
Total SNPs in common: 923704
Match on at least one allele: 719676 (77.9%)
Match on both alleles: 436733 (47.3%)

Say what? We just read that Neanderthal genome is 99.7% identical to Homo Sapiens, and here we have a result around 78%. It certainly doesn't help that 23andMe calls this 2.9% with 2.7% European average. So... what gives?

Lets first consider Wikipedia vs. GEDmatch. The answer is NOT "Because Wikipedia sucks", as happy as that answer would seem to make many people. What GEDmatch is apparently actually measuring, what they have to work with, are the SNP's in the genotype file. The around million variable spots (out of 60+ million) tested in an individual's genome. This is the "Total SNPs in common" line, which is bit of a mis-nomer, being the number of SNP's that have been tested both for the hominid in question and the individual being compared against them. So then, the amount matched is out of the SNP's, and not out of the complete genomic sequence, which we already know to be mostly identical. So the short answer is GEDmatch's numbers and percentages are out of the known changing locations tested, while Wikipedia's is out of the total genome.

We can do a quick back of the napkin estimation, that if 1 million out of 60 million SNP's have been tested, possibly the number of matching alleles (here used as a synonym of SNP; geneticists seem to love synonyms with slight nuance differences) out of all known variations is 60 times higher than listed here. Plugging in the exact numbers, we have (60 000 000 / 924 069) * 716608 or 47 million matching.

But as already said, the rest of the over 3 billion base pairs, those not in the dbSNP are likely to be identical. How identical? Well, using Wikipedia's figure of 99.7% would just lead us to full circle, so lets avoid that. According to previous calculation I can expect to have roughly 60 million - 47 million or 13 million base pairs that aren't identical with the Neanderthal sample. Out of 3 billion base pairs total, this is 0.4%. So I would personally be 99.6% identical with the Neanderthal sample - not far off from Wikipedia's 99.7%.

So what's up with 23andMe's 2.7%? Well, the short answer is, because Wikipedia sucks. No, but more seriously, 23andMe has a paper on this, basically blaming it on "ascertainment bias". The Illumina genotyping chip generally used is designed to test for 1 million SNP's known to be common among Europeans - it is not designed to test for common differences to Neanderthals. As a result the SNP frequencies cannot be generalized over the whole genome as I've just done above (though this is somewhat debatable because even studies done with whole genome, that don't exhibit ascertainment bias, tend to arrive at around 99.5%).

In the paper is described 23andMe's method to correct for this ascertainment bias; it's too long to cover in this post, but suffice to say it's one model to try to get into real, genome-wide similarity out of biased samples. But why is the number so much lower than the other estimates? Well, that's because 23andMe is trying to estimate the amount of Neanderthal ancestry and not genome-wide similarity. Neanderthals and modern humans share common ancestor, after all. I dare say this is slightly poorly defined, but I guess they're trying to say that 2.7% of average European's ancestors were Neanderthals. There's one in every family, apparently... and the rest were chimpanzees.

Again, there's a lot of ways to calculate numbers and percentages, and the results will invariably vary depending on the method. A prime example of "lies, damned lies, and statistics". Still, I think it'd be helpful for the various analysis services to at least try to clarify these rather than just throw out a random percentage and leave the details to be figured out.

No comments:

Post a Comment