RandomWalk: August 2013

Saturday, August 31, 2013

Admixtures - what they are, and aren't

Well, I promised some results for this week, so I guess I better get writing. Piecing together genetic genealogy is something that sometimes recalls Firesign Theatre's old (and by old, I mean 60's) sketch "Where're you from?" "Nairobi, ma'am. Isn't everybody?" Regardless, "genetic genealogy blogs" have been proliferating, and as hinted earlier there are some tools to help make things lot easier.

In this post I'll turn out attention to ADMIXTURE; to most people who've looked at personal genetics this can be familiar from GEDmatch's "Ad-Mix Utilities". Unfortunately there are some misconceptions about them, which I'm in part trying to set straight here, while exploring my own ancestry. The first tidbit should be self-evident but doesn't always seem to: Since all of the admixture calculators on GEDmatch give different results, they can't possibly all be correct.

So what, then, does ADMIXTURE do? To give a fairly technical summary, it tries to determine the frequency of each Single Nucleotide Polymorphism in K different populations, and the contributions of those K populations in the genetic makeup of each individual in the analysis. Ie. if SNP 1 has frequency of 25% in population A and 75% in population B, then an individual with a copy of SNP 1 from both parents has 6.25% chance of being from population A and 92.75% chance of being from population B.

But if they had the SNP - single base change - from just one parent then odds would be 50/50. Still, this wouldn't necessarily mean that they had one parent from each population. Now, if you apply this analysis with 100.000 different SNP's and determine their contribution into 20 different populations for 3000 individuals we're much closer to the real situation. The results are still more a probability than a fraction, though, and it all hinges on correct selection of the sample individuals and the number of populations K.

One good article from Eurogenes blog deals mainly with my opening pitch: Because ADMIXTURE compares (or classifies) individuals into different population clusters, it is only going to suggest differences from those clusters. Put another way, if there was a "Finnish" cluster, a Finnish individual might get out a result of "100% Finnish" with no information of their ancestral makeup. It turns out this is usually what people want out of genetic genealogy, ie. to display only recent admixture, but it can still provide for surprises.

With those caveats in mind, I set out to start to run out some experiments myself. To cut to the and provide a gentle introduction both for myself and any readers, I opted to use a dataset prepared by Razid Khan, over at Discovery Magazine's Gene Expression. Incidentally, since there's limited source of public SNP genotypes with accompanying ancestry information, that may also be pretty much only way to start off. I expect to revisit that issue later, however.

Since I'm also interested in my own genetic origins, I merged in my own genotypes. A peculiarity about this is that my known ancestry is about 3/4ths Finnish and 1/4ths Germanic, while Razid's dataset contains basically no Northern europeans to compare with. I was curious to find out if I would form my own population cluster (isn't everybody?) or what other populations I would be mapped to. So with that, onward to Finding K...

Thursday, August 29, 2013

New background

Huge news! Changed the blog background.

Well okay, maybe that's not so huge, but at least it may be little easier on the eye. I'll look at customizing the background at some point.

In other news, GEDmatch's Neanderthal/Denisova genome comparison tool I covered in last post has been disabled for a few days. Wonder what's up with that? I doubt that can be due to too high load, as the site suffers under high load all the time, and those features can't be more demanding than many of the others they offer.

Hopefully they'll be back soon with little more detailed break-down of the similarity percentages.

No posts on the blog doesn't neccessarily mean nothing's going on in the background... I've been running some experimental analysis on admixtures, as briefly explained in earlier blog post. As of today, my computer's been crunching on the data 2 weeks straight. The idea was to get some basic sample data to illustrate a few point about admixtures, in particular as implemented by ADMIXTURE utility.

I didn't expect to spend so much computer time on initial exploration, but it's been producing little more varied results and running little faster than I expected, so I've decided to let it run longer, but am fast running out of reasonable scope. I may get around to posting some basics about it here this week.

The blog has so far been about genetic testing and genetic genealogy, I'm considering keeping it that way as this may make it easier for people interested in the topic to find it. The blog name refers to class of algorithms that is often used in complex simulations and analysis (which was to be my major at university) in any case, but it's also oftentimes and apt description of my methods, in particular with regards to blogs and such :)

Saturday, August 24, 2013

GEDmatch and you

On my last post, trying to explain admixtures, I referred Wikipedia's statement that "While there are significant differences among the genomes of human individuals (on the order of 0.1%), these are considerably smaller than the differences between humans and their closest living relatives, the chimpanzees (approximately 1%) and bonobos" leading to me assert that humans and chimpanzees shared approximately 99% of base-pairs in their genomes.

As I already hinted (by pointing out that according to dbSNP variation among individual humans is at least 2%, although I grant this is likely to include severe structural deficiencies as well) back then, this doesn't seem to be quite as clear as the page makes it sound: On another page, Wikipedia states "According to preliminary sequences, 99.7% of the base pairs of the modern human and Neanderthal genomes are identical, compared to humans sharing around 98.8% of base pairs with the chimpanzee. (Other studies concerning the commonality between chimps and humans have modified the commonality of 98% to a commonality of only 94%, showing that the genetic gap between humans and chimps is far larger than originally thought.)"

Now, while it's obvious Wikipedia's authors can't count by rounding 98.8% and calling it 98%, I guess we should just conclude we don't really know. One reason seems to be that the field is moving much faster than glacial institutions can keep up with. Around middle of last decade or so it was estimated Homo Sapiens would have around 10 million Single Nucleotide Polymorphisms, ie. single base differences in genome. This number was surpassed in 2006 or so, but it's still not uncommon to see sites referring to the 10 million number; as of this writing the number is past 60 million and still growing, though slower. Most of those must be very rare, however.

Why does this matter, unless you're planning a career in genetics? Well, one reason is GEDmatch, the leading free (as long as you have your genotypes determined by 23andMe, FTDNA or the like first) SNP genealogical analysis site has just added a test to show how close your genome is to Neanderthal and Denisova genome. This, in turn, adds greatly to the confusion of genome similarity percentages.

For me, I get:
Neanderthal:

Total SNPs in common: 924069
Match on at least one allele: 716608 (77.5%)
Match on both alleles: 432012 (46.8%)

Denisova:

Total SNPs in common: 923704
Match on at least one allele: 719676 (77.9%)
Match on both alleles: 436733 (47.3%)

Say what? We just read that Neanderthal genome is 99.7% identical to Homo Sapiens, and here we have a result around 78%. It certainly doesn't help that 23andMe calls this 2.9% with 2.7% European average. So... what gives?

Lets first consider Wikipedia vs. GEDmatch. The answer is NOT "Because Wikipedia sucks", as happy as that answer would seem to make many people. What GEDmatch is apparently actually measuring, what they have to work with, are the SNP's in the genotype file. The around million variable spots (out of 60+ million) tested in an individual's genome. This is the "Total SNPs in common" line, which is bit of a mis-nomer, being the number of SNP's that have been tested both for the hominid in question and the individual being compared against them. So then, the amount matched is out of the SNP's, and not out of the complete genomic sequence, which we already know to be mostly identical. So the short answer is GEDmatch's numbers and percentages are out of the known changing locations tested, while Wikipedia's is out of the total genome.

We can do a quick back of the napkin estimation, that if 1 million out of 60 million SNP's have been tested, possibly the number of matching alleles (here used as a synonym of SNP; geneticists seem to love synonyms with slight nuance differences) out of all known variations is 60 times higher than listed here. Plugging in the exact numbers, we have (60 000 000 / 924 069) * 716608 or 47 million matching.

But as already said, the rest of the over 3 billion base pairs, those not in the dbSNP are likely to be identical. How identical? Well, using Wikipedia's figure of 99.7% would just lead us to full circle, so lets avoid that. According to previous calculation I can expect to have roughly 60 million - 47 million or 13 million base pairs that aren't identical with the Neanderthal sample. Out of 3 billion base pairs total, this is 0.4%. So I would personally be 99.6% identical with the Neanderthal sample - not far off from Wikipedia's 99.7%.

So what's up with 23andMe's 2.7%? Well, the short answer is, because Wikipedia sucks. No, but more seriously, 23andMe has a paper on this, basically blaming it on "ascertainment bias". The Illumina genotyping chip generally used is designed to test for 1 million SNP's known to be common among Europeans - it is not designed to test for common differences to Neanderthals. As a result the SNP frequencies cannot be generalized over the whole genome as I've just done above (though this is somewhat debatable because even studies done with whole genome, that don't exhibit ascertainment bias, tend to arrive at around 99.5%).

In the paper is described 23andMe's method to correct for this ascertainment bias; it's too long to cover in this post, but suffice to say it's one model to try to get into real, genome-wide similarity out of biased samples. But why is the number so much lower than the other estimates? Well, that's because 23andMe is trying to estimate the amount of Neanderthal ancestry and not genome-wide similarity. Neanderthals and modern humans share common ancestor, after all. I dare say this is slightly poorly defined, but I guess they're trying to say that 2.7% of average European's ancestors were Neanderthals. There's one in every family, apparently... and the rest were chimpanzees.

Again, there's a lot of ways to calculate numbers and percentages, and the results will invariably vary depending on the method. A prime example of "lies, damned lies, and statistics". Still, I think it'd be helpful for the various analysis services to at least try to clarify these rather than just throw out a random percentage and leave the details to be figured out.

Sunday, August 18, 2013

Biology 101

On the post topic: Well, not really.

As related earlier, the genetic testing companies do give you access to your raw genotype data, and I have a penchant for taking things apart to find out what they're made of... This, in turn, can only lead to one thing: Genetic engineering! Okay, kidding on that as well, but there are some analysis I am looking to share.

I'm going to make some posts on genotypes and their analysis, though, and for any of that to be understandable I guess there's a handful of biological basis that need to be covered. To most people, I imagine, these should be fairly familiar from basic biology lessons, but after a decade or few, it'd be no surprise to have forgotten some, and rehearsal never hurts.
First I have to cover what is meant by "raw genotype data". They're called SNP's, or Single Nucleotide Polymorphisms - this is where a single base in the genetic code has changed into another. 23andMe will also test for limited number of deletions, insertions and substitutions, which are pretty much what the name implies, but the majority of their raw results are classified as SNP's. In general, SNP's are considered UEP's, "Unique-Event Polymorphisms". What does this mean?

Human genome has 23 chromosome pairs with 3 billion base-pairs among them. But about 99.9% of them are the same among all living people (and still 99% with chimpanzees). This is according to Wikipedia; dbSNP seems to currently list at least 60,558,600 known SNP's for Homo Sapiens giving 2% out of 3 billion so there seems to be some differing counts. Regardless the amount is significantly less than the whole genome, so why not just read the locations likely to be different? This is what "genotyping" does, in the direct-to-consumer tests testing for around million representative SNP's.

I like Wikipedia's definition of them as "represent the inheritance of events it is believed can be assumed to have happened only once in all human history", or more specifically "In genetic genealogy a unique-event polymorphism (UEP) is a genetic marker that corresponds to a mutation that is likely to occur so infrequently that it is believed overwhelmingly probable that all the individuals who share the marker, worldwide, will have inherited it from the same common ancestor, and the same single mutation event."

Unfortunately for genealogists everywhere, I suppose, Homo Sapiens has been around for hundreds of thousands years, and most SNP's probably predate even that. Consequently those individual SNP's have ended up all around the place, and it's not generally possible to point to any single SNP and say "Only our clan has that". Instead, people have a random mix of them, and only their proportions vary in different populations. And when people from those populations have children together, some of that structure remains, in what is called admixture.

To understand how those SNP frequencies can survive, I guess a good reference would be The Process of Meiosis, from whence I shall shamefully link this illustration. The blue and red chromosomes represent matching chromosome pair (one from each of their parents), already split apart and duplicated, within individual. At the end of the meiosis shown here the pairs of the chromosomes will still split from the middle, forming 4 gametes one of which will pass to the child. This process happens independently for each chromosome pair within the genome.

This way the child doesn't receive a truly random permutation of the parent genotypes, but the genotypes they receive remain the same between the recombination events, or "crossovers" along the genome. This is to say that "genetically close" genotypes, and this genes, are likely to be passed together to the offspring. These recombination events happen rarely enough that it is possible to track both relatives (more or less identical stretches of genotypes) and ancestral populations (ranges where the genotype distribution is typical for specific populations).

I guess that is enough for one post.

DTC Genetic Testing

Spurred by weird family lore, generic curiosity, and a good dose of always wanting to take things apart to find out what they're made of I decided to turn to the Direct to Consumer Genetic Testing.

I know I'm quite late to this train, but then again I don't consider myself rich. As such it's probably worth pointing out since 2012's end 23andMe's autosomal DNA test has cost only $99, being cheap enough to no longer count as luxury. It's equally worth pointing out that to much of Europe the postage costs are another almost 100 dollars, and shipment by DHL may require setting apart significant amount of time for receiving and sending the test kit back yourself.

In my case, I decided to take the day off work when I was called that the courier would drop the package "somewhere between 10am and 4pm". Having dealt with DHL before I know I could've just called them to arrange delivery at my workplace instead, but I wanted to avoid all the hassle and was eager to get my hands on the sample kit. Plus, somehow I expected them to drop it in the morning so I could head to work for the rest of the day... In the end it was closer to 5pm before the courier called me to inform me he couldn't find my address, something that's become something of a running gag with deliverymen as of late...

Dropping off the sample kit was a different matter. 23andMe uses saliva for the testing, the amount of saliva they ask - 10 ml - sounds little and easy to provide, but I can confirm it can be little challenging, even if you do have 30 minutes before the stabilizer should be added. I expected there to be some kind of a snap or other indicator of when the tube was securely closed, but found nothing like that, while worried the plastic tube would split if screwed on too tight - a clear design fault in my opinion. According to 23andMe's instructions, I should've now taken the sample to DHL's main offices - a trip that would've taken most day and come at significant extra cost.

It didn't help that the package contains a scary paper intended for the local DHL main office entitled "Attention DHL Express Operations Personnel - PLEASE READ", declaring that the "exempt human specimen" conforms to all IATA regulations and does not require UN3373 packaging as per Restricted Commodity Group registration number so-and-so etc. On the waybill it's carefully labeled as "exempt human specimen, plastic test tube and inert buffer".

Emboldened by Google search results telling me everybody just had DHL come and pick it up for delivery I called DHL to ask whether they would pick it up. The conversation with their customer service went somewhat along the lines of "Hello, can you come pick up a package for delivery?" "Is this a delivery or are you expecting a package?" "A delivery." "What's the customer number?" "I don't have a customer number, this for a company called 23andMe and... " "I need a customer number."

Thus I desperately looked the waybill over, trying to find something that looked like a customer number. "Would Payer Account Number do?" "Oh yes, tell the Payer Account Number." Ah now we're making progress... "What's in the package?" Okay, I guess this is where I'd be expected to sprung that scary letter detailing what's in it, besides I wasn't altogether sure how I should translate "Exempt human specimen" into Finnish. "Well, you see this is for a company called 23andMe and..." "I guess you didn't hear, what's in the package?" "Uh, a sample." "Ok... sample." I'm guessing they had a little different idea of "sample", but I suddenly decided it was not worth it to try to explain.

To avoid a repeat of the experiences with the initial reception of the kit, I arranged the shipment to be picked up at my workplace. Since we regularly deliver products around the world from the office, I made it clear to DHL that this was a private shipment that had nothing to do with the company and they should ask me by name to pick it up. Needless to say, when the courier came to pick it up, he just announced "Picking up a package for DHL". The secretary scrambled to fetch the CEO to determine just what we were expected to be shipping that day. Luckily enough I was positioned close enough to the reception that I heard the commotion and took care of it.

Two days after picking up the sample shipment, DHL called back saying that customs clearance required my Social Security Number. There was a space for it on the DHL waybill, however according to 23andMe's instructions it didn't need to be filled. Apparently, the customs had decided otherwise. Naturally there was no way really to verify that the call was from DHL, but I suppose whoever called at least knew I had recently shipped something by DHL. I can see this being a big sticking point for many people looking for personal genome services.

There are a couple of ways to avoid most of the hassle (besides, of course, living in the USA). One is using services from Family Tree DNA, who have subsequently changed their autosomal DNA testing price to $99 to match 23andMe's price. I did not order from them and thus can't vouch for them, but according to their site shipping is by standard USPS first-class mail at $7 to most locations globally. Not sure that contains return postage.

So why not use FTDNA services? First reason is for genealogy each service matches only against their own userbase, and FTDNA's is much smaller. Also, FTDNA doesn't do health-results at all, although you can look them up on many free services, and doesn't test for many of the conditions 23andMe does. Their "Ancestry Origins" feature is much less developed than 23andMe's, and they specifically do not determine the maternal and paternal haplotypes; for that you need to order alternate tests from them.

For me that was enough to stick with 23andMe, but for others FTDNA could be enough. Both services will let you download a file with your SNP's and genotypes, which can be used at other services existing and yet to come. My own unusual verdict is that as a direct to consumer service 23andMe's test is too complex, too much hassle for the international people to consider as a real commodity, but if you're willing to jump through some hoops it's still the more advanced and feature-full alternative.

So there's my experiences with the test itself; hopefully they will help someone stumbling here by Google considering what test to take and what to expect.

First Post

I guess someone should tell Google about the power of different namespaces.
The first challenge to starting up a new blog is coming up with a cool sounding name for it. The second challenge is discovering all the cool names you can think of are already taken, so your only chance is to end up with something more or less random.

Ah yes, indeed. I guess I'll leave the detailed story behind this blog's name for later, if ever, and just apologize for the rather unpronounceable name. While it would be a cool name for an outdoors blog of almost any kind, I'm afraid I'll have to bring disappointment to anyone looking for that kind of content here. No, I expect this blog to be quite a different kind of random walk - geeky, and a bit quirky. Random, in other words.

I'll also reserve a few complaints for Blogger's usability and idiosyncrasies, though I suppose I'm only about to learn most of them. It helped a lot when I found a place to turn the user-interface language to English from my native Finnish, but some random texts will still appear in Finnish... This is not entirely unprecedented, mind you, Windows for for example keeps trying to shove Finnish down our throats even though many technology terms have no agreed upon Finnish equivalents. User interfaces in Finnish end up looking more or less Hebrew to me. Needless to say, this blog will be in English.

For all the complaints, voiced and unvoiced, I have about Blogger, I did have a look at some competing blogging platforms. As of this writing, the otherwise most appealing competition's front page still has this informative tidbit on it: "Script error: /tmp/local_200605.xml does not exist. Please create a blank file named /tmp/local_200605.xml." Uh, okay. There were also several other oddities which makes it seem like blogging is still edgy, dangerous business not for those faint of heart. Oh well, I'm ready for the challenge (for the umpteenth time).