RandomWalk: 2013

Monday, December 9, 2013

Bitcoin: Which way is down?

Lets try to keep this short... I mean, shorter, I still had to try to keep the last Bitcoin post short :) However, it would seem I have misjudged at least the short term direction of Bitcoin, as Monday saw some moderate (which is generally good for stability, with less chance of rapid correction in the other direction) rise. I'm still not convinced the crash is over however, as I can only anticipate more negative news as businesses react to the volatility and regulators think of new things to rein Bitcoin down.

There is, however, opposing views, one just recently posted on Washington Post blog: Here's Why Volatility Isn't a Big Problem for Bitcoin. However, the article fails slightly at the delivery point, summarizing at the end "Bitcoin's extreme volatility is a problem right now, but that won't last forever." So which is it, a problem or not? From my present viewpoint if the volatility continues, maybe even if it doesn't, it'll be solved by majority losing all interest in Bitcoin.

And here's Fridays Bitcoin flash-crash immortalized in media. The author, of course, seemingly misjudges the action - Bitcoin price didn't lose its mind, the last moments of the flash-crash were just playing in an auto-repeat loop during the almost 3 hour downtime, which judging by the internet chatter fooled a lot of people into thinking things were okay and there were huge profits to be made. I don't think that was Mt.Gox's doing however, possibly just a flaw in the price-graphing site as my own price-feed showed transactions in standstill.

For those who still don't even know what Bitco is, I'll just simply summarize it's a decentralized, alternative (to the established currencies) digital cryptocurrency that's been around for over 4 years. Certain very complex calculations are performed to certify all previous transactions as valid after they've been verified as valid. First person to succesfully do so gets a reward, currently set at 25 bitcoins. This reward halves every 4 years to keep inflation at bay, and the calculation complexity is adjusted so that one is completed on average every 10 minutes.

And that's that for now.

Sunday, December 8, 2013

On Bitcoin

Time to make a short digression from my usual subject. When I bought my current computer rig, I got it with a decent display card for the purpose of calculating Bitcoins to offset the hardware and electricity costs of keeping it running. If you haven't heard of Bitcoins yet, you really should get out more (or less, depends). I'll not rehearse the concept as you're likely to find out all about them whether you want to or not just looking for recent news (or Googling should you have found this blog entry in the distant future... Hi, future me!).

However, I was first sold defective memories (though I now hear all of OCZ's memories and SSD's were defective anyhow... though the computer store adviced me it couldn't be the memory, and after trying three sets of OCZ I concured.) and couldn't get it running stable, so I barely got any run-time out of it. It wasn't until several years later that I got different set of memory just to test it out, and got it running. By then, Bitcoin's prize has already inflated so high it could no longer offset the price of the electricity to generate it...

Or so I thought. Little did I know, as this week Bitcoin price hit 1200 dollars or 900 euros per Bitcoin. Had I kept the computer on churning out Bitcoins I suppose I'd more than made up for the electricity cost, at least until this week.

Anyway, despite the faulty memories I had managed to mint out a few Bitcoins and all but forgotten about them, but reading the news on the price and seeing exponential rise price graph I figured it was high time to cash them out before the inevitable - even if possibly temporary - crash.

The news media were all over Mt. Gox, the trading-card-site-turned-money-exchange, reputed as the most popular exchange with the highest exchange rate, so it was natural I would send my hard-earned Bitcoins over to Mt.Gox. Now if you've been following Bitcoins at all you should know you need to check, doublecheck and then check once again the reputability and terms of anywhere you're sending yout bitcoins out to.

I suppose I have myself only to blame, but I tried to do this with Mt. Gox, though apparently little poorly, as after I had exchanged my bitcoins to cash I found out that it was practically impossible to get my money out of the service. It would take weeks to - possibly - clear my identification and proof of residence, and only then could I even enter a months-long waiting-list to transfer part of my meager price on my bank-account.

Naturally I then converted my money back to Bitcoins to get them out and cash them out through some other service - incurring another 0.6% exchange fee. It was then that I discovered you can't even get your Bitcoins out of Mt.Gox before your identifaction & proof of residence clear. This is the main reason for the blog post, to enchance the change of anyone else finding warning that Mt.Gox is basically one-way street; as of the writing of this blog entry Mt.Gox is basically one way street, you can check your Bitcoins in but you can't get them out in reasonable time or amount, and I want to help to get the word out.

Anyway, the next day China forbade official trade and use of Bitcoins, leaving it still legal for citizens "pending further review", sending Bitcoin prices tumbling as China's market is currently driving the prices. This, in turn, has had sveral major players in China including Baidu (aka "Google of China") and China Telecom to announce they no longer accept Bitcoins as payment due to its proven volatility and the very real risk they're left holding to worthless currency.

Naturally, this sent Bitcoins price down for further tumbles, including as of this counting at least three or four flash-crashes. During the worst of them Bitcoin's price hit about 400 euros (for those keeping count, that's less than half of the highest point just a day earlier). In the ensuing panic it would probably have dropped to near zero, but Mt.Gox also has a bad (or some would say serendipitious) habit of starting to lag showing prices and reject transactions when that happens.

During the worst flash-crash it went down for almost 3 hours showing frozen price-ticker everywhere until other exchanges finally recovered to stability... Fortunately, one would say, as otherwise there might be no Bitcoin today (okay, I exagerate, it keeps coming back, like a zombie). Almost as if by design, though it's certainly not unheard of for brick & mortar stockmarkets to suspend trading when faced with high volatility, nor is it for poorly designed web-services to fail under load.

Unfortunately, that outage also deprived Mt.Gox'ers from possible profits to be made from the volatility during that time, as well as protecting their money should the other exchanges have continued their fall towards zero meanwhile. Consequently from my short experience with Mt.Gox I certainly can't recommend it for anyone for any purpose.

Presently what remains of my Bitcoin assets are firmly in the Euros camp, I could turn a profit from the volatility of the market but that wasn't enough to fight the overall trend of their value and I found their net worth dwindling along with the value of Bitcoins. As a result I'm now down to half of where I intended to cash out.

Though I could re-buy at a low point and regain losses in number of Bitcoins, I don't think I will risk it. As of this writing Bitcoins have been rising again, but I can't imagine that trend continuing past Monday, as more and more businesses are likely to announce they're suspending Bitcoin payments pending technical review due to the volatility. With a verified Bitcoin transaction taking about a hour, I can't imagine any sane business taking that risk.

About only thing that could turn around the current downward spiral would be China announcing on Monday morning that upon review they've decided to re-allow banks and other government institutions to deal in Bitcoins and fully endorse them. However, that is to say extremely unlikely. There is even higher chance after the Bitcoin price has fallen low enough to make it palatable, China is going to announce that to "protect the citizens" they're completely forbidding dealing in Bitcoins.

Is this the end for Bitcoin? I doubt it, the technology is still solid (though comparatively slow and un-economical), but the volatility and burst bubble seem almost by neccessity forcing Bitcoin back to nerd-curiosity. There are numerous alt-coins based on same principles as Bitcoin but with different problematic aspects fixed, one of them could overtake Bitcoins, but it's hard to fix their prime fail - price volatility. With that fault they're neither feasible currency nor investment, but pure gamble.

Saturday, November 23, 2013

Not quite dead, not quite alive

The analysis of the data took much longer than I anticipated, largely because the power-outage I had didn't turn out to be an isolated incident. My computer rig outgrew my office Uninterruptable Power Supplies long ago, but I never thought that would be an issue because we experience power disruptions maybe once or twice a year at most. For some reason, the power disruptions carried out daily for days, then stretched to weeks... Made me feel like living in some third world country .

The analysis had reached a phase where each new run took almost a day, and with a daily power outage you can do the math. Worse, there was some filesystem corruption from the sudden computer reboots, so I after a few days I just gave up and decided to wait out the blackouts. In truth I finished the current run of analysis a while ago already, but I've been keeping busy with my job - and still am, so I don't have time for much finesse (as usual).

Incidentally I guess that's not a huge problem; this is more a research journal than a true blog, though I'm hoping to offer more digestible content and possibly regular updates as time goes by. But for now this resembles more a log of records of my research, and I don't expect to have many readers, beyond those that may find this by googling some of the specific things mentioned here. I'll say welcome to any potential readers, and feel free to say hello.

Speaking of which, a rather sizeable dump of the results from my latest run are available. The time crunch means I still haven't got a decent way to visualize it; instead I'm just listing the 25 highest affinity samples for each genetic similarity cluster for each tested K value. As before the clusters are formed on their own, unsupervised, and vary - often back and forth - betweek K values, so this time I've simply ordered the clusters from left to right according to my own genetic makeup in the analysis rather than try to line them up in any absolute sense. I've ran the analysis for K 1 to 40 for now. Incidentally, that is also the first K value at which Orcadians separate into their own cluster in the analysis. That is a good point to pause and reflect on whether this analysis is generating anything sensible. I expect to have more to write on these things in the future.

There's a reason I'm hurrying this update out - 23andMe has finally announced and started to roll out their new Ancestry Composition update. Other genetic ancestry blogs have already broken the news all around, so I probably shouldn't waste much words on it. But the 23andMe Ancestry Composition is being expanded to recognize 31 different populations; additional ones are Japanese, Korean, Yakut, Mongolian, Chinese, Southeast Asian, West African, South African and Central & South African. With these additions there will be less need for anyone to run their own ADMIXTURE analysis except for learning (or research) purposes. But for now I understand no customers have yet received their results on the new populations.

Tuesday, September 10, 2013

Quality Control 1

I've been procrastinating long enough, though as before my computer's been hard at work churning over the data, so the time's hardly been wasted. Except for the rash of short blackouts over last two days; unfortunately my current rig takes too much power to run off my old uninterruptable power supplies, so I've lost a few days worth of progress. Just when you thought they had this "power generation" thing down... oh well.

Last time around, I was analyzing Razib Khan's PHYLOCORE genotype dataset he provided for personal ADMIXTURE runs on the Gene Expression blog with different values of K - the number of expected ancestral populations, or clusters. At high K values, several clusters turned out to contain two or three individuals - people whose genetic variation was covered by each other. This is where I should claim I was planning for that all along, and offer a condescending tidbit of foresight like "This is why you never run an analysis before doing quality control".

Alas, I was simply hoping to cut around the corners a bit, as that dataset was explictly provided for such analysis, so I though I'd be okay running it without quality controls and get straight to the business. Well, for future reference, people who google for the dataset or concepts will hopefully find this blog and warning. Of course, there remains something of a practical question of quality control for what - or just "how much" quality control is enough. As the last data indicated, the original dataset may be just fine for runs with low K.

As we presumably want to get sensible results out of high K runs, how exactly do we perform quality control? One important question at this point would be what has already been done. Safest way to know their provenance would be to construct my own dataset, but I've chosen to use this dataset as a shortcut. Besides, individual genotypes are somewhat closely guarded - a couple of well known sources like Human Genetic Diversity Project and 1000 Genomes form the basis of PHYLOCORE, but there seem to be a few other sources I'm not familiar with.

So let's keep playing with the PHYLOCORE. The Swiss army knife of genetic analysis is PLINK which does pretty much everything that's needed - but it's not as simple as clicking a single button and waiting. A number of steps, and decisions with regards to thresholds in particular need to be made. First and most important are the missing lists, generated with:

plink --bfile PHYLOCORE --missing --out PHYLOCORE

After this, PHYLOCORE.lmiss will contain statistics on missing genotypes by SNP. A quick look at this file reveals that PHYLOCORE appears to have been cleaned up to contain only SNP's which have less than 1% missing, ie. couldn't be determined or weren't tested for. This seems good enough. The PHYLOCORE.imiss file on the other hand lists missing genotypes by individual. While each SNP has a missing rate of at most 1%, it would be possible for 1% of the individuals to be missing ALL the genotypes, so we need to check for this, too. And here we find our first problem:

                            FID                IID MISS_PHENO   N_MISS   N_GENO   F_MISS
                          HADZA              BAR11          Y    21448   134459   0.1595
                          HADZA              END21          Y    20528   134459   0.1527
                          HADZA              END09          Y    19267   134459   0.1433
                          HADZA              BAR09          Y    17261   134459   0.1284
                          HADZA              BAR04          Y    13558   134459   0.1008
                          HADZA              END22          Y    11094   134459 0.08251
                          HADZA              BAR13          Y     7679   134459 0.05711
                          HADZA              BAR08          Y     7597   134459   0.0565
                          HADZA              BAR01          Y     6439   134459 0.04789
                         Yoruba            NA19248          Y     5149   134459 0.03829

A number of the Hadza in the dataset are missing upwards to 16% of the genotyped SNP's. In practice, the missing SNP's act like wildcards, in the extreme case of individuals with just one SNP genotyped, they would explain or cover everybody else in the data with that single SNP variation. Optimally, one would carefully analyze the algorithm and determine what effect the missing SNP's have, and then statistically determine a cut-off point for missing SNP's where expected error margins are met. More generally, this falls to age-old statistical method of "Whoa, that looks bad" which I've decided to employ here.

The Hadza of Tanzania, numbering around 1000 people, live 50 kilometers from the "Cradle of Humankind" and according to their oral tradition have lived there since the dawn of time when world was inhabited by hairy giants and fire was against the laws of nature. They're also likely the oldest branch of humankind still in existence, so while their significance to modern genealogy is small, I'd rather keep as many of them as possible in the analysis. I therefore opted for a cut-off point of 10% missing SNP's.

Are all the SNP's in the dataset even necessary? Well, the simple answer is no. As people may recall, the DNA strands get tangled at given spots during the meiosis, forming a crossover or recombination event. While the spots are random, they follow certain probability, and the strands of DNA and thus SNP's between those spots pass together to an offspring. The probability two SNP's are passed together is modeled by Linkage Disequilibrium. Estimating it with PLINK's indep-pairwise 50 5 X, I obtained following table for SNP's that would remain after pruning with specified R^2 values:

At LD R^2 1.0, the number of SNP's pruned would be 45, this low number is most likely because the beadchips used for genotyping are generally LD-pruned already and don't test many neighboring SNP's. It doesn't seem to me like the dataset has been LD pruned; the ADMIXTURE paper recommends pruning it, but on the other hand I'm currently experimenting with the dataset with very wide swath, so I will leave experimenting with LD pruning for later.

For this blog entry, finally, the last but not least quality control measure I'll mention is "Indentical By Descent" - or, whether individuals are relatives or not. This was the specific problem mentioned in the beginning, and admixture is meant to be run on individuals who are not related to each other. This can be determined with "plink --bfile PHYLOCORE --genome --out PHYLOCORE". After a long while, PHYLOCORE.genome is outputted, which contains the calculated amount of indentity by descent between each pair of individuals (this can, naturally, also be quite large file - for 2000 individuals there are 2000*2000/2 or 2 million pairs, since order doesn't matter, for example).

Here, we concern ourselves mostly with PI_HAT, the "Proportion of Identical by Descent". A parent and offspring, for example, are 50% identical. In typical Genome Wide Association Studies the cut-off point seems to be 25%, which is equivalent to first cousins, but I'm not running a GWAS right now, genotype data is hard to come by for an amateur researcher and I'm just exploring the dataset at wide range, so I decided to make the cut-off point 50%.

One quick script to pick out of related individuals the one with lower genotyping rate and a few Hadza with higher than 10% missing genotype rate later I pruned 127 individuals out of the the test data, and got back to crunching. That list is little long and as it depends on the thresholds chosen, I'll provide the list on request only. Unfortunately the power-cuts were AFTER that and I didn't save the script, so I'll have to return to the script later, when I need it again.

Sunday, September 1, 2013

Finding K

As I mentioned in earlier post, the goal was finding optimal K for Razid Khan's "PHYLOCORE" genotype dataset for use with ADMIXTURE utility. Or in other words, for determining ancestral populations. Another goal was to get an idea how choice of K, the number of ancestral populations, affects results and gain overall familiarity with the tools.

Well I'm putting some of the results such as they are in boring table format here. Each 10-row table section represents a run for single K, starting with 2. For each cluster or "ancestral population" I've listed up to 10 samples from the analysis that were closest to the center of said population, ie. most typical of it, as long as their estimated amount is at least 0.5 (ie. they're at least 50% of said population cluster).

For purposes of finding K, the admixture manual is pretty clear. There exists a switch called "Cross Verification" in the program, which makes it split the whole genome in (by default) 5 sections or "folds". It will the estimate the ancestry of all the individuals in the analysis leaving one fold at time out of the genome. In theory, if the analysis has been successful, there should be high degree of agreement between the estimates.

In practice, here's how it went:


PHYLOCORE K Cross Validation

By ADMIXTURE manual, 13 where the cross-validation error is at its smallest is the optimal choice. At K=13 the clusters form around Lithuania, Sardinia, Balochi, China, Georgia, Beduin, Nganassan, Hadza, Koryak, Maya, Paniya, Yoruba and San bushmen. At this level my Finnish ancestry maps to Lithuania, but my Germanic ancestry to Sardinian cluster, which is almost certainly incorrect. In fact at different levels of K the center of the cluster with my ancestry jumps between Sardinia and Spanish, until at K=25 they break to separate clusters.

I would suggest that the manual's recommendation for minimum CV error is little counter-intruitive; because very few people are unadmixed (ie. have genome from just one ancestral population) the cross-validation error is going to keep increasing the more detailed the admixture analysis gets. K values which are a good fit to the underlying population structure are going to pull the cross-validation downwards, but not enough to overcome the overall trend. But looking at the data, local minimas in the graph - at K=16, K=23, the wide one at K=25 etc. seem to generally be better quality (Admittedly, I'm mostly looking at the Sardinian cluster which is doing weird things).

Unfortunately, at K=19 where 4 Dusadh form a cluster by themselves there's first hint of something being seriously wrong. At K=28 where the Spanish split in two and three other clusters have only handful of people in them this starts to become rather obvious. I just had a look at the PHYLOCORE data, and it would appear it hasn't had even basic sample quality control done on it. So that'll have to be my next "project". I do consider this run to have been useful though, and at the very least it illustrates what happens if there are directly related individuals among the admixture analysis. I'll save other observations for later.

Saturday, August 31, 2013

Admixtures - what they are, and aren't

Well, I promised some results for this week, so I guess I better get writing. Piecing together genetic genealogy is something that sometimes recalls Firesign Theatre's old (and by old, I mean 60's) sketch "Where're you from?" "Nairobi, ma'am. Isn't everybody?" Regardless, "genetic genealogy blogs" have been proliferating, and as hinted earlier there are some tools to help make things lot easier.

In this post I'll turn out attention to ADMIXTURE; to most people who've looked at personal genetics this can be familiar from GEDmatch's "Ad-Mix Utilities". Unfortunately there are some misconceptions about them, which I'm in part trying to set straight here, while exploring my own ancestry. The first tidbit should be self-evident but doesn't always seem to: Since all of the admixture calculators on GEDmatch give different results, they can't possibly all be correct.

So what, then, does ADMIXTURE do? To give a fairly technical summary, it tries to determine the frequency of each Single Nucleotide Polymorphism in K different populations, and the contributions of those K populations in the genetic makeup of each individual in the analysis. Ie. if SNP 1 has frequency of 25% in population A and 75% in population B, then an individual with a copy of SNP 1 from both parents has 6.25% chance of being from population A and 92.75% chance of being from population B.

But if they had the SNP - single base change - from just one parent then odds would be 50/50. Still, this wouldn't necessarily mean that they had one parent from each population. Now, if you apply this analysis with 100.000 different SNP's and determine their contribution into 20 different populations for 3000 individuals we're much closer to the real situation. The results are still more a probability than a fraction, though, and it all hinges on correct selection of the sample individuals and the number of populations K.

One good article from Eurogenes blog deals mainly with my opening pitch: Because ADMIXTURE compares (or classifies) individuals into different population clusters, it is only going to suggest differences from those clusters. Put another way, if there was a "Finnish" cluster, a Finnish individual might get out a result of "100% Finnish" with no information of their ancestral makeup. It turns out this is usually what people want out of genetic genealogy, ie. to display only recent admixture, but it can still provide for surprises.

With those caveats in mind, I set out to start to run out some experiments myself. To cut to the and provide a gentle introduction both for myself and any readers, I opted to use a dataset prepared by Razid Khan, over at Discovery Magazine's Gene Expression. Incidentally, since there's limited source of public SNP genotypes with accompanying ancestry information, that may also be pretty much only way to start off. I expect to revisit that issue later, however.

Since I'm also interested in my own genetic origins, I merged in my own genotypes. A peculiarity about this is that my known ancestry is about 3/4ths Finnish and 1/4ths Germanic, while Razid's dataset contains basically no Northern europeans to compare with. I was curious to find out if I would form my own population cluster (isn't everybody?) or what other populations I would be mapped to. So with that, onward to Finding K...

Thursday, August 29, 2013

New background

Huge news! Changed the blog background.

Well okay, maybe that's not so huge, but at least it may be little easier on the eye. I'll look at customizing the background at some point.

In other news, GEDmatch's Neanderthal/Denisova genome comparison tool I covered in last post has been disabled for a few days. Wonder what's up with that? I doubt that can be due to too high load, as the site suffers under high load all the time, and those features can't be more demanding than many of the others they offer.

Hopefully they'll be back soon with little more detailed break-down of the similarity percentages.

No posts on the blog doesn't neccessarily mean nothing's going on in the background... I've been running some experimental analysis on admixtures, as briefly explained in earlier blog post. As of today, my computer's been crunching on the data 2 weeks straight. The idea was to get some basic sample data to illustrate a few point about admixtures, in particular as implemented by ADMIXTURE utility.

I didn't expect to spend so much computer time on initial exploration, but it's been producing little more varied results and running little faster than I expected, so I've decided to let it run longer, but am fast running out of reasonable scope. I may get around to posting some basics about it here this week.

The blog has so far been about genetic testing and genetic genealogy, I'm considering keeping it that way as this may make it easier for people interested in the topic to find it. The blog name refers to class of algorithms that is often used in complex simulations and analysis (which was to be my major at university) in any case, but it's also oftentimes and apt description of my methods, in particular with regards to blogs and such :)

Saturday, August 24, 2013

GEDmatch and you

On my last post, trying to explain admixtures, I referred Wikipedia's statement that "While there are significant differences among the genomes of human individuals (on the order of 0.1%), these are considerably smaller than the differences between humans and their closest living relatives, the chimpanzees (approximately 1%) and bonobos" leading to me assert that humans and chimpanzees shared approximately 99% of base-pairs in their genomes.

As I already hinted (by pointing out that according to dbSNP variation among individual humans is at least 2%, although I grant this is likely to include severe structural deficiencies as well) back then, this doesn't seem to be quite as clear as the page makes it sound: On another page, Wikipedia states "According to preliminary sequences, 99.7% of the base pairs of the modern human and Neanderthal genomes are identical, compared to humans sharing around 98.8% of base pairs with the chimpanzee. (Other studies concerning the commonality between chimps and humans have modified the commonality of 98% to a commonality of only 94%, showing that the genetic gap between humans and chimps is far larger than originally thought.)"

Now, while it's obvious Wikipedia's authors can't count by rounding 98.8% and calling it 98%, I guess we should just conclude we don't really know. One reason seems to be that the field is moving much faster than glacial institutions can keep up with. Around middle of last decade or so it was estimated Homo Sapiens would have around 10 million Single Nucleotide Polymorphisms, ie. single base differences in genome. This number was surpassed in 2006 or so, but it's still not uncommon to see sites referring to the 10 million number; as of this writing the number is past 60 million and still growing, though slower. Most of those must be very rare, however.

Why does this matter, unless you're planning a career in genetics? Well, one reason is GEDmatch, the leading free (as long as you have your genotypes determined by 23andMe, FTDNA or the like first) SNP genealogical analysis site has just added a test to show how close your genome is to Neanderthal and Denisova genome. This, in turn, adds greatly to the confusion of genome similarity percentages.

For me, I get:
Neanderthal:

Total SNPs in common: 924069
Match on at least one allele: 716608 (77.5%)
Match on both alleles: 432012 (46.8%)

Denisova:

Total SNPs in common: 923704
Match on at least one allele: 719676 (77.9%)
Match on both alleles: 436733 (47.3%)

Say what? We just read that Neanderthal genome is 99.7% identical to Homo Sapiens, and here we have a result around 78%. It certainly doesn't help that 23andMe calls this 2.9% with 2.7% European average. So... what gives?

Lets first consider Wikipedia vs. GEDmatch. The answer is NOT "Because Wikipedia sucks", as happy as that answer would seem to make many people. What GEDmatch is apparently actually measuring, what they have to work with, are the SNP's in the genotype file. The around million variable spots (out of 60+ million) tested in an individual's genome. This is the "Total SNPs in common" line, which is bit of a mis-nomer, being the number of SNP's that have been tested both for the hominid in question and the individual being compared against them. So then, the amount matched is out of the SNP's, and not out of the complete genomic sequence, which we already know to be mostly identical. So the short answer is GEDmatch's numbers and percentages are out of the known changing locations tested, while Wikipedia's is out of the total genome.

We can do a quick back of the napkin estimation, that if 1 million out of 60 million SNP's have been tested, possibly the number of matching alleles (here used as a synonym of SNP; geneticists seem to love synonyms with slight nuance differences) out of all known variations is 60 times higher than listed here. Plugging in the exact numbers, we have (60 000 000 / 924 069) * 716608 or 47 million matching.

But as already said, the rest of the over 3 billion base pairs, those not in the dbSNP are likely to be identical. How identical? Well, using Wikipedia's figure of 99.7% would just lead us to full circle, so lets avoid that. According to previous calculation I can expect to have roughly 60 million - 47 million or 13 million base pairs that aren't identical with the Neanderthal sample. Out of 3 billion base pairs total, this is 0.4%. So I would personally be 99.6% identical with the Neanderthal sample - not far off from Wikipedia's 99.7%.

So what's up with 23andMe's 2.7%? Well, the short answer is, because Wikipedia sucks. No, but more seriously, 23andMe has a paper on this, basically blaming it on "ascertainment bias". The Illumina genotyping chip generally used is designed to test for 1 million SNP's known to be common among Europeans - it is not designed to test for common differences to Neanderthals. As a result the SNP frequencies cannot be generalized over the whole genome as I've just done above (though this is somewhat debatable because even studies done with whole genome, that don't exhibit ascertainment bias, tend to arrive at around 99.5%).

In the paper is described 23andMe's method to correct for this ascertainment bias; it's too long to cover in this post, but suffice to say it's one model to try to get into real, genome-wide similarity out of biased samples. But why is the number so much lower than the other estimates? Well, that's because 23andMe is trying to estimate the amount of Neanderthal ancestry and not genome-wide similarity. Neanderthals and modern humans share common ancestor, after all. I dare say this is slightly poorly defined, but I guess they're trying to say that 2.7% of average European's ancestors were Neanderthals. There's one in every family, apparently... and the rest were chimpanzees.

Again, there's a lot of ways to calculate numbers and percentages, and the results will invariably vary depending on the method. A prime example of "lies, damned lies, and statistics". Still, I think it'd be helpful for the various analysis services to at least try to clarify these rather than just throw out a random percentage and leave the details to be figured out.

Sunday, August 18, 2013

Biology 101

On the post topic: Well, not really.

As related earlier, the genetic testing companies do give you access to your raw genotype data, and I have a penchant for taking things apart to find out what they're made of... This, in turn, can only lead to one thing: Genetic engineering! Okay, kidding on that as well, but there are some analysis I am looking to share.

I'm going to make some posts on genotypes and their analysis, though, and for any of that to be understandable I guess there's a handful of biological basis that need to be covered. To most people, I imagine, these should be fairly familiar from basic biology lessons, but after a decade or few, it'd be no surprise to have forgotten some, and rehearsal never hurts.
First I have to cover what is meant by "raw genotype data". They're called SNP's, or Single Nucleotide Polymorphisms - this is where a single base in the genetic code has changed into another. 23andMe will also test for limited number of deletions, insertions and substitutions, which are pretty much what the name implies, but the majority of their raw results are classified as SNP's. In general, SNP's are considered UEP's, "Unique-Event Polymorphisms". What does this mean?

Human genome has 23 chromosome pairs with 3 billion base-pairs among them. But about 99.9% of them are the same among all living people (and still 99% with chimpanzees). This is according to Wikipedia; dbSNP seems to currently list at least 60,558,600 known SNP's for Homo Sapiens giving 2% out of 3 billion so there seems to be some differing counts. Regardless the amount is significantly less than the whole genome, so why not just read the locations likely to be different? This is what "genotyping" does, in the direct-to-consumer tests testing for around million representative SNP's.

I like Wikipedia's definition of them as "represent the inheritance of events it is believed can be assumed to have happened only once in all human history", or more specifically "In genetic genealogy a unique-event polymorphism (UEP) is a genetic marker that corresponds to a mutation that is likely to occur so infrequently that it is believed overwhelmingly probable that all the individuals who share the marker, worldwide, will have inherited it from the same common ancestor, and the same single mutation event."

Unfortunately for genealogists everywhere, I suppose, Homo Sapiens has been around for hundreds of thousands years, and most SNP's probably predate even that. Consequently those individual SNP's have ended up all around the place, and it's not generally possible to point to any single SNP and say "Only our clan has that". Instead, people have a random mix of them, and only their proportions vary in different populations. And when people from those populations have children together, some of that structure remains, in what is called admixture.

To understand how those SNP frequencies can survive, I guess a good reference would be The Process of Meiosis, from whence I shall shamefully link this illustration. The blue and red chromosomes represent matching chromosome pair (one from each of their parents), already split apart and duplicated, within individual. At the end of the meiosis shown here the pairs of the chromosomes will still split from the middle, forming 4 gametes one of which will pass to the child. This process happens independently for each chromosome pair within the genome.

This way the child doesn't receive a truly random permutation of the parent genotypes, but the genotypes they receive remain the same between the recombination events, or "crossovers" along the genome. This is to say that "genetically close" genotypes, and this genes, are likely to be passed together to the offspring. These recombination events happen rarely enough that it is possible to track both relatives (more or less identical stretches of genotypes) and ancestral populations (ranges where the genotype distribution is typical for specific populations).

I guess that is enough for one post.

DTC Genetic Testing

Spurred by weird family lore, generic curiosity, and a good dose of always wanting to take things apart to find out what they're made of I decided to turn to the Direct to Consumer Genetic Testing.

I know I'm quite late to this train, but then again I don't consider myself rich. As such it's probably worth pointing out since 2012's end 23andMe's autosomal DNA test has cost only $99, being cheap enough to no longer count as luxury. It's equally worth pointing out that to much of Europe the postage costs are another almost 100 dollars, and shipment by DHL may require setting apart significant amount of time for receiving and sending the test kit back yourself.

In my case, I decided to take the day off work when I was called that the courier would drop the package "somewhere between 10am and 4pm". Having dealt with DHL before I know I could've just called them to arrange delivery at my workplace instead, but I wanted to avoid all the hassle and was eager to get my hands on the sample kit. Plus, somehow I expected them to drop it in the morning so I could head to work for the rest of the day... In the end it was closer to 5pm before the courier called me to inform me he couldn't find my address, something that's become something of a running gag with deliverymen as of late...

Dropping off the sample kit was a different matter. 23andMe uses saliva for the testing, the amount of saliva they ask - 10 ml - sounds little and easy to provide, but I can confirm it can be little challenging, even if you do have 30 minutes before the stabilizer should be added. I expected there to be some kind of a snap or other indicator of when the tube was securely closed, but found nothing like that, while worried the plastic tube would split if screwed on too tight - a clear design fault in my opinion. According to 23andMe's instructions, I should've now taken the sample to DHL's main offices - a trip that would've taken most day and come at significant extra cost.

It didn't help that the package contains a scary paper intended for the local DHL main office entitled "Attention DHL Express Operations Personnel - PLEASE READ", declaring that the "exempt human specimen" conforms to all IATA regulations and does not require UN3373 packaging as per Restricted Commodity Group registration number so-and-so etc. On the waybill it's carefully labeled as "exempt human specimen, plastic test tube and inert buffer".

Emboldened by Google search results telling me everybody just had DHL come and pick it up for delivery I called DHL to ask whether they would pick it up. The conversation with their customer service went somewhat along the lines of "Hello, can you come pick up a package for delivery?" "Is this a delivery or are you expecting a package?" "A delivery." "What's the customer number?" "I don't have a customer number, this for a company called 23andMe and... " "I need a customer number."

Thus I desperately looked the waybill over, trying to find something that looked like a customer number. "Would Payer Account Number do?" "Oh yes, tell the Payer Account Number." Ah now we're making progress... "What's in the package?" Okay, I guess this is where I'd be expected to sprung that scary letter detailing what's in it, besides I wasn't altogether sure how I should translate "Exempt human specimen" into Finnish. "Well, you see this is for a company called 23andMe and..." "I guess you didn't hear, what's in the package?" "Uh, a sample." "Ok... sample." I'm guessing they had a little different idea of "sample", but I suddenly decided it was not worth it to try to explain.

To avoid a repeat of the experiences with the initial reception of the kit, I arranged the shipment to be picked up at my workplace. Since we regularly deliver products around the world from the office, I made it clear to DHL that this was a private shipment that had nothing to do with the company and they should ask me by name to pick it up. Needless to say, when the courier came to pick it up, he just announced "Picking up a package for DHL". The secretary scrambled to fetch the CEO to determine just what we were expected to be shipping that day. Luckily enough I was positioned close enough to the reception that I heard the commotion and took care of it.

Two days after picking up the sample shipment, DHL called back saying that customs clearance required my Social Security Number. There was a space for it on the DHL waybill, however according to 23andMe's instructions it didn't need to be filled. Apparently, the customs had decided otherwise. Naturally there was no way really to verify that the call was from DHL, but I suppose whoever called at least knew I had recently shipped something by DHL. I can see this being a big sticking point for many people looking for personal genome services.

There are a couple of ways to avoid most of the hassle (besides, of course, living in the USA). One is using services from Family Tree DNA, who have subsequently changed their autosomal DNA testing price to $99 to match 23andMe's price. I did not order from them and thus can't vouch for them, but according to their site shipping is by standard USPS first-class mail at $7 to most locations globally. Not sure that contains return postage.

So why not use FTDNA services? First reason is for genealogy each service matches only against their own userbase, and FTDNA's is much smaller. Also, FTDNA doesn't do health-results at all, although you can look them up on many free services, and doesn't test for many of the conditions 23andMe does. Their "Ancestry Origins" feature is much less developed than 23andMe's, and they specifically do not determine the maternal and paternal haplotypes; for that you need to order alternate tests from them.

For me that was enough to stick with 23andMe, but for others FTDNA could be enough. Both services will let you download a file with your SNP's and genotypes, which can be used at other services existing and yet to come. My own unusual verdict is that as a direct to consumer service 23andMe's test is too complex, too much hassle for the international people to consider as a real commodity, but if you're willing to jump through some hoops it's still the more advanced and feature-full alternative.

So there's my experiences with the test itself; hopefully they will help someone stumbling here by Google considering what test to take and what to expect.

First Post

I guess someone should tell Google about the power of different namespaces.
The first challenge to starting up a new blog is coming up with a cool sounding name for it. The second challenge is discovering all the cool names you can think of are already taken, so your only chance is to end up with something more or less random.

Ah yes, indeed. I guess I'll leave the detailed story behind this blog's name for later, if ever, and just apologize for the rather unpronounceable name. While it would be a cool name for an outdoors blog of almost any kind, I'm afraid I'll have to bring disappointment to anyone looking for that kind of content here. No, I expect this blog to be quite a different kind of random walk - geeky, and a bit quirky. Random, in other words.

I'll also reserve a few complaints for Blogger's usability and idiosyncrasies, though I suppose I'm only about to learn most of them. It helped a lot when I found a place to turn the user-interface language to English from my native Finnish, but some random texts will still appear in Finnish... This is not entirely unprecedented, mind you, Windows for for example keeps trying to shove Finnish down our throats even though many technology terms have no agreed upon Finnish equivalents. User interfaces in Finnish end up looking more or less Hebrew to me. Needless to say, this blog will be in English.

For all the complaints, voiced and unvoiced, I have about Blogger, I did have a look at some competing blogging platforms. As of this writing, the otherwise most appealing competition's front page still has this informative tidbit on it: "Script error: /tmp/local_200605.xml does not exist. Please create a blank file named /tmp/local_200605.xml." Uh, okay. There were also several other oddities which makes it seem like blogging is still edgy, dangerous business not for those faint of heart. Oh well, I'm ready for the challenge (for the umpteenth time).