Computers, biology and 6 million Danes: medical patients have a history

This is a rough and fast translation from my French language outreach blog, following demand from a few colleagues on Twitter. Original here. Note that the original was intended for outreach, not communication to colleague bioinformaticians. But if bioinformaticians also find it interesting, why not?

Those who follow me on Twitter suffered last week, as I was in a bioinformatics conference, which I live-tweeted extensively . I learned a lot of interesting things, and I will try to cover several interesting results if I have time. First the talk of Søren Brunak , a Danish medical bioinformatician:

Creating disease trajectories from big biomedical data

based notably on the article:

Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Jensen et al 2014 Nature Comm 5: 4022

As a starting point, a few Tweets by me or others:

The key concept for Søren is the « trajectory »: a medical patient has a past and a future, which should be reflected in its diagnosis and treatment. So he wants to use the data which is available in Denmark to determine statistically probable trajectories, and how they influence the efficiency of treatments, the chances of survival or of complications, etc.

The Søren lab has used the complete data from Danish hospitals from 1996 to 2010, 6.2 million patients, 65 million encounters. We know in which order each patient had which diagnoses or treatments, and with what consequences. They found 1,171 significant « trajectories ». A trajectory is a sequence of diagnostic or medical procedures that are found to follow each other in a certain order more than expected by chance.

For example:

ncomms5022-f2In (a) there is a series of diseases which frequently follow each other, related to prostate cancer. In (b), these series are grouped to show all trajectories together.

An important point is that this is determined automatically, first by using a fairly simple correlation between diagnoses. The probability of observing a correlation is estimated by random resampling of the data (mixing random observations, in other words) millions of times, and correcting for multiple testing. As the computations takes time, they performed this on part of the data, and then used these results to validate a faster approach. They assembled pairs of diagnoses into series by simply taking overlap (if you have A-> B and B-> C, then we A-> B-> C), with again a test for statistical significance; to limit statistical noise, paths with less than 20 patients were eliminated from the analysis. The paths are grouped, as shown in (b) above, by Markov clustering. This is how I learn by checking my sources that this approach widely used used in bioinformatics has not really been published, beyond a maths PhD. The reference page being that of the software provided by the aforementionned mathematician: MCL . Basically, the method looks in a graph (points linked by lines, see figure above) for « paths » more likely to be walked randomly in the graph, these paths corresponding to subsets of the graph which are better connected. Thus forming subsets, e.g. of diagnostic, which should be grouped. QED. There’s more fun stuff in this work, including the development of a computational method to automatically understand the texts written by doctors in Danish, including negations (very important in diagnostics).

Two larger graphs, for the sake of it:


Here we can see for example in (a) that most diseases which follow athrosclerosis, and which may be considered as possible complications of it, are rather complications after a chronic obstructive pulmonary disease (COPD), which often but not always follows the athrosclerosis.

ncomms5022-f4Well you have to admit that it’s pretty.

To show a little more of what can be found in these data, and the importance of personalized medicine, here are some frequencies of some types of diagnostics by gender and by type of viste: hospital in-patient, hospital out-patient, or emergency room:

ncomms5022-f1We can see that women more frequently give birth (green) than men, and are usually hospitalized at this time. ;-) And injuries (in red) affect more 21 year old men and are found at the emergency room. Showing that these statistics mostly work.

As you have probably noticed in the tweets above, this study was made possible by a very open legislation regarding the collection and use of personal data in Denmark. It is not clear that such studies could be performed in a similar manner in other countries, which may be less inclined to trust their government and their institutions. It is not obvious to me that this is even desirable, contrary to what Søren Brunak clearly thinks preferable. If such studies are not repeated, there is the risk of having information very biased by Danish genetic risks, and especially by their lifestyle, which is apparently characterized by fatty food and little exercise. Søren has clearly admitted that although the results were partially reproduced in Great Britain and the Netherlands, it would be difficult to generalize to a Mediterranean country or to East Asia, for example.

The fact remains that the broad outlines of this study are probably very generally correct, and partial information of this kind is better than no information in my opinion. A frequent complaint against traditional doctors and hospitals from patients is that their personal history is not taken into account, resulting in a tendency to go see quacks who listen carefully to the history and provide reassurance on the future. We see here that the intelligent use of large amounts of medical data has the potential to allow a rational and truly useful utilization of patient histories.

Ce contenu a été publié dans bioinformatics, translation from French. Vous pouvez le mettre en favoris avec ce permalien.