How should abundant gene expression data be presented?

Recently, it so happened that the example we had chosen to highlight our new Bgee gene page was also used as #Geneoftheweek by Ensembl. And Ensembl tweeted the expression of this gene, hemoglobin ? (HBB), in ArrayExpress:

As noted in a recent post on the blog of Bgee, summarizing the expression of a gene such as HBB (but also most other genes, which are not extremely tissue-specific) is difficult: it is expressed in a great number of tissues and developmental or aging stages, yet its expression is more relevant in some than in others, and this expression is supported by a variety of data types and evidence lines. This poses a problem to all expression databases: how to present expression data for a gene? I propose here a small review of some solutions chosen, and some thoughts on their trade-offs.

For each database I will include screenshots; clicking on them will take you to the database pages.

Here is the presentation as it was in the old Bgee interface:


I think that we can agree that this was not optimal. Advantages: no information hidden, organised according to the anatomical ontology, and clear links to data. Disadvantages: unreadable, redundant (because the ontology is a graph, not a tree), order without biological motivation (alphabetic of ontology IDs), did I mention unreadable?

Here is the presentation in ArrayExpress Gene Expression Atlas (GXA):


Notice that like for old Bgee, the information goes largely out of the screen, although here it’s more to the right than down; the second screenshot shows the result of scrolling all the way. It’s nicer than old Bgee, but we see a similar philosophy of showing all the information, at the cost of potentially overwhelming the user. The picture of the left allows to highlight the highest expression, although the underlying data can only be found by scrolling all the way to find that information; which is unlucky since in this case « whole blood » comes last in the alphabetic order of tissues. Of note, passing the mouse over the picture highlights tissues, if they are visible on screen at that moment. Images have clear advantages in terms of readability; they also bear a cost for scaling (i.e., making an accurate picture for every new species) and are limited in resolution. Another notable choice here is to present each individual experiment as a row of the table. Personally, when I look for a summary of gene expression I am not really interested in the experiment names, but maybe others are. The table format also presents scaling challenges: when 10 times more experiments and many more tissues or organs will be added, the table will become very difficult to navigate.

Next, probably the best interface I know for gene expression summary so far, the database TISSUES:


Here the choice is clearly to put forward what is hopefully the most relevant signal of expression. This is done by ordering types of evidence (manual curation from UniProtKB/Swiss-Prot comes first), and by ordering by confidence. While TISSUES combines many sources of data, it does have a lower number of conditions than Bgee, since it does not integrate in situ hybridizations, and only a small set of RNA-seq and microarray experiments (see Santos et al 2015). In this case, the visual body map corresponds well to the level of detail of the data, and only one species is intended to be covered, thus there is no issue of scalability of anatomical representations.

There are many other databases which present expression data, but seem to demand that the user first define a dataset or a condition to look at, and are thus not relevant to the question of how to present a good overview. And of course many large data resources (such as GTEx or TCGA) present their own data, but then they do not have the challenge of integrating large quantities of diverse information that GXA, TISSUES or Bgee have.

Some more generalist databases do aim to present such integrated overviews. For example neXtprot shows a predefined subset of tissues, while GeneCards shows histograms from three datasets (one microarray experiment, one RNA-seq experiment, and SAGE). So in these cases detailed information is not presented, which is fine for such databases, but unsatisfying for an expression dedicated resource. Moreover, both neXtprot and GeneCards are human-only, while GXA and Bgee must manage to present diverse species.

Thus we get to the recent beta release of Bgee’s new gene page:


As explained on the Bgee blog, we present a sorted list of anatomical terms, where the sorting is based on a weighted average score over all data types. Thus, top terms may be very precise or very general, may be supported by RNA-seq, in situ hybridization, microarrays, or ESTs. These are not a priori decisions, but rather the result of an algorithm which tries to determine what is most informative for this gene. For each anatomical structure, development and aging are present as a unfoldable list.

The advantages of this new approach, we hope, are that:

  • it is scalable to new data types, large quantities of data per type, new anatomical or developmental detail, and to new species.
  • all information is present (the 198 anatomical entities for HBB), but the most relevant information is presented first.
  • the source of information is immediately visible, in the form of little vignettes for data types, colored by quality.
  • users do not need to chose a data type, an experiment or a condition a priori.

It is a first release, and obvious features to add are links to source data, showing the score value of each tissue, and the possibility to download results.

Limitations which we accept as part of our design choices (at least for now) are that:

  • development and aging are somewhat hidden.
  • not all data is visible at once (but in fact, due to screen size limitations, it never is, see old Bgee and GXA).
  • the relations between anatomical terms are hidden (e.g., trabecular bone tissue is_a skeletal tissue).
  • there is no graphical representation of expression levels (as in bar charts or histograms).

Finally, it is difficult for a database with 17 species and soon more, and a large level of anatomical and developmental detail, to represent anatomy by a visual body map. We would need one per developmental stage per species, which is not feasible and of debatable utility. And then these would not allow the visualization of fine cell types.

Now you understand where our compromise solution comes from. 😉

I hope that this subjective little tour is helpful in illustrating the challenges of representing the large and growing information that we have on gene expression in humans and other animals. Any great solutions which were overlooked here are welcome as comments below, or as tweets @marc_rr or @Bgeedb.

Publié dans bgee, bioinformatics | Marqué avec , , | Commentaires fermés sur How should abundant gene expression data be presented?

Computers, biology and 6 million Danes: medical patients have a history

This is a rough and fast translation from my French language outreach blog, following demand from a few colleagues on Twitter. Original here. Note that the original was intended for outreach, not communication to colleague bioinformaticians. But if bioinformaticians also find it interesting, why not?

Those who follow me on Twitter suffered last week, as I was in a bioinformatics conference, which I live-tweeted extensively . I learned a lot of interesting things, and I will try to cover several interesting results if I have time. First the talk of Søren Brunak , a Danish medical bioinformatician:

Creating disease trajectories from big biomedical data

based notably on the article:

Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Jensen et al 2014 Nature Comm 5: 4022

As a starting point, a few Tweets by me or others:

The key concept for Søren is the « trajectory »: a medical patient has a past and a future, which should be reflected in its diagnosis and treatment. So he wants to use the data which is available in Denmark to determine statistically probable trajectories, and how they influence the efficiency of treatments, the chances of survival or of complications, etc.

The Søren lab has used the complete data from Danish hospitals from 1996 to 2010, 6.2 million patients, 65 million encounters. We know in which order each patient had which diagnoses or treatments, and with what consequences. They found 1,171 significant « trajectories ». A trajectory is a sequence of diagnostic or medical procedures that are found to follow each other in a certain order more than expected by chance.

For example:

ncomms5022-f2In (a) there is a series of diseases which frequently follow each other, related to prostate cancer. In (b), these series are grouped to show all trajectories together.

An important point is that this is determined automatically, first by using a fairly simple correlation between diagnoses. The probability of observing a correlation is estimated by random resampling of the data (mixing random observations, in other words) millions of times, and correcting for multiple testing. As the computations takes time, they performed this on part of the data, and then used these results to validate a faster approach. They assembled pairs of diagnoses into series by simply taking overlap (if you have A-> B and B-> C, then we A-> B-> C), with again a test for statistical significance; to limit statistical noise, paths with less than 20 patients were eliminated from the analysis. The paths are grouped, as shown in (b) above, by Markov clustering. This is how I learn by checking my sources that this approach widely used used in bioinformatics has not really been published, beyond a maths PhD. The reference page being that of the software provided by the aforementionned mathematician: MCL . Basically, the method looks in a graph (points linked by lines, see figure above) for « paths » more likely to be walked randomly in the graph, these paths corresponding to subsets of the graph which are better connected. Thus forming subsets, e.g. of diagnostic, which should be grouped. QED. There’s more fun stuff in this work, including the development of a computational method to automatically understand the texts written by doctors in Danish, including negations (very important in diagnostics).

Two larger graphs, for the sake of it:


Here we can see for example in (a) that most diseases which follow athrosclerosis, and which may be considered as possible complications of it, are rather complications after a chronic obstructive pulmonary disease (COPD), which often but not always follows the athrosclerosis.

ncomms5022-f4Well you have to admit that it’s pretty.

To show a little more of what can be found in these data, and the importance of personalized medicine, here are some frequencies of some types of diagnostics by gender and by type of viste: hospital in-patient, hospital out-patient, or emergency room:

ncomms5022-f1We can see that women more frequently give birth (green) than men, and are usually hospitalized at this time. ;-) And injuries (in red) affect more 21 year old men and are found at the emergency room. Showing that these statistics mostly work.

As you have probably noticed in the tweets above, this study was made possible by a very open legislation regarding the collection and use of personal data in Denmark. It is not clear that such studies could be performed in a similar manner in other countries, which may be less inclined to trust their government and their institutions. It is not obvious to me that this is even desirable, contrary to what Søren Brunak clearly thinks preferable. If such studies are not repeated, there is the risk of having information very biased by Danish genetic risks, and especially by their lifestyle, which is apparently characterized by fatty food and little exercise. Søren has clearly admitted that although the results were partially reproduced in Great Britain and the Netherlands, it would be difficult to generalize to a Mediterranean country or to East Asia, for example.

The fact remains that the broad outlines of this study are probably very generally correct, and partial information of this kind is better than no information in my opinion. A frequent complaint against traditional doctors and hospitals from patients is that their personal history is not taken into account, resulting in a tendency to go see quacks who listen carefully to the history and provide reassurance on the future. We see here that the intelligent use of large amounts of medical data has the potential to allow a rational and truly useful utilization of patient histories.

Publié dans bioinformatics, translation from French | Commentaires fermés sur Computers, biology and 6 million Danes: medical patients have a history

Molecular Biology and Evolution impact factor: the MEGA effect Updated

Almost 2 years ago, I calculated how the very high impact factor of Molecular Biology and Evolution was entirely due to one paper, reporting the software MEGA: Molecular Biology and Evolution impact factor: the MEGA effect. A recent Twitter exchange prompts me to update these numbers. Spoiler: the MEGA effect has grown.

Reminder of the 2013 numbers:

  • Cites in 2012 to items published in: 2011 = 3807; 2010 = 1939
  • Number of items published in: 2011 = 297; 2010 = 258

The MEGA5 paper was cited 2473 times in 2012, so updating the numbers, we get:

  • 3807-2473+1939 = 3273 cites
  • 297-1+258 = 554 items
  • thus 3273/554 = 5.9 impact factor.

Updated in March 2015:

  • Cites in 2013 to items published in: 2012 = 1800; 2011 = 6656
  • Number of items published in: 2012 = 294; 2011 = 297

The MEGA5 paper was cited 5052 times in 2013, so updating the numbers, we get:

  • 1800+6656-5052 = 3404 cites
  • 294+297-1 = 590 items
  • thus 3404/590 = 5.8 impact factor.

Instead of an IF of 14 with MEGA. Yipee.

Publié dans publishing, statistics | 2 commentaires

Exercice for Bachelor students on using #ENCODE data in a browser

Almost two months ago, I asked on Twitter:

To which I got no answer. I had searched the web for such an exercice beforehand, without success.

The context is that I teach « Bioinformatics for genomics » to 3rd year Bachelor of Biology students, who do not know Unix, and I wanted to replace an old 2 hour hands-on exercice characterizing a Fugu conserved non coding element with a new exercice. The new exercice should illustrate the wealth of functional genomics data available through a web browser, and how to use it to answer biological questions.

Finding a good example was not easy. As Dan Graur would probably like to point out, most genes or genomic regions of some interest which I looked into had so much signal of various types as to make interpretation very difficult. So I turned to a data-driven starting point, and considered the genes with the most tissue-specific expression based on mouse and human RNA-seq (which we are analyzing anyway). Several of these had very boring profiles (I need a story for students in 2 hours), but we found a nice example in the end I believe.

Here is the exercice in its present form; probably we will modify it a bit next year. Feel free to reuse with citation. Feed-back to improve it or ideas for a better example are welcome.


Amelogenin regulates biomineralization, especially in tooth development. In eutherian mammals, the gene copies on the X and Y chromosome have diverged, following the transposition of one copy into an intron of the Rho GTPase-activating protein 6. In this exercise, we are going to use available genomic and functional genomic data, especially from the ENCODE consortium, to compare the regulation of these three genes: AMELX, the X chromosome copy, AMELY, the Y chromosome copy, and ARHGAP6, the GTPase-activating protein.

Go to the UCSC genome browser, find the genomic region containing AMELX in the human genome.UCSC browser

Zoom out 3X to see the gene in a broader context.

We will first consider histone modification marks. Click on « ENCODE Regulation » under « Regulation » to tune the presentation of the ENCODE tracks. Change all histone marks to « full » view. Back in the main view, right-click on each « Layered H… », chose « Configure », and chose « Overlay method »: « none ».

1. Which histone marks have peaks in the neighborhood of the AMELX gene?
2. In which cell lines?

Zoom out more, until you see the full ARHGAP6 gene. Compare the histone marks which characterize the AMELX and ARHGAP6 genes.

3. Are AMELX and ARHGAP6 active in the same cell lines?

As you showed the detailed histone marks, show the detailed transcription factor ChIP-seq results.

4. Are AMELX and ARHGAP6 regulated by the same transcription factors?

In a separate browser tabulation, find the genomic region containing AMELY in the human genome. Use similar settings to those used for the AMELX region.

5. How does the activity and regulation of AMELY compare to that of AMELX?

Zoom out 100X for both regions (AMELX and AMELY) (Might be slow).

6. Do you see more transcripts on the X or the Y?
7. Are the histone marks and transcription factor binding density consistent with this difference?
8. How do you interpret these observations in terms of biology of the X and Y chromosomes?

Publié dans bioinformatics, genomics, training | Commentaires fermés sur Exercice for Bachelor students on using #ENCODE data in a browser

Student blogs from my graduate course « Blogging and using Twitter for scientific communication » @unil

I’m running a soft skills course for the PhD schools of Ecology and Evolution and of Genomics (StarOmics): Blogging and using Twitter for scientific communication. In the course of the class, I have asked the students to create a blog if they don’t have one, and write a post. It can be very specialized or very general, towards scientists, the educated public or school kids, and can be in any language (in practice, we got English and French, despite some participants expressing interest to blog in Portuguese or Spanish).

Here are their posts:

Overall, a large diversity of tones and targeted public, from scientific colleagues to the general public, from very technical to quite activist. All of them have managed to provide a nice combination of personnal tone and scientific content, in these various ways.

I look forward to seeing these blogs bloom and develop; hopefully at least some will continue after the course!

Link to my slides on figshare.

Publié dans training | Un commentaire

Scientific dogma: You keep using that word. I do not think it means what you think it means.

Reading a nice paper on human Y chromosomes (Poznik et al 2013), I was struck by the use of the word dogma at the end of the paper:

Dogma has held that the common ancestor of human patrilineal lineages, popularly referred to as the Y-chromosome “Adam,” lived considerably more recently than the common ancestor of female lineages, the so-called mitochondrial “Eve.”

But the word dogma means:

a principle or set of principles laid down by an authority as incontrovertibly true (Wikipedia)

I do not know that we have any such principles in science. That were laid down by an authority and must be accepted, i.e., cannot be discussed or doubted?

Interestingly, there is some discussion on the use of « dogma » in scientific discourse on the Wikipedia Talk page, and there is nothing conclusive, apart from the well documented erroneous usage by Crick for the « central dogma of molecular biology ». Indeed, sequence information could be transferred from protein to DNA or RNA, in principle, without leading to any excommunication from molecular biology.

We have empirical generalities, theories, models, predictions, and well established facts. All are until proven otherwise, and none is dogma. Indeed, Poznik et al have not left nor been expelled from any dogmatic community of human population genetics as price of their observations, as far as I know.

And yet, like « paradigm shift« , « dogma » is over used in the literature. I find 9006 occurences in EuroPMC full text search, and 251 occurences in PubMed titles. I think that this title wins a prize:

Renal-dose dopamine: from hypothesis to paradigm to dogma to myth and, finally, superstition? (Jones & Bellomo 2005)

It seems that in science we like to feel revolutionary, and what better way to make a place in history than to overturn dogmas and paradigms? Maybe we should be more modest, and accept that this what we are doing:

From The illustrated guide to a Ph.D (click on picture)


Note: there are 1934 putative comments awaiting moderation on this blog, which are probably all or almost all spam; in addition, there are 500 comments in my spam folder. So if you comment, please also contact me by email or Twitter (@marc_rr) so that I unblock your comment specifically. Problem solved, thanks to the UNIL I.T.

Publié dans fun | 2 commentaires

Story behind the paper: The hourglass and the early conservation models – co-existing evolutionary patterns in vertebrate development

Following the noble example of Jonathan Eisen, and in the spirit of open science, here is the story behind our recent paper (not so recent because I took time to getting around to this, but here it is):

Piasecka B, Lichocki P, Moretti S, Bergmann S, Robinson-Rechavi M (2013) The Hourglass and the Early Conservation Models—Co-Existing Patterns of Developmental Constraints in Vertebrates. PLoS Genet 9(4): e1003476. doi:10.1371/journal.pgen.1003476

We’ve been interested in the hourglass pattern (the idea that mid-development is more evolutionarily conserved than early or late development) for a while, but our early efforts to investigate it using genomics and bioinformatics were not very successful. We did find a significant impact of development of molecular evolution, but nothing like an hourglass (mostly an « early conservation » pattern – Roux et al 2008, Comte et al 2010). And then in December 2010, out came a zebrafish hourglass paper by Domazet-Lošo and Tautz which shared the cover of Nature with a fly hourglass paper. Domazet-Lošo and Tautz reported a very fine grained microarray experiment over zebrafish development, and an original analysis indicating that genes expressed in mid-development would be older, thus more conserved.

I was very excited by this paper, and looked forward to building on it to analyse in more detail this elusive hourglass pattern. But when we dug more into it, we found some problems with the analysis. Indeed, re-analysing the data using standard microarray procedures produced very different results, with older genes in early development. At this point, we re-contacted Tomislav Domazet-Lošo and Dieter Tautz to discuss our findings. Of note, they were very generous and open about sharing their data and about discussing as soon as we contacted them upon reading the paper, for which I thank them sincerely.

First, I need to explain how the original study was performed. The authors performed an incredibly detailed microarray measure of gene expression at 60 stages of zebrafish ontogeny. Separately, they calculated the age of each gene, as the age of the common ancestor between zebrafish and the furthest species with a sequence genome in which an homolog could be detected (sounds complicated but straightforward if you think about it). The ages were not taken in millions of years, but in ranks along the phylogeny: if my ancestor of interest is three nodes from the root of the tree of species-with-genomes-sequenced, it gets a value of 3. These values were then combined in the « transcriptome age index », or TAI, which is defined for each sample (here, each stage of ontogeny) as: the sum of ages of the genes weighted by their level of expression at this stage. E.g., for the gene with age « rank 3 », it contributes to TAI 3*expression at each stage. Thus older genes contribute smaller ranks, and more expressed genes at a stage contribute more to the TAI of this stage. The authors thus interpret a strong dip in TAI around pharyngula as evidence for stonger contribution of older genes at this ontogenic period.

Our observations were:

  • the original microarray data, like most transcriptome data, is log-normal, i.e. there are many low values and a few extremely high values. Log-transformation recovers something pretty close to a Normal distribution, and is common and recommended practice in transcriptome analysis (including in other « hourglass » papers).
  • performing the TAI after log-transformation, the original pattern is lost, and a pattern of older genes in early development is recovered.
  • using other metrics of the relation between gene age and expression over development always recovers this pattern of older genes in early development: correlation between gene age and expression level at each stage; calling genes present/absent and calculating the average gene age at each stage; ratio of expression of oldest to youngest genes.
  • also, alternative ways of treating the data recover adult male biased genes as younger than female biased genes, consistent with the literature but in contradiction with the original paper.

We also noticed that the paper discussed in detail variations in the TAI which were included within its confidence interval, and even in one case a variation entirely due to one outlier probe (whose effect is removed by log transformation).

The answer from Tomislav Domazet-Lošo and Dieter Tautz can be summarized as follows: the TAI in its original form is intuitive and has the nice property of always adding up to the same for different microarray measures, which is not the case of a TAI based on log-transformed data. There was also some discussion on the proper role of mathematics in biology.

In coordination with them, we submitted a letter to Nature, which was reviewed and rejected as being « too technical ». I wanted to submit this elsewhere immediately, but Barbara Piasecka (student who took the lead on this re-analysis) wanted to improve much more on our previous work and this one.

Now I made a mistake which I regret, which was not making our original correspondance for Nature available in ArXiv. By not doing it we delayed uselessly public discussion of any issues with the TAI, which was later used in other papers.

Anyway, Barbara analyzed the microarray data with a modular approach, finding nice modules of gene expression specific of groups of ontogenic stages. Her first analyses of these modules were a bit disapointing, since we either found no pattern, or an early conservation pattern which we already knew. Then we had the idea to look at non coding conservation, and there was a striking « hourglass » type pattern. Interestingly, this is found not only is sequence conservation (conserved non coding elements), but also in transposon-free regions, which I find fascinating because they provide an orthogonal view of some type of constraint on a genomic area, and in conserved micro-synteny.

Once we had a nice paper (and a nice poster, here at ECCB) (UPDATE: the poster in FigShare), Evo-Devo colleagues encouraged us to submit it to a journal with a wide readership. But both PLOS Biology and PLOS Genetics turned it down. I have found that it can be very difficult to get this type of interdisciplinary paper published (BTW, in this case, I did submit it to ArXiv). It contains evolution, development, bioinformatics, genomics, molecular evolution. Where does it belong? Our 2008 paper was turned down by Mol Biol Evol before « falling up » to PLOS Genetics. The latter journal saved the day again, while we were shopping for other open access alternatives. After the publication of the plant hourglass paper, and some very constructive discussion with Greg Barsh, Editor in Chief, the paper went throught the submission process and was accepted with minor changes. Waouh! End of a long story.

What can we take home from this story? First, that biology is complicated, and insisting on answers such as « the hourglass exists (and explains diverse data) » or « it doesn’t » may not be the best strategy. Second, that the technical details are very important. In fact, I would say that they are an essential characteristic of science. And related to that, third, that the emphasis of journals such as Nature on « broad impact » or whatever it is can cause them to simply ignore the « technicalities » on which the correctness of the conclusions depend. Forth, that the refusal of Nature to publish such a letter delayed considerably a discussion of the limitations of a widely discussed paper. Fifth, that next time I have remarks on a Nature or Science paper, I’ll first relay them on blogs and in ArXiv, rather than keep them on my hard disk. Sixth, that open minded and interactive editors in chief are very important to publishing inter-disciplinary science.

And my last point will be on the casualties of the impact factor cult. Not only is a paper widely assumed to be important because of where it is published, but some of the reviewers of our correspondance with Nature, while abstaining from judging the content of our analysis, wrote that probably we were doing this just to get a Nature paper. No we were not, we were following the proper and indicated procedure when there is an issue with a published paper. This Nature/Science effect is so strong that it twists all normal scientific discource. And that is truly a pity.

Publié dans bioinformatics, evolution, genomics, story behind the paper | 3 commentaires

Molecular Biology and Evolution impact factor: the MEGA effect

(update March 2015 here)

So yesterday come out the new impact factors for journals, and while all serious people agree most of the year that they don’t mean much, when they come out and your favorite journal went up you notice.

Molecular Biology and Evolution (MBE) is the main journal of the molecular evolution community. It’s impact factor just went up from 5.5 to 10.5. Last time there was such a jump, it was due to citations to a paper describing the updated MEGA software, so I suspected something of the sort. And indeed, here are the top cited papers in MBE in 2011-2012:

Title: MEGA5: Molecular Evolutionary Genetics Analysis Using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods
Author(s): Tamura, Koichiro; Peterson, Daniel; Peterson, Nicholas; et al.
Source: MOLECULAR BIOLOGY AND EVOLUTION Volume: 28 Issue: 10 Pages: 2731-2739 DOI: 10.1093/molbev/msr121 Published: OCT 2011
Times Cited: 4,135 (from All Databases)

Title: Bayesian Phylogenetics with BEAUti and the BEAST 1.7
Author(s): Drummond, Alexei J.; Suchard, Marc A.; Xie, Dong; et al.
Source: MOLECULAR BIOLOGY AND EVOLUTION Volume: 29 Issue: 8 Pages: 1969-1973 DOI: 10.1093/molbev/mss075 Published: AUG 2012
Times Cited: 64 (from All Databases)

Title: Statistical Properties of the Branch-Site Test of Positive Selection
Author(s): Yang, Ziheng; dos Reis, Mario
Source: MOLECULAR BIOLOGY AND EVOLUTION Volume: 28 Issue: 3 Pages: 1217-1228 DOI: 10.1093/molbev/msq303 Published: MAR 2011
Times Cited: 33 (from All Databases)

Notice anything? It goes on down from there.

Confirmation: the previous MEGA paper in MBE was published in 2007. Here is the graph of citations to MBE from ISI over the years:


Clicking will take you to ISI. If you don’t have a subscription, tough luck, that’s what happens when we let a corporation rank our journals.

Let’s try to recalculate the impact factor of MBE without MEGA for the last two years. According to ISI, these are the numbers:

  • Cites in 2012 to items published in: 2011 = 3807; 2010 = 1939
  • Number of items published in: 2011 = 297; 2010 = 258

The MEGA5 paper was cited 2473 times in 2012, so updating the numbers, we get:

  • 3807-2473+1939 = 3273 cites
  • 297-1+258 = 554 items
  • thus 3273/554 = 5.9 impact factor.

Who would have thought? It didn’t change!

Amusingly, there was an even more striking jump of impact factor of Briefings in Bioinformatics a few years ago, for which I don’t have time to look up the numbers, which was also entirely due to a MEGA update.

Anyway, have fun checking your new impact factors. Don’t forget the three decimal points. For MBE, it’s 10.353.

Publié dans publishing, statistics | Un commentaire

Don’t complain about NCBI taxonomy, improve it!

This morning I have learnt that NCBI taxonomy has added the term Dipnotetrapodomorpha, following a demand which I made four days ago on behalf of our database Bgee. More details on the Bgee blog.

I would like to add a more subjective note here. This is the second time that we contact NCBI taxonomy. Previously, we asked them to remove Coelomata which was wrongly used for a grouping at the base of Bilateria. They also reacted very rapidly.

I often hear systematicians complain about NCBI taxonomy. Less complaining among systematicians, and more requests for change to NCBI, would probably be more helpful. When you have an issue with NCBI taxonomy, find references and write a constructive email. I put our recent email to NCBI in the Bgee blog post. For Coelomata, it was easy to cite references to the effect that this was not the proper use of the term, and that the proper use of the term described a discredited grouping (1, 2, 3).

It’s like Wikipedia. Everyone uses it. You can complain that everyone uses a resource that you find sub-par, or you can improve it.

Publié dans bioinformatics, evolution | Marqué avec | Un commentaire

A population genetics test for Junk DNA

A short note after reading Dan Graur’s latest series of angry posts about the misuse of Junk DNA or of its death: one two three.

Population genetics tells us that selection is more efficient in larger populations. Consequently, and on average, species with larger efficient population sizes should have more of Useful Stuff and less of Not Usefull Stuff Which Has Even a Low Cost. This provides us with a simple test: how does the abundance of this stuff you are interested in correlate with population size?

For example, large population size species have less introns, shorter introns, less dead transposons, while small population size species (including the pinacle of creation) have lots of these things.

I rest my case.

(This idea is totally not original, but I can’t be bothered to look for a reference and I wanted to put it in writing for future reference for myself. Cheers.)

Publié dans evolution, genomics | Commentaires fermés sur A population genetics test for Junk DNA