How should abundant gene expression data be presented?

Recently, it so happened that the example we had chosen to highlight our new Bgee gene page was also used as #Geneoftheweek by Ensembl. And Ensembl tweeted the expression of this gene, hemoglobin β (HBB), in ArrayExpress:

As noted in a recent post on the blog of Bgee, summarizing the expression of a gene such as HBB (but also most other genes, which are not extremely tissue-specific) is difficult: it is expressed in a great number of tissues and developmental or aging stages, yet its expression is more relevant in some than in others, and this expression is supported by a variety of data types and evidence lines. This poses a problem to all expression databases: how to present expression data for a gene? I propose here a small review of some solutions chosen, and some thoughts on their trade-offs.

For each database I will include screenshots; clicking on them will take you to the database pages.

Here is the presentation as it was in the old Bgee interface:

oldbgee1oldbgee2

I think that we can agree that this was not optimal. Advantages: no information hidden, organised according to the anatomical ontology, and clear links to data. Disadvantages: unreadable, redundant (because the ontology is a graph, not a tree), order without biological motivation (alphabetic of ontology IDs), did I mention unreadable?

Here is the presentation in ArrayExpress Gene Expression Atlas (GXA):

gxagxa2

Notice that like for old Bgee, the information goes largely out of the screen, although here it’s more to the right than down; the second screenshot shows the result of scrolling all the way. It’s nicer than old Bgee, but we see a similar philosophy of showing all the information, at the cost of potentially overwhelming the user. The picture of the left allows to highlight the highest expression, although the underlying data can only be found by scrolling all the way to find that information; which is unlucky since in this case « whole blood » comes last in the alphabetic order of tissues. Of note, passing the mouse over the picture highlights tissues, if they are visible on screen at that moment. Images have clear advantages in terms of readability; they also bear a cost for scaling (i.e., making an accurate picture for every new species) and are limited in resolution. Another notable choice here is to present each individual experiment as a row of the table. Personally, when I look for a summary of gene expression I am not really interested in the experiment names, but maybe others are. The table format also presents scaling challenges: when 10 times more experiments and many more tissues or organs will be added, the table will become very difficult to navigate.

Next, probably the best interface I know for gene expression summary so far, the database TISSUES:

tissuestissues2

Here the choice is clearly to put forward what is hopefully the most relevant signal of expression. This is done by ordering types of evidence (manual curation from UniProtKB/Swiss-Prot comes first), and by ordering by confidence. While TISSUES combines many sources of data, it does have a lower number of conditions than Bgee, since it does not integrate in situ hybridizations, and only a small set of RNA-seq and microarray experiments (see Santos et al 2015). In this case, the visual body map corresponds well to the level of detail of the data, and only one species is intended to be covered, thus there is no issue of scalability of anatomical representations.

There are many other databases which present expression data, but seem to demand that the user first define a dataset or a condition to look at, and are thus not relevant to the question of how to present a good overview. And of course many large data resources (such as GTEx or TCGA) present their own data, but then they do not have the challenge of integrating large quantities of diverse information that GXA, TISSUES or Bgee have.

Some more generalist databases do aim to present such integrated overviews. For example neXtprot shows a predefined subset of tissues, while GeneCards shows histograms from three datasets (one microarray experiment, one RNA-seq experiment, and SAGE). So in these cases detailed information is not presented, which is fine for such databases, but unsatisfying for an expression dedicated resource. Moreover, both neXtprot and GeneCards are human-only, while GXA and Bgee must manage to present diverse species.

Thus we get to the recent beta release of Bgee’s new gene page:

newbgee1

As explained on the Bgee blog, we present a sorted list of anatomical terms, where the sorting is based on a weighted average score over all data types. Thus, top terms may be very precise or very general, may be supported by RNA-seq, in situ hybridization, microarrays, or ESTs. These are not a priori decisions, but rather the result of an algorithm which tries to determine what is most informative for this gene. For each anatomical structure, development and aging are present as a unfoldable list.

The advantages of this new approach, we hope, are that:

  • it is scalable to new data types, large quantities of data per type, new anatomical or developmental detail, and to new species.
  • all information is present (the 198 anatomical entities for HBB), but the most relevant information is presented first.
  • the source of information is immediately visible, in the form of little vignettes for data types, colored by quality.
  • users do not need to chose a data type, an experiment or a condition a priori.

It is a first release, and obvious features to add are links to source data, showing the score value of each tissue, and the possibility to download results.

Limitations which we accept as part of our design choices (at least for now) are that:

  • development and aging are somewhat hidden.
  • not all data is visible at once (but in fact, due to screen size limitations, it never is, see old Bgee and GXA).
  • the relations between anatomical terms are hidden (e.g., trabecular bone tissue is_a skeletal tissue).
  • there is no graphical representation of expression levels (as in bar charts or histograms).

Finally, it is difficult for a database with 17 species and soon more, and a large level of anatomical and developmental detail, to represent anatomy by a visual body map. We would need one per developmental stage per species, which is not feasible and of debatable utility. And then these would not allow the visualization of fine cell types.

Now you understand where our compromise solution comes from. 😉

I hope that this subjective little tour is helpful in illustrating the challenges of representing the large and growing information that we have on gene expression in humans and other animals. Any great solutions which were overlooked here are welcome as comments below, or as tweets @marc_rr or @Bgeedb.

Ce contenu a été publié dans bgee, bioinformatics, avec comme mot(s)-clé(s) , , . Vous pouvez le mettre en favoris avec ce permalien.