Molecular biology

Genes, the updated 2007 model

Beads of genes dispersed intermittently along a very long string of non-coding sequences. This was the popular image once the human genome had been sequenced (in 2001). The successful completion of this colossal project, which also taught us that 99% of human DNA is not protein-coding, ushered in a new and even more arduous one: that of deciphering and analysing these sequences. The international Encode (Encyclopedia of DNA) consortium has just finished analysing 1% of the human genome. In the process it has updated our vision of what a gene is and our understanding of the mysterious role of so-called “junk DNA”.

The coincidence is symbolic. The publication of Encode’s initial findings, which took up the entire June issue of Genome Research, occurred exactly half a century after Francis Crick’s famous lecture in 1957 to the British Society for Experimental Biology. At this lecture, which marks a milestone in the history of molecular biology, Crick, who went on to win the Nobel prize for his discovery of DNA’s double-helix structure, first expressed the “central dogma” of this discipline, that of the existence of an information flow from DNA to messenger RNA (mRNA) – the transcription stage – and then from RNA to proteins – the translation stage. According to this founding vision, a particular protein matches a given gene, defined as a sequence of nucleotides necessary and sufficient to synthesise this protein’s chain of amino acids. The genetic code, whereby every nucleotide triplet is matched by either an amino acid or a start or end translation signal, ensures the correspondence between these two chemically different sequences.

From dogma to doubt

The surprising choice of the term “dogma” – hardly standard scientific vocabulary – to designate this theory did not go unchallenged. Crick’s response was that he had of course not intended any “religious” connotation, and that in fact we should be talking rather of an axiom. But the somewhat ambiguous choice of term (dogma or axiom) led to the newly-defined model imposing itself as cast in stone in the molecular biology laboratories of the 1960s. Genes, which had started out as a speculative concept introduced by Danish scientist Wilhelm Johannsen at the beginning of the 20th century, had taken on a chemical reality. A functional model, inspired by information theory, proposed by Frenchmen François Jacob and Jacques Monod (Nobel prize 1965), soon followed. The complete elucidation of the secrets of heredity and life seemed then just a few years away.

But this initial enthusiasm proved short-lived. More recent research data seemed to contradict the neat certainties of the central dogma. The original conceptual edifice was clearly too simple and might even break apart. In 1977 genes were discovered to have a mosaic structure, alternating regions in which the sequence codes for protein fragments (exons) with “intruding” non-coding regions (whence the name introns). Through the phenomenon of alternative splicing, by which a variety of mRNAs can be generated by multiplying combinations of exons, a single gene can code for several proteins. This was a first body blow to the dogma. In the 1980s scientists went on to discover the “editing phenomenon” by which the mRNA sequence is modified by an enzyme. This seriously damaged the theoretical model by disqualifying the DNA sequence as an accurate predictor of the protein sequence.

Around the same time, researchers also discovered the existence of pseudogenes, that is non-protein-coding DNA sequences generated by the retrotranscription of RNA to DNA. This seriously holed the dogma, forcing scientists to accept that data can flow in the reverse direction, from RNA to the DNA itself.

The final blow to the former certainties came with the gigantic enterprise of sequencing the human genome. The most disconcerting outcome, unveiled in 2001, was that under 1 % of our DNA consists of ‘useful’ genes, the definition of which now seemed less clear than ever. It was hard to imagine that the evolutionary process had kept so much ‘junk DNA’, as these non-coding sequences were christened. Crick’s axiom was now, if not dead, at least totally obsolete and overtaken by later knowledge. Patently the theoretical edifice of molecular biology would have to be rebuilt on new foundations.

Encode: laying new foundations

It was to begin laying these new foundations that the international Encode project was initiated in 2003 by the U.S. National Institutes of Health. This mammoth undertaking, involving some 300 researchers in 80 molecular biology research centres in 11 countries, started with the knowledge accumulated by over a decade of intense research in this area. This included new technologies like bioinformatics, a field in which Europe has one of the best potentials for excellence anywhere in the world (see box). This shared endeavour sought to “start again from zero” by concentrating on just 1 % of the human genome sequencing, that is around 30 million base pairs (out of the three billion that have been counted), each composed of the four nucleotide “letters” of the DNA alphabet.

Starting from this representative sample, researchers set out to map the fundamental and singularly complex relationship of the DNA-RNA couplet, by which a single gene, isolated like a bead on an enormously long string, can be transcribed to permit the synthesis of a protein. In a systematic and exhaustive approach, researchers focused in particular on the second component of this couplet and sought to isolate all RNAs present in a dozen human cell types.

Junk DNA is not junk

It is this meticulous – and titanic – exercise that has yielded the first surprise: the Encode partners discovered that the vast majority of nucleotides in the DNA sequence studied are indeed transcribed into RNA! “This finding tells us that so-called junk is not junk at all, but plays an active role,” sums up Erwan Birney, who led the data analysis work at EMBL’s European Bioinformatics Institute located at the Wellcome Trust Genome Campus near Hinxton (UK). How, indeed, can we describe as “junk” constituent parts that are transcribed, when we still do not know the function of the corresponding RNAs? The more so since the work at Hinxton highlights the fact that certain regions of the human genome, which are transcribed but considered to be non-coding, have been extraordinarily well conserved during the evolutionary process, being found with quasi-identical sequences in 28 other mammal species.

The global results published by the Encode project in June 2007 reveal a totally new molecular biology landscape, with the world of cellular RNA proving now much richer and more complex than we previously believed. The three RNA families on which the “old dogma” was founded – messenger RNA, which acts as a matrix for protein synthesis, ribosomal RNA, which is a component of the cellular translation machinery, and transfer RNA, which serves as a connector between a triplet of mRNA nucleotides and an amino acid – represent just a tiny minority of the transcription products revealed by the Encode researchers. We also find large numbers of tiny sequences of non-coding RNA which we assume to be involved in regulating the transcription process and in other as yet unknown roles.

What regulates transcription?

This revelation of the importance and diversity of the various RNA families raises fundamental and far-reaching questions as to how transcription is regulated. In other terms, what leads to an RNA-polymerising enzyme being fixed onto a strand of DNA. The traditional vision here spoke in terms of transcription signals that are recognised by the polymerase RNA, certain types of it very close to the gene (as its promoter) and others more distant. It was generally agreed that these regions were not part of the gene. This somewhat blurred the very definition of the gene as the necessary and sufficient sequence for the synthesis of a protein, since transcription is impossible without these regulating elements.

Unconstrained by all previous conceptions, Encode produced a radically new perspective. Researchers brought to light a hitherto unsuspected variety of transcription signals. More importantly they succeeded in pinpointing them, often in unexpected places like right at the heart of exons. But their distribution is not random. Forests of signals follow desert stretches with not a single signal. How do we explain this? The avenues opened up by the researchers are suggesting that we need to look more closely at chromatin, a nuclear DNA structure not involved in cell multiplication, which is compacted into chromosomes. Produced by the complex wrapping of the double DNA strand around proteins, chromatin may have an elaborate 3-dimensional structure, which frees up certain sites to make them accessible to RNA-synthesising enzymes. Our task now, the Encode partners are suggesting, is to understand how chromatin regulates the transcription process, in the same way as we explain the function of different proteins by their particular shape.

A new approach to the word ‘gene’

Emerging from this work is a much more complex vision of the organisation of human DNA, in which the sharp dividing line between genes and intergenic regions blurs, since sequences not containing any gene clearly have an important functional role. Should we therefore abandon the concept of ‘gene’?

Encode’s researchers acknowledge the question. But such a sea-change would probably serve only to increase confusion. What they are proposing instead is a new universal definition of ‘gene’, at once compatible with our earlier knowledge – what we used to call a gene remains a gene – but where the idea of ‘function’ results in a much broader meaning.

In this way a gene becomes “a union of genomic sequences encoding a coherent set of potentially overlapping functional products”. Exit the old textbook image of DNA as a simple string of coloured beads known as genes, acting as special protein-coding sequences. With this new definition, the beads can be fragmented (union of sequences). They can also come in several colours, as one and the same gene can code for several functional products (as many as 17 proteins for one of the genes which GEncode studied) and different types of RNA. But how do we define a “functional product”, bearing in mind that “no demonstrated function” does not mean “non-functional”? The Encode researchers are fully aware that by re-asking “what is a gene”, they have introduced a second question, to which there is perhaps no solution: “what is a function?” Hardly have we redefined a gene and we are already confronted with the limits of our new definition. Not forgetting also that we still have to attack 99 % of the genome. Wasn't it the famous American physicist Richard Feynman who used to quip that the vaguer a scientific concept, the more useful it is?

Mikhaïl Stein


Read More

European bioinformatics, a cornerstone of Encode

European bioinformaticians played a key role in Encode, in particular those from the BioSapiens network of excellence. It was they who, in the GEncode sub-project, initiated the annotation of the 1% of the genome selected for the project. This provided the basis for tracking all potential genes, in accordance with the more or less agreed definitions used at the time the research started. “The originality of our approach lay in mixing gene prediction using bioinformatics algorithms with validation of these predictions by scientific experimentation. We first identified the already known genes, used them to train our algorithms, and then used these to discover new genes,” explains Roderic Guigó of the Centre de Regulació Genòmica in Barcelona (ES). The biological functions of some 1097 proteins were identified by the GEncode team from the sample of 30 million nucleotide base pairs selected by Encode for examination.

“Biosapiens is without a doubt one of the largest and most ambitious research initiatives ever undertaken in the Union in the field of molecular biology,” notes Frederick Marcus, who supervises this network of excellence at the Commission. The objective is to build on the world-class advances in European bioinformatics, symbolised by two databases, one for DNA (EMBL gene bank) and the other for proteins (UniProtKB/ Swiss-Prot), that are used across the world. BioSapiens has a double purpose: to create a truly virtual institute for gene annotation, and to provide high quality research training through a European School of Bioinformatics. “The scientific value of a genomic sequence is measured by the quality of its annotation,” explains Tim Hubbard of the Sanger Center at Hinxton (UK): that is the quantity of data that can be drawn from the DNA sequence by comparing it with other known sequences. Another challenge of annotation is mapping out the regions, at times reduced to a single nucleotide (Single Nucleotide Polymorphism or SNP), which vary considerably from one person to another, and to correlate these variations with the onset of various pathologies, in particular cancers. The pharmaceuticals industry has invested colossal sums in studying these SNPs, with their promise of making it possible, in the longer term, to develop designer drugs tailored to everyone’s personal genetic constitution. This explains the presence of representatives of world class corporations like Hoffman-La Roche and Pfizer on BioSapiens’ scientific board.

A historical U-turn?

Michel Morange is professor of biology and history of science at the Ecole Normale Supérieure in Paris. He is also the author of a reference work, A History of Molecular Biology, published by Harvard University Press (1998). His are the comments of a close and knowledgable observer of the “Encode event”.

How do you explain the fact that genes remain a central concept in biology, even though their definition has changed considerably?

Michel Morange: The richness and rootedness of this concept lie in the combination of three aspects: functional unit, mutational unit and recombination unit. People have already tried to distinguish these three aspects by creating different terms, but this has not worked. First because it is difficult to change a nomenclature, in particular when one of the terms has passed into everyday speech. Also because the term gene allows a circulation of ideas between the three accepted meanings: for example between biochemists interested in the functional aspects, and traditional geneticists who are interested in the mutations and recombinations.

Doesn’t Encode’s proposal for a new definition introduce a break point?

It does away with the idea of functional unit, because one and the same gene can code for very different functional products: various protein isoforms, but also different regulator RNAs. All of a sudden one asks whether there is still any link between a gene and a phenotype, as the different products of a gene can be involved in determining very varied phenotypes. In fact, without explicitly stating it, the Encode researchers take an evolutionary standpoint. Natural selection will relate to one aspect of the phenotype, and hence to the genes involved in determining it. The gene becomes so to speak a selection unit.

Encode also proposes a solution to the paradox of junk DNA. Isn’t this a major turning point?

This paradox appeared at the end of the 1970s with two events. One was experimental: the discovery in 1977 of the fragmented structure of eukaryote genes, with their alternating coding and non-coding regions. The second was theoretical, in the form of the publication in 1976 of British biologist Richard Dawkins’ book, The Selfish Gene, offering a possible interpretation of non-coding sequences. If a gene is a self-reproducing entity, as Dawkins holds, we can imagine that redundant sequences, resulting from the degeneration of old genes not chosen by natural selection, will accumulate. Subsequently the first gene sequencings in the 1980s, followed by the human genome programme, have shown that these non-coding sequences represent 99% of the genome. Encode’s work shows us now that almost all these non-coding sequences are transcribed. Very interesting, but I’m not sure this implies that they have a function. One can imagine a more or less arbitrary basic level of transcription, as if DNA ‘leaked’ RNAs. I believe we will only really have resolved the question of junk DNA when we can explain why a little fish like the fugu (blow fish) can have a ten times smaller genome than other fish of the same family and of comparable biological complexity.


To find out more