A detailed analysis of synonymous versus non-synonymous substitution rates in chloroplast gene ycf2 (ORF2280 homologs) in six dicot species

by Frances Raftis, April 11, 2001


The largest chloroplast genome coding sequence is known to express a protein that is thought to be essential for plant cell survival in many species, particularly dicots. Analysis of synonymous versus non-synonymous substitution patterns in six dicots (Arabidopsis thaliana, Nicotiana tabacum, Lotus japonicus, Oenothera elata subsp. hookeri, Epifagus virginiana, and Spinacia oleracea) sequences revealed that purifying selection is acting on the the ycf2 locus. The ycf2 locus is located within the conservative IR region of the chloroplast genome, so similar analyses of two IR (interspersed repeat) coding genes (rps7 and rpl23) as well as one LSC (long single copy region) gene (psbA) were conducted to compare expected rates of synonymous versus non-synonymous substitution, as well as overall rates of sequence divergence within different regions of the chloroplast genome. The IR was found to have overall lower rates of divergence across all species. The non-photosynthetic parasite Epifagus virginiana was found to have conserved ycf2 and rps7 sequences, though rpl23 and psbA showed ds/dn values typical of pseudogenes, as expected for this species. Short-interval analyses of 100, 150, and 200bp segments were conducted on the ycf2 gene for all sequences, and several conserved regions were found. A region of poor conservation was found that corresponds to a region of the putative Ycf2 protein which was suggested to not possibly be essential (Downie et al., 1994).


The giant chloroplast gene ycf2 specifies a protein of yet undetermined function. It is found in the plastome of most land plants, including the non-photosynthetic parasite Epifagus. The gene is translated (Glick and Sears, 1993), and is more abundant in the fruits and flowers of tomato than in the leaves (Richards et al.,1994). The gene product seems to perform some vital function for the plant cell, as Drescher's work (Drescher et al., 2000) suggests that a deletion mutant is lethal. Many other lines of evidence support the theory that ycf2 is an essential gene.

A reasonable gauge of a protein's necessity is the presence or absence of purifying selection in the gene sequence. For an essential protein, one would expect to see a greater rate of synonymous than non-synonymous substitutions. With the growing availability of chloroplast gene sequences, this approach allows a relatively easy preliminary determination of the state of (dis)use of a plastome coding gene. When compared between species, the rates of synonymous and non-synonymous substitutions can also give a general idea of the rate at which a sequence actually changes at a nucleotide level rather than how it evolves (which is what is determined with a protein sequence). Post-transcriptional mRNA modification (splicing and editing) is quite common in the chloroplast, so a protein sequence may not be the actual protein produced in the chloroplast, unless it is obtained from the isolated protein itself. A detailed investigation of ds/dn for a coding gene can point to which regions of an unfamiliar protein are essential (by their degree of conservation), and possibly which parts of the gene actually make it into the finished protein.

The ycf2 gene is located within the interspersed repeat (IR) region of the plastome in dicots. This region evolves even more slowly than the rest of slowly evolving chloroplast genome (Rainer et al., 1995). This implies that genes within this region are protected from rapid divergence relative to other plastid genes. Indeed, the other genes in this region are typically an rRNA operon, ribosomal proteins, and tRNA genes, which are ideal candidates for protection from mutation by virtue of their necessity for the translation of the rest of the genes in the plastome. The inclusion of ycf2 in this region hints at its importance.

In Epifagus virginiana, some of the genetic system genes are maintained, and some have become pseudogenes. All photosynthesis genes that persist are also pseudogenes. The Epifagus chloroplast genome therefore provides an ideal gauge for the neutral rate of chloroplast genome evolution. It retains the IR, which contains the ycf2 and rps7 genes as well as a pseudogene of rpl23. Plastid genes used for this comparison were psbA, rps7, and rpl23. The D1 protein central to photosystem II is the product of the psbA gene, which is a pseudogene in Epifagus. The rps7 gene encodes a chloroplast ribosomal protein, and is located in the IR. It is conserved in Epifagus and all other dicots studied. The rpl23 gene is another chloroplast ribosomal protein, and also located in the IR, but it is a pseudogene in Epifagus, and possibly in spinach also.

Materials and Method:

Nucleotide sequences of the ORF2280 homologue (ycf2 gene) from Arabidopsis thaliana, Nicotiana tabacum, Lotus japonicus, Oenothera elata subsp. hookeri, Epifagus virginiana, and Spinacia oleracea (table 1) were aligned using clustalw1.7. Alignments were evaluated by comparing the parsimony tree they generated in Webphylip1.2 (http://www.cbr.nrc.ca/cgi-bin/WebPhylip/index.html) with a standard taxonomic tree. The entire alignment was then tested for rates of synonymous versus non-synonymous substitution using SNAP (Synonymous/Non-synonymous Analysis Program) via http://hiv-web.lanl.gov/SNAP/WEBSNAP/SNAP.html. The alignments were then analyzed for ds/dn in 100, 150, and 200 base pair increments to find conserved regions. Sequences of psbA, rps7, and rpl23 were analyzed in a similar fashion. The psbA alignments were tested for ds/dn over both the entire alignment, and over a common segment of 400bp. Only the entire rps7 and rpl23 alignments were tested.

      |  ycf2     rps7      rpl23     psbA
arabi | 7525012   7525012   7525012  X79898.1
tobac | 2924257  11465934  11465934  J01448.1
lotus |13358958  13518417  13518417  13518417
oenot |13276709  13276709  13276709  13518298
epifa |11466954  11466954  11466954  11466954
spina | 7636084  11497503  X07462.1   7636084

Table 1. GenBank accession numbers or GI of sequences used


The rate of synonymous substution over the entire ycf2 sequence was higher than the rate of non-synonymous substution between every pairwise comparison except between Tobacco and Oenothera (fig. 1). The average value of ds/dn for all the pairwise comparisons was 1.62, with comparisons between spinach and arabidopsis, tobacco, and lotus all having ds/dn values greater than 2. For all sequences, the ds/dn value for rps7 was about twice as large as that of ycf2 (3.89 versus 1.62), but the values for rpl23 and the full psbA sequence were both lower (1.41 and 0.60, respectively). The ds/dn values for the Epifagus sequences showed some deviation from the global average. The deviance was minimal for ycf2 and rps7 (1.58 versus 1.62, and 3.53 versus 3.89, respectively), but more striking for the rpl23 and full psbA sequences (1.03 versus 1.41, and 0.97 versus 0.60, respectively). The difference was most pronounced in the truncated 400bp psbA sequence (1.37 versus 13.92).

The values of ds and dn also varied somewhat by region and for Epifagus. Values for ds ranged from 0.08 to 0.12 for genes in the IR region, but were slightly higher for the full psbA sequence at 0.19. A similar trend was observed for dn values, which ranged from 0.3 to 0.8 in IR genes, and from 0.19 to 0.22 in the full psbA sequence. The ds and dn values for Epifagus sequences relative to all others were generally somewhat higher, though the trend towards lower values in the IR genes was observed. For the ds and dn values of psbA (both full and truncated sequences), the values for Epifagus were at least twice as high.

The ycf2 alignment was tested for ds/dn at 100, 150, and 200 base pair intervals to find conserved regions (fig. 2). Relatively high values were found near 1000, 2000, 4500-5000, and the last 1000 nucleotides. The conserved regions are far less pronounced over the larger interval size of 200bp than they are at 100bp, but the 150bp intervals generally showed intermediate values, so the peaks are probably not meaningless.


The ds/dn values for ycf2 are almost all greater than 1, which probably indicates some form of negative selection taking place. Though the average value for the entire sequence is only half as large as that of rps7, it must be considered that ycf2 is approximately 15 times longer than rps7. It is not unlikely that some regions of ycf2 show very strong conservation, while a great deal of it is less well conserved. This is in fact the result seen for the psbA gene: the full alignment, which contains considerable extensions on each end, has a ds/dn value so low (0.60) that it suggests positive selection. However, if the analysis is only conducted on the segment of the sequence that is common to all species studied, the average ds/dn value increases drastically to 13.92, which suggests very strong negative selection acting on this segment. If the Epifagus comparisons are omitted from the average, it increases to 20.20 for the 400bp segment.

No regions of the ycf2 sequence show ds/dn values of the same magnitude of the truncated psbA sequence, but it does suggest some that some regions are more highly conserved than either rps7 or rpl23 are generally. The gene shows some homology with FtsH/CDC48 genes, which are involved in cell division, cross-membrane transport, and proteolysis. It has been suggested that ycf2 protein functions as a plastid-specific ATPase of the CDC48 family (Wolfe, 1994). One of the binding motifs that suggested this relationship were found near the 450-residue region of the hypothetical protein sequence, which would correspond approximately to the 1300 bp region of the gene. This area shows moderate ds/dn values (between 1 and 2). This does not strongly support negative selection for this region of the gene, but the motif in question is only 18 residues (~54 bp), which could have easily been overlooked in this analysis. A similar result is found for the nucleotide-binding motif around 1450-1600 residues (corresponding to the 4350-4800 region). While there is a sharp peak in the average ds/dn value, the 13-residue (~39 bp) motif is probably not the cause of it.

An analysis of the proposed ycf2 gene product was conducted by Downie et al. (Downie et al., 1994), it which it was proposed that the region of 1000-1500 residues from the N-terminal could not be essential due to the numerous deletions between dicots and Marchantia (a bryophyte) in this region. This region would correspond approximately to the 3000-4500 bp region of the gene, which shows the lowest overall values of ds/dn of the sequence.

The IR region of the chloroplast genome has been shown to have a slower rate of overall substution than the long or short single-copy regions (Rainer et al., 1995). The values of ds and dn given by this analysis support this. The highest ds or dn values for IR genes show no overlap with the lowest psbA values. The IR values range from 0.03 to 0.12, and those of the psbA sequence range from 0.16 to 0.38. The highest IR values come from the ycf2 gene, suggesting that it has a relatively high rate of substution for an IR gene, but it is still within a reasonable range of the other IR genes. A higher rate of susbtution is not suprising in such a large sequence, either.

The Epifagus ycf2 sequence has a ds/dn value close to the average (97%), which suggests that purifying selection is still acting on the gene even in the greatly reduced plastome. The Epifagus plastome has suffered such large deletions since its loss of photosynthetic activity that the IR region occupies a considerably greater proportion of the plastome relative to tobacco (Wolfe et al., 1992). Similarly, the IR rps7 sequence shows a similar ds/dn value (91% of the average). However, residence in the IR region of Epifagus (nor being encoding a ribosomal protein) guarantees conservation: rpl23 is a pseudogene in Epifagus, with a ds/dn value of 1.03 which is 73% of average, or 53% if the spinach (in which rpl23 is thought to be a pseudogene) and Epifagus sequences are omitted from the comparison. The difference is even more striking for the common segement of psbA, where Epifagus has a ds/dn value that is only 10% of the average, or 7% if the Epifagus sequences are omitted from the average. In both rpl23 and psbA, the Epifagus sequences behave like true pseudogenes (with ds/dn values close to 1). The rps7 gene is clearly conserved in Epifagus. Though Epifagus ycf2 has a ds/dn value less than half that of its rps7, it is still arguably conserved because it is not significantly lower than the overall average ds/dn, and the Epifagus sequence was included in the 100, 150, and 200 bp interval analysis, which showed some regions of high conservation. The conservation of ycf2 in Epifagus would also suggest an explanation for the stubborn persistance of the IR. Generally, the ds and dn values for the Epifagus sequences are slightly higher that the average, which suggests that the Epifagus plastome has more freedom to evolve. This difference is again most pronounced in the psbA sequence, where the Epifagus ds and dn values are at least twice that of the average.

Considering the location of the ycf2 gene in the dicot plastome, it would be suprising if the protein were not of some utility to the cell. Not only is it the largest gene in plastomes where it is present (6843bp in tobacco, representing ~8.7% of the entire plastome), but it is also located in the relatively static IR, which is represented twice in each molecule of chloroplast DNA. Though it is not quite as streamlined as a typical bacterial genome, the chloroplast genome is not as tolerant of exteraneous DNA as eukaryotic genomes seem to be (as evinced by the 'purging' of the Epifagus plastome), so any gene that represents such a large fraction of the plastome's coding capacity is most likely important. Though ycf2 does not show the same degree of sequence conservation as other IR genes, it does not behave as an IR pseudogene would (eg., rpl23 in Epifagus). It would be interesting to compare the conservation of ycf2 to that of the entire IR in photosynthetic dicot species, as it is possible that the more rapid rate of ycf2 evolution is due to its large size.

The ycf2 gene is one of only four proteins conserved in Epifagus that is not a genetic system gene. The maintenance of the genetic system (and the plastid itself) in the parasitic plant is something of an enigma, and it has been suggested that one of these four genes provides some essential function to the plant (Wolfe et al., 1992). Considering also that Drescher et al. were unable to obtain homoplastomic transplastomic knockouts for ycf2 even under intense selectional pressure (Drescher et al., 2000) suggests that a deletion is lethal in some species. The gene seems to be entirely absent from grasses, notably maize and rice (Rainer et al., 1995), but this loss seems to be novel. It is assumed that the gene has been transfered to the nucleus in these species, but no homologue has yet been found. All the species in this analysis were dicots, and it is possible that the gene is necessary for some dicot-specific process. Since it is highly expressed in generative tissues (flowers and fruit of tomato, (Richards et al., 1994), it is possible that it plays some role in seed development, which is a different process in dicots and monocots. However, its presence in bryophytes and gymnosperms indicates that this could not be its only role. This would, however, explain why the protein seems to be more conserved in dicot species than in bryophytes (Downie et al., 1994).


Downie,S.R., Katz-Downie,D.S., Wolfe,K.H., Calie,P.J., and Palmer,J.D.
Structure and evolution of the largest chloroplast gene (ORF2280): internal plasticity and multiple gene loss during angiosperm evolution
Curr. Genet. 25, 367-378 (1994)

Drescher,A., Ruf,S., Calsa,T.Jr., Carrer,H., and Bock,R.
The two largest chloroplast genome-encoded open reading frames of higher plants are essential genes
The Plant Journal 22 (2), 97-104 (2000)

Glick,R.E., and Sears,B.B.
Large unidentified open reading frame in plastid DNA (ORF2280) is expressed in chloroplasts
Plant Mol. Bio. 21, 99-108 (1993)

Hupfer,H., Swiatek,M., Hornung,S., Herrmann,R.G., Maier,R.M., Chiu,W.L. and Sears,B.
Complete nucleotide sequence of the Oenothera elata plastid chromosome, representing plastome I of the five distinguishable euoenothera plastomes
Mol. Gen. Genet. 263 (4), 581-585 (2000)

Kato,T., Kaneko,T., Sato,S., Nakamura,Y. and Tabata,S.
Complete structure of the chloroplast genome of a legume, Lotus japonicus
DNA Res. 7 (6), 323-330 (2000)

Liere K, Kestermann M, Muller U, Link G.
Identification and characterization of the Arabidopsis thaliana chloroplast DNA region containing the genes psbA, trnH and rps19
Curr. Genet. 28 (2), 128-130 (1995)

Nei M, Gojobori T.
Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions.
Mol Biol Evol 1986 Sep;3(5):418-26

Richards,C.M., Hardison,R.C., and Boyer,C.D.
Expression of the large plastid gene, ORF2280, in tomato fruits and flowers
Curr. Genet. 26, 494-496 (1994)

Rainer,M.M., Neckermann,K., Igloi,G.L., and Kossel,H.
Complete Sequence of the Maize Chloroplast Genome: Gene Content, Hotspots of Divergence and Fine Tuning of Genetic Information by Transcript Editing
J. Mol. Biol. 251, 614-628 (1995)

Sato,S., Nakamura,Y., Kaneko,T., Asamizu,E. and Tabata,S.
Complete structure of the chloroplast genome of Arabidopsis thaliana
DNA Res. 6 (5), 283-290 (1999)

Schmitz-Linneweber,C., Maier,R.M., Alcaraz,J.P., Cottet,A., Herrmann,R.G. and Mache,R.
The plastid chromosome of spinach (Spinacia oleracea): complete nucleotide sequence and gene organization
Plant Mol. Biol. 45 (3), 307-315 (2001)

Shinozaki,K., Ohme,M., Tanaka,M., Wakasugi,T., Hayashida,N., Matsubayashi,T., Zaita,N., Chunwongse,J., Obokata,J., Yamaguchi-Shinozaki,K., Ohto,C., Torazawa,K., Meng,B.Y., Sugita,M., Deno,H., Kamogashira,T., Yamada,K., Kusuda,J., Takaiwa,F., Kato,A., Tohdoh,N., Shimada,H. and Sugiura,M.
The complete nucleotide sequence of tobacco chloroplast genome: its gene organization and expression
EMBO J. 5, 2043-2049 (1986)

Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994)
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice.
Nucleic Acids Research, 22:4673-4680.

Wolfe,K.H., Morden,C.W. and Palmer,J.D.
Function and evolution of a minimal plastid genome from a nonphotosynthetic parasitic plant
Proc. Natl. Acad. Sci. U.S.A. 89 (22), 10648-10652 (1992)

Similarity between putative ATP-binding sites in land plant plastid ORF2280 proteins and the FtsH/CDC48 family of ATPases
Curr. Genet. 25, 379-383 (1994)

Zurawski,G.R., Bohnert,H.J., Whitfeld,P.R. and Bottomley,W.
Nucleotide sequence of the gene for the M-r 32,000 thylakoid membrane protein from Spinacia oleracea and Nicotiana debneyi predicts a totally conserved primary translation product of M-r 38, 950
Proc. Natl. Acad. Sci. U.S.A. 79, 7699-7703 (1982)

Frances Raftis , copyright 2001