Duplicate Gene Analyses

From Purdue Genomics Database Facility

Jump to: navigation, search

Analyses of Selaginella moellendorfii Duplicate Genes


Michael S. Barker (msbarker@indiana.edu[1])
Dept. of Biology, Indiana University, Bloomington, IN

Dept. of Botany, University of British Columbia, Vancouver, BC


Summary

Ancient whole genome duplications have been identified in wide variety of eukaryotes, with numerous examples from plants. Paleopolyploidy has been identified in the history of the model plants Arabidopsis and Physcomitrella, as well numerous non-model species such as Persea and Solanum. To assess if Selaginella moellendorfii is a paleopolyploid, I analyzed the age distribution of duplicate genes from the predicted transcripts. Unlike other model plants with fully sequenced genomes, the age distribution of duplicate genes does not reveal any evidence of ancient whole genome duplications in the history of Selaginella. No significant peaks are apparent in the age distribution of 4,164 gene family duplications (Figure 1), and SiZer analyses of this distribution did not identify any significant peaks. Considering the paucity of extant polyploid species in the family and their relatively low chromosome numbers among the lycophytes this result is not entirely surprising. Chromosome numbers in the genus range from 7 - 12 with S. moellendorfii and other members of its clade possessing an x = 10 cytotype (Takamiya, 1993; Korall and Kenrick, 2002). Species immediately outside of the S. moellendorfii clade possess a larger diversity of chromosome counts, including x = 8 and 9 cytotypes. If the x = 10 count of S. moellendorffii and its relatives is the result of aneuploidy, we would expect to observe relatively recent, small peaks in their age distribution. Consistent with this pattern, the K-S goodness of fit test (Cui et al., 2006) rejected the null, indicating a deviation from a purely exponential distribution. Subsequent analysis with a maximum likelihood mixture model found five normal distributions in the data (Table 1, Figure 1). The youngest of these distributions corresponds to the zero class of recent duplicates, and the second youngest is likely a result of heterozygous alleles in our data set. However, two of the remaining distributions, centered at Ks = 0.14 and 0.34, are possibly segmental duplications. Future analyses of fully assembled sequence scaffolds should reveal whether or not these mixture components are indeed segmental duplications, and if they are responsible for the x = 10 cytotype observed in S. moellendorffii. Full output of the duplicate analysis is located at http://msbarker.com/selmo/pamloutput_selmo.

Image:smoagedist3.jpg
Figure 1. Age distribution of Selaginella moellendorfii duplicate genes. Black lines indicate normal fits based on mixture model analysis. No statistically significant peaks consistent with paleopolyploidy are observed, although more recent mixture components may correspond to segmental duplications.

Materials & Methods

For the predicted Selaginella moellendorfii transcripts, duplicate gene pairs were identified and their divergence, in terms of substitutions per synonymous site per year (Ks), was calculated. Duplicate pairs were identified as sequences that demonstrated 40% sequence similarity over at least 300 base pairs from a discontinguous MegaBLAST (Zhang et al., 2000; Ma et al., 2002). Reading frames for duplicate pairs were identified by comparison to available plant protein sequences. Each duplicated gene was searched against all plant proteins available on GenBank (Wheeler et al., 2007) using BLASTX (Altschul et al., 1997). Best hit proteins were paired with each gene at a minimum cutoff of 30% sequence similarity over at least 150 sites. Genes that did not have a best hit protein at this level were removed before further analyses. To determine reading frame and generate estimated amino acid sequences, each gene was aligned against its best hit protein by Genewise 2.2.2 (Birney et al., 1996). Using the highest scoring Genewise DNA-protein alignments, custom Perl scripts were used to remove stop and “N” containing codons and produce estimated amino acid sequences for each gene. Amino acid sequences for each duplicate pair were then aligned using MUSCLE 3.6 (Edgar, 2004). The aligned amino acids were subsequently used to align their corresponding DNA sequences using RevTrans 1.4 (Wernersson and Pedersen, 2003). Ks values for each duplicate pair were calculated using the maximum likelihood method implemented in codeml of the PAML package (Yang, 1997) under the F3x4 model (Goldman and Yang, 1994).

Further cleaning of the data set was conducted to remove duplication events that could bias the results. All duplicate pairs containing identifiable transposable elements were removed from the analysis because duplication resulting from transposition may obscure a signal from paleopolyploidy. To reduce the possibility that identical genes are represented in the data set, but missed by the TGICL clustering due to alternative splicing, all Ks values from one member of a duplicate pair with Ks = 0 were removed. Further, to reduce the multiplicative effects of multicopy gene families on Ks values, we used simple hierarchical clustering to construct phylogenies for each gene family (Blanc and Wolfe, 2004a), identified as single-linked clusters, and calculate the node Ks values. Node Ks values < 2 were used in subsequent analyses.

To identify significant features in the age distribution we employed three statistical tests. We used the bootstrapped K-S goodness of fit test of Cui et al. (2006) to assess if the overall age distributions deviated from a simulated null. Taxa that deviated significantly from the null were then analyzed with SiZer (Chaudhuri and Marron, 1999) to identify significant features (α = 0.05) in our age distributions. SiZer uses changes in the first derivative of a range of kernel density estimates to find significant slope increases or decreases, and the combination may be used to identify peaks and their ranges (Chaudhuri and Marron, 1999). We also used EMMIX to fit a mixture model of normal distributions to our data by maximum likelihood (Mclachlan et al., 1999). Peaks produced by paleopolyploidy are expected to be approximately Gaussian (Schlueter et al., 2004; Blanc and Wolfe, 2004a), and this mixture model test identifies the number of normal distributions and their position(s) that could produce our observed age distributions. For our analyses, 1−10 normal distributions were fitted to the data with 1000 random starts and 100 k-mean starts. The Bayesian Information Criterion (BIC) was used to select the best model fit to the data.

Literature Cited

Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25:3389–3402.
Birney E, Thompson J, Gibson T. 1996. PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucl. Acids Res. 24:2730-2739.
Blanc G, Hokamp K, Wolfe KH. 2003. A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res. 13:137-144.
Blanc G, Wolfe KH. 2004a. Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell. 16:1667-1678.
Chaudhuri P, Marron JS. 1999. SiZer for exploration of structures in curves. J. Am. Stat. Assoc. 94:807-823.
Cui L, Wall PK, Leebens-Mack JH, Lindsay BG, Soltis DE, Doyle JJ, Soltis PS, Carlson JE, Arumuganathan K, Barakat A, et al. (13 co-authors). 2006. Widespread genome duplications throughout the history of flowering plants. Genome Res. 16:738-749.
Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl. Acids Res. 32:1792-1797.
Goldman N, Yang Z. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725-736.
Korall P, Kenrick P. 2002. Phylogenetic relationships in Selaginellaceae based on rbcL sequences. Am. J. Bot. 89:506-517.
Ma B, Tromp J, Li M. 2002. PatternHunter: faster and more sensitive homology search. Bioinformatics. 18:440–445.
Mclachlan G, Peel D, Basford K, Adams P. 1999. The EMMIX software for the fitting of mixtures of normal and t-components. J. Stat. Softw. 4:2.
R Development Core Team. 2005. R: A language and environment for statistical computing, reference index version 2.x.x. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org.
Schlueter JA, Dixon P, Granger C, Grant D, Clark L, Doyle JJ, Shoemaker RC. 2004. Mining EST databases to resolve evolutionary events in major crop species. Genome. 47:868-876.
Takamiya M. 1993. Comparative karyomorphology and interrelationships of Selaginella in Japan. J. Plant Res. 106:149-166.
Wernersson R, Pedersen AG. 2003. RevTrans: multiple alignment of coding DNA from aligned amino acid sequences. Nucl. Acids Res. 31:3537-3539.
Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, et al. (30 co-authors). 2007. Database resources of the National Center for Biotechnology Information. Nucl. Acids Res. 35:D5-12.
Yang Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555-556.
Zhang Z, Schwartz S, Wagner L, Miller W. 2000. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7:203–214.

research Groups