Whole Genome Syntenic Analysis of Haplotypes
From Purdue Genomics Database Facility
Whole Genome Syntenic Analysis of Selaginella moellendorffii Haplotypes
Eric Lyons (elyons@nature.berkeley.edu[1])
Dept. of Plant and Microbial Biology, University of California, Berkeley, CA
Summary
The genome sequence of Selaginella moellendorffii is comprised of two distinct haplotype genomes. While these haplotype genomes are not fully assembled, syntenic dotplot analysis of the entire genome reveals which genomic regions are haplotypes of one another (Figure 1). Interestingly, these two haplotype genomes are structurally different from one another, and a given genomic region of one haplotype is often broken into several distinct pieces in the other haplotype. While the exact order of these pieces cannot be determined with the existing genomic assembly, it is clearly seen that these haplotype fragments are inserted and/or rearranged with respect to one another. In addition, two other interesting genomic structural patterns are seen, such as the probable large-scale insertion of a genomic region from one haplotype into the other and a few massive arrays of local gene duplications. This disjointed and fragmented nature of the two haplotype genomes may explain why isolation of a single haplotype organism was not possible.
Figures
Whole genome syntenic comparison of Selaginella to itself. Both axes represent the genome of Selaginella ordered by contig size with the largest contig beginning in the lower left corner. The black smears at the top and the right of the image are due to the large number of very short contigs. Putative gene homologs are identified using blastn with default values and an e-value cutoff of 0.05, and drawn on the image at their corresponding position with a grey dot. These results are then processed by DAGChainer (using genomic gene order and parameters -g 2 -D 5 -A 3) to identify collinear series of putative gene homologs, which is evidence for synteny (two genomic regions being derived from a common ancestral genomic region). Identified syntenic gene pairs are drawn on the dotplot using green dots. The green line running along the 45 degree axis is the self-self match. Other regions of synteny are easily visualized, the majority of which result from the two different haplotype genomes comprising this diploid genome. Nearly the entire genome is covered by such syntenic regions derived from the two haplotypes.
Same figure as above, but with numbers and arrows pointing to various regions shown in greater detail below.
A: Close-up of scaffold_5 compared to itself and its haplotype (scaffold_1). Note that while the haplotypes are very similar overall, there are differences with regards to intergenic space and copy number of a locally duplicated gene array. B: Comparison of 100kb of haplotypes using BLASTZ and GEvo to visualize the results (http://tinyurl.com/c2cp89). Top and bottom panels are from scaffold_5 and scaffold_1 respectively. The dashed line in each panel separates top and bottom strands of genomic region, gene models are drawn as yellow/green/blue composite arrows, regions of sequence similarity as identified by BLASTZ are drawn as pink blocks with those in the (++) and (+-) orientation drawn above and below the dash line respectively. Large regions of sequence similarity have pink lines connecting the two genomic regions and have an average percent identity of >95%. Note the large spaces without sequence similarity highlighting the differences in DNA sequence content between the haplotype regions. (Orange regions in the background of panels are unsequenced regions of the genome.)
A:
B:
A: Close-up of scaffold_2 self-self syntenic comparison. This scaffold shows a large duplicated region which is not found elsewhere in the genome. While nearly all other genomic regions have a corresponding haplotype region elsewhere, this duplication may be the result of a large-scale insertion and rearrangement event between haplotype genomes. B: Pairwise sequence analysis of 100kb of this duplicated region on scaffold_2 (image legend described above). This region shows a pattern of sequence similarity and overall coverage similar to that seen between haplotype regions. Combined, these data suggest that this is indeed an rearrangement event that occurred between haplotypes; their relative placement near one another is likely to be a fortuitous event. Given a more complete assembly of these genomes, it would be interesting to quantify the frequency of this type of genomic evolutionary event between the haplotype genomes. Also, this type of genomic evolution (inter-haplotype genomic exchanges) has not been characterized in other plant genomes, and may be unique to the Selaginella lineage. However, this type of genomic evolution shows a pattern that is similar to plant genomes which have had a recent whole genome duplication event (e.g. Populus trichocarpa).
3A:
A: Close up of local gene arrays found on scaffold_27, with multiple homologs found on scaffolds 17, 81, and 43. Of these, scaffold_81 has a haplotype region. B: This gene family (or shared conserved domains) is found in Rice, Arabidopsis, and Physcomitrella as determined by BLASTP in low copy number (2-3). The genes in these organisms have unknown function.
Syntenic comparison of scaffold_10 to a set of other scaffold. Note that the haplotype syntenic regions are broken up among three scaffolds, 6, 14, and 29. Scaffolds 6 and 14 each have a region where synteny stops and is evidence for haplotype genomes being structurally different due to large-scale rearrangements.
Conclusion
While the evidence presented here makes a strong case for the distinct genomic structure between haplotype genomes, it will be interesting to reconcile the data obtained in the Duplicate Gene Analyses. Those analyses use the distribution of Ks values (synonymous site rate change) of putative homologous gene pairs to infer several sets of gene duplication events including recent duplicates, duplicates between the haplotypes, and possible segmental duplications. The sets of genes comprising the peaks in the distribution of Ks values can be mapped to the syntenic data presented here in order to determine the exact origin and evolution of those gene sets.
References
To be generated upon request.

