Distribution of genes

From Purdue Genomics Database Facility

Jump to: navigation, search

S. moellendorffii genes are compressed in small gene rich regions while large deserts of genes are retaind

T. Nishiyama (tomoakin@kenroku.kanazawa-u.ac.jp)

To explore how S. moellendorffii could have a very small genome, various size measurements were compared among P. patens, S. moellendorffii, and A. thaliana. As a results, introns, and 3' UTR is shorter in S. moellendorffii, while CDS size were retained as large as A. thaliana. These comparison suggests that the S. moellendorffii underwent an evolution to reduce the total genome size and only the CDS which was functionally constrained to shorten were retained in size. Although the measurement of 5' UTR might be inaccurate due to different cDNA library construction methods, the peak location is consistent with the hypothesis.


Although the total genome size is small, S. moellendorffii genome seems to have large repeats with little gene content.

Instead, the gene rich regions are very tightly compressed with little inter-genic region.

To discribe the situation, quantitatively, I calculated gene distance between neighboring genes based on the gff files, as

distance = (this transcript start) - (previous transcript end)

the distribution of the distance was calculated for S. moellendorffii Filtered Model 2 dataset and P. patens Filtered model dataset.

The summary of the statistices were

statistics S. moellendorffii P. patens A. thaliana
Min -33074 -17007 -9065
1st Qu. 393 1647 309
Median 1086 4916 904
3rd Qu. 3038 13913 2190
Max 275050 272201 497963
Mean 4139 10524 2221

Image:3dists.gif


The data show that usual inter genic regions are more than 4-fold smaller compared to P. patens, though the mean are less impacted because there are very large deserts of gene and maximum are larger in S. moellendorffii. The distribution in S. moellendorffii looks similar to A. thaliana. However, the above calculation reflects the effort to clone full-length cDNA and identify the UTR in A. thaliana. A similar comparison between the CDS were performed

statistics S. moellendorffii P. patens A. thaliana
Min -32950 -8229 -4556
1st Qu. 445 1812 1363
Median 1158 5063 2696
3rd Qu. 3119 14055 4732
Max 275050 272201 501725
Mean 4139 10672 3993

Image:3distsCDS.gif

This comparison shows that the averege distance represented by 1st quatinent median and 3rd quatinent is smallest in S. moellendorffii and A. thaliana is intermediate.



Speculations

The skew in the pattern of intergenic regions might suggest that the selective force to reduce the genomesize, whatever the nature may be (which is unclear), act on euchromatin region easier than heterochromatin region. Or, the heterochromatin are more stable in some regard and euchromatin are more easily attacked.


Note: Negative number of distance, indicates there are overlapping transcripts predicted. The extremely large negative ones seems as results of selection of overlapping gene models that are not likely a good ones.

S. moellendorffii genes have smaller untranslated regions and introns, while the CDS size is not smaller

The size of 5' and 3' untranslated regions of P. patens and S. moellendorffii were calculated based on the start, and stop codon annotation and the exon annotations of each gene on the GFF files. For arabidopsis, the five_prime_UTR and three_prime_UTR features were used. The size of cds were calculated by summing up the size of CDS features belonging to each gene. 0 length UTR were assumed to mean that they are undetermined and removed.

For the CDS comparison, when the total annotation data were used, the pattern in P. patens showed a strange peak around 400 bp, and the shape seemed different from S. moellendorrfii and A. thalina. Since this was suspected to come from errornous annotation or transposon associated gene fragment, a comparison of CDS of actively transcribed gene were performed by using only the CDS data that is associated with 3' UTR annotation. Because 3' UTR is usually identified from EST data this should have restricted to actively transcribed gene.


statistics S. moellendorffii P. patens A. thaliana
Min 1 1 1
1st Qu. 37 117 57
Median 116 236 100
3rd Qu. 396 445 175
Max 275050 272201 501725
Mean 347 393 139

Image:5UTRhist.gif


statistics S. moellendorffii P. patens A. thaliana
Min 150 150 51 150 is perhaps the cut-off of the annotation and not biologically meaningful
1st Qu. 684 528 597
Median 1071 873 1041
3rd Qu. 1557 1440 1557
Max 25390 14670 16010
Mean 1276 1137 1211

Image:CDS-hist.png

statistics S. moellendorffii P. patens A. thaliana
Min 150 156 51
1st Qu. 800 708 666
Median 1209 1158 1080
3rd Qu. 1737 1749 1578
Max 25390 14580 16010
Mean 1445 1404 1255

Image:CDSw3UTR.gif

statistics S. moellendorffii P. patens A. thaliana
Min 1 1 1
1st Qu. 79 223 151
Median 143 351 205
3rd Qu. 302 520 269
Max 7840 7901 3118
Mean 308 459 227

Image:3UTRhist.gif

statistics S. moellendorffii P. patens A. thaliana
Min 0 0 1
1st Qu. 52 137 86
Median 58 206 99
3rd Qu. 72 323 166
Max 48860 42110 10234
Mean 58 206 164

Image:intronhist.gif

research Groups