Distribution of genes
From Purdue Genomics Database Facility
S. moellendorffii genes are compressed in small gene rich regions while large deserts of genes are retaind
T. Nishiyama (tomoakin@kenroku.kanazawa-u.ac.jp)
To explore how S. moellendorffii could have a very small genome, various size measurements were compared among P. patens, S. moellendorffii, and A. thaliana. As a results, introns, and 3' UTR is shorter in S. moellendorffii, while CDS size were retained as large as A. thaliana. These comparison suggests that the S. moellendorffii underwent an evolution to reduce the total genome size and only the CDS which was functionally constrained to shorten were retained in size. Although the measurement of 5' UTR might be inaccurate due to different cDNA library construction methods, the peak location is consistent with the hypothesis.
Although the total genome size is small, S. moellendorffii genome seems to have large repeats with little gene content.
Instead, the gene rich regions are very tightly compressed with little inter-genic region.
To discribe the situation, quantitatively, I calculated gene distance between neighboring genes based on the gff files, as
distance = (this transcript start) - (previous transcript end)
the distribution of the distance was calculated for S. moellendorffii Filtered Model 2 dataset and P. patens Filtered model dataset.
The summary of the statistices were
| statistics | S. moellendorffii | P. patens | A. thaliana | ||||||||||||||||||||
| Min | -33074 | -17007 | -9065 | 1st Qu. | 393 | 1647 | 309 | Median | 1086 | 4916 | 904 | 3rd Qu. | 3038 | 13913 | 2190 | Max | 275050 | 272201 | 497963 | Mean | 4139 | 10524 | 2221 |
The data show that usual inter genic regions are more than 4-fold smaller compared to
P. patens, though the mean are less impacted because there are very large deserts of gene
and maximum are larger in S. moellendorffii. The distribution in S. moellendorffii
looks similar to A. thaliana. However,
the above calculation reflects the effort to clone
full-length cDNA and identify the UTR
in A. thaliana. A similar comparison between the CDS were performed
| statistics | S. moellendorffii | P. patens | A. thaliana | ||||||||||||||||||||
| Min | -32950 | -8229 | -4556 | 1st Qu. | 445 | 1812 | 1363 | Median | 1158 | 5063 | 2696 | 3rd Qu. | 3119 | 14055 | 4732 | Max | 275050 | 272201 | 501725 | Mean | 4139 | 10672 | 3993 |
This comparison shows that the averege distance represented by 1st quatinent median and 3rd quatinent is smallest in S. moellendorffii and A. thaliana is intermediate.
Speculations
The skew in the pattern of intergenic regions might suggest that the selective force to reduce the genomesize, whatever the nature may be (which is unclear), act on euchromatin region easier than heterochromatin region. Or, the heterochromatin are more stable in some regard and euchromatin are more easily attacked.
Note: Negative number of distance, indicates there are overlapping transcripts predicted.
The extremely large negative ones seems as results of selection of
overlapping gene models that are not likely a good ones.
S. moellendorffii genes have smaller untranslated regions and introns, while the CDS size is not smaller
The size of 5' and 3' untranslated regions of P. patens and S. moellendorffii were calculated based on the start, and stop codon annotation and the exon annotations of each gene on the GFF files. For arabidopsis, the five_prime_UTR and three_prime_UTR features were used. The size of cds were calculated by summing up the size of CDS features belonging to each gene. 0 length UTR were assumed to mean that they are undetermined and removed.
For the CDS comparison, when the total annotation data were used, the pattern in P. patens showed a strange peak around 400 bp, and the shape seemed different from S. moellendorrfii and A. thalina. Since this was suspected to come from errornous annotation or transposon associated gene fragment, a comparison of CDS of actively transcribed gene were performed by using only the CDS data that is associated with 3' UTR annotation. Because 3' UTR is usually identified from EST data this should have restricted to actively transcribed gene.
| statistics | S. moellendorffii | P. patens | A. thaliana | ||||||||||||||||||||
| Min | 1 | 1 | 1 | 1st Qu. | 37 | 117 | 57 | Median | 116 | 236 | 100 | 3rd Qu. | 396 | 445 | 175 | Max | 275050 | 272201 | 501725 | Mean | 347 | 393 | 139 |
| statistics | S. moellendorffii | P. patens | A. thaliana | |||||||||||||||||||||
| Min | 150 | 150 | 51 | 150 is perhaps the cut-off of the annotation and not biologically meaningful | 1st Qu. | 684 | 528 | 597 | Median | 1071 | 873 | 1041 | 3rd Qu. | 1557 | 1440 | 1557 | Max | 25390 | 14670 | 16010 | Mean | 1276 | 1137 | 1211 |
| statistics | S. moellendorffii | P. patens | A. thaliana | ||||||||||||||||||||
| Min | 150 | 156 | 51 | 1st Qu. | 800 | 708 | 666 | Median | 1209 | 1158 | 1080 | 3rd Qu. | 1737 | 1749 | 1578 | Max | 25390 | 14580 | 16010 | Mean | 1445 | 1404 | 1255 |
| statistics | S. moellendorffii | P. patens | A. thaliana | ||||||||||||||||||||
| Min | 1 | 1 | 1 | 1st Qu. | 79 | 223 | 151 | Median | 143 | 351 | 205 | 3rd Qu. | 302 | 520 | 269 | Max | 7840 | 7901 | 3118 | Mean | 308 | 459 | 227 |
| statistics | S. moellendorffii | P. patens | A. thaliana | ||||||||||||||||||||
| Min | 0 | 0 | 1 | 1st Qu. | 52 | 137 | 86 | Median | 58 | 206 | 99 | 3rd Qu. | 72 | 323 | 166 | Max | 48860 | 42110 | 10234 | Mean | 58 | 206 | 164 |







