Contamination analysis

From Purdue Genomics Database Facility

Jump to: navigation, search

Contents

Articles

edit

MEGAN analysis

MEGAN (MEtaGenome ANalyzer) is a is a tool that can be used to analyze large metagenomic datasets ( see, MEGAN homepage, online publication). Based on a blastp search against Genpept (rel. 162) MEGAN was used to check the Selaginella genome assembly for contamination. For this purpose, the Selaginella genomic scaffolds were cut into 2000 nt subsequences and subjected to BLAST. MEGAN was then used to map the BLAST hits on the NCBI taxonomy to summarize and order the results.


MEGAN mapping results

Filtering preferences:

  • bit score cut-off: 50
  • top percentage score: 10
  • min support for taxa: 5

Summary

  • File: Selaginella_moellendorfii_split_2000.fas.megan2
  • Reads total: 107693
  • Reads assigned: 62724
  • Reads unassigned: 2272
  • Reads with no hits: 42967
  • Reads that only hit unknown taxa: 29
  • Hits total: 107963

File:Selaginella moellendorffi bioperled split 200043.png

The size of the circles are proportional to the number of sequences assigned to the corresponding taxon/genus. The numbers on the right side are the amount of Selaginella subsequences yielding a significant hit.

The hits in Metazoa and Fungi are mainly transposon related, such hits are also observed in Physcomitrella, and should be ok.

There are significant hits in Bacteria especially B. selenitireducens. This organism is also sequenced by the JGI B. selenitireducens genome draft. We have looked at these hits in more detail and noticed, that there are only two proteins which amount the main percentage of hits.

These two proteins are:

Genpept accession Description Number of Selaginella subsequences with hits
EDP81118.1 hypothetical protein BselDRAFT_2604 [''Bacillus selenitireducens'' MLS10] 1620
EDP81119.1 isochorismatase hydrolase [''Bacillus selenitireducens'' MLS10] 955

The corresponding Selaginella scaffolds subsequences yielding these hits are spread all over the Selaginella scaffolds.

Are there contaminations all over the Selaginella scaffolds ? To answer this question we have had a closer look at these regions in the Selaginella genome:

e.g.

  • scaffold_121:440881-442880
  • scaffold_1:2169498-2171497
  • scaffold_5:297595-299594
  • scaffold_27:1465927-1467926
  • scaffold_3:1458579-1460578

These regions seem to be always intergenic and related to LTR retrotransposons. In conclusion, the Selaginella sequences producing these Bacillus selenitireducens protein hits are repetitive in the Selaginella genome and might be related to LTR retrotransposons. The fact that only two "Bacillus selenitireducens proteins" produce all these ~2000 hits leads to the assumption that that they are in fact contaminations in the Bacillus selenitireducens genome. These two proteins EDP81118.1 and EDP81119.1 are on one contig (4000241_Cont78) in the Bacillus genome. The results above supports the fact that at least this contig is not bacterial but belongs to Selaginella.

Finally, the results suggest that there are no obvious bacterial contaminations in the Selaginella genome.

G/C content of Selaginella genomic scaffolds

We used the software geecee (EMBOSS geecee) to calculate the fraction of G+C bases of the genomic scaffolds.

Selaginella genomic scaffolds G/C distribution plot.

File:Selaginella geecee.png

The fact, that there is no secondary G/C peak detectable, indicates that there is no obvious/large scale contamination in the genome assembly.


More Information

For further information and questions please contact:
andreas.zimmer@biologie.uni-freiburg.de
stefan.rensing@biologie.uni-freiburg.de

research Groups