BioinformaticsLearnTeam solutions

DNA Case team Backstabbers

We developed workflow utilizing Blast and Centrifuge toolkits, that is able to provide precise metagenomics information about food composition, from comparing DNA reads with reference genomes of various species. Our workflow is optimized to work on Google Cloud instance (Compute Engine) with 24 CPUs and 200 GB of RAM.

10
votes

What really goes into sausages?


Research Problem Background

General problem with any metagenomics study is that the sequenced sample consists of multiple, sometimes very numerous, complex organisms. Given that sequence reads are relatively short and organisms may be closely related, it becomes clear that accurate mapping is needed in order to determine which read originates from which organism.

Business Problem Background

Malicious or accidental crisis in food processing industry nowadays have  cross-regional, national or international reach,  and are certainly followed by significant risk to public health and  safety along with  substantial financial loss to consumers or businesses. Biosurveillance emerged as the first line of defense against threats of  potential disease outbreak. But it comes with a cost. Enormous expert man power and computational resources are needed, since multiple DNA references need to be checked against all reads. This is often a very slow process since reference index must be created for each genome of interest, which is both computationally and financially expensive.

(!) WIDGET INSIDE AN ARTICLE (!)

Here we demonstrate use of alternative approach, utilizing free Centrifuge toolkit, to map reads to all given references at the same time. Contrary to what one might believe, this process is very fast since this toolkit uses one single index for all references, which is very small (~4GB for all given organisms in this task).

Case

We developed a straightforward workflow for this case (see diagram below). We want to demonstrate that running a well structured underlying pipeline can easily be done in the cloud. We do this by building a web interface for successfully building and executing tools for this case.

 

Sequenced reads, provided in file Case2.fasta were mapped to hybrid index created by Centrifuge-build tool from six genomes obtained from NCBI. Additional files we used:

  • names.dmp and nodes.dmp, obtained from this link.
  • taxid2seqid.map, manually created for six genomes. Python jupyter notebook used for generating taxid2seqid.map is presented below  link

 

Following command line was used for this step:

centrifuge-build -p <num_threads> --conversion-table <taxid2seqid.map> --taxonomy-tree <nodes.dmp> --name-table <names.dmp> <fasta.list> <output_base_name>

Once built, we used index to map reads from input fasta file to it. This is done in the matter of minutes. The tool produces classification file along with summary metrics. Command line for running centrifuge is below.

centrifuge --report-file <input_file-1.Centrifuge_report.tsv> -x $basename -U </path/to/input_file-1.ext>,</path/to/input_file-2.ext> -S <input_file-1.Classification_result.txt>

Results

Classification results file shows that even though majority of reads has been uniquely mapped to specific reference, large number of them are multimapped. We split these results into two sets. One with uniquely mapped reads and other with ambiguous ones. Below are shown read counts from output of Centrifuge classifier.

 

To improve accuracy of results, we preformed BLAST search of all reads against reference genomes. Ones from set2 were than cross-referenced with BLAST results and this information was used to confirm the correct origin of each read. Finally, we used BLAST report information to further reduce the number of unclassified (unassigned) reads. We observed a more than 3 fold reduction of unassigned reads, thus showing great improvement in analysis sensitivity. The finale summary metrics are shown below.

 

Python code used for this analysis is available here:

final_analysis (1)

 

We also provide full bash script for running this workflow.

Pipeline Bash Command

A Hidden Gem

In today’s increasing computational requirements of bioinformatics tools, we recognize the need for fast and scalable way of executing tasks on large computing instances. Therefore, we made little extra effort to build a simple but scalable web platform that interacts with computing instances required for running tools for this case. The website is available at https://datathon2018.herokuapp.com/. You can find all files and tools we used during Datathon. Please note that due to our limited budget, we have restricted use of storage and resources. However, we have enabled usage of computing resources for the final analysis, which is used for parsing and evaluating data obtained from Centrifuge and BLAST (see section above).

If you have problem with the link above, try deleting “https://” from the address.

 

Data Section

 

Analysis Section

 

Results Section

 

 

Links

[1] Centrifuge Homepage

[2] Centrifuge Paper

[3] Blast Homepage

[4] Blast Paper

[5] Taxonomy Download

 

 

Share this

6 thoughts on “DNA Case team Backstabbers

  1. 1
    votes

    Excellent presentation, I like the diagrams and barplots. Shows good command of visualization tools – such an important aspect of data analysis. Results are plausible. I tried the website, but could not figure out the credentials and it was too late to ask.

  2. 1
    votes

    Very good article. Clear, explanatory, high level. The model diagram helps a lot to the understanding; the online demo and the demonstrative video are very impressive and a great addition. The model itself looks very promising. Great work, guys!

  3. 2
    votes

    Great job guys! Besides covering the task requirements, you took time to add some extra spice.

    I especially enjoy the URL: sausage-dna.science! 😀

Leave a Reply