high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis

Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis

Johannes Koster

Technischen Universitat Dortmund

Technischen Universitat Dortmund, 2014

BibTeX

Download (PDF)

View

Source

Source codes

Package:

PEANUT

2379

views

The analysis of next-generation sequencing (NGS) data is a major topic in bioinformatics: short reads obtained from DNA, the molecule encoding the genome of living organisms, are processed to provide insight into biological or medical questions. This thesis provides novel solutions to major topics within the analysis of NGS data, focusing on parallelization, scalability and reproducibility. The read mapping problem is to find the origin of the short reads within a given reference genome. We contribute the q-group index, a novel data structure for read mapping with particularly small memory footprint. The q-group index comes with massively parallel build and query algorithms targeted towards modern graphics processing units (GPUs). On top, the read mapping software PEANUT is presented, which outperforms state of the art read mappers in speed while maintaining their accuracy. The variant calling problem is to infer (i.e., call) genetic variants of individuals compared to a reference genome using mapped reads. It is usually solved in a Bayesian way. Often, variant calling is followed by filtering variants of different biological samples against each other. With state of the art solutions, the filtering is decoupled from the calling, leading to difficulties in controlling the false discovery rate. In this work, we show how to integrate the filtering into the calling with an algebraic approach and provide an intuitive solution for controlling the false discovery rate along with solving other challenges of variant calling like scaling with a growing set of biological samples. For this, a hierarchical index data structure for storage of preprocessing results is presented and compression strategies are provided. The developed methods are implemented in the software ALPACA. Depending on the research question, the analysis of NGS data entails many other steps, typically involving diverse tools, data transformations and aggregation of results. These steps can be orchestrated by workflow management. We present the general purpose workflow system Snakemake, which provides an easy to read domain-specific language for defining and documenting workflows, thereby ensuring reproducibility of analyses. The language is complemented by an execution environment that allows to scale a workflow to available resources, including parallelization across CPU cores or cluster nodes, restricting memory usage or the number of available coprocessors like GPUs. The benefits of using Snakemake are exemplified by combining the presented approaches for read mapping and variant calling to a complete, scalable and reproducible NGS analysis.

Tags: Algorithms, Bayesian, Bioinformatics, Biology, Filtering, Next-Generation sequencing, nVidia, nVidia GeForce GTX 580, nVidia GeForce GTX 780, OpenCL, Package, Thesis

March 23, 2015 by hgpu

Rating: 0.5/5. From 1 vote.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis

Package:

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)