Transcriptomics

Brief introduction

Transcriptomics Atlas groups the uniformly processed data from a representative set of human tissues. This resource can be of use in a wide range of scientific applications, including pharmacogenomics and biomarker discovery, where transcriptomic analyses are often performed in a comparative framework, examining different health states, diseases, or stimuli relative to baseline conditions. The use of the Transcriptomics Atlas should result in reduced monetary and operational costs of transcriptomics and wet lab experiments. Transcriptomic Atlas can be used as a reference database for future clinical practice when gene expression profiles will be used in disease diagnosis and treatments. Moreover, faster results in the processing of RNA-sequences with the STAR aligner can improve medical services if STAR is used for diagnosis by medical specialists. This, in turn, has impact on the more accurate and timely diagnosis and treatments, which is key for development of healthcare systems in modern society.

Introduction of the problem

The efficient processing of large-scale RNA sequences presents significant challenges. Bioinformatics tools, such as the STAR aligner, consume substantial resources in terms of both CPU and memory. For example, processing of over 7000 FASTQ files requires procesing 130TB of data. Moreover, the Transcriptomics Atlas pipeline requires a high-throughput solution for data access and index distribution to the worker nodes. While cloud services provide a plethora of options, identifying the optimal architecture and configuration remains a complex undertaking.

How NEARDATA will address the challenge

To attain high computational efficiency, within NearData project we implemented numerous optimizations, encompassing both application-specific and cloud-related improvements. For instance, the early stopping functionality integrated into the STAR aligner enhances system throughput by 20% through the termination of alignment process for low-quality files. Moreover, the Transcriptomics Atlas pipeline itself has been tailored for the cloud environment:

The solution effectively utilizes spot instances, reducing the computational costs by 50%-60%.
It efficiently distributes the STAR index to worker nodes.
It leverages the most cost-effective instance types, which is 25% cheaper and faster with the newest processors on EC2 (r7a.2xlarge) compared to the older generation.

The optimizations implemented for the Transcriptomics Atlas pipeline can be applied to other bioinformatics pipelines and compute environments.

How it will work

In the serverless version of the pipeline, we could use NearData tools and connectors which enable efficient parallelization of workflow in cloud and HPC environments while keeping data movement intact:

DataPlug connector, which we extended to support new backend for SRA data format by introducing new system-level data-slicing techniques,
Lithops HPC, which we ported to the Ares cluster at Cyfronet Academic Computer Centre in Krakow so that allowed us to parallelize the pseudoalignment part of pipeline using Salmon tool,
PyRun, an advanced serverless platform which allows for simple automated deployment and parallelization of relevant applications within the pipeline.

Summary of some results

The optimizations we introduced within NearData into the Transctiptomics Atlas pipeline can me summarized as contributing to the following key performance indicators:

Significant performance improvements (data throughput, data transfer reduction).
Early stopping which enhances the system throughput by ~20%.
Execution time reduction of up to ~12x due to usage of newer release of human genome.
Spot instances reduce the cost by factor of ~2.
Cost-efficient instance type for STAR reduces the monetary cost by ~25%.
Demonstrated resource auto-scaling for batch and stream data processing.
Scalability of the solution and underlying tools in the experiments showing speedups of up to 100x due to efficient parallel processing.
Adaptation of the pipeline to dynamic spot instances.

Contact us

Project Coordinator

Dr. Pedro García López

pedro.garcia@urv.cat

NEARDATA has received funding from the European Union’s Horizon research and innovation programme under grant agreement No 101092644.