Transcriptomics

Brief introduction

Transcriptomics Atlas groups the uniformly processed data from a representative set of human tissues. This resource can be of use in a wide range of scientific applications, including pharmacogenomics and biomarker discovery, where transcriptomic analyses are often performed in a comparative framework, examining different health states, diseases, or stimuli relative to baseline conditions. The use of the Transcriptomics Atlas should result in reduced monetary and operational costs of transcriptomics and wet lab experiments. Transcriptomic Atlas can be used as a reference database for future clinical practice when gene expression profiles will be used in disease diagnosis and treatments. Moreover, faster results in the processing of RNA-sequences with the STAR aligner can improve medical services if STAR is used for diagnosis by medical specialists. This, in turn, has impact on the more accurate and timely diagnosis and treatments, which is key for development of healthcare systems in modern society.

Transcriptomics Interface 1
Transcriptomics Interface 2

Introduction of the problem

The efficient processing of large-scale RNA sequences presents significant challenges. Bioinformatics tools, such as the STAR aligner, consume substantial resources in terms of both CPU and memory. For example, processing of over 7000 FASTQ files requires procesing 130TB of data. Moreover, the Transcriptomics Atlas pipeline requires a high-throughput solution for data access and index distribution to the worker nodes. While cloud services provide a plethora of options, identifying the optimal architecture and configuration remains a complex undertaking.

How NEARDATA will address the challenge

To attain high computational efficiency, within NearData project we implemented numerous optimizations, encompassing both application-specific and cloud-related improvements. For instance, the early stopping functionality integrated into the STAR aligner enhances system throughput by 20% through the termination of alignment process for low-quality files. Moreover, the Transcriptomics Atlas pipeline itself has been tailored for the cloud environment:

The optimizations implemented for the Transcriptomics Atlas pipeline can be applied to other bioinformatics pipelines and compute environments.

How it will work

In the serverless version of the pipeline, we could use NearData tools and connectors which enable efficient parallelization of workflow in cloud and HPC environments while keeping data movement intact:

Summary of some results

The optimizations we introduced within NearData into the Transctiptomics Atlas pipeline can me summarized as contributing to the following key performance indicators:

Contact us

Project Coordinator

Dr. Pedro García López

pedro.garcia@urv.cat

EU Flag

NEARDATA has received funding from the European Union’s Horizon research and innovation programme under grant agreement No 101092644.