Current state of the art devoted their efforts to determining how single genome variants could be associated with complex diseases. However, recent studies have shown that while a specific couple of variants may not be associated with the disease, the combination of both of them within a single genome may be. Thus, it is necessary to devote our efforts to studying the impact of variant interactions with such diseases. In this section, we present a use case for the identification of such interactions with two different but complementary methodologies, which we define as separate use cases and experiments.
The discovery of variant groups associated with the disease involves the analysis of over 15 million variants. Studying all the possible combinations of pairs of variants represents more than 6*1013 tests. Using the Mare Nostrum 4 (MN4) architecture (165,888 CPU - 1.880 GB/core and 208 GPU - 16GB/each), this process is estimated to take approximately 9 days. This exemplifies how this use case represents an extreme data problem. On the other hand, the combinatorial analysis based on machine learning approaches requires the execution of over 112 million tests, which represents 3 days of continuous analysis using the whole MN4 architecture. Moreover, this calculus is done by computing only pairs of variants. However, the ideal scenario would be computing all the possibilities with the 23 chromosomes. Thus, sets of 23 rather than just pairs. Consequently, the problem would take months of computations using the whole supercomputer.
Traditional association testing methods often overlook the combined contribution of multiple genomic variants to disease development. Machine learning (ML) approaches have emerged as promising tools for identifying variant groups and assessing their impact on disease risk. However, the structure of genomic datasets, characterized by millions of features across a relatively limited number of individuals, poses significant challenges for the effective application of ML techniques. To address this, we implement a combinatorial strategy that aligns with the statistical constraints of ML methodologies. The genomic data is first partitioned into manageable chunks that conform to the computational limitations of ML tools, ensuring robust analytical performance. These chunks are then paired and analysed for their association with the disease. This approach enables scalable, genome-wide analysis and contributes to enhanced understanding of disease mechanisms, with potential implications for early detection and precision medicine.
The substantial volume of data stored in genomic datasets, often encompassing tens of millions of variants across hundreds of thousands of individuals, presents a significant challenge for efficient data ingestion. To ensure optimal performance and avoid computational bottlenecks, it is critical to minimize redundant read operations from the same file. This is particularly relevant when working with formats such as Variant Call Format (VCF), Binary Call Format (BCF), or scalable alternatives like Apache Parquet, which support columnar storage and compression. To address these challenges, the ingestion process is designed to partition the data into discrete chunks that can be processed in parallel. This approach reduces I/O overhead and enables distributed execution using Lithops in combination with DataPlug, facilitating efficient data access and transfer and, thus, allowing the scalable ingestion and processing of large genomic datasets.
To enable such solutions, we implemented four main connectors:
The development of the HPC extreme data connector and its integration within Lithops has allowed for speed-up improvements of up to 36x in the data ingestion of the GWD use-case, solving its main bottleneck. Moreover, further improvements on the hyperparameter tuning and selection improved its performance by another 5x. On the other hand, the Lithops-HPC data connector not only has improved performance but also the user experience by providing a much easier handling of the supercomputing resources. This enhancement allows non-technical users to manage the supercomputer and launch the GWD use-case without dealing with the hurdles of HPC resource management. Such efficiency has been further boosted by integrating the auto-scaler connectors that allow Lithops-HPC to dynamically adapt resources to foreseen demands.
Project Coordinator
Dr. Pedro García López
pedro.garcia@urv.cat
NEARDATA has received funding from the European Union’s Horizon research and innovation programme under grant agreement No 101092644.