Pathogen genomics

Brief introduction

Public health agencies such as the UK Health Security Agency (UKHSA) rely on large-scale pathogen genomics to monitor outbreaks, track variants, and inform rapid response strategies. Modern sequencing technologies generate massive volumes of FASTQ data that must be ingested, processed, compared, and stored efficiently. In this use case, NEARDATA provides a stream-based architecture capable of managing large-scale genomic data while enabling real-time analytics and AI-driven pathogen surveillance across a federated public health ecosystem.

Introduction of the problem

Pathogen genomics in public health presents three major challenges.

National surveillance programs must rapidly process, compare, and retrieve millions of pathogen genomes. Traditional workflows based on static batch processing and cluster-based analytics struggle to scale elastically during outbreak peaks.
Genomic data must be accessible through both streaming and object storage interfaces. While streaming systems enable real-time ingestion and monitoring, many legacy genomics tools expect file-based object storage. Managing these dual interfaces efficiently, without data duplication or performance loss, is non-trivial.
Deploying scalable analytics engines introduces operational overhead, including cluster provisioning, configuration, and maintenance. This complexity becomes a bottleneck in public health contexts where response to incidents and rapid deployment is crucial.

Together, these challenges require a new architecture that combines high-throughput ingestion, elastic compute, efficient data reduction, and AI-ready representations.

How NEARDATA will address the challenge

NEARDATA addresses these challenges through a two-pronged strategy.

On the architectural side, we integrate Pravega (streaming storage) and Nexus (tiered data management) to create a unified dual-access architecture. FaaStream (serverless orchestration) can also be optionally used. We developed specialised connectors, including the Nexus FASTQGzip streamlet, which compresses FASTQ data while preserving parallel access. This enables efficient ingestion, reduced storage footprint, and compatibility with both stream-based and object-based workflows.

On the analytical side, we introduced and fully integrated KPop, a novel embedding-based comparative genomics method. Unlike sketch-based approaches, KPop computes the complete k-mer spectrum of genomes and transforms it into compact vector embeddings. These embeddings enable accurate pathogen classification, rapid nearest-neighbour retrieval, and direct integration with AI/ML pipelines.

This combined architectural and methodological innovation reduces operational complexity, increases throughput, and prepares genomic data for scalable AI-driven analytics.

How it will work

Incoming FASTQ data streams are ingested via Pravega and processed through Nexus streamlets. The FASTQGzip streamlet applies windowed compression while annotating offsets, enabling efficient parallel batch access without disrupting real-time streaming reads.

We also tested execution of genomic analytics workloads, such as KPop, using FaaStream, a serverless framework that orchestrates AWS Lambda functions. This allows dynamic scaling according to workload intensity. Stream-native primitives (e.g., shuffle and stateful coordination) support high-throughput distributed processing.

KPop generates dataset-specific vector embeddings that can be used for classification, clustering, relatedness analysis, and AI-based downstream applications. Because the architecture supports both streaming and object interfaces, results can seamlessly feed legacy genomics pipelines or modern cloud-native AI frameworks.

This design ensures elasticity during outbreak peaks, reduces idle infrastructure costs, and supports UKHSA’s federated operating model, where multiple institutions require controlled and timely access to genomic insights.

Summary of some results

The UKHSA use case achieved significant improvements in throughput, scalability, and cost efficiency.

KPop demonstrated near-perfect classification performance, achieving over 98% accuracy on large real-world datasets, including one comprising more than 1.28 million SARS-CoV-2 genomes. Retrieval of related sequences from million-scale databases can be performed in seconds.

The Nexus FASTQGzip streamlet achieved a 3.8× compression ratio while preserving parallel access, offering a strong trade-off between data reduction and processing speed compared to standard FASTQ or monolithic GZip files. The serverless FaaStream implementation of KPop achieved execution times comparable to Apache Flink while reducing end-to-end cost by up to 65% when cluster provisioning time is considered.

Although these results were demonstrated on a single, well-defined workflow, they validate proof-of-principle feasibility of AI-based, serverless, stream-based genomics analytics for national-scale pathogen surveillance. NEARDATA’s architecture can therefore be used to support the implementation of resilient, scalable, and cost-efficient genomics services for federated public health genomics.

Contact us

Project Coordinator

Dr. Pedro García López

pedro.garcia@urv.cat

NEARDATA has received funding from the European Union’s Horizon research and innovation programme under grant agreement No 101092644.