Public health agencies such as the UK Health Security Agency (UKHSA) rely on large-scale pathogen genomics to monitor outbreaks, track variants, and inform rapid response strategies. Modern sequencing technologies generate massive volumes of FASTQ data that must be ingested, processed, compared, and stored efficiently. In this use case, NEARDATA provides a stream-based architecture capable of managing large-scale genomic data while enabling real-time analytics and AI-driven pathogen surveillance across a federated public health ecosystem.
Pathogen genomics in public health presents three major challenges.
Together, these challenges require a new architecture that combines high-throughput ingestion, elastic compute, efficient data reduction, and AI-ready representations.
NEARDATA addresses these challenges through a two-pronged strategy.
On the architectural side, we integrate Pravega (streaming storage), Nexus (tiered data management), and FaaStream (serverless orchestration) to create a unified dual-access architecture. We developed specialised connectors, including the Nexus FASTQGzip streamlet, which compresses FASTQ data while preserving parallel access. This enables efficient ingestion, reduced storage footprint, and compatibility with both stream-based and object-based workflows.
On the analytical side, we introduced and fully integrated KPop, a novel embedding-based comparative genomics method. Unlike sketch-based approaches, KPop computes the complete k-mer spectrum of genomes and transforms it into compact vector embeddings. These embeddings enable accurate pathogen classification, rapid nearest-neighbour retrieval, and direct integration with AI/ML pipelines.
This combined architectural and methodological innovation reduces operational complexity, increases throughput, and prepares genomic data for scalable AI-driven analytics.
Incoming FASTQ data streams are ingested via Pravega and processed through Nexus streamlets. The FASTQGzip streamlet applies windowed compression while annotating offsets, enabling efficient parallel batch access without disrupting real-time streaming reads.
Genomic analytics workloads—such as KPop—are executed using FaaStream, a serverless framework that orchestrates AWS Lambda functions. This allows dynamic scaling according to workload intensity. Stream-native primitives (e.g., shuffle and stateful coordination) support high-throughput distributed processing.
KPop generates dataset-specific vector embeddings that can be used for classification, clustering, relatedness analysis, and AI-based downstream applications. Because the architecture supports both streaming and object interfaces, results can seamlessly feed legacy genomics pipelines or modern cloud-native AI frameworks.
This design ensures elasticity during outbreak peaks, reduces idle infrastructure costs, and supports UKHSA’s federated operating model, where multiple institutions require controlled and timely access to genomic insights.
The UKHSA use case achieved significant improvements in throughput, scalability, and cost efficiency.
KPop demonstrated near-perfect classification performance, achieving over 98% accuracy on large real-world datasets, including one comprising more than 1.28 million SARS-CoV-2 genomes. Retrieval of related sequences from million-scale databases can be performed in seconds.
The Nexus FASTQGzip streamlet achieved a 3.8× compression ratio while preserving parallel access, offering a strong trade-off between data reduction and processing speed compared to standard FASTQ or monolithic GZip files. The serverless FaaStream implementation of KPop achieved execution times comparable to Apache Flink while reducing end-to-end cost by up to 65% when cluster provisioning time is considered.
Although these results were demonstrated on a single, well-defined workflow, they validate proof-of-principle feasibility of AI-based, serverless, stream-based genomics analytics for national-scale pathogen surveillance. NEARDATA’s architecture can therefore be used to support the implementation of resilient, scalable, and cost-efficient genomics services for federated public health genomics.
Project Coordinator
Dr. Pedro García López
pedro.garcia@urv.cat
NEARDATA has received funding from the European Union’s Horizon research and innovation programme under grant agreement No 101092644.