A Unified Ingestion and Storage System for Scalable Big Data Processing


KerA is developed within the KerData team. The project is hosted on GitLab and licensed under Apache 2.0.

This project has been partially funded by EIT Digital.


KerA is a unified ingestion and storage system that meets the following design principles:

Unified analytics architecture

KerA focuses on how to transform data through a workflow composed of stateful and/or stateless stream-based operators. It has been designed with the computation in mind: how to define the computation, when to trigger the computation and how to combine the computation with offline analytics. Furthermore, all high-level data management (fault-tolerance, persistence of operator states, etc.) are handled natively by the unified layer. Stream-based interfaces are provided for data ingestion, storage and processing.

Scalable data ingestion

KerA offers dynamic partitioning using semantic grouping. Logical partitions defined by the user correspond to multiple physical sub-partitions that are created and filled in order by producers and dynamically discovered and processed in order by consumers. KerA also includes a lightweight offset indexing that assigns offsets at coarse granularity (sub-partition level).

Diverse data access patterns

KerA provides interfaces for both online (records or streams) and offline (objects) access patterns. The data model is flexible enough to allow the access to streams/objects and the fine-grained access to records of a stream. The latter are modelled with a multi-key-value format. Furthermore, the ingestion/storage system leverages a log-structured approach (based on RAMCloud) for both data in memory and on disk. Replication is possible in both synchronous and asynchronous modes.

Efficient integration

KerA should be co-located with processing engines by leveraging a shared memory buffer, thus bypassing the network and reducing the communication interferences between reads and writes. Metadata is handled on the broker providing access to the corresponding data.