top of page

Building the Data Lake Architecture


The keynote explores how we, at Quantal re-architected and simplified an essential component in our data pipeline, delivering a large amount of business value in the process.

We a data-driven company provides operational intelligence and data input automation by analyzing business activities. Our matching technology is enabled by our data pipeline, which ingests and analyzes series of activities (such as tasks, alerts, or incidents etc) on a daily basis.

These activities are then used as a source to build the data presentation layer, to perform a variety of risk analysis and exploration of tasks. Until recently, most of our data stores were either key-value, non-relational or relational databases and served both online and offline data processing tasks, which required expensive replicas or oversized clusters. This approach proved to be extremely expensive to maintain and not scalable at all in the long run. We faced critical issues with offline batch jobs impacting the performance of the app as well as poor performance and bottlenecks when reading the bulk data real time.

That’s when we decided to fully separate online and offline storage and use optimal technology for both.

Thought of Data lake

The concept of a data lake is hardly new to us and it was nicely explained in the article by Martin Fowler. Since we have a multitude of batch jobs analyzing data sets, when we decided to build an offline data storage, our main concerns were storage costs and data read performance.

We explored multiple ways to solve this specific issue. We come to the point that Amazon S3 and Apache Parquet (well suited for the rise in interactive query services) have proven to be a great fit for both of these requirements. S3 storage is highly cost-effective and offers great read throughput, partitioned Hive tables on top of Parquet files allow us to query batch data efficiently.

This is how it looks like:


  • The activity ingestion pipeline processes data and stores the data in online storage. Simultaneously, the pipeline streams data to a Kinesis stream.

  • The raw data is stored on S3 in a JSON format.

  • Using Spark for data extraction, transformation, and loading (ETL), we perform cleanup and transformations on the data and store the output as parquet files on S3, while also making the data accessible for querying through hive meta store.

  • The structured data from a relational databases, is dumped directly to S3 as Parquet files and indexed in the hive metastore.

  • The analytics jobs query the hive tables, then populate online stores such as elastic search, or produce results to SQS, S3 or any other data sink.

The approach above allowed us to decouple online and offline data storage systems as well as enable easier access to the data for our operational intelligence.

Conclusion

Using S3 and Hive, we’ve seen a 5-6x improvement in the performance of our heavier batch jobs that previously loaded data from key-value databases. Offloading batch processing jobs from the online databases to the data lake has removed CPU spikes and allowed us to downscale our most expensive database cluster.

In the end, moving to the new data architecture allowed us to significantly cut infrastructure costs, improve batch jobs performance in offline as well as online, enable easier and uniform access to the data.

Extreme imagination and never settle for anything, the key to building great technology products for the future

3 views

Commentaires


bottom of page