The real-time nature of our datasets pose the unique and challenging issues for us. We needed to quickly surface the real time unusual activity, SOS alerts, and serve relevant metrics to our business team, and address many other real-time use cases. Our Pub/Sub system provides the infrastructure to handle this workload.
We use Apache Kafka as a messaging bus for connecting different parts of the tech ecosystem. We get system, application logs and indoor beacon location tracking as well as event trigger data from our service providers executive apps. Then we use this data available to a variety of downstream consumers via Kafka.
Data in kafka will be used to feed both real time and batch streaming pipelines. The data will be used for complex business computations, real time alerts, analytics dashboard etc.
Kafka Ecosystem @ Risk tech
Given the current Kafka architecture and our data volume, to achieve lossless delivery for our data pipeline is cost prohibitive in AWS EC2. Accounting for this, for our infrastructure to arrive at an acceptable amount of data loss, while balancing cost. We’ve achieved a daily data loss rate of less than 0.01%. Metrics are gathered for dropped messages so we can take action if needed.
The Apache kafka pipeline produces messages asynchronously without blocking other microservices. In case a message cannot be delivered after a few retries, It will be dropped by the producer to ensure the availability of the application and good user experience.
Most of the applications in our technology use our Java client library to produce to kafka pipeline. On each instance of those applications, there are multiple Kafka producers, with each producing to a predefined kafka cluster for sink level isolation. The producers have flexible topic routing from the sources and sink configuration that can be changed at runtime without having to restart the whole application process. This makes it possible for things like redirecting traffic and migrating topics across multiple kafka clusters.
Although we’ve made such improvements to scale the Kafka service, though many more interesting problems need to be solved to bring the service to the next level. For instance, we’ll be exploring an abstraction layer for Kafka.
We have many interesting engineering problems to solve, from building scalable, reliable, and efficient infrastructure to applying cutting edge machine learning technologies to help our business grow.