Welcome to the 13th edition of the data engineering newsletter. This week's release is a new set of articles that focus on the success stories and the importance of Kafka, data quality tools landscape, MLOps tools landscape, schema evolution, and airflow DAG best practices from Twitter, Grab, Climate Corp, and Confluent.
Schema evolution is a fundamental aspect of data management and, consequently, data governance. Applications tend to evolve, and together with them, their internal data definitions need to change. The article narrates different schema compatibility that focuses on Avro and the Confluent schema registry.
https://medium.com/data-rocks/schema-evolution-is-not-that-complex-b7cf7eb567ac
Data quality is critical for running high functional business operations. The high-quality datasets produce high accurate model predication that gives strategic advantages to the business operations. The article is a pretty good overview of various players in the data quality space.
https://medium.com/memory-leak/data-quality-a-primer-f6a945915511
The MLOps goes through a maturity cycle from manual model building to deployment pipeline to CI/CD integration, just like any software maturity model. The article walks through the major cloud providers AWS, Azure, and the Google Cloud ML platform offerings on label the data, experimentation, and industrialisation.
https://towardsdatascience.com/which-cloud-servicer-provider-ml-platform-do-you-need-69ff5d96b7db
Continuing on the MLOps, The article compares five different MLOps tools available. The comparison focuses on MLflow, Pachyderm, Kubeflow, DataRobot, and Algorithmia.
https://medium.com/better-programming/5-great-mlops-tools-to-launch-your-next-machine-learning-model-3e403d0c97d3
Twitter's Kafka infrastructure process 150Million events per second. Some mind-blowing stats, 80 Kafka clusters up to 200 brokers per cluster, 40,000 subscribers (Kafka clients), 2000+ Kafka topics. The article is a good walkthrough of Twitter's Kafka adoption.
https://videos.confluent.io/watch/3V7HtAVxvGv2zpcewb5zT3
Grab shared its experience on optimally scaling Kafka consumer applications that handle 400 billion events per day.
https://engineering.grab.com/optimally-scaling-kafka-consumer-applications
Machine learning in the agriculture domain is an exciting space to watch, and I hope our time's bright minds start focusing on the growth of civilization instead of ads click baits. Climate Corp shared its experience in building a recommendation engine for the farmers.
https://blog.dominodatalab.com/bringing-ml-to-agriculture/
Astronomer shared some of the best practices while writing the Airflow DAGs. The article focuses on Idempotent, incremental data processing.
https://www.astronomer.io/guides/dag-best-practices/
What is happening now in my data stream? The materialization is a critical feature of a stream processing engine to support this question. The article is an excellent walkthrough of how KSQLDB handles stream materialization.
https://www.confluent.io/blog/how-real-time-materialized-views-work-with-ksqldb/
Continuing stream processing, managing the state in a stream processing critical to building distributed stateful applications. Apache Flink is one of the matured stream processing engines in the market, and the blog narrates the internals of the stateful function implementation.
https://flink.apache.org/news/2020/10/13/stateful-serverless-internals.html
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.