Data Engineering Weekly #4

Weekly data engineering newsletter

Aug 16, 2020

Welcome to the fourth edition of the data engineering newsletter. This week's release is a new set of articles that focus on data orchestration, ML applications, tuning data workload, and Kafka on Kubernetes.

Airflow is a huge step forward over loosely coupled cron jobs for running the data pipeline. Dagster, a data-aware, typed, self-describing, logical orchestration graph, takes the data orchestration to the next level by focusing on local development, testable code before production, and Linking data assets to the code that produced them. The focus on data dependencies, not with pure execution dependencies, is a data engineer's dream comes true.

https://medium.com/dagster-io/dagster-the-data-orchestrator-5fe5cadb0dfb

Amundsen is a data discovery and metadata engine, open-sourced by Lyft joining LF AI Foundation.

https://lfai.foundation/blog/2020/08/11/amundsen-joins-lf-ai-as-new-incubation-project/

Vimeo writes a post on video social analytics infrastructure using Apache Spark. The major challenge around integrating the external API guarded with severe rate limits. The practical usage of micro batching to workaround external API rate limiting and decouple the application logic from API data sourcing is a pragmatic approach and an exciting read.

https://medium.com/vimeo-engineering-blog/video-social-analytics-at-scale-using-apache-spark-5bf34359c9ba

Slack writes about ML infrastructure to prevent spam invites. The key takeaway is the simplicity of the approach and focuses on the operational aspect of the ML application.

https://slack.engineering/blocking-slack-invite-spam-with-machine-learning/

Koalas is an open-source project which provides a drop-in replacement for pandas that focuses on scalability. Databricks writes a post on how PySpark can effectively work with Koalas.

https://databricks.com/blog/2020/08/11/interoperability-between-koalas-and-apache-spark.html

EMR is the widely used big data service from AWS. Monitoring Amazon EMR clusters is essential to help detect critical issues with the applications or infrastructure in real-time and identify root causes quickly. AWS writes about hot to integrate EMR metrics with Prometheus and other monitoring ecosystems such as Grafana for dashboarding and AWS SNS to send notification and alerts.

https://aws.amazon.com/blogs/big-data/monitor-and-optimize-analytic-workloads-on-amazon-emr-with-prometheus-and-grafana/

The Buy vs. Build on the table when it comes to stream processing considering the complexity of the system. Apache Kafka and AWS Kinesis are the leading competitors when it comes to message brokers. It's (not) surprising that Apache Kafka still years ahead in stream processing.

https://medium.com/flo-engineering/kinesis-vs-kafka-6709c968813

Strimzi is an open-source CNCF sandbox project that focuses on running Apache Kafka on Kubernetes while providing container images for Apache Kafka itself, Zookeeper, and other components that are part of the Strimzi ecosystem. The blog post narrates how to move the Apache Kafka workload to Kubernetes.

https://developers.redhat.com/blog/2020/08/14/introduction-to-strimzi-apache-kafka-on-kubernetes-kubecon-europe-2020/

Apache Kafka consumers are a single-threaded processing model that follows one partition consumed per thread. The model simplifies the ordering and processing guarantee in processing the stream of events. The downside of the approach, we often underutilize the CPU. Confluent writes a blog post narrates how can we implement Multi-Threaded Message Consumption with the Apache Kafka Consumer and the challenges around it.

https://www.confluent.io/blog/kafka-consumer-multi-threaded-messaging/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Engineering Weekly

Discussion about this post

Ready for more?