Data Engineering Weekly #2

Weekly data engineering newsletter

Welcome to the second edition of the data engineering newsletter. This week's release is a new set of articles that focus on workflow schedulers, data pipeline observability, performance, and cost-effectiveness.


Apache Pinot is gaining momentum as a realtime OLAP system for data engineering needs. In this blog post, Sapient narrates its experience benchmarking Apache Pinot. The ingestion rate cross 120k entries/second on one node is impressive.

https://medium.com/@shounakmk219/tasted-apache-pinot-and-we-loved-it-85f9022c30f7


Netflix open sourced metaflow.org December 2019. Metaflow follows a layered architecture approach to run the data workload, a contrasting approach from a tightly coupled airflow's scheduler architecture. In this post, Netflix explains how the scheduler layer integrated with the AWS step functions.

https://netflixtechblog.com/unbundling-data-science-workflows-with-metaflow-and-aws-step-functions-d454780c6280


The Airflow operator represents a single idempotent task. Operators determine what executes when your DAG runs. One of the drawbacks of the operator is that no Airflow does not have explicit inter-operator communication, aka no easy way to pass messages between operators! AIP-31 proposal adopting a functional DAG abstraction to hide the complexity. The following article explains how the functional definition can solve the inter-operator communication.

https://medium.com/databand-ai/aip-31-airflow-functional-dag-definition-b34852a632d0


Python becomes the de facto language for data science workload. Apache Spark community continually improves the performance of PySpark. Pinterest writes about its data infrastructure to empower their data science workload. The design approach to isolate the Python environment for each workload and the use of SparkMagic is an exciting read.

https://medium.com/pinterest-engineering/empowering-pinterest-data-scientists-and-machine-learning-engineers-with-pyspark-f41b0d1dd1b8


Cost optimization is essential engineering in cloud computing. Netflix writes about its cost optimization platform in this blog post. The automated TTL recommendations only for tables with material cost-saving potentials are the highlight of this post.

https://netflixtechblog.com/byte-down-making-netflixs-data-infrastructure-cost-effective-fee7b3235032


Square writes about using Amundsen to support users' privacy. The post narrates the challenges to label columns for the sensitive data and the usage of Google's Cloud data loss prevention tool.

https://developer.squareup.com/blog/using-amundsen-to-support-user-privacy-via-metadata-collection-at-square/


Spark 3.0 made many improvements with the SparkSQL. The article explains the internals of the Spark SQL execution plan and how to interpret the query plan to optimize the execution.

https://towardsdatascience.com/mastering-query-plans-in-spark-3-0-f4c334663aa4


Structured Streaming was initially introduced in Apache Spark 2.0. It has proven to be the best platform for building distributed stream processing applications. The article narrates troubleshooting streaming performance using the Spark UI 3.0

https://databricks.com/blog/2020/07/29/a-look-at-the-new-structured-streaming-ui-in-apache-spark-3-0.html


The support for running Spark on Kubernetes added with version 2.3, and Spark-on-k8s adoption has been accelerating ever since. The lake of external shuffle service is one of the drawbacks of adopting Spark on Kubernetes. Spark 3.0 added support for soft dynamic allocation to mitigate the issue. The benchmark in the blog shows the performance difference between Spark on K8s and Spark on Yarn narrowing.

https://towardsdatascience.com/performance-of-apache-spark-on-kubernetes-has-caught-up-with-yarn-73730878a792


The modern data platform is moving from the traditional data warehouse -> data lake to data mesh. This blog post is blueprint guidance on how to move the data warehouse to the data mesh world. The blog focused on Google cloud offerings, but the concept still applicable to any cloud infrastructure.

https://medium.com/swlh/building-a-data-platform-to-enable-analytics-and-ai-driven-innovation-1bd95e37efb9


There has been a growing interest lately among the industry on getting better control over one's data ecosystem and improving its operational efficiency. Following Amundsen (Lyft), DataHub (Linkedin), Databook (Uber), and Metacat (Netflix), Criteo published it's internal data discovery system DataDoc.

https://medium.com/criteo-labs/datadoc-the-criteo-data-observability-platform-2cd826a9a1af


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.