Welcome to the 11th edition of the data engineering newsletter. This week's release is a new set of articles that focus on data infrastructure trends 2020, seven principles of data ops, data quality, ML transference & performance tuning, and Samza runner for Beam from LinkedIn, Twitter, DoorDash, Airbnb, Shopify, Apache Pinot, Dagster.
Developer productivity and the ability to iterate through the correctness of a job is always challenging. The Airflow test utility took the first step to improve developer productivity. Dagster in this post brought to the next level, describing how to run PySpark in either EMR or Dagster with the mode switch.
https://dagster.io/blog/pyspark
The seven principles of reliable data pipelines are an excellent read compares with the Google SRE principles. The author narrates the importance of adopting SLO & SLI, reducing the toil, the importance of monitoring the pipeline, and simplicity.
https://medium.com/toro-data-quality/seven-principles-for-reliable-data-pipelines-e82a82810e4f
The 2020 data & AI landscape is an excellent read. The author narrates some of the recent trends in the data infrastructure. The shift from Hadoop systems to the cloud warehouses like the snowflake, Google Big query, The gaining momentum for the data lineage and the discovery tools, The second generation orchestration tools Prefect & Dagster the rise of the AIOps are the exciting trends to look.
https://mattturck.com/data2020/
LinkedIn published the benchmarking results for Samza runner for Apache Beam. It's a good reference article on how to think performance improvement as a continuous process.
https://engineering.linkedin.com/blog/2020/building-a-better-and-faster-beam-samza-runner
Twitter writes a short and exciting blog about the recent image cropping transparency issue. It is a good reminder that machine learning is not always an answer and lets users choose what they want.
https://blog.twitter.com/en_us/topics/product/2020/transparency-image-cropping.html
Airbnb writes about how it builds the data platform to conduct Revenue Forecasting at Airbnb. The blog is an excellent narration of some practical challenges with the data infrastructure, supporting multiple query engines, dynamic metrics generation, late arrival data, and maintains SLA.
https://medium.com/@jerry.chu/airbnbs-data-platform-of-revenue-forecasting-2e95a01122e6
Doordash writes about its recent performance challenges with the search scoring and the ranking model infrastructure. The blog narrates its migration to the internal predication service, emphasizing the importance of the dedicated feature store.
https://doordash.engineering/2020/10/01/integrating-a-scoring-framework-into-a-prediction-service/
The critical site-facing analytical applications require high throughput and strict p99th query latency. Apache Pinot is an excellent OLAP engine to serve use facing analytical solutions, and the article narrates the challenges of doing concurrent, low latency SLA queries using Apache Pinot.
Data quality has been a consistent focus, as it often leads to issues that can go unnoticed for a long time, bring entire pipelines to a halt, and erode stakeholders' trust in the reliability of their analytical insights. Great Expectations writes an excellent narration of how data quality is key to the success of MLOps.
https://medium.com/@expectgreatdata/why-data-quality-is-key-to-successful-ml-ops-a18d6e373ca9
Descriptive statistics and correlations are data scientists' bread and butter, but they often come with the caveat that correlation isn't causation. In this blog post, Shopify narrates different causal inference methods and uses them to build great products.
https://engineering.shopify.com/blogs/engineering/using-quasi-experiments-counterfactuals
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.