Data Engineering Weekly #68
Weekly Data Engineering Newsletter
Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.
OtterTune: Databases in 2021: A Year in Review
A comprehensive overview of databases in 2021, from the dominance of PostgreSQL to the Database vendor fights over performance benchmark results. It is an exciting time for cloud databases where companies like ClickHouse Inc, StartTree, Imply & Single Store collectively raised around $480M in 2021.
Shipyard: dbt Coalesce 2021 Takeaways
A collection of exciting notes in case you missed the dbt coalesce 2021
Ernest Chan: Lessons on ML Platforms — from Netflix, DoorDash, Spotify, and more
A comprehensive overview of the ML platform across the companies. It is not so surprising to see most of the platform developed in-house. I hope the year 2022 pave the way to democratize and simplify MLOps.
Netflix: Robust Foundation for Data Pipelines at Scale - Lessons from Netflix
Job orchestration and scheduling are the core parts of data engineering. In this InfoQ talk, Netflix narrates its data pipeline scheduler design and lessons learned from operating large-scale pipelines.
Shopify: Shopify’s Unique Data Science Hierarchy Of Needs
Data analytics goes through its hierarchy of needs, from descriptive to predictive analytics to prescriptive action. Shopify shares its journey on handling pandemic data and the data science maturity model.
Sponsored: The Data Stack Show Live - What is the Modern Data Stack?
Join The Data Stack Show Live for a special panel with experts from Databricks, dbt, Fivetran, Essence VC, and Hinge. The panel will look at the modern stack from all angles and discuss the future of data tooling.
Etsy: Redesigning Etsy’s Machine Learning Platform
We often underestimate the lead time to train a new engineer to an internal tool, which is a significant push for companies to adopt open standards/ open source systems. Etsy writes about its journey through the ML platform redesign on Google Cloud.
Twitter: Advancing Jupyter Notebooks at Twitter
The flexibility of Notebooks also brings the challenge of disconnected infrastructure in an organization. Twitter narrates the tools that helped simplify notebook lifecycle management and integrated development environments.
Khuyen Tran: Top Bootcamps for Data Professionals— An Analysis of 5000 Profiles
What are the top boot camps and Universities for Data Scientists? The author did an excellent data-scientific way to figure this out!! It is interesting to see Udacity on top of traditional universities, and no doubt it is the top Bootcamp for data science.
Vinoth Chandar: Lakehouse Concurrency Controls - Are we too optimistic?
Data lakes have come a long way, and supporting transactions is now an essential characteristic of Lakehouse design. The author writes an exciting overview of different patterns of concurrency control and Apache Hudi's support for it.
Yotpo: Scheduling Millions Of Messages With Kafka & Debezium
Yotpo writes about its CDC pipeline using Kafka & Debezium for email services. Domain event sourcing is increasingly adopting the transactional outbox pattern. IMAO and Debezium is an underrated system, yet the critical open source solution in data engineering. It deserves much more limelight than what it gets now.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.