Data Engineering Weekly #47

Weekly Data Engineering Newsletter

Welcome to the 47th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Erik Bernhardsson's building a data team, Jamie Brandon's Against SQL, Benn Stancil's self-serve is a feeling, Saxo Bank's enabling data discovery, OpenLineage's backfilling Airflow DAGs using Marquez, Shopify's tuning Trino queries, DoorDash's modularize the recommendation engine, Uber's tuning ML models, FinancialTimes' learning from rapid experiments, and OLAP != OLAP Cubes.

Erik Bernhardsson: Building a data team at a mid-stage startup - a short story

The blog is possibly one of the best narrations highlighting the real-world complexity of data engineering. The blog is walking through various stages of data team as an organization grows and can be a catalyst to bring data-driven culture.

Jamie Brandon: Against SQL

SQL is the lingua franca for databases, but how does this stand with other general-purpose languages? The author takes a fresh perspective on SQL, discussing some of the shortcomings and discuss the possibility of what is after SQL looks.

Benn Stancil: Self-serve is a feeling

Every organization loves to talk about self-serve analytics, but the definition of self-serve after vague. Is it a bot answering all business questions that is self-serving? The author narrates all aspects of self-serving and emphasizes the importance of chase the self-serve experience that makes the data team and the data consumers feel most at home.

OpenLineage: Backfilling Airflow DAGs using Marquez

Backfilling is a vital aspect of the data pipeline to fix the computing or produce a newer version. In a typical functional data engineering, backfilling can have a cascading downstream effect. Though systems like Airflow does provide backfilling capabilities out-of-the-box, the scope is limited to DAG definition. Marquez writes an exciting blog that narrates how to use the Marquez lineage API to trigger end-to-end backfilling.

DataHub/ Saxo Bank: Enabling Data Discovery in a Data Mesh - The Saxo Journey

Saxo Bank writes about its data infrastructure with an in-house central data management application, "Data Workbench." powered by Data Hub and Great Expectations. The blog narrates the data inconsistency issues resulting from inconsistent naming and the Saxo Bank's approach with the data glossary feature.

Shopify: Shopify's Path to a Faster Trino Query Execution - Infrastructure

Shopify writes about its experience in tuning the Trino query infrastructure. The workload-specific Trino clusters, analysis on the coordinator node congestion, limit the number of drivers per query to preventing the compute starvation are some of the exciting reads.

DoorDash: Leveraging the Pipeline Design Pattern to Modularize Recommendation Services

Doordash writes about its experience applying pipeline design patterns to the explore page to improve the modularization. The blog is an exciting read on the pipeline approach to decoupling retrieval and ranking to efficiently solve the information retrieval problem.

Uber: Tuning Model Performance

Creating and maintaining a high-performing model is an iterative process. Uber writes about its ML platform Michelangelo and the support for iterative tuning and one-off comprehensive tuning of ML models.

Fiancial Times: 6 Lessons from rapid experimentation at the Financial Times

The fast-changing content platform brings challenges to run through A/B testing. The Financial Times writes about its lesson learned from adopting the rapid experiments strategy.

Holistic Blog: OLAP != OLAP Cube

OLAP and OLAP Cubes are often confused, where OLAP specifies the access pattern, and the OLAP Cubes specify the data structure. The blog is walking through the distinction of the two terms.

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.