Data Engineering Weekly #90

The Weekly Data Engineering Newsletter

Jun 26, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

dbt: How We Rebuilt the dbt Cloud Scheduler

A scalable orchestration engine is the holy grail of a data pipeline. dbt writes an exciting blog detailing the redesign of its cloud orchestration engine. The use of round-robin load balancing is an interesting choice for a multi-tenant scheduler since various tenants will have their priority to maintain. Airflow did a PriorityQueue as a workaround and Yarn with a hierarchical queue. It's an incredible effort by the team to increase the reliability from only 21% of runs being "on time" at the beginning of the year to 99.99%.

https://www.getdbt.com/blog/rebuilding-dbt-scheduler/

Cedric Chin: Slowly Changing Dimensions (SCDs) In The Age of The Cloud Data Warehouse

I happened to see this article pretty late; nonetheless, it is an excellent comparison of SCD of Kimball methodology to how we could handle the SCD today.

https://www.holistics.io/blog/scd-cloud-data-warehouse/

Mikkel Dengsøe: Stakeholders - The most important relationship for analysts

Building trust within your stakeholders is the first step to building trust in the data. The author writes excellent do's and don'ts as a data analyst and the stakeholders to establish the trust.

https://medium.com/@mikldd/stakeholders-the-most-important-relationship-for-analysts-e078f3ea0c60

Emilie Schario: Building more effective data teams using the JTBD framework

Staying with building trust in data; An efficient execution framework is essential to facilitate data empowerment across an organization. The author introduces JTBD (Jobs To Be Done) framework that can help data teams prioritize the exemplary work and be more impactful in their day-to-day. JTBD framework focused on five major areas of data functions

Data Activation
Metrics Management
Proactive Insight Discovery
Driving Experimentation
Interfacing with the Data.

https://locallyoptimistic.com/post/building-more-effective-data-teams-using-the-jtbd-framework/

Gradient Flow: Data Quality Unpacked

Data management is moving to a healthy space where folks started to talk more about the quality and integrity of the data over the volume of the data. The blog highlights various dimensions of data quality, such as data profiling, data quality measurement, data cleansing & repair, and advanced capabilities such as entity resolution & collaborative metadata management.

https://gradientflow.com/data-quality-unpacked/

LakeFS: The State of Data Engineering 2022

Is it too many tools to manage a data infrastructure? One could not stop wondering why we got three formats under "Open Table Format." The blog rightly called the future of metastore is still in the dark. The blog is an excellent summary of the state of data engineering in 2022.

https://lakefs.io/the-state-of-data-engineering-2022/

Motif Analytics: Everything Is a Funnel, But SQL Doesn’t Get It

Is SQL a silver bullet for all data analytics problems? The blog narrates the complexity of SQL in computing the funnel analysis. I remember a few conversations on tracing as a better choice for funnel analytics, and curious to know what is coming out of Motif Analytics.

https://motifanalytics.medium.com/everything-is-a-funnel-but-sql-doesnt-get-it-c35356424044

Shopify: Introducing ShopifyQL - Our New Commerce Data Querying Language

Staying with Is SQL good enough territory, Shopify writes a timely article about ShopifyQL, an SQLish domain language for ECommerce analytics. Perhaps it is the beginning of domain-specific SQL-ish DSLs?

https://shopifyengineering.myshopify.com/blogs/engineering/shopify-commerce-data-querying-language-shopifyql

Ben Rogojan: Why Are We Still Struggling To Answer How Many Active Customers We Have?

We often quote counting as the most challenging problem in data engineering, but why? Why do we still struggle to answer basic questions like how many customers are still active or what exactly is the company’s churn? The author walks through the practical complexity of designing systems that seem simple as churn computing.

https://medium.com/coriers/why-are-we-still-struggling-to-answer-how-many-active-customers-we-have-191b2b3c09a0

Data Science at Microsoft: Scalable time series forecasting

Microsoft writes about forecasts for multiple univariate time series in a scalable manner, focusing on multi-horizon forecasts instead of single horizon forecasts. The article compares the result of TFT (Temporal Fusion Transformer) with the optimized Prophet approach published earlier.

https://medium.com/data-science-at-microsoft/scalable-time-series-forecasting-fee61da75923

Instacart: Griffin - How Instacart’s ML Platform Tripled ML Applications in a year

Instacart writes about its ML platform Griffin, an extensible platform that supports diverse data management systems and integrates with multiple machine learning tools and machine learning workflows. Airflow combined with AWS Sagemaker and a cloud data warehouse becomes a defacto ML platform over the bespoke vendor solutions.

https://tech.instacart.com/griffin-how-instacarts-ml-platform-tripled-ml-applications-in-a-year-d3d4dcae3690

Canva Engineering: Service-aligned Data Platform Architecture

Canva publishes an excellent data ingestion journey from database snapshots to CDC (Change Data Capture) to ingest business process changes. The blog narrates how Snowplow and S3 used to integrate continuous data ingestion to Snowflake.

https://canvatechblog.com/service-aligned-data-platform-architecture-6b5a6fc366c4

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly