Data Engineering Weekly #90
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
dbt: How We Rebuilt the dbt Cloud Scheduler
A scalable orchestration engine is the holy grail of a data pipeline. dbt writes an exciting blog detailing the redesign of its cloud orchestration engine. The use of round-robin load balancing is an interesting choice for a multi-tenant scheduler since various tenants will have their priority to maintain. Airflow did a PriorityQueue as a workaround and Yarn with a hierarchical queue. It's an incredible effort by the team to increase the reliability from only 21% of runs being "on time" at the beginning of the year to 99.99%.
Cedric Chin: Slowly Changing Dimensions (SCDs) In The Age of The Cloud Data Warehouse
I happened to see this article pretty late; nonetheless, it is an excellent comparison of SCD of Kimball methodology to how we could handle the SCD today.
Mikkel Dengsøe: Stakeholders - The most important relationship for analysts
Building trust within your stakeholders is the first step to building trust in the data. The author writes excellent do's and don'ts as a data analyst and the stakeholders to establish the trust.
Emilie Schario: Building more effective data teams using the JTBD framework
Staying with building trust in data; An efficient execution framework is essential to facilitate data empowerment across an organization. The author introduces JTBD (Jobs To Be Done) framework that can help data teams prioritize the exemplary work and be more impactful in their day-to-day. JTBD framework focused on five major areas of data functions
Proactive Insight Discovery
Interfacing with the Data.
Sponsored: Firebolt - Embedded Analytics vs Data Apps
But Data Apps is still a loosely defined term, and there’s a lot of debate and confusion about what it really means, and how it differs from traditional dashboarding and embedded analytics. Boaz Farkash shares his point of view on the subject.
Gradient Flow: Data Quality Unpacked
Data management is moving to a healthy space where folks started to talk more about the quality and integrity of the data over the volume of the data. The blog highlights various dimensions of data quality, such as data profiling, data quality measurement, data cleansing & repair, and advanced capabilities such as entity resolution & collaborative metadata management.
LakeFS: The State of Data Engineering 2022
Is it too many tools to manage a data infrastructure? One could not stop wondering why we got three formats under "Open Table Format." The blog rightly called the future of metastore is still in the dark. The blog is an excellent summary of the state of data engineering in 2022.
Motif Analytics: Everything Is a Funnel, But SQL Doesn’t Get It
Is SQL a silver bullet for all data analytics problems? The blog narrates the complexity of SQL in computing the funnel analysis. I remember a few conversations on tracing as a better choice for funnel analytics, and curious to know what is coming out of Motif Analytics.
Shopify: Introducing ShopifyQL - Our New Commerce Data Querying Language
Staying with Is SQL good enough territory, Shopify writes a timely article about ShopifyQL, an SQLish domain language for ECommerce analytics. Perhaps it is the beginning of domain-specific SQL-ish DSLs?
Sponsored: RudderStack - What is the Growth Stack?
A detailed guide to building the Growth Stack—an architecture to centralize every data point into a comprehensive source of truth and activate that centralized data in downstream tools. The growth stack is phase two of RudderStack's Data Maturity Journey framework.
Ben Rogojan: Why Are We Still Struggling To Answer How Many Active Customers We Have?
We often quote counting as the most challenging problem in data engineering, but why? Why do we still struggle to answer basic questions like how many customers are still active or what exactly is the company’s churn? The author walks through the practical complexity of designing systems that seem simple as churn computing.
Data Science at Microsoft: Scalable time series forecasting
Microsoft writes about forecasts for multiple univariate time series in a scalable manner, focusing on multi-horizon forecasts instead of single horizon forecasts. The article compares the result of TFT (Temporal Fusion Transformer) with the optimized Prophet approach published earlier.
Instacart: Griffin - How Instacart’s ML Platform Tripled ML Applications in a year
Instacart writes about its ML platform Griffin, an extensible platform that supports diverse data management systems and integrates with multiple machine learning tools and machine learning workflows. Airflow combined with AWS Sagemaker and a cloud data warehouse becomes a defacto ML platform over the bespoke vendor solutions.
Canva Engineering: Service-aligned Data Platform Architecture
Canva publishes an excellent data ingestion journey from database snapshots to CDC (Change Data Capture) to ingest business process changes. The blog narrates how Snowplow and S3 used to integrate continuous data ingestion to Snowflake.
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.
Nit: At Canva we use Snowpipe, not Snowplow.