Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack
Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.
Devoted Health: One Year of dbt
Devoted Health shares its experience running dbt for a year. The two-phase commit strategy to publish datasets, wrapper tooling to handle authentication, unit testing framework on top of dbt are some exciting reads. The blog is an excellent reminder that any tool integrated well with the developer workflow multiplies its effectiveness.
https://tech.devoted.com/one-year-of-dbt-b2e8474841ca
Matt: The future history of Data Engineering
The article captures the trend on the commoditization of data infrastructure complexity. Any successful technology should move from niche to commoditized to survive longer-term.
However, it is vital to remember What Goes Around Comes Around in software engineering. In the past, companies used to run their own CRM, HR system in-house, and tools like Microsoft SSIS, Informatica successfully commoditized the ETL until the underlying business model changed to the SaaS model. With the likes of Fivetran and Airbyte, we are just reinventing SSIS. We don't know what the underlying business model changes in the next decade, so long live ETL.
https://groupby1.substack.com/p/data-engineering
What Goes Around Comes Around Paper.
[Must read in case you missed it]
Thinh Ha: 10 reasons why you are not ready to adopt data mesh
This article is an excellent checklist before starting your data mesh journey. The author highlights the need for organizational maturity before taking the data mesh approach since the principles require a strong foundation & tooling.
https://medium.com/google-cloud/10-reasons-why-you-should-not-adopt-data-mesh-7a0b045ea40f
Mikkel Dengsøe: Data to engineers ratio - A deep dive into 50 top European tech companies
The blog is an excellent analysis of the data engineers ratio in an organization and how the organization's engineering culture impacts the hiring pattern. It is interesting to see platform/ marketplace companies hire more data engineers than B2B companies.
https://mikkeldengsoe.substack.com/p/data-to-engineers
Halodoc: Lake House Architecture @ Halodoc - Data Platform 2.0
Halodoc writes an excellent overview of its data platform 2.0, focusing on the LakeHouse architecture. The blog narrates some of the key takeaways from implementing Apache Hudi, a configuration-driven approach to onboarding new tables. Kudos for including the end-to-end reference architecture diagram.
https://blogs.halodoc.io/lake-house-architecture-halodoc-data-platform-2-0/amp/
Sponsored: New Year, Better Event Data with Avo & Rudderstack
Join RudderStack and Avo for a live webinar on January 27 @ 9am PT to learn how you can increase your event data quality and streamline your behavioral data pipelines.
https://www.avo.app/event-driven-infrastructure-webinar
Picnic: Picnic Analytics Platform - Migration from AWS Kinesis to Confluent Cloud
Picnic writes about its migration story from AWS Kinesis to Confluent Cloud. The prime motivation behind the move seems to be to have a longer retention time and adopt the broad Kafka ecosystem. Interestingly, Kinesis can't extend its hot data retention for more than seven days!!
PayPal: Sales Pipeline Management with Machine Learning - A Lightweight Two-Layer Ensemble Classifier Framework
PayPal writes about ML-driven sales pipeline management. The lightweight two-layer ensemble classifier framework as a solution to progressive prediction problems is an exciting read.
https://medium.com/paypal-tech/sales-pipeline-management-with-machine-learning-15398bab913b
Apache Dolphin Scheduler: From Airflow to Apache DolphinScheduler, the Evolution of Scheduling System On Youzan Big Data Development Platform
Youzan writes an in-depth overview of their migration of data orchestration engine from Airflow to Apache Dolphine. The article contains an excellent comparison of Airflow and Dolphin regarding scalability and high availability.
Google Cloud: Announcing preview of BigQuery’s native support for semi-structured data
I firmly believe that native indexing support for semi-structured data is a must-have feature in modern data warehouse systems. It is exciting to see Google BigQuery announce native support for semi-structured data.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.