Data Engineering Weekly #69

Weekly Data Engineering Newsletter

Jan 10, 2022

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

Devoted Health: One Year of dbt

Devoted Health shares its experience running dbt for a year. The two-phase commit strategy to publish datasets, wrapper tooling to handle authentication, unit testing framework on top of dbt are some exciting reads. The blog is an excellent reminder that any tool integrated well with the developer workflow multiplies its effectiveness.

https://tech.devoted.com/one-year-of-dbt-b2e8474841ca

Matt: The future history of Data Engineering

The article captures the trend on the commoditization of data infrastructure complexity. Any successful technology should move from niche to commoditized to survive longer-term.

However, it is vital to remember What Goes Around Comes Around in software engineering. In the past, companies used to run their own CRM, HR system in-house, and tools like Microsoft SSIS, Informatica successfully commoditized the ETL until the underlying business model changed to the SaaS model. With the likes of Fivetran and Airbyte, we are just reinventing SSIS. We don't know what the underlying business model changes in the next decade, so long live ETL.

https://groupby1.substack.com/p/data-engineering

What Goes Around Comes Around Paper. [Must read in case you missed it]

Thinh Ha: 10 reasons why you are not ready to adopt data mesh

This article is an excellent checklist before starting your data mesh journey. The author highlights the need for organizational maturity before taking the data mesh approach since the principles require a strong foundation & tooling.

https://medium.com/google-cloud/10-reasons-why-you-should-not-adopt-data-mesh-7a0b045ea40f

Mikkel Dengsøe: Data to engineers ratio - A deep dive into 50 top European tech companies

The blog is an excellent analysis of the data engineers ratio in an organization and how the organization's engineering culture impacts the hiring pattern. It is interesting to see platform/ marketplace companies hire more data engineers than B2B companies.

https://mikkeldengsoe.substack.com/p/data-to-engineers

Halodoc: Lake House Architecture @ Halodoc - Data Platform 2.0

Halodoc writes an excellent overview of its data platform 2.0, focusing on the LakeHouse architecture. The blog narrates some of the key takeaways from implementing Apache Hudi, a configuration-driven approach to onboarding new tables. Kudos for including the end-to-end reference architecture diagram.

https://blogs.halodoc.io/lake-house-architecture-halodoc-data-platform-2-0/amp/

Picnic: Picnic Analytics Platform - Migration from AWS Kinesis to Confluent Cloud

Picnic writes about its migration story from AWS Kinesis to Confluent Cloud. The prime motivation behind the move seems to be to have a longer retention time and adopt the broad Kafka ecosystem. Interestingly, Kinesis can't extend its hot data retention for more than seven days!!

https://blog.picnic.nl/picnic-analytics-platform-migration-from-aws-kinesis-to-confluent-cloud-adb06601c78

PayPal: Sales Pipeline Management with Machine Learning - A Lightweight Two-Layer Ensemble Classifier Framework

PayPal writes about ML-driven sales pipeline management. The lightweight two-layer ensemble classifier framework as a solution to progressive prediction problems is an exciting read.

https://medium.com/paypal-tech/sales-pipeline-management-with-machine-learning-15398bab913b

Apache Dolphin Scheduler: From Airflow to Apache DolphinScheduler, the Evolution of Scheduling System On Youzan Big Data Development Platform

Youzan writes an in-depth overview of their migration of data orchestration engine from Airflow to Apache Dolphine. The article contains an excellent comparison of Airflow and Dolphin regarding scalability and high availability.

https://medium.com/@ApacheDolphinScheduler/from-airflow-to-apache-dolphinscheduler-the-evolution-of-scheduling-system-on-youzan-big-data-ec897f310f91

Google Cloud: Announcing preview of BigQuery’s native support for semi-structured data

I firmly believe that native indexing support for semi-structured data is a must-have feature in modern data warehouse systems. It is exciting to see Google BigQuery announce native support for semi-structured data.

https://cloud.google.com/blog/products/data-analytics/bigquery-now-natively-supports-semi-structured-data

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly