Data Engineering Weekly #73

Weekly Data Engineering Newsletter

Feb 07, 2022

Benn Stancil: Business in the back, party in the front

Another exciting take from Benn on the current state of the analytical ecosystem. As an industry, there is a consensus on the ELT approach. There are well-established tools and practices to support the ELT practices. However, the analytical and BI tools track the problems in different directions.

My take on this,

I believe the last-mile data consumption is going to remain. Unlike the ELT system, the BI system does have a human-in-the-loop. It might take one form to another, but it is here to stay—any human-in-the-loop problems inherently wicked problem in nature. Just like we switching from AOL -> Yahoo Messenger -> MySpace -> Orkut -> Facebook -> WhatsApp, it is a never-ending process.

benn.substack

Business in the back, party in the front

The front is falling off. Or, more accurately, the front is splitting into a thousand tiny pieces, dumping 20,000 tons of crude oil into our corporate environments. In this case, our enormous faceless frigate is the front of the modern data stack. Over the last decade, the data industry has been building a giant ship, now worth hundreds of billions of do…

3 years ago · 7 likes · 19 comments · Benn Stancil

Damon Cortesi: An Introduction to Modern Data Lake Storage Layers

The blog is an exciting overview of the three lakehouse systems, Apache Hudi, Apache Iceberg, and Delta Lake. The author narrates various features of these lakehouse systems and how they support time travel queries. I'm looking forward to the Delete operation follow-up blog.

https://dacort.dev/posts/modern-data-lake-storage-layers/

Here is the talk for the same.

Stanford HAI: Data-Centric AI - AI Models Are Only as Good as Their Data Pipeline

Stanford HAI (Human-Centered Artificial Intelligence) writes an exciting blog on data-centric AI.

developers must turn their attention toward the data side of AI research, says James Zou, assistant professor of biomedical data science at Stanford University and member of the Stanford Institute for Human-Centered Artificial Intelligence. “One of the best ways to improve algorithms’ trustworthiness is to improve the data that goes into training and evaluating the algorithm,” he says

https://hai.stanford.edu/news/data-centric-ai-ai-models-are-only-good-their-data-pipeline

All the Stanford HAI data-centric AI virtual workshops are available on Youtube.

Emily Thompson: Productizing Analytics - A Retrospective

One of the burning questions from all the data team is,

How do you build an analytics platform that is flexible enough to let people explore and answer their questions, while at the same time making sure there are guardrails in place so that similar-sounding metrics that are calculated slightly differently don’t compete with each other, causing a loss of hard-earned trust in the data team?

The author wrote an exciting retrospect blog on their data team mission and accomplishment.

The Data Leader's Survival Guide

Productizing Analytics: A Retrospective

Like many companies that formed before the era of “Big Data”, Mozilla needed to undergo a paradigm shift in order to bring data into the heart of its decision-making. While I was there, we established a cross-functional team that went on to build a centralized experimentation platform, brought Data Science, Marketing Analy…

3 years ago · 7 likes · Emily Thompson

Zhenzhong Xu: The Four Innovation Phases of Netflix’s Trillions Scale Real-time Data Infrastructure

Kishore Gopalakrishna @KishoreBytes

@gunnarmorling @sc13ts Stream processing is 1000x harder than what people think it is.. Once they realize it is hard 😉 Most start thinking it’s a simple function applied on every event in the stream.

Netflix engineering is one of those companies seen running large-scale real-time data infrastructure. The author writes an exciting blog narrating the four generations of Netflix's real-time data infrastructure.

https://zhenzhongxu.com/the-four-innovation-phases-of-netflixs-trillions-scale-real-time-data-infrastructure-2370938d7f01

Mikkel Dengsøe: We’ve only scratched the surface of the full potential for the data warehouse

Researchers project the data warehouse market to grow 34% each year until it reaches $39b in 2026. The author makes a well-articulated case where data warehouses will become the core of modern companies.

Inside Data by Mikkel Dengsøe

We’ve only scratched the surface of the full potential for the data warehouse

It may feel like we’re at the peak point for the data warehouse. Data teams are approaching 50% of engineering team size in some companies, Snowflake revenue has grown more than 100% the last year and the modern data stack is now a commonly used term…

3 years ago · 7 likes · Mikkel Dengsøe

Barr Moses: Stop Treating Your Data Engineer Like a Data Catalog

Data certification is a standard approach in many data-driven companies to streamline the business metrics and build trust in data. The author writes an excellent blog on six steps to implementing a data certification program.

https://barrmoses.medium.com/stop-treating-your-data-engineer-like-a-data-catalog-14ed3eacf646

Monzo: How we validated our handling time data

Counting is the most complex problem in data engineering; in fact, that is the only problem we are all trying to solve other than moving data from one S3 bucket to another. - The hard truth of data engineering that no one wants to hear :-)

How long does Monzo's customer service staff spend doing a given task? A simple counting query, isn't it? Monzo bank narrates their experience and the complexity of handling the time data.

https://monzo.com/blog/2022/02/04/how-we-validated-our-handling-time-data

Vimeo - dbt development at Vimeo

Vimeo writes an exciting blog on its adoption of dbt and how it compares to our previous workflow. I like how the author focused on developer workflow rather than comparing the functionality of the tools. The pain points around having the Airflow Jinja template for SQL pipeline and dbt solving it are a great read.

https://medium.com/vimeo-engineering-blog/dbt-development-at-vimeo-fe1ad9eb212

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly