Data Engineering Weekly

Share this post

Data Engineering Weekly #73

www.dataengineeringweekly.com

Data Engineering Weekly #73

Weekly Data Engineering Newsletter

Ananth Packkildurai
Feb 7, 2022
2
Share this post

Data Engineering Weekly #73

www.dataengineeringweekly.com

Benn Stancil: Business in the back, party in the front

Another exciting take from Benn on the current state of the analytical ecosystem. As an industry, there is a consensus on the ELT approach. There are well-established tools and practices to support the ELT practices. However, the analytical and BI tools track the problems in different directions. 

My take on this, 

I believe the last-mile data consumption is going to remain. Unlike the ELT system, the BI system does have a human-in-the-loop. It might take one form to another, but it is here to stay—any human-in-the-loop problems inherently wicked problem in nature. Just like we switching from AOL -> Yahoo Messenger -> MySpace -> Orkut -> Facebook -> WhatsApp, it is a never-ending process. 

benn.substack
Business in the back, party in the front
The front is falling off. Or, more accurately, the front is splitting into a thousand tiny pieces, dumping 20,000 tons of crude oil into our corporate environments. In this case, our enormous faceless frigate is the front of the modern data stack. Over the last decade, the data industry has been building a giant ship, now worth hundreds of billions of do…
Read more
a year ago · 7 likes · 19 comments · Benn Stancil

Damon Cortesi: An Introduction to Modern Data Lake Storage Layers

The blog is an exciting overview of the three lakehouse systems, Apache Hudi, Apache Iceberg, and Delta Lake. The author narrates various features of these lakehouse systems and how they support time travel queries. I'm looking forward to the Delete operation follow-up blog.

https://dacort.dev/posts/modern-data-lake-storage-layers/

Here is the talk for the same.


Stanford HAI: Data-Centric AI - AI Models Are Only as Good as Their Data Pipeline

Stanford HAI (Human-Centered Artificial Intelligence) writes an exciting blog on data-centric AI.

developers must turn their attention toward the data side of AI research, says James Zou, assistant professor of biomedical data science at Stanford University and member of the Stanford Institute for Human-Centered Artificial Intelligence. “One of the best ways to improve algorithms’ trustworthiness is to improve the data that goes into training and evaluating the algorithm,” he says

https://hai.stanford.edu/news/data-centric-ai-ai-models-are-only-good-their-data-pipeline

All the Stanford HAI data-centric AI virtual workshops are available on Youtube.


Emily Thompson: Productizing Analytics - A Retrospective

One of the burning questions from all the data team is,

How do you build an analytics platform that is flexible enough to let people explore and answer their questions, while at the same time making sure there are guardrails in place so that similar-sounding metrics that are calculated slightly differently don’t compete with each other, causing a loss of hard-earned trust in the data team?

The author wrote an exciting retrospect blog on their data team mission and accomplishment.

The Data Leader's Survival Guide
Productizing Analytics: A Retrospective
Like many companies that formed before the era of “Big Data”, Mozilla needed to undergo a paradigm shift in order to bring data into the heart of its decision-making. While I was there, we established a cross-functional team that went on to build a centralized experimentation platform, brought Data Science, Marketing Analy…
Read more
a year ago · 7 likes · Emily Thompson

Zhenzhong Xu: The Four Innovation Phases of Netflix’s Trillions Scale Real-time Data Infrastructure

Twitter avatar for @KishoreBytes
Kishore Gopalakrishna @KishoreBytes
@gunnarmorling @sc13ts Stream processing is 1000x harder than what people think it is.. Once they realize it is hard 😉 Most start thinking it’s a simple function applied on every event in the stream.
12:39 AM ∙ Jan 16, 2022
62Likes12Retweets

Netflix engineering is one of those companies seen running large-scale real-time data infrastructure. The author writes an exciting blog narrating the four generations of Netflix's real-time data infrastructure.

https://zhenzhongxu.com/the-four-innovation-phases-of-netflixs-trillions-scale-real-time-data-infrastructure-2370938d7f01


Mikkel Dengsøe: We’ve only scratched the surface of the full potential for the data warehouse

Researchers project the data warehouse market to grow 34% each year until it reaches $39b in 2026. The author makes a well-articulated case where data warehouses will become the core of modern companies.

Inside Data by Mikkel Dengsøe
We’ve only scratched the surface of the full potential for the data warehouse
It may feel like we’re at the peak point for the data warehouse. Data teams are approaching 50% of engineering team size in some companies, Snowflake revenue has grown more than 100% the last year and the modern data stack is now a commonly used term…
Read more
a year ago · 7 likes · Mikkel Dengsøe

Barr Moses: Stop Treating Your Data Engineer Like a Data Catalog

Data certification is a standard approach in many data-driven companies to streamline the business metrics and build trust in data. The author writes an excellent blog on six steps to implementing a data certification program.

https://barrmoses.medium.com/stop-treating-your-data-engineer-like-a-data-catalog-14ed3eacf646


Monzo: How we validated our handling time data

Counting is the most complex problem in data engineering; in fact, that is the only problem we are all trying to solve other than moving data from one S3 bucket to another. - The hard truth of data engineering that no one wants to hear :-)

How long does Monzo's customer service staff spend doing a given task? A simple counting query, isn't it? Monzo bank narrates their experience and the complexity of handling the time data.

https://monzo.com/blog/2022/02/04/how-we-validated-our-handling-time-data


Vimeo - dbt development at Vimeo

Vimeo writes an exciting blog on its adoption of dbt and how it compares to our previous workflow. I like how the author focused on developer workflow rather than comparing the functionality of the tools. The pain points around having the Airflow Jinja template for SQL pipeline and dbt solving it are a great read.

https://medium.com/vimeo-engineering-blog/dbt-development-at-vimeo-fe1ad9eb212


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post

Data Engineering Weekly #73

www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing