Benn Stancil: Business in the back, party in the front
Another exciting take from Benn on the current state of the analytical ecosystem. As an industry, there is a consensus on the ELT approach. There are well-established tools and practices to support the ELT practices. However, the analytical and BI tools track the problems in different directions.
My take on this,
I believe the last-mile data consumption is going to remain. Unlike the ELT system, the BI system does have a human-in-the-loop. It might take one form to another, but it is here to stay—any human-in-the-loop problems inherently wicked problem in nature. Just like we switching from AOL -> Yahoo Messenger -> MySpace -> Orkut -> Facebook -> WhatsApp, it is a never-ending process.
Damon Cortesi: An Introduction to Modern Data Lake Storage Layers
The blog is an exciting overview of the three lakehouse systems, Apache Hudi, Apache Iceberg, and Delta Lake. The author narrates various features of these lakehouse systems and how they support time travel queries. I'm looking forward to the Delete operation follow-up blog.
https://dacort.dev/posts/modern-data-lake-storage-layers/
Here is the talk for the same.
Stanford HAI: Data-Centric AI - AI Models Are Only as Good as Their Data Pipeline
Stanford HAI (Human-Centered Artificial Intelligence) writes an exciting blog on data-centric AI.
developers must turn their attention toward the data side of AI research, says James Zou, assistant professor of biomedical data science at Stanford University and member of the Stanford Institute for Human-Centered Artificial Intelligence. “One of the best ways to improve algorithms’ trustworthiness is to improve the data that goes into training and evaluating the algorithm,” he says
https://hai.stanford.edu/news/data-centric-ai-ai-models-are-only-good-their-data-pipeline
All the Stanford HAI data-centric AI virtual workshops are available on Youtube.
Emily Thompson: Productizing Analytics - A Retrospective
One of the burning questions from all the data team is,
How do you build an analytics platform that is flexible enough to let people explore and answer their questions, while at the same time making sure there are guardrails in place so that similar-sounding metrics that are calculated slightly differently don’t compete with each other, causing a loss of hard-earned trust in the data team?
The author wrote an exciting retrospect blog on their data team mission and accomplishment.
Zhenzhong Xu: The Four Innovation Phases of Netflix’s Trillions Scale Real-time Data Infrastructure
Netflix engineering is one of those companies seen running large-scale real-time data infrastructure. The author writes an exciting blog narrating the four generations of Netflix's real-time data infrastructure.
Mikkel Dengsøe: We’ve only scratched the surface of the full potential for the data warehouse
Researchers project the data warehouse market to grow 34% each year until it reaches $39b in 2026. The author makes a well-articulated case where data warehouses will become the core of modern companies.
Barr Moses: Stop Treating Your Data Engineer Like a Data Catalog
Data certification is a standard approach in many data-driven companies to streamline the business metrics and build trust in data. The author writes an excellent blog on six steps to implementing a data certification program.
https://barrmoses.medium.com/stop-treating-your-data-engineer-like-a-data-catalog-14ed3eacf646
Monzo: How we validated our handling time data
Counting is the most complex problem in data engineering; in fact, that is the only problem we are all trying to solve other than moving data from one S3 bucket to another. - The hard truth of data engineering that no one wants to hear :-)
How long does Monzo's customer service staff spend doing a given task? A simple counting query, isn't it? Monzo bank narrates their experience and the complexity of handling the time data.
https://monzo.com/blog/2022/02/04/how-we-validated-our-handling-time-data
Vimeo - dbt development at Vimeo
Vimeo writes an exciting blog on its adoption of dbt and how it compares to our previous workflow. I like how the author focused on developer workflow rather than comparing the functionality of the tools. The pain points around having the Airflow Jinja template for SQL pipeline and dbt solving it are a great read.
https://medium.com/vimeo-engineering-blog/dbt-development-at-vimeo-fe1ad9eb212
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.