Data Engineering Weekly #157
The Weekly Data Engineering Newsletter
RudderStack is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Visit rudderstack.com to learn more.
Joe Reis: Definition of Data Modeling & What Data Modeling Is not
Joe raised a very fundamental question in data engineering. What is Data Modeling, and what is not? He rightly points out that many data engineers will name any of the modeling techniques as data modeling. They are indeed the core part of data modeling, yet it is still a technique rather than a definition.
Joe went on to define the data modeling as follows:
A data model is a structured representation that organizes and standardizes data to enable and guide human and machine behavior, inform decision-making, and facilitate actions.
The definition indeed elevates the purpose of the data modeling techniques. If I were to define data modeling at a very high level, my definition of data modeling is;
Every business, in a way, is a state machine. The user journey, sales process, marketing campaign, everything falls under a state machine. Data modeling is a collaborative process across business units to capture state changes in business activity.
The state-machine analogy is vital; your design approach and usage will significantly improve once you get that perspective.
Grab: Rethinking Stream Processing - Data Exploration
Grab writes an excellent blog about data exploration on stream processing. The solution centered around Notebook opens a Flink Session for the Kafka stream and continues the exploration.
It opens some old memory; try to solve this problem first with Presto-Kafka connector and then using OLAP engines like Druid & Apache Pinot. The challenges of connecting Kafka, obviously with more ad-hoc consumers, saturate the network bandwidth (though Kafka improved in this way to allow consumption from the follower’s node, the problem persists). The challenge with the OLAP engine is that it is an additional system to maintain, and indexing everything is too costly.
I do believe there is an open-ended deep tech problem to support exploratory analytics in stream processing.
Michiel De Muynck: 7 Lessons Learned migrating dbt code from Snowflake to Trino
SQL standards are like toothbrushes: everyone agrees they're essential, but no one wants to use someone else's.
The blog is an excellent reminder no matter how standardized your tools are, the migration from one DB to another is always a nightmare.
Condé Nast: Our transformation journey toward an open data platform
Condé Nast writes its transformation journey adopting Databricks & LakeHouse architecture, moving away from Presto and Data Lake. It is noteworthy to see the reasoning of Databricks over Snowflake with the data science workload.
Sponsored: RudderStack Launches Data Quality Toolkit
Leading data teams leverage their customer data to deliver high-impact machine learning projects like churn prediction and personalized recommendation systems to create significant competitive advantages for their companies. If you have a data quality problem, success like this can seem out of reach.
Poor data quality leads to lackluster results and frustrated stakeholders, but fixing bad data can become an endless task that keeps you from key initiatives. To help you drive data quality at the source, RudderStack just launched a Data Quality Toolkit. It includes features for collaborative event definitions, violation management, real-time schema fixes, and monitoring and alerting. With the toolkit, you can spend less time wrangling and more time helping your business drive revenue.
Rapido: Data Platform at Rapido — Cheap, efficient, and scalable analytics.
Rapido, on the other hand, takes a journey to improving the efficiency of the Trino query engine by adopting the Trino query adoption process. It also reminds me there is no modern alternative to Secor.
Please comment in the thread if you’re using any alternative for Secor in your pipeline.
Klaviyo: Data Dictionary: How I Learned to Stop Worrying and Love Reporting Standardization
How you measure a business activity will differ depending on whom you’re asking and which business unit you are asking. The worst part is that the same question may result in a completely different answer to the same person at a different time. Klaviyo writes about one of the experiences and the process they took to standardize the reporting and metrics definitions.
Yuval Itzchakov: Snowpipe Streaming Deep Dive
Snowpipe Streaming: A feature created by Snowflake to support streaming writes to underlying Snowflake tables. The author takes the Snowpipe Java streaming SDK and deep-dives into various stages of how it works internally. It's an excellent read if you're a snowflake user.
Meta: Improving machine learning iteration speed with faster application build and packaging
A slow build is a bottleneck in disguise, turning the CI/CD pipeline from a speedway into a scenic route. The goal is to deliver, not detour.
Meta writes about its technique to speed the machine learning iteration with faster application build and package. The build process uses the Buck2 build engine with remote execution API to prevent the non-deterministic nature of tooling and build rules.
Shweta Shrestha: Decoding BigQuery Expenses: The Ultimate Query for Analyzing Your Analysis Costs
Simple tooling can solve a complicated process; I found the query to analyze cost is simple but useful for many cost optimization techniques.
How are you analyzing the cost of your infrastructure?
Garima Arora: Summarizing PostgreSQL Indexes
Going back to the basics, I found the blog is a simplified explanation of PostgreSQL indexes. The blog talks about the index properties and various indexes available in Postgres.
Murat: Looking Back at Postgres
Speaking of Postgres, one can’t deny its impact in the industry and with the storage engines. We started to see many “Serverless Postgres as Service” companies. The blog is an excellent summarization of the infamous Looking Back at Postgres paper.
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer’’ opinions.