Data Engineering Weekly #157

The Weekly Data Engineering Newsletter

Feb 04, 2024

RudderStack is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Visit rudderstack.com to learn more.

Joe Reis: Definition of Data Modeling & What Data Modeling Is not

Joe raised a very fundamental question in data engineering. What is Data Modeling, and what is not? He rightly points out that many data engineers will name any of the modeling techniques as data modeling. They are indeed the core part of data modeling, yet it is still a technique rather than a definition.

Joe went on to define the data modeling as follows:

A data model is a structured representation that organizes and standardizes data to enable and guide human and machine behavior, inform decision-making, and facilitate actions.

The definition indeed elevates the purpose of the data modeling techniques. If I were to define data modeling at a very high level, my definition of data modeling is;

Every business, in a way, is a state machine. The user journey, sales process, marketing campaign, everything falls under a state machine. Data modeling is a collaborative process across business units to capture state changes in business activity.

The state-machine analogy is vital; your design approach and usage will significantly improve once you get that perspective.

https://practicaldatamodeling.substack.com/p/what-data-modeling-is-not

https://practicaldatamodeling.substack.com/p/my-definition-of-data-modeling-for

Grab: Rethinking Stream Processing - Data Exploration

Grab writes an excellent blog about data exploration on stream processing. The solution centered around Notebook opens a Flink Session for the Kafka stream and continues the exploration.

It opens some old memory; try to solve this problem first with Presto-Kafka connector and then using OLAP engines like Druid & Apache Pinot. The challenges of connecting Kafka, obviously with more ad-hoc consumers, saturate the network bandwidth (though Kafka improved in this way to allow consumption from the follower’s node, the problem persists). The challenge with the OLAP engine is that it is an additional system to maintain, and indexing everything is too costly.

I do believe there is an open-ended deep tech problem to support exploratory analytics in stream processing.

https://engineering.grab.com/rethinking-streaming-processing-data-exploration

Michiel De Muynck: 7 Lessons Learned migrating dbt code from Snowflake to Trino

SQL standards are like toothbrushes: everyone agrees they're essential, but no one wants to use someone else's.

The blog is an excellent reminder no matter how standardized your tools are, the migration from one DB to another is always a nightmare.

https://medium.com/datamindedbe/7-lessons-learned-migrating-dbt-code-from-snowflake-to-trino-42fc907f0202

Condé Nast: Our transformation journey toward an open data platform

Condé Nast writes its transformation journey adopting Databricks & LakeHouse architecture, moving away from Presto and Data Lake. It is noteworthy to see the reasoning of Databricks over Snowflake with the data science workload.

https://medium.com/@bxh_io/our-transformation-journey-toward-an-open-data-platform-b6f869b6a173

Rapido: Data Platform at Rapido — Cheap, efficient, and scalable analytics.

Rapido, on the other hand, takes a journey to improving the efficiency of the Trino query engine by adopting the Trino query adoption process. It also reminds me there is no modern alternative to Secor.

Please comment in the thread if you’re using any alternative for Secor in your pipeline.

https://medium.com/rapido-labs/data-platform-rapido-part-i-cheap-efficient-and-scalable-analytics-52662111b2d2

Klaviyo: Data Dictionary: How I Learned to Stop Worrying and Love Reporting Standardization

How you measure a business activity will differ depending on whom you’re asking and which business unit you are asking. The worst part is that the same question may result in a completely different answer to the same person at a different time. Klaviyo writes about one of the experiences and the process they took to standardize the reporting and metrics definitions.

https://klaviyo.tech/data-dictionary-how-i-learned-to-stop-worrying-and-love-reporting-standardization-2c756a226549

Yuval Itzchakov: Snowpipe Streaming Deep Dive

Snowpipe Streaming: A feature created by Snowflake to support streaming writes to underlying Snowflake tables. The author takes the Snowpipe Java streaming SDK and deep-dives into various stages of how it works internally. It's an excellent read if you're a snowflake user.

https://blog.yuvalitzchakov.com/snowpipe-streaming-deep-dive/

Meta: Improving machine learning iteration speed with faster application build and packaging

A slow build is a bottleneck in disguise, turning the CI/CD pipeline from a speedway into a scenic route. The goal is to deliver, not detour.

Meta writes about its technique to speed the machine learning iteration with faster application build and package. The build process uses the Buck2 build engine with remote execution API to prevent the non-deterministic nature of tooling and build rules.

https://engineering.fb.com/2024/01/29/ml-applications/improving-machine-learning-iteration-speed-with-faster-application-build-and-packaging/

Shweta Shrestha: Decoding BigQuery Expenses: The Ultimate Query for Analyzing Your Analysis Costs

Simple tooling can solve a complicated process; I found the query to analyze cost is simple but useful for many cost optimization techniques.

How are you analyzing the cost of your infrastructure?

https://medium.com/@shwetastha1/decoding-bigquery-expenses-the-ultimate-query-for-analyzing-your-analysis-costs-2e163bb28538

Garima Arora: Summarizing PostgreSQL Indexes

Going back to the basics, I found the blog is a simplified explanation of PostgreSQL indexes. The blog talks about the index properties and various indexes available in Postgres.

https://medium.com/@aroragarima/summarizing-postgresql-indexes-53ae5ca3e6f8

Murat: Looking Back at Postgres

Speaking of Postgres, one can’t deny its impact in the industry and with the storage engines. We started to see many “Serverless Postgres as Service” companies. The blog is an excellent summarization of the infamous Looking Back at Postgres paper.

https://muratbuffalo.blogspot.com/2024/01/looking-back-at-postgres.html

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer’’ opinions.

Data Engineering Weekly