Data Engineering Weekly #123

The Weekly Data Engineering Newsletter

Mar 20, 2023

Contribute to the Rudderstack Transformations Library, Win $1000

RudderStack Transformations lets you customize event data in real time with your own JavaScript or Python code. Now you can win $1,000 cash by contributing a Transformation to our open-source library.

https://www.rudderstack.com/blog/join-the-transformations-challenge-for-a-chance-to-win/

Sanjeev Mohan: What Exactly is a Data Product?

Is chatGPT a data product? Is Data a product? What is Data Product, indeed? The author makes an interesting analogy, if you buy your favorite cereal without the box, ingredient details, and other relevant information, do you trust it? I won't trust them.

The author defines Data Product as the combination of

Datasets
Domain
Access

It is an exciting time for the data industry as we are increasingly talking about philosophies to adopt data in an organization than technology complexities such as Hadoop, Spark, etc.,

https://sanjmo.medium.com/what-exactly-is-a-data-product-7f6935a17912

Uber: Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi

Uber writes a comprehensive guide on running incremental ETL using Apache Hudi. The article discusses incremental processing strategy, handling late-arriving data, and backfilling with the design patterns explaining how Apache Hudi simplifies ETL processing.

https://www.uber.com/blog/ubers-lakehouse-architecture/

Data Engineering Weekly talks in detail about adopting functional data engineering principles, and Apache Hudi certainly supports it out of the box.

Data Engineering Weekly

Functional Data Engineering - A Blueprint

The Rise of Data Modeling Data modeling has been one of the hot topics in Data LinkedIn. Hadoop put forward the schema-on-read strategy that leads to the disruption of data modeling techniques as we know until then. We went through a full cycle that…

3 years ago · 27 likes · 1 comment · Ananth Packkildurai

Whatnot: Same Data, Sturdier Frame: Layering in Dimensional Data Modeling at Whatnot

Whatnot writes an interesting article about the shift from a loosely coupled system to adopting the Kimbal model centralized core data model generation. The blog discusses implementing Type-2 SCD modeling and strategies to generate surrogate keys and bridge tables to handle many-to-many relationships.

https://medium.com/whatnot-engineering/same-data-sturdier-frame-layering-in-dimensional-data-modeling-at-whatnot-5e6a548ee713

Open Questions on Type-2 modeling

I keep thinking about the Type-2 SCD and the complexity of the data pipeline. The example Address is a self-contained unit, so it is easier to implement the Kimball-style Type-2 model. What happens in some core models like “customers” where more than one dimension change requires tracking? How do you handle these models without compromising scale and usability? Map table vs. using complex data structure? Share your thoughts if you’ve implemented multi-dimensional SCD type-2 changes and how you track it at scale.

Gloat: A Startup Journey to a Modern Data Platform

Gloat writes about a startup journey adopting a modern data platform with a simplified technology stack.

Debezium
Kafka
Snowflake
Airflow & dbt

If anyone is confused with the modern data landscape, the blog is an excellent reminder that you only need a handful of systems to process your data.

https://theblog.workey.co/a-startup-journey-to-a-modern-data-platform-4c6a884f70da

Netflix: Building a Media Understanding Platform for ML Innovations

Netflix recently wrote a series of blogs about its media ML platform. Much of it focuses on model training, evaluation, and scoring. The blog discusses how these ML models integrate with the application to serve users. The blog narrates the use cases for such media ML platforms and discusses the pros & cons of doing on-demand analysis vs. pre-computation.

https://netflixtechblog.com/building-a-media-understanding-platform-for-ml-innovations-9bef9962dcb7

Expedia: Unified Machine Learning Platform at Expedia Group

On a similar line with the Netflix media ML platform, Expedia discusses the unified ML platform approach to enable innovation across the organization. The blog narrates how Expedia with nine ML systems with no unified way to build and deploy to standardize the infrastructure by adopting build → optimize → enhance methodologies.

https://medium.com/expedia-group-tech/unified-machine-learning-platform-at-expedia-group-5aee72606c74

Shopify: Unlocking Real-time Predictions with Shopify's Machine Learning Platform

Shopify wrote about its unified ML platform Merlin in the past. Continue expanding Merlin; Shopify writes about Merlin’s online inference system design to unlock real-time prediction services. The platform approach to online inference systems to support no code, low code, and full custom interfaces is an exciting read.

https://shopifyengineering.myshopify.com/blogs/engineering/shopifys-machine-learning-platform-real-time-predictions

Microsoft: Unpacking churn with survival models

Churn prediction is a vital part of growth engineering to understand the trends in customer churn and deploy mitigation plans to prevent it. The author discusses the problem statement of variable hazard ratios in churn over time and compares churn analysis results of the traditional cox model vs. the time-varying cox model. The article brings back some fond memories as it was the very first ML problem I worked on to predict when an employee will resign from a company :-) Yes, you heard it correctly.

https://medium.com/data-science-at-microsoft/unpacking-churn-with-survival-models-762822132c21

BuzzFeed: Lessons Learned Building Products Powered by Generative AI

BuzzFeed writes about adopting generative AI in building products powered by generative AI. I can’t highlight this statement enough.

At BuzzFeed, we believe AI will bring a new era of creativity. We think it will open up brand-new content formats, new ideas, and novel ways for content creators to interact with their audience.

BuzzFeed shares seven learning on integrating generative AI.
Get the technology into the hands of your employees, especially the creative ones.
Good and effective prompts are the result of close collaboration between writers and engineers.
Moderation is essential. Build guardrails into your prompt.
LLMs are not dark magic. Demystifying the technical concepts behind this technology can lead to better application of those tools.
Integrating with OpenAI and scaling your usage to thousands of requests per minute is easy, but be prepared for some downtime.
The economics of using Generative AI can be tough, especially for ad-supported business models.
There are a lot of good tools and resources out there to help you experiment with this technology.

https://tech.buzzfeed.com/lessons-learned-building-products-powered-by-generative-ai-7f6c23bff376

Socure: Migrating Large-Scale Data Pipelines

There will always be a migration project in data engineering :-) Socure writes another case of migration from Redshift to Snowflake and the deprecation of a few internal systems. The blog narrates the engineering principles they adopted for a large-scale migration and shares some lessons learned along the way.

https://medium.com/the-socure-technology-blog/migrating-large-scale-data-pipelines-493655a47fa6

Exness: The best orchestration tool for MLOps: a real story about difficult choices

What is the best orchestration engine for the MLOps? The author compares the choices of Airflow vs. Prefect vs. Kubeflow. The author concludes by continuing with Kubeflow.

What is your choice of Orchestration Engine? Comment on your choice.

https://medium.com/exness-blog/the-best-orchestration-tool-for-mlops-a-real-story-about-difficult-choices-5ee6a087c9e3

Xiaoxu Gao: How to Build an On-Call Culture in a Data Engineering Team

Building an efficient on-call culture is critical to bring ownership and operational excellence to operate the data pipeline. The author discusses what data engineers do while on call, the workflow, tools, and complete process associated with running a successful on-call.

https://towardsdatascience.com/how-to-build-an-on-call-culture-in-a-data-engineering-team-7856fac0c99

Hiflylabs: dbt Docs as a Static Website

I often joke, “This data catalog tool could be the static website out of dbt docs.” The blog narrates how to build a data catalog without spending money on dbt docs!!!

https://medium.com/hiflylabs/dbt-docs-as-a-static-website-c50a5b306514

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.