Data Engineering Weekly #123
The Weekly Data Engineering Newsletter
Contribute to the Rudderstack Transformations Library, Win $1000
Sanjeev Mohan: What Exactly is a Data Product?
Is chatGPT a data product? Is Data a product? What is Data Product, indeed? The author makes an interesting analogy, if you buy your favorite cereal without the box, ingredient details, and other relevant information, do you trust it? I won't trust them.
The author defines Data Product as the combination of
It is an exciting time for the data industry as we are increasingly talking about philosophies to adopt data in an organization than technology complexities such as Hadoop, Spark, etc.,
Uber: Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi
Uber writes a comprehensive guide on running incremental ETL using Apache Hudi. The article discusses incremental processing strategy, handling late-arriving data, and backfilling with the design patterns explaining how Apache Hudi simplifies ETL processing.
Data Engineering Weekly talks in detail about adopting functional data engineering principles, and Apache Hudi certainly supports it out of the box.
Whatnot: Same Data, Sturdier Frame: Layering in Dimensional Data Modeling at Whatnot
Whatnot writes an interesting article about the shift from a loosely coupled system to adopting the Kimbal model centralized core data model generation. The blog discusses implementing Type-2 SCD modeling and strategies to generate surrogate keys and bridge tables to handle many-to-many relationships.
Open Questions on Type-2 modeling
I keep thinking about the Type-2 SCD and the complexity of the data pipeline. The example Address is a self-contained unit, so it is easier to implement the Kimball-style Type-2 model. What happens in some core models like “customers” where more than one dimension change requires tracking? How do you handle these models without compromising scale and usability? Map table vs. using complex data structure? Share your thoughts if you’ve implemented multi-dimensional SCD type-2 changes and how you track it at scale.
Gloat: A Startup Journey to a Modern Data Platform
Gloat writes about a startup journey adopting a modern data platform with a simplified technology stack.
Airflow & dbt
If anyone is confused with the modern data landscape, the blog is an excellent reminder that you only need a handful of systems to process your data.
Sponsored: [New] Data Observability for Startups - New Free Trial for Growth Teams
Being in the dark is unnerving. Your data team is moving fast, but you can’t be everywhere and test for everything, especially in a small team with limited resources. With a flip of the switch, you can shine a light across all of your data pipelines and tables. Spot bad data before it impacts your stakeholders, costing you time and credibility. Here is an easy, no commitment way to see how you can make a big impact with your data with data observability.
Netflix: Building a Media Understanding Platform for ML Innovations
Netflix recently wrote a series of blogs about its media ML platform. Much of it focuses on model training, evaluation, and scoring. The blog discusses how these ML models integrate with the application to serve users. The blog narrates the use cases for such media ML platforms and discusses the pros & cons of doing on-demand analysis vs. pre-computation.
Expedia: Unified Machine Learning Platform at Expedia Group
On a similar line with the Netflix media ML platform, Expedia discusses the unified ML platform approach to enable innovation across the organization. The blog narrates how Expedia with nine ML systems with no unified way to build and deploy to standardize the infrastructure by adopting build → optimize → enhance methodologies.
Shopify: Unlocking Real-time Predictions with Shopify's Machine Learning Platform
Shopify wrote about its unified ML platform Merlin in the past. Continue expanding Merlin; Shopify writes about Merlin’s online inference system design to unlock real-time prediction services. The platform approach to online inference systems to support no code, low code, and full custom interfaces is an exciting read.
Sponsored: RudderStack Transformations - Move Faster and Build Data Trust
With Device Mode Transformations, you can transform data sent to downstream integrations running in device mode. When destination integrations are set up in device mode, RudderStack loads that tool's native SDK asynchronously and sends event data directly to the destination from the device itself (i.e., from the browser or mobile app).
RudderStack Product manager, Badri Veeraragavan, details a few big updates to RudderStack's beloved data transformation feature. New features include Python Transformations (including Libraries and Transformations API), Transformation Templates, and Device Mode Transformations. 75% of RudderStack users already leverage Transformations, and now they're even more powerful.
Microsoft: Unpacking churn with survival models
Churn prediction is a vital part of growth engineering to understand the trends in customer churn and deploy mitigation plans to prevent it. The author discusses the problem statement of variable hazard ratios in churn over time and compares churn analysis results of the traditional cox model vs. the time-varying cox model. The article brings back some fond memories as it was the very first ML problem I worked on to predict when an employee will resign from a company :-) Yes, you heard it correctly.
BuzzFeed: Lessons Learned Building Products Powered by Generative AI
BuzzFeed writes about adopting generative AI in building products powered by generative AI. I can’t highlight this statement enough.
At BuzzFeed, we believe AI will bring a new era of creativity. We think it will open up brand-new content formats, new ideas, and novel ways for content creators to interact with their audience.
BuzzFeed shares seven learning on integrating generative AI.
Get the technology into the hands of your employees, especially the creative ones.
Good and effective prompts are the result of close collaboration between writers and engineers.
Moderation is essential. Build guardrails into your prompt.
LLMs are not dark magic. Demystifying the technical concepts behind this technology can lead to better application of those tools.
Integrating with OpenAI and scaling your usage to thousands of requests per minute is easy, but be prepared for some downtime.
The economics of using Generative AI can be tough, especially for ad-supported business models.
There are a lot of good tools and resources out there to help you experiment with this technology.
Socure: Migrating Large-Scale Data Pipelines
There will always be a migration project in data engineering :-) Socure writes another case of migration from Redshift to Snowflake and the deprecation of a few internal systems. The blog narrates the engineering principles they adopted for a large-scale migration and shares some lessons learned along the way.
Exness: The best orchestration tool for MLOps: a real story about difficult choices
What is the best orchestration engine for the MLOps? The author compares the choices of Airflow vs. Prefect vs. Kubeflow. The author concludes by continuing with Kubeflow.
What is your choice of Orchestration Engine? Comment on your choice.
Xiaoxu Gao: How to Build an On-Call Culture in a Data Engineering Team
Building an efficient on-call culture is critical to bring ownership and operational excellence to operate the data pipeline. The author discusses what data engineers do while on call, the workflow, tools, and complete process associated with running a successful on-call.
Hiflylabs: dbt Docs as a Static Website
I often joke, “This data catalog tool could be the static website out of dbt docs.” The blog narrates how to build a data catalog without spending money on dbt docs!!!
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.