Data Engineering Weekly

Share this post

Data Engineering Weekly #123

www.dataengineeringweekly.com

Data Engineering Weekly #123

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Mar 20, 2023
5
Share
Share this post

Data Engineering Weekly #123

www.dataengineeringweekly.com

Contribute to the Rudderstack Transformations Library, Win $1000

RudderStack Transformations lets you customize event data in real time with your own JavaScript or Python code. Now you can win $1,000 cash by contributing a Transformation to our open-source library.

https://www.rudderstack.com/blog/join-the-transformations-challenge-for-a-chance-to-win/


Sanjeev Mohan: What Exactly is a Data Product?

Is chatGPT a data product? Is Data a product? What is Data Product, indeed? The author makes an interesting analogy, if you buy your favorite cereal without the box, ingredient details, and other relevant information, do you trust it? I won't trust them.

The author defines Data Product as the combination of

  1. Datasets

  2. Domain

  3. Access

It is an exciting time for the data industry as we are increasingly talking about philosophies to adopt data in an organization than technology complexities such as Hadoop, Spark, etc.,

https://sanjmo.medium.com/what-exactly-is-a-data-product-7f6935a17912


Uber: Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi

Uber writes a comprehensive guide on running incremental ETL using Apache Hudi. The article discusses incremental processing strategy, handling late-arriving data, and backfilling with the design patterns explaining how Apache Hudi simplifies ETL processing.

https://www.uber.com/blog/ubers-lakehouse-architecture/

Data Engineering Weekly talks in detail about adopting functional data engineering principles, and Apache Hudi certainly supports it out of the box.

Data Engineering Weekly
Functional Data Engineering - A Blueprint
The Rise of Data Modeling Data modeling has been one of the hot topics in Data LinkedIn. Hadoop put forward the schema-on-read strategy that leads to the disruption of data modeling techniques as we know until then. We went through a full cycle that…
Read more
5 months ago · 27 likes · 1 comment · Ananth Packkildurai

Whatnot: Same Data, Sturdier Frame: Layering in Dimensional Data Modeling at Whatnot

Whatnot writes an interesting article about the shift from a loosely coupled system to adopting the Kimbal model centralized core data model generation. The blog discusses implementing Type-2 SCD modeling and strategies to generate surrogate keys and bridge tables to handle many-to-many relationships.

https://medium.com/whatnot-engineering/same-data-sturdier-frame-layering-in-dimensional-data-modeling-at-whatnot-5e6a548ee713

Open Questions on Type-2 modeling

I keep thinking about the Type-2 SCD and the complexity of the data pipeline. The example Address is a self-contained unit, so it is easier to implement the Kimball-style Type-2 model. What happens in some core models like “customers” where more than one dimension change requires tracking? How do you handle these models without compromising scale and usability? Map table vs. using complex data structure? Share your thoughts if you’ve implemented multi-dimensional SCD type-2 changes and how you track it at scale.


Gloat: A Startup Journey to a Modern Data Platform

Gloat writes about a startup journey adopting a modern data platform with a simplified technology stack.

  1. Debezium

  2. Kafka

  3. Snowflake

  4. Airflow & dbt

If anyone is confused with the modern data landscape, the blog is an excellent reminder that you only need a handful of systems to process your data.

https://theblog.workey.co/a-startup-journey-to-a-modern-data-platform-4c6a884f70da


Sponsored: [New] Data Observability for Startups - New Free Trial for Growth Teams

Being in the dark is unnerving. Your data team is moving fast, but you can’t be everywhere and test for everything, especially in a small team with limited resources. With a flip of the switch, you can shine a light across all of your data pipelines and tables. Spot bad data before it impacts your stakeholders, costing you time and credibility. Here is an easy, no commitment way to see how you can make a big impact with your data with data observability.

Try it free


Netflix: Building a Media Understanding Platform for ML Innovations

Netflix recently wrote a series of blogs about its media ML platform. Much of it focuses on model training, evaluation, and scoring. The blog discusses how these ML models integrate with the application to serve users. The blog narrates the use cases for such media ML platforms and discusses the pros & cons of doing on-demand analysis vs. pre-computation.

https://netflixtechblog.com/building-a-media-understanding-platform-for-ml-innovations-9bef9962dcb7


Expedia: Unified Machine Learning Platform at Expedia Group

On a similar line with the Netflix media ML platform, Expedia discusses the unified ML platform approach to enable innovation across the organization. The blog narrates how Expedia with nine ML systems with no unified way to build and deploy to standardize the infrastructure by adopting build → optimize → enhance methodologies.

https://medium.com/expedia-group-tech/unified-machine-learning-platform-at-expedia-group-5aee72606c74


Shopify: Unlocking Real-time Predictions with Shopify's Machine Learning Platform

Shopify wrote about its unified ML platform Merlin in the past. Continue expanding Merlin; Shopify writes about Merlin’s online inference system design to unlock real-time prediction services. The platform approach to online inference systems to support no code, low code, and full custom interfaces is an exciting read.

https://shopifyengineering.myshopify.com/blogs/engineering/shopifys-machine-learning-platform-real-time-predictions


Sponsored: RudderStack Transformations - Move Faster and Build Data Trust

With Device Mode Transformations, you can transform data sent to downstream integrations running in device mode. When destination integrations are set up in device mode, RudderStack loads that tool's native SDK asynchronously and sends event data directly to the destination from the device itself (i.e., from the browser or mobile app).

RudderStack Product manager, Badri Veeraragavan, details a few big updates to RudderStack's beloved data transformation feature. New features include Python Transformations (including Libraries and Transformations API), Transformation Templates, and Device Mode Transformations. 75% of RudderStack users already leverage Transformations, and now they're even more powerful.

https://www.rudderstack.com/blog/transformations-move-faster-and-build-data-trust/


Microsoft: Unpacking churn with survival models

Churn prediction is a vital part of growth engineering to understand the trends in customer churn and deploy mitigation plans to prevent it. The author discusses the problem statement of variable hazard ratios in churn over time and compares churn analysis results of the traditional cox model vs. the time-varying cox model. The article brings back some fond memories as it was the very first ML problem I worked on to predict when an employee will resign from a company :-) Yes, you heard it correctly.

https://medium.com/data-science-at-microsoft/unpacking-churn-with-survival-models-762822132c21


BuzzFeed: Lessons Learned Building Products Powered by Generative AI

BuzzFeed writes about adopting generative AI in building products powered by generative AI. I can’t highlight this statement enough.

At BuzzFeed, we believe AI will bring a new era of creativity. We think it will open up brand-new content formats, new ideas, and novel ways for content creators to interact with their audience.

  1. BuzzFeed shares seven learning on integrating generative AI.

  2. Get the technology into the hands of your employees, especially the creative ones.

  3. Good and effective prompts are the result of close collaboration between writers and engineers.

  4. Moderation is essential. Build guardrails into your prompt.

  5. LLMs are not dark magic. Demystifying the technical concepts behind this technology can lead to better application of those tools.

  6. Integrating with OpenAI and scaling your usage to thousands of requests per minute is easy, but be prepared for some downtime.

  7. The economics of using Generative AI can be tough, especially for ad-supported business models.

  8. There are a lot of good tools and resources out there to help you experiment with this technology.

https://tech.buzzfeed.com/lessons-learned-building-products-powered-by-generative-ai-7f6c23bff376


Socure: Migrating Large-Scale Data Pipelines

There will always be a migration project in data engineering :-) Socure writes another case of migration from Redshift to Snowflake and the deprecation of a few internal systems. The blog narrates the engineering principles they adopted for a large-scale migration and shares some lessons learned along the way.

https://medium.com/the-socure-technology-blog/migrating-large-scale-data-pipelines-493655a47fa6


Exness: The best orchestration tool for MLOps: a real story about difficult choices

What is the best orchestration engine for the MLOps? The author compares the choices of Airflow vs. Prefect vs. Kubeflow. The author concludes by continuing with Kubeflow.

What is your choice of Orchestration Engine? Comment on your choice.

https://medium.com/exness-blog/the-best-orchestration-tool-for-mlops-a-real-story-about-difficult-choices-5ee6a087c9e3


Xiaoxu Gao: How to Build an On-Call Culture in a Data Engineering Team

Building an efficient on-call culture is critical to bring ownership and operational excellence to operate the data pipeline. The author discusses what data engineers do while on call, the workflow, tools, and complete process associated with running a successful on-call.

https://towardsdatascience.com/how-to-build-an-on-call-culture-in-a-data-engineering-team-7856fac0c99


Hiflylabs: dbt Docs as a Static Website

I often joke, “This data catalog tool could be the static website out of dbt docs.” The blog narrates how to build a data catalog without spending money on dbt docs!!!

https://medium.com/hiflylabs/dbt-docs-as-a-static-website-c50a5b306514


All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

5
Share
Share this post

Data Engineering Weekly #123

www.dataengineeringweekly.com
Comments
Top
New
Community

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing