Data Engineering Weekly #111

The Weekly Data Engineering Newsletter

Dec 12, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Editor’s Note: Upcoming articles in Data Engineering Weekly

It is a nervous week at the desk as I'm writing the 111th edition of Data Engineering Weekly. If you're a cricket follower, you know what I'm talking about. Yes, we hit the Nelson. If you're unfamiliar with cricket and Nelson, here is an interesting history behind the number 111.

David Shepherd — The man who hopped his way to the top as an umpire

2022 has been a remarkable year and one of the most memorable years for me personally. It's the year's end, so there is plenty of 2023 predictions in Data Engineering. Should I write another prediction? Probably not (I can hear the sigh of relief from all of you). Instead, I will write three back-to-back articles from my experience in recent data engineering projects.

What I’m planning to write?

Functional Data Engineering - A Blueprint

There has been an uptick in discussion about data modeling in recent years. Maxime Beauchemin wrote an influential article, Functional Data Engineering — a modern paradigm for batch data processing. I recently worked on sketching data architecture from scratch, and the article summarises data pipeline patterns I put in to gather to adopt the functional principles.

Data Catalogs - A broken promise

I've been a big fan of data catalogs for a long time. After actively observing a couple of data catalog implementations, I started questioning my beliefs. The article is a reflection of my thoughts and experience with data catalogs. Is Data Catalog a product or a feature? 🤔

Data Quality - Shift Left, bring consumers closer.

Data Quality is close to my heart, and I continue studying various social & organizational dynamics about quality in other domains. The Data Quality tools available in the market focus on Quality Control but not much focus on Quality Assurance. I think we barely touched the surface of Data Quality Management.

With that, Let’s jump on to this week’s top articles.

Etsy: Understanding the collective impact of experiments

A collective understanding of the impact of the experiments is essential to understand the overall business impact of the changes made across all teams. Etsy writes about using holdout groups to estimate the collective impact of individual experiments.

https://www.etsy.com/codeascraft/understanding-the-collective-impact-of-experiments

LinkedIn: Our Approach to Research and A/B Testing

Experimentation is cultural; either you believe in an experiment, or you don't. The article from LinkedIn reminds the same by stating the journey from Why Experimentation is so Important for LinkedIn to Approach to Research and A/B Testing.

https://engineering.linkedin.com/blog/2022/our-approach-to-research-and-a-b-testing-

Instacart: Personalizing Recommendations for a Learning User

Instacart writes the summary article from its Distinguished Speaker Series by Professor Hongning Wang of the University of Virginia. Talks at Google are one of my sources of informative talks, and kudos to the Instacart team for sharing the same.

https://tech.instacart.com/personalizing-recommendations-for-a-learning-user-ed170a197f2e

Nvidia: What Is a Pretrained AI Model?

Instead of building an AI model from scratch, developers can use pre-trained models and customize them to meet their requirements. How exciting it is!! Nvidia writes about some of the sources of pre-trained AI models and their applications.

https://blogs.nvidia.com/blog/2022/12/08/what-is-a-pretrained-ai-model/

Grab: Zero trust with Kafka

Zero Trust is a security framework requiring all users, whether in or outside the organization’s network, to be authenticated, authorized, and continuously validated for security configuration and posture before being granted or keeping access to applications and data. Grab writes about how it implemented Zero trust infrastructure for Kafka.

https://engineering.grab.com/zero-trust-with-kafka

Lumen: Our journey with Apache Flink (Part 1) - Operation and deployment tips

Lumen shares a few practical tips to run Flink in production, reflecting a few core themes to scale the streaming infrastructure.

Multi-instance (one cluster per job) infrastructure scale is better than the multi-tenant (one cluster for all jobs) cluster.
Automate the CI/ CD pipeline and observability, so it's relatively simpler to scale.

https://medium.com/lumen-engineering-blog/our-journey-with-apache-flink-part-1-operation-and-deployment-tips-5c23e1b96bf7

AutoTrader: Scoring Adverts Quickly but Fairly

AutoTrader writes about its advert scoring feature with prior weightings to give adverts with few observations score that was fair, smoothly changing with increasing data. Thank you, Mark Crossfield, for contributing the article to Data Engineering Weekly.

https://engineering.autotrader.co.uk/2022/10/28/scoring-adverts-quickly-but-fairly.html

Trivago: Marketing Attribution: Evaluating The Path to Purchase in the Product Ecosystem

It is an excellent article about marketing attribution and various attribution models. The author discusses the pros and cons of the Last Touch Attribution Model, First Touch Attribution Model, Linear Attribution Model, and Multi-Touch Attribution Model (Markov Chain)

https://tech.trivago.com/post/2022-12-06-marketing-attribution-evaluating-the-path-to-purchase/

TASK Group: Reshaping Data Engineering at Plexure

The multi-cloud fragmented data infrastructure is a nightmare for innovation. The Plexure team writes about how it simplifies the architecture using Databricks and Prefect from Azure & AWS services.

https://medium.com/task-group/reshaping-data-engineering-at-plexure-5897bf398b2b

HomeToGo: How HomeToGo has connected Superset Dashboards to dbt Exposures to improve Data Discoverability

dbt Exposure is one of my favorite features that helps codify a model's downstream usage. It may need to be more scalable to define the upstream, but it is a helpful feature to deliver last-mile insights with the data visualization tools. HomeToGo writes about one such experience on integrating dbt exposure and Apache Superset.

https://engineering.hometogo.com/how-hometogo-has-connected-superset-dashboards-to-dbt-exposures-to-improve-data-discoverability-3d0add162e4a

Angad Singh: Data tooling is not the problem, processes and people are

An opinionated tool/ platform that drives behavioral and process change is the secret sauce of many successful companies. In a way, the title is misleading, but the essence of the blog is to build tools to drive process changes, not a general-purpose tool.

https://angadsg.medium.com/data-tooling-is-not-the-problem-processes-and-people-are-da25973caa2f

Tony Seale: Building Your Connected Data Catalog

Standardization of naming and glossary is often a significant hurdle in data management—the author proposes to take inspiration from the Schema.org approach of the sharable definition of a dataset across organizations. Data Contracts are a critical missing piece to achieve this since the shared definitions must integrate with the developer's workflow instead of a separate workflow.

https://medium.com/@Tonyseale/building-your-connected-data-catalog-634674b41770

All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly