Data Engineering Weekly #111
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Editor’s Note: Upcoming articles in Data Engineering Weekly
It is a nervous week at the desk as I'm writing the 111th edition of Data Engineering Weekly. If you're a cricket follower, you know what I'm talking about. Yes, we hit the Nelson. If you're unfamiliar with cricket and Nelson, here is an interesting history behind the number 111.
2022 has been a remarkable year and one of the most memorable years for me personally. It's the year's end, so there is plenty of 2023 predictions in Data Engineering. Should I write another prediction? Probably not (I can hear the sigh of relief from all of you). Instead, I will write three back-to-back articles from my experience in recent data engineering projects.
What I’m planning to write?
Functional Data Engineering - A Blueprint
There has been an uptick in discussion about data modeling in recent years. Maxime Beauchemin wrote an influential article, Functional Data Engineering — a modern paradigm for batch data processing. I recently worked on sketching data architecture from scratch, and the article summarises data pipeline patterns I put in to gather to adopt the functional principles.
Data Catalogs - A broken promise
I've been a big fan of data catalogs for a long time. After actively observing a couple of data catalog implementations, I started questioning my beliefs. The article is a reflection of my thoughts and experience with data catalogs. Is Data Catalog a product or a feature? 🤔
Data Quality - Shift Left, bring consumers closer.
Data Quality is close to my heart, and I continue studying various social & organizational dynamics about quality in other domains. The Data Quality tools available in the market focus on Quality Control but not much focus on Quality Assurance. I think we barely touched the surface of Data Quality Management.
With that, Let’s jump on to this week’s top articles.
Etsy: Understanding the collective impact of experiments
A collective understanding of the impact of the experiments is essential to understand the overall business impact of the changes made across all teams. Etsy writes about using holdout groups to estimate the collective impact of individual experiments.
LinkedIn: Our Approach to Research and A/B Testing
Experimentation is cultural; either you believe in an experiment, or you don't. The article from LinkedIn reminds the same by stating the journey from Why Experimentation is so Important for LinkedIn to Approach to Research and A/B Testing.
Instacart: Personalizing Recommendations for a Learning User
Instacart writes the summary article from its Distinguished Speaker Series by Professor Hongning Wang of the University of Virginia. Talks at Google are one of my sources of informative talks, and kudos to the Instacart team for sharing the same.
Nvidia: What Is a Pretrained AI Model?
Instead of building an AI model from scratch, developers can use pre-trained models and customize them to meet their requirements. How exciting it is!! Nvidia writes about some of the sources of pre-trained AI models and their applications.
Sponsored: Upsolver - Write a SQL Query, Get a Data-in-Motion Pipeline!
Pipelines for data in motion can quickly turn into DAG hell. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query.
Streaming plus batch unified in a single platform.
Stateful processing at scale - joins, aggregations, upserts
Orchestration auto-generated from the data and SQL
Templates with sample data for Kafka/Kinesis/S3 sources -> S3/Athena/Snowflake/Redshift
Grab: Zero trust with Kafka
Zero Trust is a security framework requiring all users, whether in or outside the organization’s network, to be authenticated, authorized, and continuously validated for security configuration and posture before being granted or keeping access to applications and data. Grab writes about how it implemented Zero trust infrastructure for Kafka.
Lumen: Our journey with Apache Flink (Part 1) - Operation and deployment tips
Lumen shares a few practical tips to run Flink in production, reflecting a few core themes to scale the streaming infrastructure.
Multi-instance (one cluster per job) infrastructure scale is better than the multi-tenant (one cluster for all jobs) cluster.
Automate the CI/ CD pipeline and observability, so it's relatively simpler to scale.
Sponsored: Monte Carlo - [New Guide] Data Pipeline Monitoring 101
Minimize data downtime, maximize data trust. As data becomes increasingly important to modern companies, it’s crucial for it to be trusted and accessible. Learn how to stop missing incidents and spending precious engineering hours maintaining static tests and learn how data pipeline monitoring can help take your team to the next level by accessing this new "Data Pipeline Monitoring 10" guide.
AutoTrader: Scoring Adverts Quickly but Fairly
AutoTrader writes about its advert scoring feature with prior weightings to give adverts with few observations score that was fair, smoothly changing with increasing data. Thank you, Mark Crossfield, for contributing the article to Data Engineering Weekly.
Trivago: Marketing Attribution: Evaluating The Path to Purchase in the Product Ecosystem
It is an excellent article about marketing attribution and various attribution models. The author discusses the pros and cons of the Last Touch Attribution Model, First Touch Attribution Model, Linear Attribution Model, and Multi-Touch Attribution Model (Markov Chain)
Sponsored: Rudderstack - Take Control of Your Customer Data With RudderStack
Legacy CDPs charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost-effective solution. Plus, it gives you more technical controls, so you can fully unlock the power of your customer data.
TASK Group: Reshaping Data Engineering at Plexure
The multi-cloud fragmented data infrastructure is a nightmare for innovation. The Plexure team writes about how it simplifies the architecture using Databricks and Prefect from Azure & AWS services.
HomeToGo: How HomeToGo has connected Superset Dashboards to dbt Exposures to improve Data Discoverability
dbt Exposure is one of my favorite features that helps codify a model's downstream usage. It may need to be more scalable to define the upstream, but it is a helpful feature to deliver last-mile insights with the data visualization tools. HomeToGo writes about one such experience on integrating dbt exposure and Apache Superset.
Angad Singh: Data tooling is not the problem, processes and people are
An opinionated tool/ platform that drives behavioral and process change is the secret sauce of many successful companies. In a way, the title is misleading, but the essence of the blog is to build tools to drive process changes, not a general-purpose tool.
Tony Seale: Building Your Connected Data Catalog
Standardization of naming and glossary is often a significant hurdle in data management—the author proposes to take inspiration from the Schema.org approach of the sharable definition of a dataset across organizations. Data Contracts are a critical missing piece to achieve this since the shared definitions must integrate with the developer's workflow instead of a separate workflow.
All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.