Data Engineering Weekly #153

The Weekly Data Engineering Newsletter

Dec 18, 2023

RudderStack is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Visit rudderstack.com to learn more.

Lorin Hochstein: “Human error” means they don’t understand how the system worked

The post is not directly related to Data Engineering but system operations in general. I included this post because I often see high-pitched LinkedIn posts stating it is the human fault, especially around data quality issues.

How about we play this game: whenever any LinkedIn post blames humans for data quality issues, replace “human error” and just read it as “they don’t understand how the system worked.”

https://surfingcomplexity.blog/2023/12/10/human-error-means-they-dont-understand-how-the-system-worked/

Netflix: Our First Netflix Data Engineering Summit

Netflix publishes the tech talk videos of their internal data summit. It is great to see an internal tech talk with a series focus on data engineering. My highlight is the talk about the data processing pattern around incremental data pipelines. Databrick’s autoloader works in a similar way using RocksDB. Are there any mainstream orchestration engines supporting these patterns out of the box?

https://netflixtechblog.com/our-first-netflix-data-engineering-summit-f326b0589102

Instacart: Monte Carlo, Puppetry and Laughter: The Unexpected Joys of Prompt Engineering

Instacart writes an article exploring prompt techniques used for internal productivity tooling. It is one of the structured articles I came across about prompt techniques that discusses Chain of Thoughts, ReAct, and advanced prompting techniques to boost productivity.

https://tech.instacart.com/monte-carlo-puppetry-and-laughter-the-unexpected-joys-of-prompt-engineering-4b9272e0c4eb

LinkedIn: Privacy Preserving Single Post Analytics

LinkedIn writes about its efforts to balance the utility of post analytics with the privacy of its members. The article discusses the design of PEDAL (Privacy Enhanced Data Analytics Layer), a mid-tier service between applications and backend services like Pinot, to implement differential privacy, including differentially private algorithms, a metadata store, and a privacy loss tracker.

https://engineering.linkedin.com/blog/2023/privacy-preserving-single-post-analytics

SquareUp: How To Train Your Own GenAI Model

SquareUp writes a practical guide to training a GenAI model using GPT2, emphasizing its advantages over larger models like GPT3.5 for certain applications. GPT2 is highlighted for its open-source nature, smaller size, and ability to run on mobile devices, including offline operation. The article covers the essentials of Seq2Seq models, training processes, and considerations for efficient training, such as GPU VRAM, GPU Compute, Max Length, Batch Size, and Number of Records.

https://developer.squareup.com/blog/how-to-train-your-own-genai-model/

PayPal: Declarative Feature Engineering at PayPal

The ultimate goal of any data platform is to allow data scientists to declare their features rather than explicitly specify how to construct them on top of different execution platforms. PayPal talks about declarative feature engineering and how it reduces hidden tech debt and decreases the total cost of ownership.

https://medium.com/paypal-tech/declarative-feature-engineering-at-paypal-eddcae81c06d

Picnic: Running demand forecasting machine learning models at scale

Picnic writes about how it leverages advanced machine learning models, including Temporal Fusion Transformers and Tide, for demand forecasting to minimize food waste and meet customer demand efficiently. It is great to see the article emphasizes the importance of maintaining data quality and regularly retraining models to address data drift.

https://blog.picnic.nl/running-demand-forecasting-machine-learning-models-at-scale-bd058c9d4aa7

Vimeo: ClickHouse is in the house

Vimeo writes about its migration journey from HBase+ Phoenix infrastructure to ClickHouse. The article discusses the growing video analytical issues, scalability challenges with HBase, and ClickHouse integration with Apache Spark.

https://medium.com/vimeo-engineering-blog/clickhouse-is-in-the-house-413862c8ac28

Adyen: Apache Airflow at Adyen: Our journey and challenges to achieve reliability at scale

Adyan writes about its journey and challenges to achieve reliability at scale for operating Apache Airflow. The article focuses on four key areas.

Setting up a scalable multi-tenant Airflow setup
Handling machine failures
Handling task priority
Maximizing user productivity

https://medium.com/apache-airflow/apache-airflow-at-adyen-our-journey-and-challenges-to-achieve-reliability-at-scale-c5535a7061bf

Mixpanel: How Mixpanel Built a “Fast Lane” for Our Modern Data Stack

Mixpanel writes about the challenges in marketing architecture with multi-level sync from different apps, the delays with reverse ETL tooling, and the need for real-time architecture for marketing ops. The article discusses the available options to bring near-real-time infrastructure to marketing ops, such as

Build classic real-time infrastructure with stream processing frameworks
Use low-code/ no-code integration tools
Connect all internal SaaS tools with their native integrations.

An architectural diagram representing the YCat system, in the form of a flow chart. The alt-text character limit on Medium does not have enough space for a full description of the flow-chart, but the accompanying plaintext describess everything in the diagram in detail.

https://engineering.mixpanel.com/how-mixpanel-built-a-fast-lane-for-our-modern-data-stack-680701736f8c

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly