Data Engineering Weekly #153
The Weekly Data Engineering Newsletter
RudderStack is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Visit rudderstack.com to learn more.
Lorin Hochstein: “Human error” means they don’t understand how the system worked
The post is not directly related to Data Engineering but system operations in general. I included this post because I often see high-pitched LinkedIn posts stating it is the human fault, especially around data quality issues.
How about we play this game: whenever any LinkedIn post blames humans for data quality issues, replace “human error” and just read it as “they don’t understand how the system worked.”
Netflix: Our First Netflix Data Engineering Summit
Netflix publishes the tech talk videos of their internal data summit. It is great to see an internal tech talk with a series focus on data engineering. My highlight is the talk about the data processing pattern around incremental data pipelines. Databrick’s autoloader works in a similar way using RocksDB. Are there any mainstream orchestration engines supporting these patterns out of the box?
Instacart: Monte Carlo, Puppetry and Laughter: The Unexpected Joys of Prompt Engineering
Instacart writes an article exploring prompt techniques used for internal productivity tooling. It is one of the structured articles I came across about prompt techniques that discusses Chain of Thoughts, ReAct, and advanced prompting techniques to boost productivity.
LinkedIn: Privacy Preserving Single Post Analytics
LinkedIn writes about its efforts to balance the utility of post analytics with the privacy of its members. The article discusses the design of PEDAL (Privacy Enhanced Data Analytics Layer), a mid-tier service between applications and backend services like Pinot, to implement differential privacy, including differentially private algorithms, a metadata store, and a privacy loss tracker.
Sponsored: RudderStack launches Trino Reverse ETL source
The integration supports warehouse-based diffing, making it the most performant Reverse ETL solution for Trino.
RudderStack just launched Trino as a Reverse ETL source. It's the only Trino Reverse ETL solution that supports warehouse-based CDC. With Rudderstack and Trino, you can also create custom SQL queries for building data models — execute these using Trino via RudderStack and seamlessly deliver your data to your downstream business tools. Read the announcement for more details.
SquareUp: How To Train Your Own GenAI Model
SquareUp writes a practical guide to training a GenAI model using GPT2, emphasizing its advantages over larger models like GPT3.5 for certain applications. GPT2 is highlighted for its open-source nature, smaller size, and ability to run on mobile devices, including offline operation. The article covers the essentials of Seq2Seq models, training processes, and considerations for efficient training, such as GPU VRAM, GPU Compute, Max Length, Batch Size, and Number of Records.
PayPal: Declarative Feature Engineering at PayPal
The ultimate goal of any data platform is to allow data scientists to declare their features rather than explicitly specify how to construct them on top of different execution platforms. PayPal talks about declarative feature engineering and how it reduces hidden tech debt and decreases the total cost of ownership.
Picnic: Running demand forecasting machine learning models at scale
Picnic writes about how it leverages advanced machine learning models, including Temporal Fusion Transformers and Tide, for demand forecasting to minimize food waste and meet customer demand efficiently. It is great to see the article emphasizes the importance of maintaining data quality and regularly retraining models to address data drift.
Vimeo: ClickHouse is in the house
Vimeo writes about its migration journey from HBase+ Phoenix infrastructure to ClickHouse. The article discusses the growing video analytical issues, scalability challenges with HBase, and ClickHouse integration with Apache Spark.
Adyen: Apache Airflow at Adyen: Our journey and challenges to achieve reliability at scale
Adyan writes about its journey and challenges to achieve reliability at scale for operating Apache Airflow. The article focuses on four key areas.
Setting up a scalable multi-tenant Airflow setup
Handling machine failures
Handling task priority
Maximizing user productivity
Mixpanel: How Mixpanel Built a “Fast Lane” for Our Modern Data Stack
Mixpanel writes about the challenges in marketing architecture with multi-level sync from different apps, the delays with reverse ETL tooling, and the need for real-time architecture for marketing ops. The article discusses the available options to bring near-real-time infrastructure to marketing ops, such as
Build classic real-time infrastructure with stream processing frameworks
Use low-code/ no-code integration tools
Connect all internal SaaS tools with their native integrations.
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.