Data Engineering Weekly #119
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Netflix: Scaling Media Machine Learning at Netflix
Netflix writes about media machine learning infrastructure and media-focused ML infrastructure to reduce the time from ideation to productization for media ML practitioners. The focus is to bring in data in-specific to their media assets and build a feature store. Seeing a pattern similar to Data Mart emerging in ML infrastructure is interesting. Is it the beginning of a domain-specific ML platform?
DoorDash: Lifecycle of a Successful ML Product - Reducing Dasher Wait Times
DoorDash, as a real-time supply chain optimization problem, is an interesting way to look at their business. DoorDash writes about the ML lifecycle process from ideation to partnering with the customers to verify the efficiency to reduce the dasher’s waiting time.
Microsoft: Measuring fairness in Machine Learning
Our day-to-day life increasingly depends on AI & ML systems. We trust AI decision-making better than human decision-making to eliminate human bias. However, we can’t forget that these AI/ ML systems also build by humans with bias. The blog discusses fairness in ML and demonstrates why high accuracy doesn’t mean the algorithm is fair.
Sponsored: [New Guide] The Ultimate Guide to Data Mesh Architecture
If implementing data mesh is high on your list of priorities, you’re not alone. As organizations scale their use of data, centralized architectures can prevent data teams from keeping pace with stakeholder demands and system needs. In this guide, learn through strategies deployed by leading data teams that have successfully implemented data mesh.
Get The Guide
Foodpanda: Menu Ranking
Foodpanda, in a similar line of application as DoorDash, talks about optimizing menu ranking by applying A/B testing. The blog mostly focuses on how it optimizes the batch pipeline using Airflow and BigQuery.
Shopify: ShopifyQL Notebooks: Simplifying Querying with Commerce Data Models
Shopify writes about its analytical engineering process to build a commerce data model to simplify analytics for non-SQL users. The process applies to any analytical engineering practices that
start with the business problem.
Design a mock data
Find the data
Assess Data Quality and Consistency
Assess Model freshness
Assess Model performance
Sponsored: Replacing GA4 with Analytics on your Data Cloud
The GA4 migration deadline is fast approaching. If you’re still heavily reliant on Google for data collection and reporting, now is the perfect time to center your data analytics strategy around your data warehouse. Join our webinar to learn how you can replace GA with analytics on your data cloud.
Alex Woodie: Open Table Formats Square Off in Lakehouse Data Smackdown
The Open LakeHouse format is emerging as the defacto storage format for Data Warehouses. The blog compares the features available from Apache Hudi, Iceberg & DeltaLake. The author recommends
“If you’re looking for full-featured, more real-time, go with Hudi; if you’re Spark-oriented and very much in the Databricks ecosystem, that choice is obvious. If you’re looking for something with multivendor support right now, go with Iceberg.”
Plum Living: Building a semantic layer in Preset (Superset) with dbt
The semantic layer in the edge to define metrics is gaining adoption as dbt labs acquired Transform recently. Plum Living writes about integrating the dbt semantic layer with Superset and the developer workflow.
Tinybird: Horizontally scaling Kafka consumers with rendezvous hashing
Optimizing Kafka consumer resources is indeed an exciting problem; Tinybird writes about its optimization problem with Kafka connect, and it uses the rendezvous hashing to balance the consumer workload.
Data Council - Austin 2023 Discount Code
Data Council - Austin 2023 is nearing, and I’m super excited to meet all the data practitioners in person. Data Engineering Weekly readers can use the DataWeekly20 promo code to get a 20% discount on the ticket price.
Link to Register: https://www.datacouncil.ai/austin
Promo Code: DataWeekly20
Sourygna Luangsay: BigQuery, Spark, or Dataflow? A Story of Speed and Other Comparisons
Every cloud company offers multiple tools to run analytics, and Google Cloud is no different. The author compares BigQuery, Dataflow, and Spark in Google cloud to measure performance and cost. Dataflow, not surprisingly, costs much more than Spark and BigQuery.
Thank you, Souryhna Luangsay, for submitting this article to Data Engineering Weekly Github.
Antons Tocilins-Ruberts: End-to-End ML Pipelines with MLflow: Tracking, Projects & Serving
MLFlow is an open-source platform for managing the machine learning lifecycle. I found this blog an amazing introduction that talks about how Mlflow can help to build the ML pipeline from training, registering, and serving the model.
Craig Kerstiens: Using Postgres FILTER
TIL about Postgres Filter, a much better readable expression than a case statement. It turns out that Filter not only increases readability but also improves performance.
Performance Benchmark: https://blog.jooq.org/the-performance-impact-of-sqls-filter-clause/
Finally, Transformers have accelerated the development of new techniques and models for natural language processing (NLP) tasks. I found the Transformers-Recipe Github Repo a very informative source to learn more about transformers.
All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.