Data Engineering Weekly #119

The Weekly Data Engineering Newsletter

Feb 20, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Netflix: Scaling Media Machine Learning at Netflix

Netflix writes about media machine learning infrastructure and media-focused ML infrastructure to reduce the time from ideation to productization for media ML practitioners. The focus is to bring in data in-specific to their media assets and build a feature store. Seeing a pattern similar to Data Mart emerging in ML infrastructure is interesting. Is it the beginning of a domain-specific ML platform?

https://netflixtechblog.com/scaling-media-machine-learning-at-netflix-f19b400243

DoorDash: Lifecycle of a Successful ML Product - Reducing Dasher Wait Times

DoorDash, as a real-time supply chain optimization problem, is an interesting way to look at their business. DoorDash writes about the ML lifecycle process from ideation to partnering with the customers to verify the efficiency to reduce the dasher’s waiting time.

https://doordash.engineering/2023/02/15/lifecycle-of-a-successful-ml-product-reducing-dasher-wait-times/

Microsoft: Measuring fairness in Machine Learning

Our day-to-day life increasingly depends on AI & ML systems. We trust AI decision-making better than human decision-making to eliminate human bias. However, we can’t forget that these AI/ ML systems also build by humans with bias. The blog discusses fairness in ML and demonstrates why high accuracy doesn’t mean the algorithm is fair.

https://medium.com/data-science-at-microsoft/measuring-fairness-in-machine-learning-3211b62340b

Foodpanda: Menu Ranking

Foodpanda, in a similar line of application as DoorDash, talks about optimizing menu ranking by applying A/B testing. The blog mostly focuses on how it optimizes the batch pipeline using Airflow and BigQuery.

https://medium.com/foodpanda-data/menu-ranking-422ad21f381e

Shopify: ShopifyQL Notebooks: Simplifying Querying with Commerce Data Models

Shopify writes about its analytical engineering process to build a commerce data model to simplify analytics for non-SQL users. The process applies to any analytical engineering practices that

start with the business problem.
Design a mock data
Find the data
Assess Data Quality and Consistency
Assess Model freshness
Assess Model performance

https://shopifyengineering.myshopify.com/blogs/engineering/building-commerce-data-models-with-shopifyql

Alex Woodie: Open Table Formats Square Off in Lakehouse Data Smackdown

The Open LakeHouse format is emerging as the defacto storage format for Data Warehouses. The blog compares the features available from Apache Hudi, Iceberg & DeltaLake. The author recommends

“If you’re looking for full-featured, more real-time, go with Hudi; if you’re Spark-oriented and very much in the Databricks ecosystem, that choice is obvious. If you’re looking for something with multivendor support right now, go with Iceberg.”

https://www.datanami.com/2023/02/15/open-table-formats-square-off-in-lakehouse-data-smackdown/

Plum Living: Building a semantic layer in Preset (Superset) with dbt

The semantic layer in the edge to define metrics is gaining adoption as dbt labs acquired Transform recently. Plum Living writes about integrating the dbt semantic layer with Superset and the developer workflow.

https://medium.com/plum-living/building-a-semantic-layer-in-preset-superset-with-dbt-71ee3238fc20

Tinybird: Horizontally scaling Kafka consumers with rendezvous hashing

Optimizing Kafka consumer resources is indeed an exciting problem; Tinybird writes about its optimization problem with Kafka connect, and it uses the rendezvous hashing to balance the consumer workload.

https://www.tinybird.co/blog-posts/kafka-horizontal-scaling

Data Council - Austin 2023 Discount Code

Data Council - Austin 2023 is nearing, and I’m super excited to meet all the data practitioners in person. Data Engineering Weekly readers can use the DataWeekly20 promo code to get a 20% discount on the ticket price.

Link to Register: https://www.datacouncil.ai/austin

Promo Code: DataWeekly20

Sourygna Luangsay: BigQuery, Spark, or Dataflow? A Story of Speed and Other Comparisons

Every cloud company offers multiple tools to run analytics, and Google Cloud is no different. The author compares BigQuery, Dataflow, and Spark in Google cloud to measure performance and cost. Dataflow, not surprisingly, costs much more than Spark and BigQuery.

Thank you, Souryhna Luangsay, for submitting this article to Data Engineering Weekly Github.

https://medium.com/cts-technologies/bigquery-spark-or-dataflow-a-story-of-speed-and-other-comparisons-fb1b8fea3619

Antons Tocilins-Ruberts: End-to-End ML Pipelines with MLflow: Tracking, Projects & Serving

MLFlow is an open-source platform for managing the machine learning lifecycle. I found this blog an amazing introduction that talks about how Mlflow can help to build the ML pipeline from training, registering, and serving the model.

https://towardsdatascience.com/end-to-end-ml-pipelines-with-mlflow-tracking-projects-serving-1b491bcdc25f

Craig Kerstiens: Using Postgres FILTER

TIL about Postgres Filter, a much better readable expression than a case statement. It turns out that Filter not only increases readability but also improves performance.

https://www.crunchydata.com/blog/using-postgres-filter

Performance Benchmark: https://blog.jooq.org/the-performance-impact-of-sqls-filter-clause/

Finally, Transformers have accelerated the development of new techniques and models for natural language processing (NLP) tasks. I found the Transformers-Recipe Github Repo a very informative source to learn more about transformers.

All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?