Data Engineering Weekly

Share this post

Data Engineering Weekly #119

www.dataengineeringweekly.com

Data Engineering Weekly #119

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Feb 20
5
Share this post

Data Engineering Weekly #119

www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


Netflix: Scaling Media Machine Learning at Netflix

Netflix writes about media machine learning infrastructure and media-focused ML infrastructure to reduce the time from ideation to productization for media ML practitioners. The focus is to bring in data in-specific to their media assets and build a feature store. Seeing a pattern similar to Data Mart emerging in ML infrastructure is interesting. Is it the beginning of a domain-specific ML platform?

https://netflixtechblog.com/scaling-media-machine-learning-at-netflix-f19b400243


DoorDash: Lifecycle of a Successful ML Product - Reducing Dasher Wait Times

DoorDash, as a real-time supply chain optimization problem, is an interesting way to look at their business. DoorDash writes about the ML lifecycle process from ideation to partnering with the customers to verify the efficiency to reduce the dasher’s waiting time.

https://doordash.engineering/2023/02/15/lifecycle-of-a-successful-ml-product-reducing-dasher-wait-times/


Microsoft: Measuring fairness in Machine Learning

Our day-to-day life increasingly depends on AI & ML systems. We trust AI decision-making better than human decision-making to eliminate human bias. However, we can’t forget that these AI/ ML systems also build by humans with bias. The blog discusses fairness in ML and demonstrates why high accuracy doesn’t mean the algorithm is fair.

https://medium.com/data-science-at-microsoft/measuring-fairness-in-machine-learning-3211b62340b


Sponsored: [New Guide] The Ultimate Guide to Data Mesh Architecture

If implementing data mesh is high on your list of priorities, you’re not alone. As organizations scale their use of data, centralized architectures can prevent data teams from keeping pace with stakeholder demands and system needs. In this guide, learn through strategies deployed by leading data teams that have successfully implemented data mesh.
Get The Guide


Foodpanda: Menu Ranking

Foodpanda, in a similar line of application as DoorDash, talks about optimizing menu ranking by applying A/B testing. The blog mostly focuses on how it optimizes the batch pipeline using Airflow and BigQuery.

https://medium.com/foodpanda-data/menu-ranking-422ad21f381e


Shopify: ShopifyQL Notebooks: Simplifying Querying with Commerce Data Models

Shopify writes about its analytical engineering process to build a commerce data model to simplify analytics for non-SQL users. The process applies to any analytical engineering practices that

  1. start with the business problem.

  2. Design a mock data

  3. Find the data

  4. Assess Data Quality and Consistency

  5. Assess Model freshness

  6. Assess Model performance

https://shopifyengineering.myshopify.com/blogs/engineering/building-commerce-data-models-with-shopifyql


Sponsored: Replacing GA4 with Analytics on your Data Cloud

The GA4 migration deadline is fast approaching. If you’re still heavily reliant on Google for data collection and reporting, now is the perfect time to center your data analytics strategy around your data warehouse. Join our webinar to learn how you can replace GA with analytics on your data cloud.
https://www.rudderstack.com/events/replacing-ga4-with-analytics-on-your-data-cloud/


Alex Woodie: Open Table Formats Square Off in Lakehouse Data Smackdown

The Open LakeHouse format is emerging as the defacto storage format for Data Warehouses. The blog compares the features available from Apache Hudi, Iceberg & DeltaLake. The author recommends

“If you’re looking for full-featured, more real-time, go with Hudi; if you’re Spark-oriented and very much in the Databricks ecosystem, that choice is obvious. If you’re looking for something with multivendor support right now, go with Iceberg.”

https://www.datanami.com/2023/02/15/open-table-formats-square-off-in-lakehouse-data-smackdown/


Plum Living: Building a semantic layer in Preset (Superset) with dbt

The semantic layer in the edge to define metrics is gaining adoption as dbt labs acquired Transform recently. Plum Living writes about integrating the dbt semantic layer with Superset and the developer workflow.

https://medium.com/plum-living/building-a-semantic-layer-in-preset-superset-with-dbt-71ee3238fc20


Tinybird: Horizontally scaling Kafka consumers with rendezvous hashing

Optimizing Kafka consumer resources is indeed an exciting problem; Tinybird writes about its optimization problem with Kafka connect, and it uses the rendezvous hashing to balance the consumer workload.

https://www.tinybird.co/blog-posts/kafka-horizontal-scaling


Data Council - Austin 2023 Discount Code

Data Council - Austin 2023 is nearing, and I’m super excited to meet all the data practitioners in person. Data Engineering Weekly readers can use the DataWeekly20 promo code to get a 20% discount on the ticket price.

Link to Register: https://www.datacouncil.ai/austin

Promo Code: DataWeekly20


Sourygna Luangsay: BigQuery, Spark, or Dataflow? A Story of Speed and Other Comparisons

Every cloud company offers multiple tools to run analytics, and Google Cloud is no different. The author compares BigQuery, Dataflow, and Spark in Google cloud to measure performance and cost. Dataflow, not surprisingly, costs much more than Spark and BigQuery.

Thank you, Souryhna Luangsay, for submitting this article to Data Engineering Weekly Github.

https://medium.com/cts-technologies/bigquery-spark-or-dataflow-a-story-of-speed-and-other-comparisons-fb1b8fea3619


Antons Tocilins-Ruberts: End-to-End ML Pipelines with MLflow: Tracking, Projects & Serving

MLFlow is an open-source platform for managing the machine learning lifecycle. I found this blog an amazing introduction that talks about how Mlflow can help to build the ML pipeline from training, registering, and serving the model.

https://towardsdatascience.com/end-to-end-ml-pipelines-with-mlflow-tracking-projects-serving-1b491bcdc25f


Craig Kerstiens: Using Postgres FILTER

TIL about Postgres Filter, a much better readable expression than a case statement. It turns out that Filter not only increases readability but also improves performance.

https://www.crunchydata.com/blog/using-postgres-filter

Performance Benchmark: https://blog.jooq.org/the-performance-impact-of-sqls-filter-clause/


Finally, Transformers have accelerated the development of new techniques and models for natural language processing (NLP) tasks. I found the Transformers-Recipe Github Repo a very informative source to learn more about transformers.


All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post

Data Engineering Weekly #119

www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing