Data Engineering Weekly #142

The Weekly Data Engineering Newsletter

Aug 13, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack Profiles takes the SaaS guesswork, and SQL grunt work out of building complete customer profiles, so you can quickly ship actionable, enriched data to every downstream team. See how it works today.

Editor’s Note: DewCon.ai Registration is Now Open

Great news! We've overcome some unexpected hiccups, and guess what? Conference registration is officially OPEN! 🎉

Mark your calendars: DEWCon is happening on October 12th at the luxurious Taj Hotel on MG Road. 🏨

We got some exciting keynotes lined up. Joe Reis, author of "The Fundamentals of Data Engineering," and Vinoth Chandar, creator of Apache Hudi and founder of OneHouse.ai. 📚💡 will be diving into the latest industry buzz. 🚀 Stay tuned for all the details! 🔍

https://www.dewcon.ai/

Meta: Scaling the Instagram Explore recommendations system

AI plays an important role in what people see on Meta’s platforms. Meta writes about mult—stage ranking approach with several well-defined stages, each focusing on different objectives and algorithms.

Retrieval
First-stage ranking
Second-stage ranking
Final reranking

https://engineering.fb.com/2023/08/09/ml-applications/scaling-instagram-explore-recommendations-system/

Weiting Chen: Accelerate Spark SQL Queries with Gluten

SparkSQL is a stop-gap to bride Spark RDD accessible to a mass audience, and ever since, there has been a constant effort to improve its SQL performance. Databricks attempt to solve this by Photon. Gluten from Intel & Kyligence is taking an open-source attempt to accelerate Spark SQL queries.

https://medium.com/intel-analytics-software/accelerate-spark-sql-queries-with-gluten-9000b65d1b4e

Github: https://github.com/oap-project/gluten

Vimeo: From idea to reality - Elevating our customer support through generative AI

The application of GenAI has huge potential to disrupt the customer support industry. Vimeo writes about indexing Zendesk’s articles in a vector database to improve search performance and use ConversationalRetrievalQAChain to connect to the Vector store and LLM.

https://medium.com/vimeo-engineering-blog/from-idea-to-reality-elevating-our-customer-support-through-generative-ai-101a2c5ea680

OLX Engineering: Machine Learning for Delivery Time Estimation

An accurate delivery time estimation has become crucial in ensuring customer satisfaction and building trust in e-commerce platforms. OLX writes an exciting article discussing every step of a data science project, from defining the problem and success criteria to taking the solution to production and identifying the next steps.

https://tech.olx.com/machine-learning-for-delivery-time-estimation-1-591c8df849a0

Swiggy: Predicting Food Delivery Time at Cart

Continue discussing the delivery time estimation problem, Swiggy writes about challenges in building a food delivery estimation service. The food delivery service is a multi-stage business process where each step has its feature to predict the delivery estimation.

Food Preparation Features
Assignment and First Mile features
Last Mile Features

https://bytes.swiggy.com/predicting-food-delivery-time-at-cart-cda23a84ba63

Just Eat: Building a Listwise Ranking TF recommender - A step-by-step guide.

Breaking from the delivery estimation, Just Eat, a food delivery service writes about a step-by-step guide to building a listwise ranking TF recommender. Listwise ranking, often referred to within the context of Learning to Rank (L2R) in the machine learning domain, is an approach to ranking where entire lists of items are used as training examples rather than individual items or pairs of items.

https://medium.com/justeattakeaway-tech/building-a-listwise-ranking-tf-recommender-a-step-by-step-guide-727e572860b8

DraftKings: Intro — Sports Intelligence at DraftKings

Sports Analytics is an exciting domain, and I’m sure many of our readers are interested. DraftKings writes an in-depth article about its Sports Intelligence team mission, a high-level architectural overview, and the data science software development lifecycle for MLOps.

https://medium.com/draftkings-engineering/intro-sports-intelligence-draftkings-51c285bb3737

Faire: The building blocks of Faire’s Data Team onboarding

Onboarding is a vital aspect that defines the engineering culture of a company. A best-in-class onboarding process is a reflection of inclusive and empathetic engineering culture.

Faire writes about their onboarding process and course structure to better educate engineers and stakeholders to be familiar with data engineering.

A visual showing each class category and the different classes that fall into each one. Data Infrastructure includes Querying Data 101 and Data Quality 101; Data Warehouse includes Data Infra 101, Airflow 101, and Airflow 201; Experimentation includes Experimentation 101 and Experimentation 201; Machine Learning includes Features 101, Model Training 101, Model Deployment 101, and Features 201; Backend includes Backend 101 and Backend 201.

https://craft.faire.com/the-building-blocks-of-faires-data-team-onboarding-628229b043b6

Sivabalan Narayanan: Different Query types with Apache Hudi

Incremental data processing & Time Travel are the core access patterns of Data Pipelines, and that is the foundation of transactional LakeHouse formats like Apache Hudi. However, I do think SQL is not fully reflecting the underlying pattern. The author narrates different types of queries with Apache Hudi is a good foundational understanding of thinking about LakeHouse access patterns.

https://medium.com/@simpsons/different-query-types-with-apache-hudi-e14c2064cfd6

Lucas Gabriel: Dask, Dagster, and Coiled for Production Analysis at OnlineApp

Dask is a flexible parallel computing library for analytics. As Python takes center stage in AI computing, reading about parallel processing engines like Dask is fascinating. The blog narrates the integration of Dask with Dagster for analytical pipeline design.

https://medium.com/coiled-hq/dask-dagster-and-coiled-for-production-analysis-at-onlineapp-f22eb2573967

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?