Data Engineering Weekly #124

The Weekly Data Engineering Newsletter

Mar 27, 2023

Contribute to the Rudderstack Transformations Library, Win $1000

RudderStack Transformations lets you customize event data in real time with your own JavaScript or Python code. Now you can win $1,000 cash by contributing a Transformation to our open-source library.

https://www.rudderstack.com/blog/join-the-transformations-challenge-for-a-chance-to-win/

Editor’s Note: Last Call for Data Council & Get 30% off on Real-Time Analytical Summit - 2023

I’m excited to attend this year’s Data Council, Austin conference. Last year around this time, Bundling vs. Unbundling was the talk of the town. With the current economic climate, I suppose there is a consensus that bundling is inevitable. I’m excited to listen to all the excellent speakers at the conference, and looking forward to the week of learning.

Data Engineering Weekly readers get a 20% discount by applying

Promo Code: DataWeekly20

Data Council website: https://www.datacouncil.ai/austin

The Real-Time Analytic Summit is on April 25-26 in downtown San Francisco, CA. Come and hear talks from companies like StarTree, Confluent, LinkedIn, DoorDash, Imply, and Uber on how they are advancing the state-of-the-art in user-facing analytics delivered instantly.

Go to rtasummit.com and register with DEW30 for 30% off.

dbt: State of Analytics Engineering

dbt publishes the state of analytical [data???🤔] engineering. If you follow Data Engineering Weekly, We actively talk about data contracts & how data is a collaboration problem, not just an ETL problem. The state of analytical engineering survey validates it as two of the top 5 concerns are data ownership & collaboration between the data producer & consumer. Here are the top 5 key learnings from the report.

46% of respondents plan to invest more in data quality and observability this year— the most popular area for future investment.
Lack of coordination between data producers and data consumers are perceived by all respondents to be this year’s top threat to the ecosystem.
Data engineers and analytics engineers are most likely to believe they have clear goals, and most likely to agree their work is valued.
71% of respondents rated data team productivity and agility positively, while data ownership ranked as a top concern for most.
Analytics leaders are most concerned with stakeholder needs. 42% say their top concern is “Data isn’t where business users need it.”

https://www.getdbt.com/state-of-analytics-engineering-2023/

Rittman Analytics: ChatGPT, Large Language Models and the Future of dbt and Analytics Consulting

Very fascinating to read about the potential impact of LLM in the future of dbt and analytical consulting. The author predicts we are at the beginning of the industrial revolution of computing.

Future iterations of generative AI, public services such as ChatGPT, and domain-specific versions of these underlying models will make IT, and computing to date look like the spinning jenny that was the start of the industrial revolution.

🤺🤺🤺🤺🤺🤺🤺🤺🤺May the best LLM wins!! 🤺🤺🤺🤺🤺🤺

https://www.rittmananalytics.com/blog/2023/3/26/chatgpt-large-language-models-and-the-future-of-dbt-and-analytics-consulting

Redpoint: Introducing the Redpoint Open-source Top 25

Open source has revolutionized enterprise applications by enabling rapid innovation, collaboration, and cost reduction while increasing transparency and interoperability. Redpoint ventures write about the top 25 open-source companies ranked by adoption, momentum, usage, and health.

https://cloudinfrastructure.substack.com/p/introducing-the-redpoint-open-source

NYT: Day in the Life of a Senior Analyst in the Data and Insights Group

NYT publishes an article on data in the life of a senior analyst. The blog highlights that the job is not just writing SQL but providing a strategic business solution for an organization.

https://open.nytimes.com/day-in-the-life-of-a-senior-analyst-in-the-data-and-insights-group-626c5e1e94f1

Stitch Fix: A New Era of Creativity: Expert-in-the-loop Generative AI at Stitch Fix

Stitch Fix writes about the efficient use of generative AI for their marketing and product content generation. I predict that generative AI will greatly impact last-mile analytics delivery. The BI dashboard tools could be simple prompts. It is an exciting breakthrough to watch and adapt.

https://multithreaded.stitchfix.com/blog/2023/03/06/expert-in-the-loop-generative-ai-at-stitch-fix/

LinkedIn: Unified Streaming And Batch Pipelines At LinkedIn: Reducing Processing time by 94% with Apache Beam

One of the curses of adopting Lambda Architecture is the need for rewriting business logic in both streaming and batch pipelines. Spark attempt to solve this by creating a unified RDD model for streaming and batch; Flink introduces the table format to bridge the gap in batch processing. LinkedIn writes about its experience adopting Apache Beam’s approach, where Apache Beam follows unified pipeline abstraction that can run in any target data processing runtime such as Samza, Spark & Flink.

https://engineering.linkedin.com/blog/2023/unified-streaming-and-batch-pipelines-at-linkedin--reducing-proc

Wix: How Wix manages Schemas for Kafka (and gRPC) used by 2000 microservices

Wix writes about managing schema for 2000 (😬) microservices by standardizing schema structure with protobuf and Kafka schema registry. Some exciting reads include patterns like an internal Wix Docs approach & integration of the documentation publishing as part of the CI/ CD pipelines.

https://medium.com/wix-engineering/how-wix-manages-schemas-for-kafka-and-grpc-used-by-2000-microservices-2117416ea17b

Instacart: Distributed Machine Learning at Instacart

There is much talk about the impact of Machine Learning; however, a stable underlying infrastructure is vital to realize the true potential of Machine Learning. Instacart writes about its usage of Ray and demonstrates how hosting a monolithic service as the computation backend for all distributed ML applications has limitations in scalability, efficiency, and diversity, given the rapidly evolving and highly diversified nature of ML applications.

https://tech.instacart.com/distributed-machine-learning-at-instacart-4b11d7569423

Airbnb: Building Airbnb Categories with ML & Human in the Loop

Airbnb Engineering writes about creating listings categories using machine learning and human expertise. The "Human-in-the-Loop" strategy involves training models on a rich dataset of user-generated content, then utilizing human reviewers to validate and refine these models iteratively. Airbnb highlights improving the accuracy of its category predictions and better user experience on the platform with the Human-in-the-Loop approach.

https://medium.com/airbnb-engineering/building-airbnb-categories-with-ml-human-in-the-loop-35b78a837725

DoorDash: Using CockroachDB to Reduce Feature Store Costs by 75%

DoorDash writes about switching its feature store from Redis to CockroachDB for efficiency and cost saving. DoorDash highlights the following advantages of CockroachDB over Redis as a feature store.

Spin up a Redis cluster with the desired number of nodes from the most recent daily backup
Replay all of the writes from the last day on the new cluster
Switch over traffic to the new cluster
Delete the old cluster

https://doordash.engineering/2023/03/21/using-cockroachdb-to-reduce-feature-store-costs-by-75/

Lyft: lyft2vec — Embeddings at Lyft

Lyft Engineering introduces Lyft2Vec, an embedding framework representing various entities within the Lyft ecosystem. Leveraging the power of graph-based embeddings, Lyft2Vec captures the relationships and interactions between different entities, such as drivers, riders, and locations. This approach enables Lyft to optimize its services, including dispatching, pricing, and routing. The article outlines the challenges faced, the methodology, and the results achieved, showcasing the potential of Lyft2Vec in enhancing the platform's overall performance.

https://eng.lyft.com/lyft2vec-embeddings-at-lyft-d4231a76d219

Canva: Understanding a Diverse User Base with Frequency Segmentation at Scale

Canva writes about the frequency segmentation at scale using Buy-Till-You-Die(BTYD) model. The blog is very educative for me about measuring the lifetime value of a customer and segmentation on buying behavior. The BTYD model is excellent for building a recommendation engine and marketing personalization.

https://canvatechblog.com/understanding-a-diverse-user-base-with-frequency-segmentation-at-scale-34dc285f0f75

Adam Brownell writes an in-depth overview of the BTYD model in the blog Customer Behavior Modeling: Buy-til-you-Die Models- A brief intro to the BTYD family, Pareto/NBD, & Pareto/GGG for Predicting Buying Behavior.

https://towardsdatascience.com/customer-behavior-modeling-buy-til-you-die-models-6f9580e38cf4

All rights reserved ProtoGrowth Inc, India. We have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly