Data Engineering Weekly #124
The Weekly Data Engineering Newsletter
Contribute to the Rudderstack Transformations Library, Win $1000
Editor’s Note: Last Call for Data Council & Get 30% off on Real-Time Analytical Summit - 2023
I’m excited to attend this year’s Data Council, Austin conference. Last year around this time, Bundling vs. Unbundling was the talk of the town. With the current economic climate, I suppose there is a consensus that bundling is inevitable. I’m excited to listen to all the excellent speakers at the conference, and looking forward to the week of learning.
Data Engineering Weekly readers get a 20% discount by applying
Promo Code: DataWeekly20
Data Council website: https://www.datacouncil.ai/austin
The Real-Time Analytic Summit is on April 25-26 in downtown San Francisco, CA. Come and hear talks from companies like StarTree, Confluent, LinkedIn, DoorDash, Imply, and Uber on how they are advancing the state-of-the-art in user-facing analytics delivered instantly.
Go to rtasummit.com and register with DEW30 for 30% off.
dbt: State of Analytics Engineering
dbt publishes the state of analytical [data???🤔] engineering. If you follow Data Engineering Weekly, We actively talk about data contracts & how data is a collaboration problem, not just an ETL problem. The state of analytical engineering survey validates it as two of the top 5 concerns are data ownership & collaboration between the data producer & consumer. Here are the top 5 key learnings from the report.
46% of respondents plan to invest more in data quality and observability this year— the most popular area for future investment.
Lack of coordination between data producers and data consumers are perceived by all respondents to be this year’s top threat to the ecosystem.
Data engineers and analytics engineers are most likely to believe they have clear goals, and most likely to agree their work is valued.
71% of respondents rated data team productivity and agility positively, while data ownership ranked as a top concern for most.
Analytics leaders are most concerned with stakeholder needs. 42% say their top concern is “Data isn’t where business users need it.”
Rittman Analytics: ChatGPT, Large Language Models and the Future of dbt and Analytics Consulting
Very fascinating to read about the potential impact of LLM in the future of dbt and analytical consulting. The author predicts we are at the beginning of the industrial revolution of computing.
Future iterations of generative AI, public services such as ChatGPT, and domain-specific versions of these underlying models will make IT, and computing to date look like the spinning jenny that was the start of the industrial revolution.
May the best LLM wins!! 🤺🤺🤺🤺🤺🤺
Redpoint: Introducing the Redpoint Open-source Top 25
Open source has revolutionized enterprise applications by enabling rapid innovation, collaboration, and cost reduction while increasing transparency and interoperability. Redpoint ventures write about the top 25 open-source companies ranked by adoption, momentum, usage, and health.
NYT: Day in the Life of a Senior Analyst in the Data and Insights Group
NYT publishes an article on data in the life of a senior analyst. The blog highlights that the job is not just writing SQL but providing a strategic business solution for an organization.
Stitch Fix: A New Era of Creativity: Expert-in-the-loop Generative AI at Stitch Fix
Stitch Fix writes about the efficient use of generative AI for their marketing and product content generation. I predict that generative AI will greatly impact last-mile analytics delivery. The BI dashboard tools could be simple prompts. It is an exciting breakthrough to watch and adapt.
Sponsored: [Webinar] How to Scale Data Reliability
Learn how Blend, a cloud infrastructure platform powering digital experiences for some of the world’s largest financial institutions, combined cloud-based data transformations and data observability to deliver trustworthy insights faster.
LinkedIn: Unified Streaming And Batch Pipelines At LinkedIn: Reducing Processing time by 94% with Apache Beam
One of the curses of adopting Lambda Architecture is the need for rewriting business logic in both streaming and batch pipelines. Spark attempt to solve this by creating a unified RDD model for streaming and batch; Flink introduces the table format to bridge the gap in batch processing. LinkedIn writes about its experience adopting Apache Beam’s approach, where Apache Beam follows unified pipeline abstraction that can run in any target data processing runtime such as Samza, Spark & Flink.
Wix: How Wix manages Schemas for Kafka (and gRPC) used by 2000 microservices
Wix writes about managing schema for 2000 (😬) microservices by standardizing schema structure with protobuf and Kafka schema registry. Some exciting reads include patterns like an internal Wix Docs approach & integration of the documentation publishing as part of the CI/ CD pipelines.
Instacart: Distributed Machine Learning at Instacart
There is much talk about the impact of Machine Learning; however, a stable underlying infrastructure is vital to realize the true potential of Machine Learning. Instacart writes about its usage of Ray and demonstrates how hosting a monolithic service as the computation backend for all distributed ML applications has limitations in scalability, efficiency, and diversity, given the rapidly evolving and highly diversified nature of ML applications.
Sponsored: The Data Stack Show Meet and Greet at Data Council Austin
Headed to Data Council Austin? Join The Data Stack Show at Scholz Garten on Wednesday, March 29th, for a night of bowling, beers, and brats. You'll have a chance to meet hosts Eric and Kostas in person and nerd out on data with some new friends.
Airbnb: Building Airbnb Categories with ML & Human in the Loop
Airbnb Engineering writes about creating listings categories using machine learning and human expertise. The "Human-in-the-Loop" strategy involves training models on a rich dataset of user-generated content, then utilizing human reviewers to validate and refine these models iteratively. Airbnb highlights improving the accuracy of its category predictions and better user experience on the platform with the Human-in-the-Loop approach.
DoorDash: Using CockroachDB to Reduce Feature Store Costs by 75%
DoorDash writes about switching its feature store from Redis to CockroachDB for efficiency and cost saving. DoorDash highlights the following advantages of CockroachDB over Redis as a feature store.
Spin up a Redis cluster with the desired number of nodes from the most recent daily backup
Replay all of the writes from the last day on the new cluster
Switch over traffic to the new cluster
Delete the old cluster
Lyft: lyft2vec — Embeddings at Lyft
Lyft Engineering introduces Lyft2Vec, an embedding framework representing various entities within the Lyft ecosystem. Leveraging the power of graph-based embeddings, Lyft2Vec captures the relationships and interactions between different entities, such as drivers, riders, and locations. This approach enables Lyft to optimize its services, including dispatching, pricing, and routing. The article outlines the challenges faced, the methodology, and the results achieved, showcasing the potential of Lyft2Vec in enhancing the platform's overall performance.
Canva: Understanding a Diverse User Base with Frequency Segmentation at Scale
Canva writes about the frequency segmentation at scale using Buy-Till-You-Die(BTYD) model. The blog is very educative for me about measuring the lifetime value of a customer and segmentation on buying behavior. The BTYD model is excellent for building a recommendation engine and marketing personalization.
Adam Brownell writes an in-depth overview of the BTYD model in the blog Customer Behavior Modeling: Buy-til-you-Die Models- A brief intro to the BTYD family, Pareto/NBD, & Pareto/GGG for Predicting Buying Behavior.
All rights reserved ProtoGrowth Inc, India. We have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.