DROP the Modern Data Stack
It’s time to make sense of today’s data tooling ecosystem. Check out rudderstack.com/dmds to get a guide that will help you build a practical data stack for every phase of your company’s journey to data maturity. The guide includes architecture and tactical advice to help you progress through four stages: Starter, Growth, Machine Learning, and Real-Time. Visit RudderStack.com/dmds today to DROP the modern data stack and USE a practical data engineering framework.
Matt Weingarten: Data & AI Summit Takeaways
The Databricks AI & data summit 2022 brings some exciting talks and announcements. The improvements with the delta sharing, similar to Snowflake & Redshift Data sharing, is an exciting development overall by the cloud data stores.
Matt wrote an excellent summary of key takeaways from the Databricks AI summit.
Part 1: https://medium.com/@matt_weingarten/data-ai-summit-takeaways-part-i-ed1da1aa8125
Part 2: https://medium.com/@matt_weingarten/data-ai-summit-takeaways-part-ii-2f940eee2487
One critical development in the announcement is Spark Connect, a thin “thin client” to enable Spark query capability on low-compute devices. Stan writes an excellent summary of Spark Connect.
Apache Airflow: Airflow Survey 2022
Apache Airflow released its 2022 survey capturing the developer’s thoughts about the current state of Apache Airflow. The data-driven scheduling & signal-based scheduling seems to be the most wanted features. I wish more open source systems publish their user survey to bring more openness.
https://airflow.apache.org/blog/airflow-survey-2022/
Ilan Man: People-first Data stacks
Is Modern Data Stack(?) solving the right problem? The author raised an interesting question: Can it solve people's problems?
My take on this: I think it is a bit tricky to answer. From a typical system thinking perspective, The tools provide a feedback loop [e.g.] How slow is the query running? What is the frequently used Dashboard? Every company's workflow & culture is different. It's up to the data team to use the feedback loop and create a balancing action on the workflow. The workflow defines the culture of an org, and I don't think there will be one vendor that can provide an out-of-the-box solution to solve this problem.
https://locallyoptimistic.com/post/people-first-data-stacks/
Salma Bakouk: Things I Wish I Knew When I Was Building a Data Team
The author writes an excellent overview of thinking about building the data team. Start with the right objective [vision & mission], and build a team structure to enable the execution, the tech stack & hiring strategies.
https://towardsdatascience.com/things-i-wish-i-knew-when-i-was-building-a-data-team-efcb43591204
Sponsored: Firebolt - Firebolt is a proud sponsor of Data Engineering Weekly.
Firebolt is the cloud data warehouse for builders of next-gen analytics experiences. Combining the benefits and ease of use of modern architecture with a sub-second performance at a terabyte-scale, Firebolt helps data engineering and dev teams deliver data applications that end-users love.
Nagesh Singh Chauhan: Model Evaluation Metrics in Machine Learning
The ML Model evaluation metrics aim to find how well the model will perform on unseen data. The author discusses various evaluation metrics one could use for the model evaluation.
https://www.kdnuggets.com/2020/05/model-evaluation-metrics-machine-learning.html
Evidently AI: Which test is the best? We compared 5 methods to detect data drift on large datasets.
Data Drift occurs when the dataset used to train your model does not mimic the data you receive in production. An unexpected and undocumented change to the data structure, semantics, and infrastructure can potentially cause Data Drift. The article discusses various testing strategies to identify Data Drift.
https://evidentlyai.com/blog/data-drift-detection-large-datasets
Sponsored: Rudderstack - What is the Machine Learning Stack?
A detailed guide to building the Machine Learning Stack—an architecture to help you take your first steps into the world of ML and move from historical analytics to predictive analysis. The ML stack is phase three of RudderStack's Data Maturity Journey framework.
https://www.rudderstack.com/blog/what-is-the-ml-stack
Halodoc: Data Model Adoption at Halodoc
The importance of data modeling has been highlighted often recently. Halodoc writes about its data modeling strategy and the choice of a Hybrid Data model with a combination of Dimensions, Facts, and denormalized structure.
https://blogs.halodoc.io/data-model-adoption-at-halodoc/
Zomato: Powering Zomato’s data analytics using Trino
Zomato writes about its Trino infrastructure and how it enables data analytics across the org. The experimentation around the scaling needs and design to handle skewed traffic is an exciting read.
https://www.zomato.com/blog/powering-data-analytics-with-trino
Bazaar: Why Apache Pinot is Worth the Hype? Bazaar’s Success Story
Bazaar writes about its adoption story of Apache Pinot. The blog overviews Apache Pinot’s architecture and integration with the data tools ecosystem.
https://medium.com/@fizzabid96/why-apache-pinot-is-worth-the-hype-bazaars-success-story-1f90b5212fe4
AWS: Disaster recovery considerations with Amazon EMR on Amazon EC2 for Spark workloads
Disaster recovery is a critical aspect of data reliability engineering to bring business continuation. AWS writes about strategies for disaster recovery designs with Amazon EMR on Amazon EC2 for Spark workloads.
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.