Data Engineering Weekly #91

The Weekly Data Engineering Newsletter

Jul 04, 2022

DROP the Modern Data Stack

It’s time to make sense of today’s data tooling ecosystem. Check out rudderstack.com/dmds to get a guide that will help you build a practical data stack for every phase of your company’s journey to data maturity. The guide includes architecture and tactical advice to help you progress through four stages: Starter, Growth, Machine Learning, and Real-Time. Visit RudderStack.com/dmds today to DROP the modern data stack and USE a practical data engineering framework.

Matt Weingarten: Data & AI Summit Takeaways

The Databricks AI & data summit 2022 brings some exciting talks and announcements. The improvements with the delta sharing, similar to Snowflake & Redshift Data sharing, is an exciting development overall by the cloud data stores.

Ananth Packkildurai@ananthdurai

With @databricks announcing delta sharing, the cloud data warehouses are building a compelling case for all the B2B SaaS businesses not to build their own reporting solutions. It will be interesting to see how customer-facing analytics change in a couple of years!!

9:53 PM · Jun 28, 2022

1 Repost · 28 Likes

Matt wrote an excellent summary of key takeaways from the Databricks AI summit.

Part 1: https://medium.com/@matt_weingarten/data-ai-summit-takeaways-part-i-ed1da1aa8125

Part 2: https://medium.com/@matt_weingarten/data-ai-summit-takeaways-part-ii-2f940eee2487

One critical development in the announcement is Spark Connect, a thin “thin client” to enable Spark query capability on low-compute devices. Stan writes an excellent summary of Spark Connect.

https://medium.com/@coderstan/understanding-spark-connect-reynold-xins-keynote-on-data-ai-summit-2022-6b9e1b5e1fd9

Apache Airflow: Airflow Survey 2022

Apache Airflow released its 2022 survey capturing the developer’s thoughts about the current state of Apache Airflow. The data-driven scheduling & signal-based scheduling seems to be the most wanted features. I wish more open source systems publish their user survey to bring more openness.

https://airflow.apache.org/blog/airflow-survey-2022/

Ilan Man: People-first Data stacks

Is Modern Data Stack(?) solving the right problem? The author raised an interesting question: Can it solve people's problems?

My take on this: I think it is a bit tricky to answer. From a typical system thinking perspective, The tools provide a feedback loop [e.g.] How slow is the query running? What is the frequently used Dashboard? Every company's workflow & culture is different. It's up to the data team to use the feedback loop and create a balancing action on the workflow. The workflow defines the culture of an org, and I don't think there will be one vendor that can provide an out-of-the-box solution to solve this problem.

https://locallyoptimistic.com/post/people-first-data-stacks/

Salma Bakouk: Things I Wish I Knew When I Was Building a Data Team

The author writes an excellent overview of thinking about building the data team. Start with the right objective [vision & mission], and build a team structure to enable the execution, the tech stack & hiring strategies.

https://towardsdatascience.com/things-i-wish-i-knew-when-i-was-building-a-data-team-efcb43591204

Nagesh Singh Chauhan: Model Evaluation Metrics in Machine Learning

The ML Model evaluation metrics aim to find how well the model will perform on unseen data. The author discusses various evaluation metrics one could use for the model evaluation.

https://www.kdnuggets.com/2020/05/model-evaluation-metrics-machine-learning.html

Evidently AI: Which test is the best? We compared 5 methods to detect data drift on large datasets.

Data Drift occurs when the dataset used to train your model does not mimic the data you receive in production. An unexpected and undocumented change to the data structure, semantics, and infrastructure can potentially cause Data Drift. The article discusses various testing strategies to identify Data Drift.

https://evidentlyai.com/blog/data-drift-detection-large-datasets

Halodoc: Data Model Adoption at Halodoc

The importance of data modeling has been highlighted often recently. Halodoc writes about its data modeling strategy and the choice of a Hybrid Data model with a combination of Dimensions, Facts, and denormalized structure.

https://blogs.halodoc.io/data-model-adoption-at-halodoc/

Zomato: Powering Zomato’s data analytics using Trino

Zomato writes about its Trino infrastructure and how it enables data analytics across the org. The experimentation around the scaling needs and design to handle skewed traffic is an exciting read.

https://www.zomato.com/blog/powering-data-analytics-with-trino

Bazaar: Why Apache Pinot is Worth the Hype? Bazaar’s Success Story

Bazaar writes about its adoption story of Apache Pinot. The blog overviews Apache Pinot’s architecture and integration with the data tools ecosystem.

https://medium.com/@fizzabid96/why-apache-pinot-is-worth-the-hype-bazaars-success-story-1f90b5212fe4

AWS: Disaster recovery considerations with Amazon EMR on Amazon EC2 for Spark workloads

Disaster recovery is a critical aspect of data reliability engineering to bring business continuation. AWS writes about strategies for disaster recovery designs with Amazon EMR on Amazon EC2 for Spark workloads.

https://aws.amazon.com/blogs/big-data/disaster-recovery-considerations-with-amazon-emr-on-amazon-ec2-for-spark-workloads/

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?