Data Engineering Weekly #112

The Weekly Data Engineering Newsletter

Dec 18, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

McKinsey: The state of AI in 2022—and a half decade in review

McKinsey publishes the state of AI in 2022 with the last five years’ review. A few highlights

63 percent of respondents say they expect their organizations’ investments to increase in AI over the next three years.
Today, the biggest reported revenue effects are found in marketing and sales, product and service development, and strategy and corporate finance, and respondents report the highest cost benefits from AI in supply chain management
The tech talent shortage shows no sign of easing, threatening to slow that shift for some companies. A majority of respondents report difficulty in hiring for each AI-related role in the past year, and most say it either wasn’t any easier or was more difficult to acquire this talent than in years past

https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2022-and-a-half-decade-in-review

Jacob Matson: Modern Data Stack in a Box with DuckDB

One important characteristic of the data infrastructure is that the more recent the data more frequent the access. Given the characteristic, are we having a “Big Data” problem? Can we spin off a machine with all the data stack and run through the analysis? The author writes an exciting blog, Modern data stack in a Box!!

https://duckdb.org/2022/10/12/modern-data-stack-in-a-box.html

Data Engineering Central: Why is everyone trying to kill Airflow?

Airflow is probably one of the Top 5 breakthrough data technology in the last ten years. The author narrates the competitive landscape in the orchestration engine today by comparing some of the pros and cons of Airflow as its stands today.

https://dataengineeringcentral.substack.com/p/why-is-everyone-trying-to-kill-airflow

Confessions of a Data Guy: Dataframe Showdown – Polars vs. Spark vs. Pandas vs. DataFusion. Guess who wins?

Dataframe is mainstream data abstraction now, and as the popularity increases, the innovation around the tools to run efficiently increases. Looking at the test results, Polars implementation performs much better than Apache Spark.

https://www.confessionsofadataguy.com/dataframe-showdown-polars-vs-spark-vs-pandas-vs-datafusion-guess-who-wins/

Wayve: Wayve's End-to-End Deep Learning Model for Self-Driving Cars

Wayve, the autonomous driving technology based on computer vision and machine learning, writes about its end-to-end deep learning model for self-driving cars. I found the tech forum from Scale AI very informative about the various approaches in self-driving car efforts.

https://www.infoq.com/news/2022/12/wayve-deep-learning-model/

Percona: JSON and Relational Databases – Part One

Whether we like it or not, most data engineering and modeling challenges will be handling semi-structured data in the coming years.

SaaS companies like Salesforce and Zendesk are increasingly processing and emitting sem-structure data. We have already seen systems like Apache Pinot; Apache Druid improves their JSON support. The Percona blog walkthrough JSON support in the relational databases.

https://www.percona.com/blog/json-and-relational-databases-part-one/

Etsy: Mitigating the winner’s curse in online experiments

I enjoy reading Etsy blogs about A/B testing, TIL about the winner’s curse in the experimentation, and the blog narrates how Etsy approaches to mitigate the winner’s curse.

https://www.etsy.com/codeascraft/mitigating-the-winners-curse-in-online-experiments

Neil Raden: We need a real semantic layer - but something is missing

Will the semantic layer induce more challenges than the problem it solves? The author explains the problem with customer mapping. Who is a customer? The question remains the same but will have a different answer from marketing, sales, and products. The author gives a fresh perspective to the semantic layer!!

https://diginomica.com/we-need-real-semantic-layer-something-missing

Motherbrain: Disrupting private capital using machine learning and an event-driven architecture

The blog is an exciting one giving a peak into the private capital ventures approach to finding startup investment strategy. The blog doesn’t leave any traces of the data sources they consume, but curious

What are the data sources the private venture capital firms depend on? Let me know in the comments or DM me on LinkedIn

https://motherbrain.ai/disrupting-private-capital-using-machine-learning-and-an-event-driven-architecture-a966c66ac93a.

Monzo: Building an extension framework for dbt

Possibly one of the most brilliant pieces of engineering I read this year

Kudos to the Monzo data team. The blog narrates bringing a platform approach to dbt, lessons learned, tracking back, and pragmatic hacking into dbt core to build the extension framework; A great joy to read.

I hope we will see dbt-core support the extension framework out of the box

https://monzo.com/blog/2022/12/15/building-an-extension-framework-for-dbt.

Shopify: 3 (More) Tips for Optimizing Apache Flink Applications

Shopify writes three more practical tips for optimizing Apache Flink. TIL about the Hybrid Source support from Apache Flink and the role in Backfilling. I recently had to design for a similar problem and vaguely arrived at a similar strategy, but I thought it might be complicated. Seeing Shopify implement it gives much hope to explore the option further. Thank you, Shopify data team.

https://shopifyengineering.myshopify.com/blogs/engineering/optimizing-apache-flink-tips-part-two

First part: https://shopify.engineering/optimizing-apache-flink-applications-tips

Microsoft: Search and ranking for information retrieval (IR)

The blog is a good summarization of the searching and ranking problem domain. The author narrates techniques to adopt finding the best matching document [search] and order them [rank]. TIL about Pointwise, Pairwise, and Listwise learning methods.

https://medium.com/data-science-at-microsoft/search-and-ranking-for-information-retrieval-ir-5f9ca52dd056

All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?