Data Engineering Weekly

Share this post
Data Engineering Weekly #112
www.dataengineeringweekly.com

Data Engineering Weekly #112

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Dec 18, 2022
10
Share this post
Data Engineering Weekly #112
www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


McKinsey: The state of AI in 2022—and a half decade in review

McKinsey publishes the state of AI in 2022 with the last five years’ review. A few highlights

  • 63 percent of respondents say they expect their organizations’ investments to increase in AI over the next three years.

  • Today, the biggest reported revenue effects are found in marketing and sales, product and service development, and strategy and corporate finance, and respondents report the highest cost benefits from AI in supply chain management

  • The tech talent shortage shows no sign of easing, threatening to slow that shift for some companies. A majority of respondents report difficulty in hiring for each AI-related role in the past year, and most say it either wasn’t any easier or was more difficult to acquire this talent than in years past

https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2022-and-a-half-decade-in-review


Jacob Matson: Modern Data Stack in a Box with DuckDB

One important characteristic of the data infrastructure is that the more recent the data more frequent the access. Given the characteristic, are we having a “Big Data” problem? Can we spin off a machine with all the data stack and run through the analysis? The author writes an exciting blog, Modern data stack in a Box!!

https://duckdb.org/2022/10/12/modern-data-stack-in-a-box.html


Data Engineering Central: Why is everyone trying to kill Airflow?

Airflow is probably one of the Top 5 breakthrough data technology in the last ten years. The author narrates the competitive landscape in the orchestration engine today by comparing some of the pros and cons of Airflow as its stands today.

https://dataengineeringcentral.substack.com/p/why-is-everyone-trying-to-kill-airflow


Confessions of a Data Guy: Dataframe Showdown – Polars vs. Spark vs. Pandas vs. DataFusion. Guess who wins?

Dataframe is mainstream data abstraction now, and as the popularity increases, the innovation around the tools to run efficiently increases. Looking at the test results, Polars implementation performs much better than Apache Spark.

https://www.confessionsofadataguy.com/dataframe-showdown-polars-vs-spark-vs-pandas-vs-datafusion-guess-who-wins/


Sponsored: Upsolver - Write a SQL Query, Get a Data-in-Motion Pipeline!

Pipelines for data in motion can quickly turn into DAG hell. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query.

  • Streaming plus batch unified in a single platform.

  • Stateful processing at scale - joins, aggregations, upserts

  • Orchestration auto-generated from the data and SQL

  • Templates with sample data for Kafka/Kinesis/S3 sources -> S3/Athena/Snowflake/Redshift

Try now and get 30 Days Free


Wayve: Wayve's End-to-End Deep Learning Model for Self-Driving Cars

Wayve, the autonomous driving technology based on computer vision and machine learning, writes about its end-to-end deep learning model for self-driving cars. I found the tech forum from Scale AI very informative about the various approaches in self-driving car efforts.

https://www.infoq.com/news/2022/12/wayve-deep-learning-model/


Percona: JSON and Relational Databases – Part One

Whether we like it or not, most data engineering and modeling challenges will be handling semi-structured data in the coming years.

SaaS companies like Salesforce and Zendesk are increasingly processing and emitting sem-structure data. We have already seen systems like Apache Pinot; Apache Druid improves their JSON support. The Percona blog walkthrough JSON support in the relational databases.

https://www.percona.com/blog/json-and-relational-databases-part-one/


Sponsored: Monte Carlo - [New Guide] Data Pipeline Monitoring 101

Minimize data downtime, and maximize data trust. As data becomes increasingly important to modern companies, it’s crucial for it to be trusted and accessible. Learn how to stop missing incidents and spending precious engineering hours maintaining static tests and learn how data pipeline monitoring can help take your team to the next level by accessing this new "Data Pipeline Monitoring 10" guide.

Get The Guide


Etsy: Mitigating the winner’s curse in online experiments

I enjoy reading Etsy blogs about A/B testing, TIL about the winner’s curse in the experimentation, and the blog narrates how Etsy approaches to mitigate the winner’s curse.

https://www.etsy.com/codeascraft/mitigating-the-winners-curse-in-online-experiments


Neil Raden: We need a real semantic layer - but something is missing

Will the semantic layer induce more challenges than the problem it solves? The author explains the problem with customer mapping. Who is a customer? The question remains the same but will have a different answer from marketing, sales, and products. The author gives a fresh perspective to the semantic layer!!

https://diginomica.com/we-need-real-semantic-layer-something-missing


Sponsored: Take Control of Your Customer Data With RudderStack

Legacy CDPs charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost-effective solution. Plus, it gives you more technical controls, so you can fully unlock the power of your customer data. 

Take control of your customer data today.


Motherbrain: Disrupting private capital using machine learning and an event-driven architecture

The blog is an exciting one giving a peak into the private capital ventures approach to finding startup investment strategy. The blog doesn’t leave any traces of the data sources they consume, but curious

What are the data sources the private venture capital firms depend on? Let me know in the comments or DM me on LinkedIn

https://motherbrain.ai/disrupting-private-capital-using-machine-learning-and-an-event-driven-architecture-a966c66ac93a.


Monzo: Building an extension framework for dbt

Possibly one of the most brilliant pieces of engineering I read this year

Kudos to the Monzo data team. The blog narrates bringing a platform approach to dbt, lessons learned, tracking back, and pragmatic hacking into dbt core to build the extension framework; A great joy to read.

I hope we will see dbt-core support the extension framework out of the box

https://monzo.com/blog/2022/12/15/building-an-extension-framework-for-dbt.


Shopify: 3 (More) Tips for Optimizing Apache Flink Applications

Shopify writes three more practical tips for optimizing Apache Flink. TIL about the Hybrid Source support from Apache Flink and the role in Backfilling. I recently had to design for a similar problem and vaguely arrived at a similar strategy, but I thought it might be complicated. Seeing Shopify implement it gives much hope to explore the option further. Thank you, Shopify data team.

https://shopifyengineering.myshopify.com/blogs/engineering/optimizing-apache-flink-tips-part-two

First part: https://shopify.engineering/optimizing-apache-flink-applications-tips


Microsoft: Search and ranking for information retrieval (IR)

The blog is a good summarization of the searching and ranking problem domain. The author narrates techniques to adopt finding the best matching document [search] and order them [rank]. TIL about Pointwise, Pairwise, and Listwise learning methods.

https://medium.com/data-science-at-microsoft/search-and-ranking-for-information-retrieval-ir-5f9ca52dd056


All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post
Data Engineering Weekly #112
www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing