Data Engineering Weekly #77

Weekly Data Engineering Newsletter

Mar 07, 2022

Data Council - Austin 2022

Data Council published the Austin 2022 schedule here. The data engineering weekly readers can get a 20% discount using promo code DataWeekly20

https://www.datacouncil.ai/austin

Datanami: Harvard’s New Data Storage Is to Dye For, Avoids DNA Storage Pitfalls

The explosion in data collection has led to challenges in storing enormous amounts of data, particularly for archival data. The Harvard researchers introduce a new container for long-term storage: dye!!

https://www.datanami.com/2022/02/14/harvards-new-data-storage-is-to-dye-for-avoids-dna-storage-pitfalls/

LinkedIn: Near real-time features for near real-time personalization

Establishing a faster feedback loop is vital in developing the recommendation engine. LinkedIn writes about the usage of Samza SQL and Apache Pinot to build near real-time personalization.

https://engineering.linkedin.com/blog/2022/near-real-time-features-for-near-real-time-personalization

eBay: Building a Deep Learning-Based Retrieval System for Personalized Recommendations

On a similar note with LinkedIn's previous blog, eBay writes about the maturity phases of the recommendation engine. The blog narrates the architecture style adopted by eBay from batch-only, batch & near real-time to a near-real-time (NRT) system.

https://tech.ebayinc.com/engineering/building-a-deep-learning-based-retrieval-system-for-personalized-recommendations/

Spotify: Search Journey Towards Better Experimentation Practices

Spotify writes about the adoption of its experimentation in its search product. Any product will go through its technology adoption lifecycle( Chasm theory), yet we rarely talk about it. Spotify narrates the adoption curve and the importance of starting and maintaining the momentum in adoption.

https://engineering.atspotify.com/2022/02/search-journey-towards-better-experimentation-practices/

Spotify’s New Experimentation Platform (Part 1)

Spotify’s New Experimentation Platform (Part 2)

Future: Why SQL Needs Software Libraries

In the last two decades, the industry has attempted to reinvent the alternative for SQL with no success. The lack of a software library and the limitation in distributing SQL are some of the significant shortcomings of SQL. It's is an exciting conversation on SQL, software libraries, and dbt.

https://future.a16z.com/sql-needs-software-libraries/

Aliaksei Mikhailiuk: Nine Tools I Wish I Mastered before My Ph.D. in Machine Learning

A good collection of tools before starting working on machine learning & AI at the industrial scale engineering.

Question to the readers: What would be the top 9 tools you wish you had learned before entering data/ analytical engineering? Please tweet back to @data_weekly.

https://towardsdatascience.com/nine-tools-i-wish-i-mastered-before-my-phd-in-machine-learning-708c6dcb2fb0

Mihail Eric: MLOps Is a Mess, But That's to be Expected

The ML & data landscape is fragmented, where each tool tries to solve niche problems. The author narrates the current state of the MLOps and a few predictions. The author predicts Increasing consolidation around end-to-end platforms, similar to the conversation in the data landscape with bunding & unbundling.

https://www.mihaileric.com/posts/mlops-is-a-mess/

James Le: What I Learned From Attending Tecton apply(meetup) 2022

The author shared the notes from the Tecton conference. I didn't have a chance to go through the full notes, but there is tons of learning. Thanks, James, for sharing your notes.

https://data-notes.co/what-i-learned-from-attending-tecton-apply-meetup-2022-4b7be87e2f17

Monte Carlo: Building End-to-End Field Level Lineage for Modern Data Systems

Data lineage is a critical connector to establish end-to-end observability and explainability of the analytical pipeline. Monte Carlo writes about the importance of the column-level lineage of the SQL pipeline and the design journey to establish observability.

There are a lot of talks and effort on "Explainable AI" did we achieve "Explainable Analytics." Is there any tool a salesperson can use to understand the business logic of ARR computation without understanding SQL? Found anything, please tweet @data_weekly

https://www.infoq.com/articles/field-level-lineage-modern-data-systems/

Expedia: Handling Incompatible Schema Changes with Avro

A backward-incompatible schema change is painful, and I still remember fixing a Thrift incompatible change to make sure any backfilling in the future does not break. Expedia writes an exciting blog that narrates how to handle Avro incompatible schema changes.

https://medium.com/expedia-group-tech/handling-incompatible-schema-changes-with-avro-2bc147e26770

PyCaret: Multiple Time Series Forecasting with PyCaret

TIL about PyCaret, an open-source, low-code machine learning library and end-to-end model management tool built-in Python for automating machine learning workflows.

https://www.kdnuggets.com/2021/04/multiple-time-series-forecasting-pycaret.html

Data Engineering Weekly

Discussion about this post

Ready for more?