Data Engineering Weekly #77
Weekly Data Engineering Newsletter
Data Council - Austin 2022
Data Council published the Austin 2022 schedule
here. The data engineering weekly readers can get a 20% discount using promo code
Datanami: Harvard’s New Data Storage Is to Dye For, Avoids DNA Storage Pitfalls
The explosion in data collection has led to challenges in storing enormous amounts of data, particularly for archival data. The Harvard researchers introduce a new container for long-term storage: dye!!
LinkedIn: Near real-time features for near real-time personalization
Establishing a faster feedback loop is vital in developing the recommendation engine. LinkedIn writes about the usage of Samza SQL and Apache Pinot to build near real-time personalization.
eBay: Building a Deep Learning-Based Retrieval System for Personalized Recommendations
On a similar note with LinkedIn's previous blog, eBay writes about the maturity phases of the recommendation engine. The blog narrates the architecture style adopted by eBay from batch-only, batch & near real-time to a near-real-time (NRT) system.
Spotify: Search Journey Towards Better Experimentation Practices
Spotify writes about the adoption of its experimentation in its search product. Any product will go through its technology adoption lifecycle(
Chasm theory), yet we rarely talk about it. Spotify narrates the adoption curve and the importance of starting and maintaining the momentum in adoption.
Future: Why SQL Needs Software Libraries
In the last two decades, the industry has attempted to reinvent the alternative for SQL with no success. The lack of a software library and the limitation in distributing SQL are some of the significant shortcomings of SQL. It's is an exciting conversation on SQL, software libraries, and dbt.
Aliaksei Mikhailiuk: Nine Tools I Wish I Mastered before My Ph.D. in Machine Learning
A good collection of tools before starting working on machine learning & AI at the industrial scale engineering.
Question to the readers: What would be the top 9 tools you wish you had learned before entering data/ analytical engineering? Please tweet back to
Sponsored: From First-Touch to Multi-Touch Attribution With RudderStack, Dbt, and SageMaker
Here, RudderStack provides a detailed overview of the architecture, data, and modeling required to assess the contribution to conversion in multi-touch customer journeys.
Mihail Eric: MLOps Is a Mess, But That's to be Expected
The ML & data landscape is fragmented, where each tool tries to solve niche problems. The author narrates the current state of the MLOps and a few predictions. The author predicts Increasing consolidation around end-to-end platforms, similar to the conversation in the data landscape with bunding & unbundling.
James Le: What I Learned From Attending Tecton apply(meetup) 2022
The author shared the notes from the Tecton conference. I didn't have a chance to go through the full notes, but there is tons of learning. Thanks, James, for sharing your notes.
Monte Carlo: Building End-to-End Field Level Lineage for Modern Data Systems
Data lineage is a critical connector to establish end-to-end observability and explainability of the analytical pipeline. Monte Carlo writes about the importance of the column-level lineage of the SQL pipeline and the design journey to establish observability.
There are a lot of talks and effort on "Explainable AI" did we achieve "Explainable Analytics." Is there any tool a salesperson can use to understand the business logic of ARR computation without understanding SQL? Found anything, please tweet
Expedia: Handling Incompatible Schema Changes with Avro
A backward-incompatible schema change is painful, and I still remember fixing a Thrift incompatible change to make sure any backfilling in the future does not break. Expedia writes an exciting blog that narrates how to handle Avro incompatible schema changes.
PyCaret: Multiple Time Series Forecasting with PyCaret
TIL about PyCaret, an open-source, low-code machine learning library and end-to-end model management tool built-in Python for automating machine learning workflows.
Sponsored: Rudderstack - The Data Stack Show Live: Is Reverse ETL Just Another Data Pipeline?
You’ve heard about Reverse ETL. Here’s your chance to learn all about the tooling from the folks who are creating it. Join Hosts Eric and Kostas for a live recording of The Data Stack Show on March 9th to get insights from experts at Census, Hightouch, and Workato.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.