Data Engineering Weekly #74
Weekly Data Engineering Newsletter
Amit Prakash: How to design your data stack for curiosity
The current data stack design skews towards serving a well-defined dashboard. The curious question often falls under Adhoc request that triggers the long tail of manual exploration. The author writes an excellent case for tuning data stacks to answer curious questions. Invoking curiosity through adjacency is an excellent read.
My take on this,
The data analytics world can learn a lot from the content platforms like YouTube and TikTok and influencer marketing. The data world has to live with the zombie dashboard apocalypse until then.
Twitter: Next-generation data insights using natural language queries
Continuing on the quest for overcoming the dashboard apocalypse, Twitter writes an exciting blog narrating the Qurious [Great coincident with the previous blog on data stack for curiosity!!] app architecture, its natural language query system with Slack interface!!!
Joanna He: Understanding the Metrics Store
The Business Intelligence & Data Warehouse came a long way from Teradata to Snowflake. The consistency across the metrics is still challenging, giving the path to metrics layer/ store/ platform or Headless BI (Ha, we need naming consistency for the metrics layer first!!!). The author gives an excellent overview of what metrics store is.
Shopify: Shopify's Playbook for Scaling Machine Learning
Shopify writes an exciting blog about its playbook for scaling machine learning. Identifying the downstream and optimizing machine learning for the business outcomes is an excellent model for anyone starting machine learning from scratch.
Uber: DeepETA- How Uber Predicts Arrival Times Using Deep Learning
Uber writes about the system design of its deep learning system to predict the arrival time. The design of a general ETA prediction service across all Uber's businesses is an exciting read.
eBay: Creating High-Quality Staging Data with a NoSQL Data Migration System
One of the challenging tasks of data engineering is to create a staging environment that mimics close to the production. The previous attempt is like a random event generator, or a web event simulator is not optimal. I believe anonymized production data with a sampling technique is optimal for the staging environment. It's great to see eBay write about its staging system design on the same line.
Validio: 5 Data Trends in 2022
There are many data predictions; this is an excellent summarization of the trends to watch for 2022. I believe operational and real-time analytics will play a vital role in data engineering, and it is great to see the author reflect the same.
Maarten Grootendorst: 9 Distance Measures in Data Science
Distance measure algorithms are vital in recommendation systems and similarity classifiers. The author did a fantastic job explaining the available distance measure algorithms, advantages, and use cases.
AWS: MemQ by Pinterest - An efficient, scalable, cloud-native publish/subscribe system
Pinterest writes about
MemQ, a scalable cloud-native Pub-Sub system in the past. It's great to see
MemQ open-sourced, and AWS writes on how it works with the AWS ecosystem.
Jeff: The Case for Marketing Attribution
Accounting for the unobserved parts of the journey in marketing attribution is always challenging. Can we use probabilistic techniques to answer the unknown? The author writes a compelling case for marketing attribution using the Hidden Markov Model.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.