Data Engineering Weekly #74

Weekly Data Engineering Newsletter

Feb 14, 2022

Amit Prakash: How to design your data stack for curiosity

The current data stack design skews towards serving a well-defined dashboard. The curious question often falls under Adhoc request that triggers the long tail of manual exploration. The author writes an excellent case for tuning data stacks to answer curious questions. Invoking curiosity through adjacency is an excellent read.

My take on this,

The data analytics world can learn a lot from the content platforms like YouTube and TikTok and influencer marketing. The data world has to live with the zombie dashboard apocalypse until then.

https://prakasha.substack.com/p/how-to-design-your-data-stack-for

Twitter: Next-generation data insights using natural language queries

Continuing on the quest for overcoming the dashboard apocalypse, Twitter writes an exciting blog narrating the Qurious [Great coincident with the previous blog on data stack for curiosity!!] app architecture, its natural language query system with Slack interface!!!

https://blog.twitter.com/engineering/en_us/topics/insights/2022/next-generation-data-insights-using-natural-language-queries

Joanna He: Understanding the Metrics Store

The Business Intelligence & Data Warehouse came a long way from Teradata to Snowflake. The consistency across the metrics is still challenging, giving the path to metrics layer/ store/ platform or Headless BI (Ha, we need naming consistency for the metrics layer first!!!). The author gives an excellent overview of what metrics store is.

https://medium.com/kyligence/understanding-the-metrics-store-c213341e4c25

Shopify: Shopify's Playbook for Scaling Machine Learning

Shopify writes an exciting blog about its playbook for scaling machine learning. Identifying the downstream and optimizing machine learning for the business outcomes is an excellent model for anyone starting machine learning from scratch.

https://shopifyengineering.myshopify.com/blogs/engineering/shopify-playbook-scaling-machine-learning

Uber: DeepETA- How Uber Predicts Arrival Times Using Deep Learning

Uber writes about the system design of its deep learning system to predict the arrival time. The design of a general ETA prediction service across all Uber's businesses is an exciting read.

https://eng.uber.com/deepeta-how-uber-predicts-arrival-times/

eBay: Creating High-Quality Staging Data with a NoSQL Data Migration System

One of the challenging tasks of data engineering is to create a staging environment that mimics close to the production. The previous attempt is like a random event generator, or a web event simulator is not optimal. I believe anonymized production data with a sampling technique is optimal for the staging environment. It's great to see eBay write about its staging system design on the same line.

https://tech.ebayinc.com/engineering/creating-high-quality-staging-data-with-a-nosql-data-migration-system/

Validio: 5 Data Trends in 2022

There are many data predictions; this is an excellent summarization of the trends to watch for 2022. I believe operational and real-time analytics will play a vital role in data engineering, and it is great to see the author reflect the same.

https://medium.com/validio/5-data-trends-in-2022-4035c099aac2

Maarten Grootendorst: 9 Distance Measures in Data Science

Distance measure algorithms are vital in recommendation systems and similarity classifiers. The author did a fantastic job explaining the available distance measure algorithms, advantages, and use cases.

https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa

AWS: MemQ by Pinterest - An efficient, scalable, cloud-native publish/subscribe system

Pinterest writes about MemQ, a scalable cloud-native Pub-Sub system in the past. It's great to see MemQ open-sourced, and AWS writes on how it works with the AWS ecosystem.

https://aws.amazon.com/blogs/storage/memq-by-pinterest-an-efficient-scalable-cloud-native-publish-subscribe-system/

Jeff: The Case for Marketing Attribution

Accounting for the unobserved parts of the journey in marketing attribution is always challenging. Can we use probabilistic techniques to answer the unknown? The author writes a compelling case for marketing attribution using the Hidden Markov Model.

https://jwithing.com/the-case-for-marketing-attribution/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly