Data Engineering Weekly #85

The Weekly Data Engineering Newsletter

May 02, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Monzo: Monzo’s machine learning stack

Monzo writes about its Machine Learning stack built on the three principles.

Autonomy
Flexibility
Reuse over a rebuild.

The blog is an exciting read with the mix of Google Cloud for data science work and AWS for production deployment.

https://monzo.com/blog/2022/04/26/monzos-machine-learning-stack

Netflix: Evolution of ML Fact Store

Netflix writes about Axion, its fact store that stores the large volume of high-quality data leveraged to compute offline. The blog is an exciting read pointing out the importance of high-quality data for Machine Learning applications and the long-standing challenges of accessing a small subset of data in the data lake systems.

https://netflixtechblog.com/evolution-of-ml-fact-store-5941d3231762

Meta: SQL Notebooks - Combining the power of Jupyter and SQL editors for data analytics

Meta writes about SQL notebook, combing the power of notebook and SQL. The blog narrates some of the enforcement challenges with CTE and how it uses the notebook-style cell reference design to make the code more modular.

https://engineering.fb.com/2022/04/26/developer-tools/sql-notebooks/

Etsy: Building a Platform for Serving Recommendations at Etsy

Etsy writes about the evolution of its recommendation engine over the years. The blog narrates the journey from a static pre-computed journey to a platform-centric approach.

https://www.etsy.com/codeascraft/building-a-platform-for-serving-recommendations-at-etsy

PayPal: Words All the Way Down — Conversational Sentiment Analysis

Sentiment in a conversation is more complex than a movie or product review. The conversation sentiment rapidly changes with the context of the conversation for both the speakers. PayPal writes about how it approaches conversational sentiment analysis.

https://medium.com/paypal-tech/words-all-the-way-down-conversational-sentiment-analysis-afe0165b84db

Vinoth Chandar: Corrections in data lakehouse table format comparisons

The LakeHouse capabilities are evolving fast, and it's hard to keep track of all the features. We've seen a few lake format comparison studies, and the author rightly pointed out a few corrections.

Unlike the ugly benchmark war by a few systems in the past, I like the approach from Vinoth Chandar on collaboratively building the future of LakeHouse.

Open Call for the LakeHouse Formats

It will be excellent if these systems come together and build the OpenLakeHouse format. The data quality, data visualization, and data discovery systems can benefit from shared metadata from an OpenLakeHouse format. Historically these metadata are only used for query optimizers for the closed-loop system. With the LakeHouse format, we genuinely have an opportunity to build an ecosystem around metadata.

https://bytearray.io/corrections-in-data-lakehouse-table-format-comparisons-b72eb63ece32

Pinterest: Optimizing Pinterest’s Data Ingestion Stack: Findings and Learnings

Pinterest writes about its optimizing strategy adopted in its logging infrastructure. The journey from adopting a round-robin strategy to writing to the Kafka partitions to the RandomPartition approach & static partition write approach is an exciting read.

https://medium.com/@Pinterest_Engineering/optimizing-pinterests-data-ingestion-stack-findings-and-learnings-994fddb063bf

It reminds me of our earlier design of Slack's logging pipeline Murron.

FindHotel: Enriching Looker with reliability metadata

ModernDataStack tries solving each niche data engineering challenge; however, it brings its problem of introducing a disjointed data workflow. The tweet summarizes the same.

Josh Wills @josh_wills

To my many friends/followers doing metadata/catalog startups, I have a request: please integrate the metadata info with my BI tool so that I can see it *while I am doing queries.* I have no desire to *ever* visit a third website to just "browse the metadata."

FindHotel reflects a similar pain point in its data ecosystem and writes how it integrates reliability metadata with Looker.

Most of these tools store the results of the reliability tests in a database and expose them in a custom front-end application.
However, the need to access another tool can become a problem. Data consumers are usually familiar with the BI tool. Having them open the BI tool in one tab, the data reliability tool in another tab, and perhaps other tools (data catalog, etc.) in different tabs, may have a bad impact on adoption and usage.

https://blog.findhotel.net/enriching-looker-with-reliability-metadata-8a4aff6667cb

Shopify: Data Is An Art, Not Just A Science—And Storytelling Is The Key

Data Storytelling is vital to influencing the decision-making process, and Spotify writes about strategies to adopt data storytelling.

https://shopifyengineering.myshopify.com/blogs/engineering/data-storytelling-shopify

Mothership: Analytics Engineering At Mothership

A good data modeling process empowers the business to make data-driven decisions, enables curated self-service, and ensures scalability, reliability, and shared context. The Analytical Engineering Roundup recently wrote about Data Modeling for Collaboration. Mothership writes about its journey of adopting the data modeling process.

https://medium.com/mothership/analytics-engineering-at-mothership-8d061b66bec3

Swiggy: An end to end system to detect and explain anomalies in operational metrics

Swiggy writes about its end-to-end detection system for detecting anomalies for critical business metrics. Adopting an expert system module to add human input to detect rain or other local events is an exciting approach.

https://bytes.swiggy.com/an-end-to-end-system-to-detect-and-explain-anomalies-in-operational-metrics-448bc74c700e

Michael Katz: Will the CDP be unbundled?

The data engineering world can't escape from the bundling and unbundling debate!! Hightouch writes The CDP as we know it is dead: Introducing the Unbundled CDP, and mParticle followup with Will the CDP be unbundled?

This complexity arises from the low-quality event tracking system, as pointed out.

Sarah Catanzaro @sarahcat21

You can’t transform digital exhaust into actionable insights. It’s shocking how many companies invest in analytics tools without spending time on event tracking.

pipeline gig worker @CSVjanitor

Event instrumentation and analytics on top of it is a fucking disaster. I’d guess maybe ~20% of events tops are setup correctly? And in most cases no party has full visibility or incentive to improve it (data team / instrumenting SWEs / vendor)

https://medium.com/@mkatz0630/will-the-cdp-be-unbundled-6e8308b2e0e1

Disney Streaming: The Fine Art of Visualizing Experiment Results

Data visualization plays a vital role in experiment evaluation. Disney writes about automated visualization at the ad-hoc and platform levels to simplify the experiment evaluation.

https://medium.com/disney-streaming/the-fine-art-of-visualizing-experiment-results-95a687b2bb0e

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly