Data Engineering Weekly

Share this post

Data Engineering Weekly #85

www.dataengineeringweekly.com

Data Engineering Weekly #85

The Weekly Data Engineering Newsletter

Ananth Packkildurai
May 2, 2022
5
Share this post

Data Engineering Weekly #85

www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


Monzo: Monzo’s machine learning stack

Monzo writes about its Machine Learning stack built on the three principles.

  1. Autonomy

  2. Flexibility

  3. Reuse over a rebuild.

The blog is an exciting read with the mix of Google Cloud for data science work and AWS for production deployment. 

https://monzo.com/blog/2022/04/26/monzos-machine-learning-stack


Netflix: Evolution of ML Fact Store

Netflix writes about Axion, its fact store that stores the large volume of high-quality data leveraged to compute offline. The blog is an exciting read pointing out the importance of high-quality data for Machine Learning applications and the long-standing challenges of accessing a small subset of data in the data lake systems.

https://netflixtechblog.com/evolution-of-ml-fact-store-5941d3231762


Meta: SQL Notebooks - Combining the power of Jupyter and SQL editors for data analytics

Meta writes about SQL notebook, combing the power of notebook and SQL. The blog narrates some of the enforcement challenges with CTE and how it uses the notebook-style cell reference design to make the code more modular.

https://engineering.fb.com/2022/04/26/developer-tools/sql-notebooks/


Etsy: Building a Platform for Serving Recommendations at Etsy

Etsy writes about the evolution of its recommendation engine over the years. The blog narrates the journey from a static pre-computed journey to a platform-centric approach.

https://www.etsy.com/codeascraft/building-a-platform-for-serving-recommendations-at-etsy


Sponsored: What Makes Firebolt's Cloud Data Warehouse So Fast

Watch how Firebolt's cloud data warehouse for engineers delivers sub-second analytics at the data lake scale
https://www.firebolt.io/resources/cloud-data-warehouse-demo


PayPal: Words All the Way Down — Conversational Sentiment Analysis

Sentiment in a conversation is more complex than a movie or product review. The conversation sentiment rapidly changes with the context of the conversation for both the speakers. PayPal writes about how it approaches conversational sentiment analysis.

https://medium.com/paypal-tech/words-all-the-way-down-conversational-sentiment-analysis-afe0165b84db


Vinoth Chandar: Corrections in data lakehouse table format comparisons

The LakeHouse capabilities are evolving fast, and it's hard to keep track of all the features. We've seen a few lake format comparison studies, and the author rightly pointed out a few corrections.

Unlike the ugly benchmark war by a few systems in the past, I like the approach from Vinoth Chandar on collaboratively building the future of LakeHouse.

Open Call for the LakeHouse Formats

It will be excellent if these systems come together and build the OpenLakeHouse format. The data quality, data visualization, and data discovery systems can benefit from shared metadata from an OpenLakeHouse format. Historically these metadata are only used for query optimizers for the closed-loop system. With the LakeHouse format, we genuinely have an opportunity to build an ecosystem around metadata.

https://bytearray.io/corrections-in-data-lakehouse-table-format-comparisons-b72eb63ece32


Sponsored: Rudderstack - A Practical Guide to The Modern Data Stack: The Data Maturity Journey

Data maturity is rapidly becoming a matter of survival, but the modern data stack can be overwhelming. Here, RudderStack provides a helpful framework that places the tools of the modern stack in the context of a 4-stage journey to help you build the right stack at every stage.

https://www.rudderstack.com/blog/a-practical-guide-to-the-modern-data-stack-the-data-maturity-journey


Pinterest: Optimizing Pinterest’s Data Ingestion Stack: Findings and Learnings

Pinterest writes about its optimizing strategy adopted in its logging infrastructure. The journey from adopting a round-robin strategy to writing to the Kafka partitions to the RandomPartition approach & static partition write approach is an exciting read.

https://medium.com/@Pinterest_Engineering/optimizing-pinterests-data-ingestion-stack-findings-and-learnings-994fddb063bf

It reminds me of our earlier design of Slack's logging pipeline Murron.


FindHotel: Enriching Looker with reliability metadata

ModernDataStack tries solving each niche data engineering challenge; however, it brings its problem of introducing a disjointed data workflow. The tweet summarizes the same.

Twitter avatar for @josh_wills
Josh Wills @josh_wills
To my many friends/followers doing metadata/catalog startups, I have a request: please integrate the metadata info with my BI tool so that I can see it *while I am doing queries.* I have no desire to *ever* visit a third website to just "browse the metadata."
4:49 PM ∙ Apr 29, 2022
193Likes20Retweets

FindHotel reflects a similar pain point in its data ecosystem and writes how it integrates reliability metadata with Looker.

Most of these tools store the results of the reliability tests in a database and expose them in a custom front-end application.

However, the need to access another tool can become a problem. Data consumers are usually familiar with the BI tool. Having them open the BI tool in one tab, the data reliability tool in another tab, and perhaps other tools (data catalog, etc.) in different tabs, may have a bad impact on adoption and usage.

https://blog.findhotel.net/enriching-looker-with-reliability-metadata-8a4aff6667cb


Shopify: Data Is An Art, Not Just A Science—And Storytelling Is The Key

Data Storytelling is vital to influencing the decision-making process, and Spotify writes about strategies to adopt data storytelling.

https://shopifyengineering.myshopify.com/blogs/engineering/data-storytelling-shopify


Sponsored: Monte Carlo Data - The Modern Data Leader’s Playbook

Learn how today’s best data engineering and analytics leaders are staying ahead of the competition in our exclusive guide.

Download the modern data leader’s playbook


Mothership: Analytics Engineering At Mothership

A good data modeling process empowers the business to make data-driven decisions, enables curated self-service, and ensures scalability, reliability, and shared context. The Analytical Engineering Roundup recently wrote about Data Modeling for Collaboration. Mothership writes about its journey of adopting the data modeling process.

https://medium.com/mothership/analytics-engineering-at-mothership-8d061b66bec3


Swiggy: An end to end system to detect and explain anomalies in operational metrics

Swiggy writes about its end-to-end detection system for detecting anomalies for critical business metrics. Adopting an expert system module to add human input to detect rain or other local events is an exciting approach.

https://bytes.swiggy.com/an-end-to-end-system-to-detect-and-explain-anomalies-in-operational-metrics-448bc74c700e


Michael Katz: Will the CDP be unbundled?

The data engineering world can't escape from the bundling and unbundling debate!! Hightouch writes The CDP as we know it is dead: Introducing the Unbundled CDP, and mParticle followup with Will the CDP be unbundled?

This complexity arises from the low-quality event tracking system, as pointed out.

Twitter avatar for @sarahcat21
Sarah Catanzaro @sarahcat21
You can’t transform digital exhaust into actionable insights. It’s shocking how many companies invest in analytics tools without spending time on event tracking.
Twitter avatar for @CSVjanitor
pipeline gig worker @CSVjanitor
Event instrumentation and analytics on top of it is a fucking disaster. I’d guess maybe ~20% of events tops are setup correctly? And in most cases no party has full visibility or incentive to improve it (data team / instrumenting SWEs / vendor)
10:37 PM ∙ Apr 4, 2022
107Likes6Retweets

https://medium.com/@mkatz0630/will-the-cdp-be-unbundled-6e8308b2e0e1


Disney Streaming: The Fine Art of Visualizing Experiment Results

Data visualization plays a vital role in experiment evaluation. Disney writes about automated visualization at the ad-hoc and platform levels to simplify the experiment evaluation.

https://medium.com/disney-streaming/the-fine-art-of-visualizing-experiment-results-95a687b2bb0e


All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post

Data Engineering Weekly #85

www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing