Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Monzo: Monzo’s machine learning stack
Monzo writes about its Machine Learning stack built on the three principles.
Reuse over a rebuild.
The blog is an exciting read with the mix of Google Cloud for data science work and AWS for production deployment.
Netflix: Evolution of ML Fact Store
Netflix writes about Axion, its fact store that stores the large volume of high-quality data leveraged to compute offline. The blog is an exciting read pointing out the importance of high-quality data for Machine Learning applications and the long-standing challenges of accessing a small subset of data in the data lake systems.
Meta: SQL Notebooks - Combining the power of Jupyter and SQL editors for data analytics
Meta writes about SQL notebook, combing the power of notebook and SQL. The blog narrates some of the enforcement challenges with CTE and how it uses the notebook-style cell reference design to make the code more modular.
Etsy: Building a Platform for Serving Recommendations at Etsy
Etsy writes about the evolution of its recommendation engine over the years. The blog narrates the journey from a static pre-computed journey to a platform-centric approach.
Sponsored: What Makes Firebolt's Cloud Data Warehouse So Fast
Watch how Firebolt's cloud data warehouse for engineers delivers sub-second analytics at the data lake scale
PayPal: Words All the Way Down — Conversational Sentiment Analysis
Sentiment in a conversation is more complex than a movie or product review. The conversation sentiment rapidly changes with the context of the conversation for both the speakers. PayPal writes about how it approaches conversational sentiment analysis.
Vinoth Chandar: Corrections in data lakehouse table format comparisons
The LakeHouse capabilities are evolving fast, and it's hard to keep track of all the features. We've seen a few lake format comparison studies, and the author rightly pointed out a few corrections.
Unlike the ugly benchmark war by a few systems in the past, I like the approach from Vinoth Chandar on collaboratively building the future of LakeHouse.
Open Call for the LakeHouse Formats
It will be excellent if these systems come together and build the OpenLakeHouse format. The data quality, data visualization, and data discovery systems can benefit from shared metadata from an OpenLakeHouse format. Historically these metadata are only used for query optimizers for the closed-loop system. With the LakeHouse format, we genuinely have an opportunity to build an ecosystem around metadata.
Sponsored: Rudderstack - A Practical Guide to The Modern Data Stack: The Data Maturity Journey
Data maturity is rapidly becoming a matter of survival, but the modern data stack can be overwhelming. Here, RudderStack provides a helpful framework that places the tools of the modern stack in the context of a 4-stage journey to help you build the right stack at every stage.
Pinterest: Optimizing Pinterest’s Data Ingestion Stack: Findings and Learnings
Pinterest writes about its optimizing strategy adopted in its logging infrastructure. The journey from adopting a round-robin strategy to writing to the Kafka partitions to the RandomPartition approach & static partition write approach is an exciting read.
It reminds me of our earlier design of Slack's logging pipeline Murron.
FindHotel: Enriching Looker with reliability metadata
ModernDataStack tries solving each niche data engineering challenge; however, it brings its problem of introducing a disjointed data workflow. The tweet summarizes the same.
FindHotel reflects a similar pain point in its data ecosystem and writes how it integrates reliability metadata with Looker.
Most of these tools store the results of the reliability tests in a database and expose them in a custom front-end application.
However, the need to access another tool can become a problem. Data consumers are usually familiar with the BI tool. Having them open the BI tool in one tab, the data reliability tool in another tab, and perhaps other tools (data catalog, etc.) in different tabs, may have a bad impact on adoption and usage.
Shopify: Data Is An Art, Not Just A Science—And Storytelling Is The Key
Data Storytelling is vital to influencing the decision-making process, and Spotify writes about strategies to adopt data storytelling.
Sponsored: Monte Carlo Data - The Modern Data Leader’s Playbook
Learn how today’s best data engineering and analytics leaders are staying ahead of the competition in our exclusive guide.
Mothership: Analytics Engineering At Mothership
A good data modeling process empowers the business to make data-driven decisions, enables curated self-service, and ensures scalability, reliability, and shared context. The Analytical Engineering Roundup recently wrote about Data Modeling for Collaboration. Mothership writes about its journey of adopting the data modeling process.
Swiggy: An end to end system to detect and explain anomalies in operational metrics
Swiggy writes about its end-to-end detection system for detecting anomalies for critical business metrics. Adopting an expert system module to add human input to detect rain or other local events is an exciting approach.
Michael Katz: Will the CDP be unbundled?
The data engineering world can't escape from the bundling and unbundling debate!! Hightouch writes The CDP as we know it is dead: Introducing the Unbundled CDP, and mParticle followup with Will the CDP be unbundled?
This complexity arises from the low-quality event tracking system, as pointed out.
pipeline gig worker @CSVjanitorEvent instrumentation and analytics on top of it is a fucking disaster. I’d guess maybe ~20% of events tops are setup correctly? And in most cases no party has full visibility or incentive to improve it (data team / instrumenting SWEs / vendor)
Disney Streaming: The Fine Art of Visualizing Experiment Results
Data visualization plays a vital role in experiment evaluation. Disney writes about automated visualization at the ad-hoc and platform levels to simplify the experiment evaluation.
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.