Data Engineering Weekly #87
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Let’s start this week with this thought.
There are some fantastic lively discussions on data deletion, the need for it, and the cultural aspect of the teams. The discussion leads to a few exciting reads on my list.
***IEEE March 2022 Edition is discussing data errors and privacy. ***
LinkedIn: One-stop MLOps portal at LinkedIn
Though the ML lifecycle has several phases, it is all part of the developer workflow. A single interface to manage the data/ ML lifecycle yields significant productivity gain. The LinkedIn blog on the One-Stop MLOps portal is an excellent reminder of developer productivity in the data workflow.
Adidas: Adidas Data Mesh Journey - Sharing data efficiently at scale
As data adoption grows, the next few years' data engineering challenges will be figuring out data contract frameworks and efficient data sharing internally and externally among different org. I started my attempt to solve this problem with Schemata.
Read more about Schemata, and I'm looking for contributors to Schemata.
Adidas writes its adoption design of data mesh on a similar principle to define contracts between the data producer & consumer.
Sponsored: Firebolt - How Vimeo Keeps Data Intact with 85 Billion Events Per Month
Lior Solomon, VP of Data Engineering at Vimeo shares his own experience on The Data Engineering Show: What made him recently build a new data ops team? How do you operate a data stack that supports 85 billion events per month and 2 PBs of data? What does Fatal Attraction have to do with all of this?
Hubspot: How to Get Better at Updating Your Data Infrastructure
Infrastructure upgrades are always challenging and require disciplined engineering practices. HubSpot writes an exciting article describing the engineering practices for upgrading its data infrastructure.
Lyft: Trino - Open Source Infrastructure Upgrading at Lyft
Staying on upgrades, Lyft writes about its Trino upgrade story. The article is an exciting read to understand Trino internals and efficient flame graph usage in performance debugging.
Michael Toy: Designing Malloy — Introduction & The Syntactic Shell
Malloy is a query language of data, an alternate to SQL. It is an exciting framework that I'm actively following, and looking forward to trying it out. The blog is an exciting read to understand the evaluation and design thinking of building a framework.
Sponsored: Monte Carlo Data - The Modern Data Leader’s Playbook
Learn how today’s best data engineering and analytics leaders are staying ahead of the competition in our complete guide.
Download the modern data leader’s playbook
Kovid Rathee: Data Quality and Testing Frameworks
Data Quality & testing frameworks play a vital role in establishing data contracts and healthy data exchange between the data producer & consumer. The author writes an excellent comparison blog on data quality & testing frameworks.
Amit Prakash: The metrics layer has growing up to do
The metric layer is picking momentum with the dbt metric layer and the introduction of MetricFlow. The advantage of discoverable, sharable & reusable metric definition is significant in streamlining the analytical processing. The author discusses various metric query patterns and the logical layer to integrate the metric layer.
Twitter: Scaling data access by moving an exabyte of data to Google Cloud
Twitter started their on-prem to Google BigQuery migration. The blog narrates the realized gain with the cloud migration and how the automated data ingestion framework reduced the initial productivity and onboarding hiccups.
Sponsored: Rudderstack - The Future of Customer Data Platforms: To Bundle or Not to Bundle?
Here, RuddersStack examines the technical limitations behind the push to unbundle the CDP and assesses whether unbundling is the appropriate way to overcome these limitations.
Rovio Tech: Unlocking interactive dashboards at Rovio with Druid and Spark
Rovio writes about its adoption of Druid and its need to write its custom Druid ingestion framework. In the past, I had tons of trouble with the Druid middle manager & Zombie tasks of unsuccessful segment commits. It is exciting to see Rovio's design without the Druid middle managers.
How did we find out about Druid? We came across some existing BI tools that supported pivot charts: Turnilo and Superset.
I find this interesting quote in the blog; it highlights the importance of community integrations.
Adobe: Exploring Kafka Producer’s Internals
Kafka or variation of Kafka protocol implementation becomes a defacto component in the data infrastructure. Adobe writes an educative blog that narrates the internals of Kafka producers.
Cockroach Labs: Idempotency and Ordering in Event-Driven Systems
Idempotency and event ordering guarantees are critical properties to understand in the event-driven architecture. The blog narrates these properties and walks through a few examples.
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.