Data Engineering Weekly #87

The Weekly Data Engineering Newsletter

May 23, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Let’s start this week with this thought.

Ananth Packkildurai@ananthdurai

Two hard problems in data engineering 1. Counting the data. 2. Deleting the data. As the data community, we actively talk about counting but fail to discuss deleting.

9:12 PM · May 20, 2022

3 Reposts · 77 Likes

There are some fantastic lively discussions on data deletion, the need for it, and the cultural aspect of the teams. The discussion leads to a few exciting reads on my list.

***IEEE March 2022 Edition is discussing data errors and privacy. ***

LinkedIn: One-stop MLOps portal at LinkedIn

Though the ML lifecycle has several phases, it is all part of the developer workflow. A single interface to manage the data/ ML lifecycle yields significant productivity gain. The LinkedIn blog on the One-Stop MLOps portal is an excellent reminder of developer productivity in the data workflow.

https://engineering.linkedin.com/blog/2022/one-stop-mlops-portal-at-linkedin

Adidas: Adidas Data Mesh Journey - Sharing data efficiently at scale

As data adoption grows, the next few years' data engineering challenges will be figuring out data contract frameworks and efficient data sharing internally and externally among different org. I started my attempt to solve this problem with Schemata.

Read more about Schemata, and I'm looking for contributors to Schemata.

Adidas writes its adoption design of data mesh on a similar principle to define contracts between the data producer & consumer.

https://medium.com/adidoescode/adidas-data-mesh-journey-sharing-data-efficiently-at-scale-c50ee671fbd7

Hubspot: How to Get Better at Updating Your Data Infrastructure

Infrastructure upgrades are always challenging and require disciplined engineering practices. HubSpot writes an exciting article describing the engineering practices for upgrading its data infrastructure.

https://product.hubspot.com/blog/updating-data-infrastructure

Lyft: Trino - Open Source Infrastructure Upgrading at Lyft

Staying on upgrades, Lyft writes about its Trino upgrade story. The article is an exciting read to understand Trino internals and efficient flame graph usage in performance debugging.

https://eng.lyft.com/trino-open-source-infrastructure-upgrading-at-lyft-83f26b099fa

Michael Toy: Designing Malloy — Introduction & The Syntactic Shell

Malloy is a query language of data, an alternate to SQL. It is an exciting framework that I'm actively following, and looking forward to trying it out. The blog is an exciting read to understand the evaluation and design thinking of building a framework.

https://medium.com/@michaeltoy/designing-malloy-0-introduction-88b8809d75d0

https://medium.com/@michaeltoy/designing-malloy-1-the-syntactic-shell-7216bcc9ffdf

Kovid Rathee: Data Quality and Testing Frameworks

Data Quality & testing frameworks play a vital role in establishing data contracts and healthy data exchange between the data producer & consumer. The author writes an excellent comparison blog on data quality & testing frameworks.

https://servian.dev/data-quality-and-testing-frameworks-316c09436ab2

Amit Prakash: The metrics layer has growing up to do

The metric layer is picking momentum with the dbt metric layer and the introduction of MetricFlow. The advantage of discoverable, sharable & reusable metric definition is significant in streamlining the analytical processing. The author discusses various metric query patterns and the logical layer to integrate the metric layer.

https://prakasha.substack.com/p/the-metrics-layer-has-growing-up?s=r

Twitter: Scaling data access by moving an exabyte of data to Google Cloud

Twitter started their on-prem to Google BigQuery migration. The blog narrates the realized gain with the cloud migration and how the automated data ingestion framework reduced the initial productivity and onboarding hiccups.

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2022/scaling-data-access-by-moving-an-exabyte-of-data-to-google-cloud

Rovio Tech: Unlocking interactive dashboards at Rovio with Druid and Spark

Rovio writes about its adoption of Druid and its need to write its custom Druid ingestion framework. In the past, I had tons of trouble with the Druid middle manager & Zombie tasks of unsuccessful segment commits. It is exciting to see Rovio's design without the Druid middle managers.

How did we find out about Druid? We came across some existing BI tools that supported pivot charts: Turnilo and Superset.

I find this interesting quote in the blog; it highlights the importance of community integrations.

https://medium.com/@Rovio_Tech/unlocking-interactive-dashboards-at-rovio-with-druid-and-spark-40f8fe6a0b05

Adobe: Exploring Kafka Producer’s Internals

Kafka or variation of Kafka protocol implementation becomes a defacto component in the data infrastructure. Adobe writes an educative blog that narrates the internals of Kafka producers.

https://medium.com/adobetech/exploring-kafka-producers-internals-37411b647d0f

Cockroach Labs: Idempotency and Ordering in Event-Driven Systems

Idempotency and event ordering guarantees are critical properties to understand in the event-driven architecture. The blog narrates these properties and walks through a few examples.

https://www.cockroachlabs.com/blog/idempotency-and-ordering-in-event-driven-systems/

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?