Data Engineering Weekly #72

Weekly Data Engineering Newsletter

Jan 31, 2022

Benn Stancil: Service pressure

Is the data org a service organization? It doesn’t need to be, but what we can learn from the service operations. The article is an exciting read for the data team and any teams that build the internal platform to support the business growth.

My highlight of the blog is

Data teams should remember that as well. We often chase big projects, like launching a new testing platform, building a new pricing forecast model, or refactoring core financial metrics. But the value of this work doesn’t discount the value of analytical maintenance.

benn.substack

Service pressure

Some of our opinions make sense: They’re well-reasoned and deeply considered, and built on firm foundations of logic and experience. Others, however, come from hiccups in our timeline, developed from quirks of coincidence, as inexplicable as they are strong. For me, this latter gr…

3 years ago · 7 likes · Benn Stancil

LinkedIn: DARWIN - Data Science and Artificial Intelligence Workbench at LinkedIn

Context switching across multiple tools hamper developer productivity. It is one of my takes on the modern data stack.

Ananth Packkildurai @ananthdurai

@sarahmk125 MDS is a set of vendor tools that solve niche data problems (lineage, orchestration, quality) with the side effect of creating a disjointed data workflow that makes data folks lives more complicated.

LinkedIn writes an exciting blog about a similar problem and unifies the data science, data exploration, and business analytics workflow.

https://engineering.linkedin.com/blog/2022/darwin--data-science-and-artificial-intelligence-workbench-at-li

Uber: Cost Efficiency @ Scale in Big Data File Format

Uber writes an exciting article about data compression algorithms comparing Snappy, GZip, and ZSTD. ZSTD is a clear winner on the optimal compression and vCore second savings. The column deletion support in parquet and multiple column reordering is informative.

The experiment is for one table and one partition, with four columns containing the type of UUID as string, timestamps as BIGINT, and lat/long as double, which we sort in different orders. The results show that data size is affected by ordering. Eventually, we see a 32% drop in the data size from no sorting to 4 columns sorting with the proper order (UUID, time_ms, latitude, longitude)

https://eng.uber.com/cost-efficiency-big-data/

PrestoDB: Avoid Data Silos in Presto in Meta - the journey from Raptor to RaptorX

Presto-Raptor is a shared-nothing storage engine for Presto. Presto-Raptor is an exciting tool we attempted to use almost four years back but left without documentation and momentum. It is exciting to see a new evaluation of Raptor architecture to RapterX by using Alluxio as an underlying data locality engine.

https://prestodb.io/blog/2022/01/28/avoid-data-silos-in-presto-in-meta.html

Gradient Flow: What is Graph Intelligence?

The blog highlights the current state of Graph Intelligence and How and why the best companies are adopting Graph Visual Analytics, Graph AI, and Graph Neural Networks. The highlight of the blog

You don’t need a graph database: none of Meituan’s 30 GNNs use one.

https://gradientflow.com/what-is-graph-intelligence

Iteratively: Who should really own your tracking plan?

The blog is one year old, but it is a common confusion for many companies about the ownership of tracking the events. The blog builds a strong case for why the product managers should own the tracking plan.

https://iterative.ly/blog/tracking-plan-ownership

Singular: Achieving fast upserts for Apache Druid

Many OLAP engines started with the root from the ads serving analytics, where the events were primarily immutable. The upsert (mutability) is often an afterthought, making it harder to adapt for business process analytics. Singular writes a blog on their workaround to make Apache Druid support upsert. It is one reason why I like the Apache Pinot design with the row-level upsert support that fits well for the business process analytics.

https://singular-engineering-blog.medium.com/achieving-fast-upserts-for-apache-druid-db6c33fba466

Pinterest: Experiment without the wait - Speeding up the iteration cycle with Offline Replay Experimentation

Could we predict experiment outcomes without even running an experiment? Pinterest writes about the Offline Replay Experimentation Framework, where the framework simulates the performance of new ideas entirely offline based on historical data.

https://medium.com/pinterest-engineering/experiment-without-the-wait-speeding-up-the-iteration-cycle-with-offline-replay-experimentation-7a4a95fa674b

Zendesk: Building reliability into uncertain event delivery

Head-of-line blocking is always challenging while building queueing systems. Uber writes about solving head-of-line patterns with Kafka multi-thread Consumer Proxy with Out-of-Order Commit support. Zendesk writes about its Job Queue system named Event Job Distributor by introducing SQS as a proxy layer.

https://medium.com/zendesk-engineering/building-reliability-into-uncertain-event-delivery-a09db0750ef9

Workday: Scaling Multi-tenanted Machine Learning Applications on Kubernetes

Workday writes about multi-tenant reusable models serving infrastructure and various sharding strategy options to scale the infrastructure. An interesting approach is the bin-packed shared model with cost functions applied on each tenant/ model.

https://medium.com/workday-engineering/scaling-multi-tenanted-machine-learning-applications-on-kubernetes-3f744ae543e2

Joom: Spark on Kubernetes in 2022

Joom writes about the current state of running Apache Spark in Kubernetes and the lesson learned along the way. The blog is an interesting fact to think about.

With AWS EMR, you pay for the EC2 instances and the EMR itself. You can use spot instances to reduce EC2 costs, but the EMR surcharge can add 50% to the total bill.

I get that EMR support for Auto Scale, but other than that, it is simply a package of open-source systems. I still don't understand why AWS adds this "Amazon EMR Price" to the EC2 price?.. For EKS, you pay $0.10 per hour for each Amazon EKS cluster you create.

https://medium.com/@vladimir.prus/spark-on-kubernetes-in-2022-32458999e831

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly