Data Engineering Weekly #72
Weekly Data Engineering Newsletter
Benn Stancil: Service pressure
Is the data org a service organization? It doesn’t need to be, but what we can learn from the service operations. The article is an exciting read for the data team and any teams that build the internal platform to support the business growth.
My highlight of the blog is
Data teams should remember that as well. We often chase big projects, like launching a new testing platform, building a new pricing forecast model, or refactoring core financial metrics. But the value of this work doesn’t discount the value of analytical maintenance.
LinkedIn: DARWIN - Data Science and Artificial Intelligence Workbench at LinkedIn
Context switching across multiple tools hamper developer productivity. It is one of my takes on the modern data stack.
LinkedIn writes an exciting blog about a similar problem and unifies the data science, data exploration, and business analytics workflow.
Uber: Cost Efficiency @ Scale in Big Data File Format
Uber writes an exciting article about data compression algorithms comparing Snappy, GZip, and ZSTD. ZSTD is a clear winner on the optimal compression and vCore second savings. The
column deletion support in parquet and multiple column reordering is informative.
The experiment is for one table and one partition, with four columns containing the type of UUID as string, timestamps as BIGINT, and lat/long as double, which we sort in different orders. The results show that data size is affected by ordering. Eventually, we see a 32% drop in the data size from no sorting to 4 columns sorting with the proper order (UUID, time_ms, latitude, longitude)
PrestoDB: Avoid Data Silos in Presto in Meta - the journey from Raptor to RaptorX
Presto-Raptor is a shared-nothing storage engine for Presto. Presto-Raptor is an exciting tool we attempted to use almost four years back but left without documentation and momentum. It is exciting to see a new evaluation of Raptor architecture to RapterX by using
Alluxio as an underlying data locality engine.
Gradient Flow: What is Graph Intelligence?
The blog highlights the current state of Graph Intelligence and How and why the best companies are adopting Graph Visual Analytics, Graph AI, and Graph Neural Networks. The highlight of the blog
You don’t need a graph database: none of Meituan’s 30 GNNs use one.
Iteratively: Who should really own your tracking plan?
The blog is one year old, but it is a common confusion for many companies about the ownership of tracking the events. The blog builds a strong case for why the product managers should own the tracking plan.
Singular: Achieving fast upserts for Apache Druid
Many OLAP engines started with the root from the ads serving analytics, where the events were primarily immutable. The upsert (mutability) is often an afterthought, making it harder to adapt for business process analytics. Singular writes a blog on their workaround to make Apache Druid support upsert. It is one reason why I like the Apache Pinot design with the
row-level upsert support that fits well for the business process analytics.
Pinterest: Experiment without the wait - Speeding up the iteration cycle with Offline Replay Experimentation
Could we predict experiment outcomes without even running an experiment? Pinterest writes about the Offline Replay Experimentation Framework, where the framework simulates the performance of new ideas entirely offline based on historical data.
Zendesk: Building reliability into uncertain event delivery
Head-of-line blocking is always challenging while building queueing systems. Uber writes about solving head-of-line patterns with Kafka multi-thread Consumer Proxy with Out-of-Order Commit support. Zendesk writes about its Job Queue system named Event Job Distributor by introducing SQS as a proxy layer.
Workday: Scaling Multi-tenanted Machine Learning Applications on Kubernetes
Workday writes about multi-tenant reusable models serving infrastructure and various sharding strategy options to scale the infrastructure. An interesting approach is the bin-packed shared model with cost functions applied on each tenant/ model.
Joom: Spark on Kubernetes in 2022
Joom writes about the current state of running Apache Spark in Kubernetes and the lesson learned along the way. The blog is an interesting fact to think about.
With AWS EMR, you pay for the EC2 instances and the EMR itself. You can use spot instances to reduce EC2 costs, but the EMR surcharge can add 50% to the total bill.
I get that EMR support for Auto Scale, but other than that, it is simply a package of open-source systems. I still don't understand why AWS adds this
"Amazon EMR Price" to the EC2 price?
For EKS, you pay $0.10 per hour for each Amazon EKS cluster you create.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.