Data Engineering Weekly #83

The Weekly Data Engineering Newsletter

Apr 18, 2022

Mikkel Dengsøe: Data salaries at FAANG companies in 2022

Data is the new oil, but how much are companies paying to extract the value out of the data? The author compares the data practitioners' salaries in FAANG companies.

https://medium.com/@mikldd/data-salaries-at-faang-companies-in-2022-29d5b56b2428

Benn Stancil: Has SQL gone too far?

Every time we flirt with some NoSQL alternative, we rebound to SQL. However, is SQL the appropriate abstraction for the emerging semantic & metrics layer? The author raises some interesting questions about the next-gen business query language comparing dbt & Malloy.

https://benn.substack.com/p/has-sql-gone-too-far

Barr Moses: Is The Modern Data Warehouse Broken?

Is the data warehouse broken? Does the immutable data warehouse the cure? Though the article titled the immutable data warehouse, I feel the focus point is the contract-driven collaborative data engineering between the data producers and the consumers.

https://towardsdatascience.com/is-the-modern-data-warehouse-broken-1c9cbfddec3e

Bill Inmon: Burying Data Warehouse - RIP

Following up on the previous article, this is an interesting rebuttal to the “data warehouse is dead” argument. There have been several rather significant attempts at exterminating and bypassing a data warehouse from the data lake to date mesh. The author writes a historical walkthrough of attempts and points out the data warehouse's resiliency.

https://www.linkedin.com/pulse/buying-data-warehouse-rip-bill-inmon/

LakeFS: Hudi, Iceberg and Delta Lake: Data Lake Table Formats Compared

LakeHouse architecture is evolving and getting fast adoption. The article compares the lakehouse systems Apache Hudi, Apache Iceberg & Databricks Delta lake. The recommendation is as follows.

Use Apache Iceberg if your primary pain point is the Hive metastore
Use Apache Hudi if you need high flexibility in handling mutable & immutable data.
If you're a Databricks/ Spark shop, use Delta lake.

https://lakefs.io/hudi-iceberg-and-delta-lake-data-lake-table-formats-compared/

VFisa: An overview of Metric Layer offerings

Benn Stancil quoted the metric layer as the missing piece of the modern data stack. Fast forward, we got multiple commercial and open-source metric layer systems emerging. The author writes a comparative study of the metrics layers. Please suggest to the sheet if you spot any inaccuracy.

https://medium.com/@vfisa/an-overview-of-metric-layer-offerings-a9ddcffb446e

EBook: [New Guide] The Big Book of Data Observability

Organizations spend 40% of their time tackling data downtime. Learn how to tackle data trust with best practices from today's data leaders.

Download the Big Book of Data Observability

DoorDash: 3 Principles for Building an ML Platform That Will Sustain Hyper-Growth

Every technology & process breaks at scale. How to create a continuously sustainable ML platform? DoorDash writes about three principles to follow.

Dream Big, Start Small
1% better every day
Customer Obsession

https://doordash.engineering/2022/04/12/3-principles-for-building-an-ml-platform/

LinkedIn: Open sourcing Feathr – LinkedIn’s feature store for productive machine learning

Preparing and managing features has been one of the most time-consuming parts of operating our ML applications at scale. Feature stores are emerging as the solution to manage the ML feature data. LinkedIn open sources its feature store named Feathr.

https://engineering.linkedin.com/blog/2022/open-sourcing-feathr---linkedin-s-feature-store-for-productive-m

Even Oldridge: Recommender Systems, Not Just Recommender Models

The recommender system has a wide application; many publications focus on the recommender model, but these scores aren’t enough to serve users’ recommendations in most real-world contexts. The author writes about 4-stages of the recommender system.

The talk on the same is an interesting watch.

https://medium.com/nvidia-merlin/recommender-systems-not-just-recommender-models-485c161c755e

Preset: The Case for Dataset-Centric Visualization

The metrics layer, semantic layer, or denormalized flat table; what does that all mean to the visualization layer? Preset writes an exciting blog comparing query-centric, semantic-layer-centric & data-centric visualization and how Apache Superset approaches the data-centric visualization.

https://preset.io/blog/dataset-centric-visualization/

Uber: Presto® on Apache Kafka® At Uber Scale

Uber writes about its usage of Presto to query Kafka. The article narrates the challenges of running Presto Connectors for Kafka and some of the guardrails to limit the network congestion from the Presto queries. Adhoc querying the streaming data sources underinvested in most companies, and thrilled to see the Presto improvements for Kafka.

An interesting talk to watch on the same topic.

https://eng.uber.com/presto-on-apache-kafka-at-uber-scale/

AWS: Best practices to optimize data access performance from Amazon EMR and AWS Glue to Amazon S3

S3 is the defacto data storage for bulk data processing in the AWS cloud. AWS writes about a few tuning techniques to optimize EMR data processing on S3.

https://aws.amazon.com/blogs/big-data/best-practices-to-optimize-data-access-performance-from-amazon-emr-and-aws-glue-to-amazon-s3/

Alluxio: From Zookeeper to Raft: How Alluxio Stores File System State with High Availability and Fault Tolerance

Alluxio implements a virtual distributed file system that allows accessing independent large data stores with compute engines like Hadoop or Spark through a single interface. Alluxio writes about how running external systems like Zookeeper causes inconsistency and its adoption of Raft implementation from Apache Ratis.

https://www.alluxio.io/blog/from-zookeeper-to-raft-how-alluxio-stores-file-system-state-with-high-availability-and-fault-tolerance/

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?