Data Engineering Weekly #83
The Weekly Data Engineering Newsletter
Sponsored: Joybird’s Warehouse-First Customer Data Stack with Iterable, Snowflake, & RudderStack
Live on April 20th, the Director of Analytics from Joybird, a La-Z-Boy company, joins RudderStack, Snowflake, and Iterable to detail how his team retooled its data stack and reduced their time spent building integrations and managing data pipelines by 93%.
Mikkel Dengsøe: Data salaries at FAANG companies in 2022
Data is the new oil, but how much are companies paying to extract the value out of the data? The author compares the data practitioners' salaries in FAANG companies.
Benn Stancil: Has SQL gone too far?
Every time we flirt with some NoSQL alternative, we rebound to SQL. However, is SQL the appropriate abstraction for the emerging semantic & metrics layer? The author raises some interesting questions about the next-gen business query language comparing dbt & Malloy.
Barr Moses: Is The Modern Data Warehouse Broken?
Is the data warehouse broken? Does the immutable data warehouse the cure? Though the article titled the immutable data warehouse, I feel the focus point is the contract-driven collaborative data engineering between the data producers and the consumers.
Bill Inmon: Burying Data Warehouse - RIP
Following up on the previous article, this is an interesting rebuttal to the “data warehouse is dead” argument. There have been several rather significant attempts at exterminating and bypassing a data warehouse from the data lake to date mesh. The author writes a historical walkthrough of attempts and points out the data warehouse's resiliency.
Sponsored: Firebolt - SQL: Thinking in Lambdas
How do you perform operations on an array of arrays? Or on multiple correlated arrays in a table? Do you keep on UNNESTing, or is there a more easy-to-use, elegant approach to querying it? If anyone can help you navigate the world of arrays with ease – all using SQL and without the need to define new types of objects – it’s Octavian Zarzu.
LakeFS: Hudi, Iceberg and Delta Lake: Data Lake Table Formats Compared
LakeHouse architecture is evolving and getting fast adoption. The article compares the lakehouse systems Apache Hudi, Apache Iceberg & Databricks Delta lake. The recommendation is as follows.
Use Apache Iceberg if your primary pain point is the Hive metastore
Use Apache Hudi if you need high flexibility in handling mutable & immutable data.
If you're a Databricks/ Spark shop, use Delta lake.
VFisa: An overview of Metric Layer offerings
Benn Stancil quoted the metric layer as the missing piece of the modern data stack. Fast forward, we got multiple commercial and open-source metric layer systems emerging. The author writes a comparative study of the metrics layers. Please suggest to the sheet if you spot any inaccuracy.
EBook: [New Guide] The Big Book of Data Observability
Organizations spend 40% of their time tackling data downtime. Learn how to tackle data trust with best practices from today's data leaders.
DoorDash: 3 Principles for Building an ML Platform That Will Sustain Hyper-Growth
Every technology & process breaks at scale. How to create a continuously sustainable ML platform? DoorDash writes about three principles to follow.
Dream Big, Start Small
1% better every day
LinkedIn: Open sourcing Feathr – LinkedIn’s feature store for productive machine learning
Preparing and managing features has been one of the most time-consuming parts of operating our ML applications at scale. Feature stores are emerging as the solution to manage the ML feature data. LinkedIn open sources its feature store named Feathr.
Even Oldridge: Recommender Systems, Not Just Recommender Models
The recommender system has a wide application; many publications focus on the recommender model, but these scores aren’t enough to serve users’ recommendations in most real-world contexts. The author writes about 4-stages of the recommender system.
The talk on the same is an interesting watch.
Sponsored: Rudderstack - How Does The Data Lakehouse Enhance The Customer Data Stack?
RudderStack explores how the unique technical challenges that come along with customer data make lakehouse architecture a natural fit for its storage and processing.
Preset: The Case for Dataset-Centric Visualization
The metrics layer, semantic layer, or denormalized flat table; what does that all mean to the visualization layer? Preset writes an exciting blog comparing query-centric, semantic-layer-centric & data-centric visualization and how Apache Superset approaches the data-centric visualization.
Uber: Presto® on Apache Kafka® At Uber Scale
Uber writes about its usage of Presto to query Kafka. The article narrates the challenges of running Presto Connectors for Kafka and some of the guardrails to limit the network congestion from the Presto queries. Adhoc querying the streaming data sources underinvested in most companies, and thrilled to see the Presto improvements for Kafka.
An interesting talk to watch on the same topic.
AWS: Best practices to optimize data access performance from Amazon EMR and AWS Glue to Amazon S3
S3 is the defacto data storage for bulk data processing in the AWS cloud. AWS writes about a few tuning techniques to optimize EMR data processing on S3.
Alluxio: From Zookeeper to Raft: How Alluxio Stores File System State with High Availability and Fault Tolerance
Alluxio implements a virtual distributed file system that allows accessing independent large data stores with compute engines like Hadoop or Spark through a single interface. Alluxio writes about how running external systems like Zookeeper causes inconsistency and its adoption of Raft implementation from Apache Ratis.
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.