Data Engineering Weekly #81
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
George Xing: Building data-driven organizations
The tweet triggered some exciting conversations, where
George Xing pointed me to one of his blogs about building data-driven organizations. The blog discusses the broad challenges of adopting the data in an organization. Are we making decisions differently based on data? What are the challenges to including data in the decision-making process? How do operationalize a better decision-making process?
Andy Reagan: Non-SQL in your DBT pipeline
I write about the disjoined data workflow with model and task approach of data orchestration with
Bundling Vs. UnBundling: The Tale of Airflow Operator and dbt Ref.
I'm happy to see the conversation started in the dbt community even before the article in this thread.
Lyft: How LyftLearn Democratizes Distributed Compute through Kubernetes Spark and Fugue
Lyft writes about democratizing distributed computing through Kubernetes Spark and Fugue. Exciting approaches to describing cluster resources as a code, reusable transformation logic with Fugue, and ephemeral clusters for running Hive queries.
The programming model to define the resource usage with the application is an exciting approach.
It will be fantastic to see the orchestration engine taking the next step in integrating serverless computing, where the application can define CPU, memory, or even Cost to run a computation.
Pathways: Asynchronous Distributed Dataflow for ML
An excellent weekend read Pathways with a single-controller programming model for richer computing patterns. Like the LyftLearn in the previous blog, it is exciting to see the emerging pattern of defining the computing need in the user code.
Sponsored: Firebolt - 5 Steps to Debug Your SQL Queries
Substituting code with SQL can oftentimes result in complex and long SQL statements that are not performing as fast as expected. If your query is running slow and you want to understand why - you’ve come to the right place.
Motif Analytics: Why Can’t You “Pull Data Real Quick”?
A challenge of running the data analytics org is the help desk style request known as can you pull the data real quick!. The author explains why one can't simply pull the data quickly, highlighting real-world data problems.
Pinterest: Large Scale Hadoop Upgrade At Pinterest
Production upgrade without downtime is always challenging. Pinterest writes about its YARN cluster upgrade discussing multiple upgrade patterns.
Sponsored: Monte Carlo Data - IMPACT TOUR 2022 - The data leaders event series to learn key strategies to make an IMPACT with your data.
Join 3 virtual keynotes and 3 city stops, to learn how data leaders are tackling the biggest challenges in data, from building more reliable stacks to hiring top talent for your team.
Gusto: End to End Feature Development at Gusto
Feature flagging is the core part of the modern software development & release process. Gusto writes about building a feature to empower the product managers to launch experiments and the data scientists to iterate on models to improve users engagement.
Dunith Dhanushka: Build a real-time data analytics pipeline with Airbyte, Kafka, and Pinot
Airbyte is a CDC SaaS application like Debezium for DB changes. How to enable real-time analytics on SaaS CDC pipelines? The author writes an exciting step to integrate Airbyte, Kafka, and Apache Pinot.
Sponsored: Rudderstack - How RudderStack Core Enabled Us To Build Reverse ETL
Principal Engineer, Ranjeet Mishra, details how the foresight of RudderStack's founding engineers made it easier to solve some of the unique technical challenges involved in building Reverse ETL.
Beat: Data Build Tool (dbt) - The Beat story
Beat writes about its dbt adoption and lessons learned along the way. The author narrates things to improve on dbt and the steps to migrate the existing workflow to dbt.
Sam Crowder: A (Recent) History of Batch Data
Batch computing came a long way from the MapReduce programming model, which triggered innovative file formats, table formats, and query engines. The author takes a brief historical view of how different segments evolved in this space.
Yotpo Engineering: Outbox with Debezium and Kafka — The hidden challenges
Outbox pattern is well known and widely discussed integration pattern. The blog narrates the hidden technical challenges of adopting the Outbox pattern while implementing Change Data Capture with Debezium.
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.