Data Engineering Weekly Is Brought to You by RudderStack
RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
George Xing: Building data-driven organizations
The tweet triggered some exciting conversations, where George Xing
pointed me to one of his blogs about building data-driven organizations. The blog discusses the broad challenges of adopting the data in an organization. Are we making decisions differently based on data? What are the challenges to including data in the decision-making process? How do operationalize a better decision-making process?
https://georgexing.substack.com/p/what-it-means-to-be-data-driven
https://georgexing.substack.com/p/why-organizations-fail-to-make-data
https://georgexing.substack.com/p/building-data-driven-organizations
Andy Reagan: Non-SQL in your DBT pipeline
I write about the disjoined data workflow with model and task approach of data orchestration with Bundling Vs. UnBundling: The Tale of Airflow Operator and dbt Ref.
I'm happy to see the conversation started in the dbt community even before the article in this thread. https://discourse.getdbt.com/t/representing-non-sql-models-in-a-dbt-dag/2083
.
The author summarizes the need for running tasks and the model and points to a promising tool fal
.
I’m excited to see what is coming out of the fal
team.
https://andyreagan.medium.com/non-sql-in-your-dbt-pipeline-c7cef2091619
Lyft: How LyftLearn Democratizes Distributed Compute through Kubernetes Spark and Fugue
Lyft writes about democratizing distributed computing through Kubernetes Spark and Fugue. Exciting approaches to describing cluster resources as a code, reusable transformation logic with Fugue, and ephemeral clusters for running Hive queries.
The programming model to define the resource usage with the application is an exciting approach.
It will be fantastic to see the orchestration engine taking the next step in integrating serverless computing, where the application can define CPU, memory, or even Cost to run a computation.
Pathways: Asynchronous Distributed Dataflow for ML
An excellent weekend read Pathways with a single-controller programming model for richer computing patterns. Like the LyftLearn in the previous blog, it is exciting to see the emerging pattern of defining the computing need in the user code.
https://arxiv.org/abs/2203.12533
Sponsored: Firebolt - 5 Steps to Debug Your SQL Queries
Substituting code with SQL can oftentimes result in complex and long SQL statements that are not performing as fast as expected. If your query is running slow and you want to understand why - you’ve come to the right place.
https://www.firebolt.io/blog/5-steps-to-debug-your-complex-sql-queries-in-firebolt
Motif Analytics: Why Can’t You “Pull Data Real Quick”?
A challenge of running the data analytics org is the help desk style request known as can you pull the data real quick!. The author explains why one can't simply pull the data quickly, highlighting real-world data problems.
https://motifanalytics.medium.com/why-cant-you-pull-data-real-quick-318f90024712
Pinterest: Large Scale Hadoop Upgrade At Pinterest
Production upgrade without downtime is always challenging. Pinterest writes about its YARN cluster upgrade discussing multiple upgrade patterns.
https://medium.com/pinterest-engineering/large-scale-hadoop-upgrade-at-pinterest-a23a112deb73
Sponsored: Monte Carlo Data - IMPACT TOUR 2022 - The data leaders event series to learn key strategies to make an IMPACT with your data.
Join 3 virtual keynotes and 3 city stops, to learn how data leaders are tackling the biggest challenges in data, from building more reliable stacks to hiring top talent for your team.
Register Now: https://www.impactdatatour.com
Gusto: End to End Feature Development at Gusto
Feature flagging is the core part of the modern software development & release process. Gusto writes about building a feature to empower the product managers to launch experiments and the data scientists to iterate on models to improve users engagement.
https://engineering.gusto.com/end-to-end-feature-development/
Dunith Dhanushka: Build a real-time data analytics pipeline with Airbyte, Kafka, and Pinot
Airbyte is a CDC SaaS application like Debezium for DB changes. How to enable real-time analytics on SaaS CDC pipelines? The author writes an exciting step to integrate Airbyte, Kafka, and Apache Pinot.
Sponsored: Rudderstack - How RudderStack Core Enabled Us To Build Reverse ETL
Principal Engineer, Ranjeet Mishra, details how the foresight of RudderStack's founding engineers made it easier to solve some of the unique technical challenges involved in building Reverse ETL.
https://www.rudderstack.com/blog/how-rudderstack-core-enabled-us-to-build-reverse-etl
Beat: Data Build Tool (dbt) - The Beat story
Beat writes about its dbt adoption and lessons learned along the way. The author narrates things to improve on dbt and the steps to migrate the existing workflow to dbt.
https://build.thebeat.co/data-build-tool-dbt-the-beat-story-a5c09471cf66
Sam Crowder: A (Recent) History of Batch Data
Batch computing came a long way from the MapReduce programming model, which triggered innovative file formats, table formats, and query engines. The author takes a brief historical view of how different segments evolved in this space.
https://www.linkedin.com/pulse/recent-history-batch-data-sam-crowder/
Yotpo Engineering: Outbox with Debezium and Kafka — The hidden challenges
Outbox pattern is well known and widely discussed integration pattern. The blog narrates the hidden technical challenges of adopting the Outbox pattern while implementing Change Data Capture with Debezium.
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.