Data Engineering Weekly #81

The Weekly Data Engineering Newsletter

Apr 04, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

George Xing: Building data-driven organizations

Ananth Packkildurai@ananthdurai

A gentle reminder to all data folks; more data != better decisions. A better decision requires data, clear thinking and time.

1:09 AM · Apr 1, 2022

3 Reposts · 32 Likes

The tweet triggered some exciting conversations, where George Xing pointed me to one of his blogs about building data-driven organizations. The blog discusses the broad challenges of adopting the data in an organization. Are we making decisions differently based on data? What are the challenges to including data in the decision-making process? How do operationalize a better decision-making process?

https://georgexing.substack.com/p/what-it-means-to-be-data-driven

https://georgexing.substack.com/p/why-organizations-fail-to-make-data

https://georgexing.substack.com/p/building-data-driven-organizations

Andy Reagan: Non-SQL in your DBT pipeline

I write about the disjoined data workflow with model and task approach of data orchestration with Bundling Vs. UnBundling: The Tale of Airflow Operator and dbt Ref.

I'm happy to see the conversation started in the dbt community even before the article in this thread. https://discourse.getdbt.com/t/representing-non-sql-models-in-a-dbt-dag/2083.

The author summarizes the need for running tasks and the model and points to a promising tool fal.I’m excited to see what is coming out of the fal team.

Ananth Packkildurai@ananthdurai

The vision behind integrating python and SQL workload from @fal_ai_data is amazing, you should check this out.

Gorkem Yurtseven @gorkemyurt

https://t.co/hpmXMPyv8D

12:29 AM · Mar 31, 2022

2 Reposts · 8 Likes

https://andyreagan.medium.com/non-sql-in-your-dbt-pipeline-c7cef2091619

Lyft: How LyftLearn Democratizes Distributed Compute through Kubernetes Spark and Fugue

Lyft writes about democratizing distributed computing through Kubernetes Spark and Fugue. Exciting approaches to describing cluster resources as a code, reusable transformation logic with Fugue, and ephemeral clusters for running Hive queries.

The programming model to define the resource usage with the application is an exciting approach.

It will be fantastic to see the orchestration engine taking the next step in integrating serverless computing, where the application can define CPU, memory, or even Cost to run a computation.

https://eng.lyft.com/how-lyftlearn-democratizes-distributed-compute-through-kubernetes-spark-and-fugue-c0875b97c3d9

Pathways: Asynchronous Distributed Dataflow for ML

An excellent weekend read Pathways with a single-controller programming model for richer computing patterns. Like the LyftLearn in the previous blog, it is exciting to see the emerging pattern of defining the computing need in the user code.

https://arxiv.org/abs/2203.12533

Motif Analytics: Why Can’t You “Pull Data Real Quick”?

A challenge of running the data analytics org is the help desk style request known as can you pull the data real quick!. The author explains why one can't simply pull the data quickly, highlighting real-world data problems.

https://motifanalytics.medium.com/why-cant-you-pull-data-real-quick-318f90024712

Pinterest: Large Scale Hadoop Upgrade At Pinterest

Production upgrade without downtime is always challenging. Pinterest writes about its YARN cluster upgrade discussing multiple upgrade patterns.

https://medium.com/pinterest-engineering/large-scale-hadoop-upgrade-at-pinterest-a23a112deb73

Gusto: End to End Feature Development at Gusto

Feature flagging is the core part of the modern software development & release process. Gusto writes about building a feature to empower the product managers to launch experiments and the data scientists to iterate on models to improve users engagement.

https://engineering.gusto.com/end-to-end-feature-development/

Dunith Dhanushka: Build a real-time data analytics pipeline with Airbyte, Kafka, and Pinot

Airbyte is a CDC SaaS application like Debezium for DB changes. How to enable real-time analytics on SaaS CDC pipelines? The author writes an exciting step to integrate Airbyte, Kafka, and Apache Pinot.

https://medium.com/event-driven-utopia/build-a-real-time-data-analytics-pipeline-with-airbyte-kafka-and-pinot-c9ff3c42dcf2

Beat: Data Build Tool (dbt) - The Beat story

Beat writes about its dbt adoption and lessons learned along the way. The author narrates things to improve on dbt and the steps to migrate the existing workflow to dbt.

https://build.thebeat.co/data-build-tool-dbt-the-beat-story-a5c09471cf66

Sam Crowder: A (Recent) History of Batch Data

Batch computing came a long way from the MapReduce programming model, which triggered innovative file formats, table formats, and query engines. The author takes a brief historical view of how different segments evolved in this space.

https://www.linkedin.com/pulse/recent-history-batch-data-sam-crowder/

Yotpo Engineering: Outbox with Debezium and Kafka — The hidden challenges

Outbox pattern is well known and widely discussed integration pattern. The blog narrates the hidden technical challenges of adopting the Outbox pattern while implementing Change Data Capture with Debezium.

https://medium.com/yotpoengineering/outbox-with-debezium-and-kafka-the-hidden-challenges-998c00487ae4

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?