Data Engineering Weekly

Share this post

Data Engineering Weekly #81

www.dataengineeringweekly.com

Data Engineering Weekly #81

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Apr 4, 2022
3
Share this post

Data Engineering Weekly #81

www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


George Xing: Building data-driven organizations

Twitter avatar for @ananthdurai
Ananth Packkildurai @ananthdurai
A gentle reminder to all data folks; more data != better decisions. A better decision requires data, clear thinking and time.
1:09 AM ∙ Apr 1, 2022
32Likes3Retweets

The tweet triggered some exciting conversations, where George Xing pointed me to one of his blogs about building data-driven organizations. The blog discusses the broad challenges of adopting the data in an organization. Are we making decisions differently based on data? What are the challenges to including data in the decision-making process? How do operationalize a better decision-making process?

https://georgexing.substack.com/p/what-it-means-to-be-data-driven

https://georgexing.substack.com/p/why-organizations-fail-to-make-data

https://georgexing.substack.com/p/building-data-driven-organizations


Andy Reagan: Non-SQL in your DBT pipeline

I write about the disjoined data workflow with model and task approach of data orchestration with Bundling Vs. UnBundling: The Tale of Airflow Operator and dbt Ref.

I'm happy to see the conversation started in the dbt community even before the article in this thread. https://discourse.getdbt.com/t/representing-non-sql-models-in-a-dbt-dag/2083.

The author summarizes the need for running tasks and the model and points to a promising tool fal.I’m excited to see what is coming out of the fal team.

Twitter avatar for @ananthdurai
Ananth Packkildurai @ananthdurai
The vision behind integrating python and SQL workload from @fal_ai_data is amazing, you should check this out.
Twitter avatar for @gorkemyurt
Gorkem Yurtseven @gorkemyurt
https://t.co/hpmXMPyv8D
12:29 AM ∙ Mar 31, 2022
8Likes2Retweets

https://andyreagan.medium.com/non-sql-in-your-dbt-pipeline-c7cef2091619


Lyft: How LyftLearn Democratizes Distributed Compute through Kubernetes Spark and Fugue

Lyft writes about democratizing distributed computing through Kubernetes Spark and Fugue. Exciting approaches to describing cluster resources as a code, reusable transformation logic with Fugue, and ephemeral clusters for running Hive queries.

The programming model to define the resource usage with the application is an exciting approach.

It will be fantastic to see the orchestration engine taking the next step in integrating serverless computing, where the application can define CPU, memory, or even Cost to run a computation.

https://eng.lyft.com/how-lyftlearn-democratizes-distributed-compute-through-kubernetes-spark-and-fugue-c0875b97c3d9


Pathways: Asynchronous Distributed Dataflow for ML

An excellent weekend read Pathways with a single-controller programming model for richer computing patterns. Like the LyftLearn in the previous blog, it is exciting to see the emerging pattern of defining the computing need in the user code.

https://arxiv.org/abs/2203.12533


Sponsored: Firebolt - 5 Steps to Debug Your SQL Queries

Substituting code with SQL can oftentimes result in complex and long SQL statements that are not performing as fast as expected. If your query is running slow and you want to understand why - you’ve come to the right place.

https://www.firebolt.io/blog/5-steps-to-debug-your-complex-sql-queries-in-firebolt


Motif Analytics: Why Can’t You “Pull Data Real Quick”?

A challenge of running the data analytics org is the help desk style request known as can you pull the data real quick!. The author explains why one can't simply pull the data quickly, highlighting real-world data problems.

https://motifanalytics.medium.com/why-cant-you-pull-data-real-quick-318f90024712


Pinterest: Large Scale Hadoop Upgrade At Pinterest

Production upgrade without downtime is always challenging. Pinterest writes about its YARN cluster upgrade discussing multiple upgrade patterns.

https://medium.com/pinterest-engineering/large-scale-hadoop-upgrade-at-pinterest-a23a112deb73


Sponsored: Monte Carlo Data - IMPACT TOUR 2022 - The data leaders event series to learn key strategies to make an IMPACT with your data.

Join 3 virtual keynotes and 3 city stops, to learn how data leaders are tackling the biggest challenges in data, from building more reliable stacks to hiring top talent for your team.

Register Now: https://www.impactdatatour.com


Gusto: End to End Feature Development at Gusto

Feature flagging is the core part of the modern software development & release process. Gusto writes about building a feature to empower the product managers to launch experiments and the data scientists to iterate on models to improve users engagement.

https://engineering.gusto.com/end-to-end-feature-development/


Dunith Dhanushka: Build a real-time data analytics pipeline with Airbyte, Kafka, and Pinot

Airbyte is a CDC SaaS application like Debezium for DB changes. How to enable real-time analytics on SaaS CDC pipelines? The author writes an exciting step to integrate Airbyte, Kafka, and Apache Pinot.

https://medium.com/event-driven-utopia/build-a-real-time-data-analytics-pipeline-with-airbyte-kafka-and-pinot-c9ff3c42dcf2


Sponsored: Rudderstack - How RudderStack Core Enabled Us To Build Reverse ETL

Principal Engineer, Ranjeet Mishra, details how the foresight of RudderStack's founding engineers made it easier to solve some of the unique technical challenges involved in building Reverse ETL.

https://www.rudderstack.com/blog/how-rudderstack-core-enabled-us-to-build-reverse-etl


Beat: Data Build Tool (dbt) - The Beat story

Beat writes about its dbt adoption and lessons learned along the way. The author narrates things to improve on dbt and the steps to migrate the existing workflow to dbt.

https://build.thebeat.co/data-build-tool-dbt-the-beat-story-a5c09471cf66


Sam Crowder: A (Recent) History of Batch Data

Batch computing came a long way from the MapReduce programming model, which triggered innovative file formats, table formats, and query engines. The author takes a brief historical view of how different segments evolved in this space.

https://www.linkedin.com/pulse/recent-history-batch-data-sam-crowder/


Yotpo Engineering: Outbox with Debezium and Kafka — The hidden challenges

Outbox pattern is well known and widely discussed integration pattern. The blog narrates the hidden technical challenges of adopting the Outbox pattern while implementing Change Data Capture with Debezium.

https://medium.com/yotpoengineering/outbox-with-debezium-and-kafka-the-hidden-challenges-998c00487ae4


All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post

Data Engineering Weekly #81

www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing