Data Engineering Weekly #115

The Weekly Data Engineering Newsletter

Jan 23, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Editor’s Note: Update on our blog series

One of the promises I made toward the end of 2022 is to publish more of my thoughts and industry observation of data engineering trends. So far, we have published.

Functional Programming - A Blueprint

The blog retriggers the conversations around data modeling, so many data practitioners reached out and discussed and evaluated their data stack. The response humbles me. Thank you for everyone reaching out and brainstorming about it.

Data Catalog - A broken promise

A classic blog triggers a few conversations about Data Catalog and its future. A few data practitioners reached out and appreciated triggering a healthy conversation around data catalogs. We’ve seen a similar prediction during the conversation of the Analyst Predictions 2023 and the release of embedded data catalogs such as reCap.

François Nguyen | Building Data Teams & Platform @Francois_Nguyen

Like @ananthdurai , my thoughts on data catalog has changed : expensive and users keep declining. « Is Data Catalog a 1980s Solution for 2020’s Problems« … I guess so

dataengineeringweekly.comData Catalog - A Broken PromiseA critique on Data Catalog, and the future of knowledge management

A few of the upcoming blogs; stay tuned.

Data Quality: Shift Left, Bring Consumers Closer
Event Source vs. Outbox Pattern vs. CDC - The challenges and opportunities
Data Contract - An Executive Overview
Data Contract - Why does Everyone Talks about Data Contract now? A historical walkthrough for Data Engineering Leaders about Data Contract

Ananth’s Talk about Functional Principles in Data Engineering

Last week I had an opportunity to talk about Functional Data Engineering - A Blueprint for adopting functional principles in the data pipeline at the State of Data 2023 Conference.

When I started to talk about Data Contract & Schemata, a few data executives and practitioners approached me and asked, “Ananth, which Data Modeling techniques should we adopt”? Is it Kimball techniques, Data Vault, Activity Schema, or 3NF? The answer is always the classic “it depends.” 😂

However, then I realized, somewhere in these data modeling concepts, the key principles of Data Engineering, that is

Reproducibility
Re-Computability

If you want to talk more about data modeling and functional principles in data engineering, feel free to pick a slot on my calendar [https://calendly.com/apackkildurai/]

Slides of the Talk: https://speakerdeck.com/vananth22/functional-data-engineering-a-blueprint-for-adopting-functional-principles-in-data-pipeline

Google AI: Google Research, 2022 & Beyond: Language, Vision, and Generative Models

Google AI started a series of blog posts highlighting some exciting progress Google made in 2022 and presenting the vision for 2023 and beyond. The first blog post highlights the advancement in language, computer vision, multi-modal models, and generative machine learning models. The blog series is interesting to watch since the advancement of ChatGPT, OpenAI <> Microsoft, and Google triggers the AI Battle. 2023 will be an exciting year for AI research and advancements with refreshed investments from big tech companies.

https://ai.googleblog.com/2023/01/google-research-2022-beyond-language.html

Sergey Gigoyan: Crafting better dbt Projects

Infostrux writes about general best practices to structure your dbt environment, with an example of config file structuring and organizing the data flow.

https://medium.com/infostrux-solutions/crafting-better-dbt-projects-aa5c48aebfc9

On the list of recommendations, the development environment setup triggers some curiosity.

We can generate dev environments by cloning ingest layer of the PROD environment

One challenge in Data Engineering is to set up the dev environment since, with the compliance and regulatory requirements, we can’t copy the prod data into the dev. How do you set up your development environment? Please add your thoughts in the comments in the discussion forum

https://www.dataengineeringweekly.com/p/how-do-you-setup-your-development/comments.

Jacob Baruch: Maximizing Your Data’s Value using Activity Schema Data Model

Activity Schema focuses on structuring all the business activities in a single time series table, which brings easy to model and understand customer activity across the system. The blog briefly introduces activity schema, pros & cons, and further reads.

https://medium.com/@baruchjacob/maximizing-your-datas-value-using-activity-schema-data-model-c796bea41c4f

Achievers Engineering: Enabling Self-Serve Data Platform with Apache Beam & Cookiecutter

An exciting blog post of this week with a refreshing idea of templated development for building a data pipeline. The CookieCutter approach finds a fine balance between flexibility and autonomy in building the data pipeline.

https://achievers.engineering/enabling-self-serve-data-platform-with-apache-beam-cookiecutter-d94230e1fef9

Teads Engineering: BigQuery Ingestion-Time Partitioning and Partition Copy With dbt

We talked about the functional data engineering principles of building the DateTime partition table, and super thrilled to see the pattern added in dbt with good performance optimization.

https://medium.com/teads-engineering/bigquery-ingestion-time-partitioning-and-partition-copy-with-dbt-cc8a00f373e3

DV Engineering: Optimizing for DAG and task complexity in Airflow

DV engineering writes about its migration from Luigi to Airflow by taking an example case of file processing DAG. Two key lessons out of the blog.

Finding a fine balance between parallelism and the effectiveness of a system is a challenge on its own. An easy to create parallel/ concurrent tasks sometimes become a curse.
A clear and usable UI is a significant differentiator of your system.

https://medium.com/doubleverify-engineering/optimizing-for-dag-and-task-complexity-in-airflow-4fb6501e34d1

Ververica: Flink SQL: Queries and Time

Time is a critical element in the data processing. Understanding time-based window processing techniques is essential if you’re starting a career in data engineering. The blog explains how Flink provides time-based window processing capabilities.

https://www.ververica.com/blog/flink-sql-queries-and-time

All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly