Data Engineering Weekly

Share this post
Data Engineering Weekly #115
www.dataengineeringweekly.com

Data Engineering Weekly #115

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Jan 23
3
Share this post
Data Engineering Weekly #115
www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


Editor’s Note: Update on our blog series

One of the promises I made toward the end of 2022 is to publish more of my thoughts and industry observation of data engineering trends. So far, we have published.

  1. Functional Programming - A Blueprint

The blog retriggers the conversations around data modeling, so many data practitioners reached out and discussed and evaluated their data stack. The response humbles me. Thank you for everyone reaching out and brainstorming about it.

  1. Data Catalog - A broken promise

A classic blog triggers a few conversations about Data Catalog and its future. A few data practitioners reached out and appreciated triggering a healthy conversation around data catalogs. We’ve seen a similar prediction during the conversation of the Analyst Predictions 2023 and the release of embedded data catalogs such as reCap.

Twitter avatar for @Francois_Nguyen
François Nguyen | Building Data Teams & Platform @Francois_Nguyen
Like @ananthdurai , my thoughts on data catalog has changed : expensive and users keep declining. « Is Data Catalog a 1980s Solution for 2020’s Problems« … I guess so
dataengineeringweekly.comData Catalog - A Broken PromiseA critique on Data Catalog, and the future of knowledge management
8:01 AM ∙ Jan 9, 2023
4Likes1Retweet

A few of the upcoming blogs; stay tuned.

  1. Data Quality: Shift Left, Bring Consumers Closer

  2. Event Source vs. Outbox Pattern vs. CDC - The challenges and opportunities

  3. Data Contract - An Executive Overview

  4. Data Contract - Why does Everyone Talks about Data Contract now? A historical walkthrough for Data Engineering Leaders about Data Contract


Ananth’s Talk about Functional Principles in Data Engineering

Last week I had an opportunity to talk about Functional Data Engineering - A Blueprint for adopting functional principles in the data pipeline at the State of Data 2023 Conference.

When I started to talk about Data Contract & Schemata, a few data executives and practitioners approached me and asked, “Ananth, which Data Modeling techniques should we adopt”? Is it Kimball techniques, Data Vault, Activity Schema, or 3NF? The answer is always the classic “it depends.” 😂

However, then I realized, somewhere in these data modeling concepts, the key principles of Data Engineering, that is

  1. Reproducibility

  2. Re-Computability

If you want to talk more about data modeling and functional principles in data engineering, feel free to pick a slot on my calendar [https://calendly.com/apackkildurai/]

Slides of the Talk: https://speakerdeck.com/vananth22/functional-data-engineering-a-blueprint-for-adopting-functional-principles-in-data-pipeline


Sponsored: Upsolver - Write a SQL Query, Get a Data-in-Motion Pipeline!

Pipelines for data in motion can quickly turn into DAG hell. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query.

  • Streaming plus batch unified in a single platform.

  • Stateful processing at scale - joins, aggregations, upserts

  • Orchestration auto-generated from the data and SQL

  • Templates with sample data for Kafka/Kinesis/S3 sources -> S3/Athena/Snowflake/Redshift

Try now and get 30 Days Free


Google AI: Google Research, 2022 & Beyond: Language, Vision, and Generative Models

Google AI started a series of blog posts highlighting some exciting progress Google made in 2022 and presenting the vision for 2023 and beyond. The first blog post highlights the advancement in language, computer vision, multi-modal models, and generative machine learning models. The blog series is interesting to watch since the advancement of ChatGPT, OpenAI <> Microsoft, and Google triggers the AI Battle. 2023 will be an exciting year for AI research and advancements with refreshed investments from big tech companies.

https://ai.googleblog.com/2023/01/google-research-2022-beyond-language.html


Sergey Gigoyan: Crafting better dbt Projects

Infostrux writes about general best practices to structure your dbt environment, with an example of config file structuring and organizing the data flow.

https://medium.com/infostrux-solutions/crafting-better-dbt-projects-aa5c48aebfc9

On the list of recommendations, the development environment setup triggers some curiosity.

We can generate dev environments by cloning ingest layer of the PROD environment

One challenge in Data Engineering is to set up the dev environment since, with the compliance and regulatory requirements, we can’t copy the prod data into the dev. How do you set up your development environment? Please add your thoughts in the comments in the discussion forum

https://www.dataengineeringweekly.com/p/how-do-you-setup-your-development/comments.


Sponsored: 10 Things to Consider Before Choosing a Data Observability Platform

Ready to stop fighting bad data and explore end-to-end coverage with Data Observability? Learn the 10 most important things to consider when choosing a data observability platform. Get the new platform guide, and take the next step in your journey to data trust.

Get The Guide


Jacob Baruch: Maximizing Your Data’s Value using Activity Schema Data Model

Activity Schema focuses on structuring all the business activities in a single time series table, which brings easy to model and understand customer activity across the system. The blog briefly introduces activity schema, pros & cons, and further reads.

https://medium.com/@baruchjacob/maximizing-your-datas-value-using-activity-schema-data-model-c796bea41c4f


Achievers Engineering: Enabling Self-Serve Data Platform with Apache Beam & Cookiecutter

An exciting blog post of this week with a refreshing idea of templated development for building a data pipeline. The CookieCutter approach finds a fine balance between flexibility and autonomy in building the data pipeline.

https://achievers.engineering/enabling-self-serve-data-platform-with-apache-beam-cookiecutter-d94230e1fef9


Sponsored: Take Control of Your Customer Data With RudderStack

Legacy CDPs charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost-effective solution. Plus, it gives you more technical controls, so you can fully unlock the power of your customer data. 

Take control of your customer data today.


Teads Engineering: BigQuery Ingestion-Time Partitioning and Partition Copy With dbt

We talked about the functional data engineering principles of building the DateTime partition table, and super thrilled to see the pattern added in dbt with good performance optimization.

https://medium.com/teads-engineering/bigquery-ingestion-time-partitioning-and-partition-copy-with-dbt-cc8a00f373e3


DV Engineering: Optimizing for DAG and task complexity in Airflow

DV engineering writes about its migration from Luigi to Airflow by taking an example case of file processing DAG. Two key lessons out of the blog.

  1. Finding a fine balance between parallelism and the effectiveness of a system is a challenge on its own. An easy to create parallel/ concurrent tasks sometimes become a curse.

  2. A clear and usable UI is a significant differentiator of your system.

https://medium.com/doubleverify-engineering/optimizing-for-dag-and-task-complexity-in-airflow-4fb6501e34d1


Ververica: Flink SQL: Queries and Time

Time is a critical element in the data processing. Understanding time-based window processing techniques is essential if you’re starting a career in data engineering. The blog explains how Flink provides time-based window processing capabilities.

https://www.ververica.com/blog/flink-sql-queries-and-time


All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post
Data Engineering Weekly #115
www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing