Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Editor’s Note: Update on our blog series
One of the promises I made toward the end of 2022 is to publish more of my thoughts and industry observation of data engineering trends. So far, we have published.
The blog retriggers the conversations around data modeling, so many data practitioners reached out and discussed and evaluated their data stack. The response humbles me. Thank you for everyone reaching out and brainstorming about it.
A classic blog triggers a few conversations about Data Catalog and its future. A few data practitioners reached out and appreciated triggering a healthy conversation around data catalogs. We’ve seen a similar prediction during the conversation of the Analyst Predictions 2023 and the release of embedded data catalogs such as reCap.
A few of the upcoming blogs; stay tuned.
Data Quality: Shift Left, Bring Consumers Closer
Event Source vs. Outbox Pattern vs. CDC - The challenges and opportunities
Data Contract - An Executive Overview
Data Contract - Why does Everyone Talks about Data Contract now? A historical walkthrough for Data Engineering Leaders about Data Contract
Ananth’s Talk about Functional Principles in Data Engineering
Last week I had an opportunity to talk about Functional Data Engineering - A Blueprint for adopting functional principles in the data pipeline at the State of Data 2023 Conference.
When I started to talk about Data Contract & Schemata, a few data executives and practitioners approached me and asked, “Ananth, which Data Modeling techniques should we adopt”? Is it Kimball techniques, Data Vault, Activity Schema, or 3NF? The answer is always the classic “it depends.” 😂
However, then I realized, somewhere in these data modeling concepts, the key principles of Data Engineering, that is
Reproducibility
Re-Computability
If you want to talk more about data modeling and functional principles in data engineering, feel free to pick a slot on my calendar [https://calendly.com/apackkildurai/]
Slides of the Talk: https://speakerdeck.com/vananth22/functional-data-engineering-a-blueprint-for-adopting-functional-principles-in-data-pipeline
Sponsored: Upsolver - Write a SQL Query, Get a Data-in-Motion Pipeline!
Pipelines for data in motion can quickly turn into DAG hell. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query.
Streaming plus batch unified in a single platform.
Stateful processing at scale - joins, aggregations, upserts
Orchestration auto-generated from the data and SQL
Templates with sample data for Kafka/Kinesis/S3 sources -> S3/Athena/Snowflake/Redshift
Google AI: Google Research, 2022 & Beyond: Language, Vision, and Generative Models
Google AI started a series of blog posts highlighting some exciting progress Google made in 2022 and presenting the vision for 2023 and beyond. The first blog post highlights the advancement in language, computer vision, multi-modal models, and generative machine learning models. The blog series is interesting to watch since the advancement of ChatGPT, OpenAI <> Microsoft, and Google triggers the AI Battle. 2023 will be an exciting year for AI research and advancements with refreshed investments from big tech companies.
https://ai.googleblog.com/2023/01/google-research-2022-beyond-language.html
Sergey Gigoyan: Crafting better dbt Projects
Infostrux writes about general best practices to structure your dbt environment, with an example of config file structuring and organizing the data flow.
https://medium.com/infostrux-solutions/crafting-better-dbt-projects-aa5c48aebfc9
On the list of recommendations, the development environment setup triggers some curiosity.
We can generate dev environments by cloning ingest layer of the PROD environment
One challenge in Data Engineering is to set up the dev environment since, with the compliance and regulatory requirements, we can’t copy the prod data into the dev. How do you set up your development environment? Please add your thoughts in the comments in the discussion forum
https://www.dataengineeringweekly.com/p/how-do-you-setup-your-development/comments.
Sponsored: 10 Things to Consider Before Choosing a Data Observability Platform
Ready to stop fighting bad data and explore end-to-end coverage with Data Observability? Learn the 10 most important things to consider when choosing a data observability platform. Get the new platform guide, and take the next step in your journey to data trust.
Jacob Baruch: Maximizing Your Data’s Value using Activity Schema Data Model
Activity Schema focuses on structuring all the business activities in a single time series table, which brings easy to model and understand customer activity across the system. The blog briefly introduces activity schema, pros & cons, and further reads.
Achievers Engineering: Enabling Self-Serve Data Platform with Apache Beam & Cookiecutter
An exciting blog post of this week with a refreshing idea of templated development for building a data pipeline. The CookieCutter approach finds a fine balance between flexibility and autonomy in building the data pipeline.
Sponsored: Take Control of Your Customer Data With RudderStack
Legacy CDPs charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost-effective solution. Plus, it gives you more technical controls, so you can fully unlock the power of your customer data.
Take control of your customer data today.
Teads Engineering: BigQuery Ingestion-Time Partitioning and Partition Copy With dbt
We talked about the functional data engineering principles of building the DateTime partition table, and super thrilled to see the pattern added in dbt with good performance optimization.
DV Engineering: Optimizing for DAG and task complexity in Airflow
DV engineering writes about its migration from Luigi to Airflow by taking an example case of file processing DAG. Two key lessons out of the blog.
Finding a fine balance between parallelism and the effectiveness of a system is a challenge on its own. An easy to create parallel/ concurrent tasks sometimes become a curse.
A clear and usable UI is a significant differentiator of your system.
Ververica: Flink SQL: Queries and Time
Time is a critical element in the data processing. Understanding time-based window processing techniques is essential if you’re starting a career in data engineering. The blog explains how Flink provides time-based window processing capabilities.
https://www.ververica.com/blog/flink-sql-queries-and-time
All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.