Data Engineering Weekly #120

The Weekly Data Engineering Newsletter

Feb 27, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Editor’s Note: Debunking Data Contracts; How Schemata Make Data Contracts Easy, Scalable, and Meaningful

I talked about Schemata, Data Product Management, and Data Contracts in the last few weeks. Ashwin is running an excellent Data Community, “Data Heros,” where I interacted with the Data Heros community folks to discuss Schemata & Data Contract.

If you’ve not been following the Data Heros community, please do follow; They engage in a highly productive data engineering conversation. I highly recommend following their LinkedIn group for updates.

https://www.linkedin.com/company/data-heroes-community-for-data-folks/

I sit with Scott Hirleman on Data Mesh Radio to talk about How We Make Data Contracts Easy, Scalable, and Meaningful. In the conversation, I discussed why collaboration around data is crucial and how data creation is a human-in-the-loop problem. You can hear the full episode here.

https://daappod.com/data-mesh-radio/easy-scalable-meaningful-data-contracts-ananth-packkildurai/

Colin Campbell: The Case for Data Contracts

The author published the case for a data contract, capturing the current state of the data contract marketplace and potential players in the market. The author points out that Data contracts are a technical implementation, not an organizational one. I believe Data Contract is a technology solution to bring organizational change. It is something like how Kubernetes is a technology solution, at the same time, drives the system architecture to certain characteristics. Data Contract platforms are the same, so this space is wide open, waiting for disruption.

https://uncomfortablyidiosyncratic.substack.com/p/the-case-for-data-contracts

Matt Turck: The 2023 MAD (Machine Learning, Artificial Intelligence & Data) Landscape

It’s time for Matt’s 2023 MAD landscape with 1,416 logos, up from 139 in 2012. the addition of the “Fully Managed” category is an exciting space to watch out for. Perhaps merge Data LakeHouse & Data Warehouse in the next edition? The gap is blurring between them.

https://mattturck.com/mad2023/

Guy Fighel: Stop emphasizing the Data Catalog

We talked in the past about Data Catalog - A broken promise, and how data catalogs can become quickly outdated and may not reflect the current state of the data assets within an organization. We also talked about how Data Catalog operates in a disjointed workflow leads to usability nightmares. Similarly, the author points out that the semantic layer is a more efficient and dynamic approach to data catalogs. This an interesting observation to keep an eye on.

https://www.linkedin.com/pulse/stop-emphasizing-data-catalog-guy-fighel/

Chase Bank: Achieving Data Autonomy

One of the exciting case studies from Chase is about modernizing its data platform. The five series emphasize any modernization should bring its people along with a sufficient uplifting program.

600 ETL developers using point-and-click tools could be reskilled to adopt the solution [Code first approach with Spark & Java]

Part 1: Setting the Stage for Change

Part 2: Our Pilot Phase and the Beginning of a Modernization Journey

Part 3: Modernization at Scale — Starting with People

Part 4: Accelerating Data Modernization — Execution Methodology

Part 5: Lessons Learned (So Far)

Virgin Media O2: Riffing: our recipe for iterating fast, failing forward, and achieving success with data

The author writes about practical difficulties in building data products from the raw data and how Virgin Media O2 adopted Riffing engineering process to navigate it. Riffing is a 5 step process that contains

What is the goal?
Identify and study the raw data.
Modeling
Test and optimize the output
Productionise into a usable format

https://medium.com/@vmo2techteam/riffing-our-recipe-for-iterating-fast-failing-forward-and-achieving-success-with-data-fd218fda1041

Funding Circle: Data Engineering Culture @ FC

Many companies show their technical excellence in their blogs. I’m thrilled to see Funding Circle writes about its Data Engineering culture. A good data team culture is vital to establish data-driven culture across the companies, and I hope many companies will write about its Data Engineering culture.

https://medium.com/funding-circle/data-engineering-culture-fc-445142a51ace

Yerachmiel Feltzman: Action-Position data quality assessment framework

If your data engineering team has not yet adopted either the “Write-Audit-Publish” or “Audit-Publish-Write” pattern, the time to implement the pattern is yesterday :-). The author published a data quality assessment framework for the pipeline patterns, including monitoring & investigation of data quality issues.

https://medium.com/everything-full-stack/action-position-data-quality-assessment-framework-d833f6b77b7

Ancestry: Scaling Ancestry.com - How to Optimize Updates for Iceberg Tables with 100 Billion Rows

Ancestry writes about using Apache Iceberg and its optimization strategy to update Iceberg tables. The solution mainly focuses on partitioning and compacting the Iceberg tables.

https://medium.com/ancestry-product-and-technology/scaling-ancestry-com-how-to-optimize-updates-for-iceberg-tables-with-100-billion-rows-860285922316

Square: Why You Need an Experimentation Template

Standardization & tracking the decision of experimentation is vital for the success of a data-driven organization. Square writes about why one should require an Experimentation template and share their copy for usage.

https://developer.squareup.com/blog/why-you-need-an-experimentation-template/

Template: https://assets.ctfassets.net/1wryd5vd9xez/5ag9t08L5hiRX1dwDDG1Rm/9012733fee27a264a8a10dcb8253083e/Genericized_A_B_template.pdf

Data Council - Austin 2023 Discount Code

Data Council - Austin 2023 is nearing, and I’m super excited to meet all the data practitioners in person. Data Engineering Weekly readers can use the DataWeekly20 promo code to get a 20% discount on the ticket price.

Link to Register: https://www.datacouncil.ai/austin

Promo Code: DataWeekly20

Dair-AI: Prompt Engineering Guide

Prompt engineering is a relatively new discipline for developing and optimizing prompts to efficiently use language models (LMs) for various applications and research topics. The repo contains exciting reference materials to learn more about prompt engineering.

https://github.com/dair-ai/Prompt-Engineering-Guide

MetaAI: Introducing LLaMA: A foundational, 65-billion-parameter large language model

The chatGPT genuinely increased the curiosity about LLM (Large Language Model), with that MetaAI open-sourced LLaMA (Large Language Model Meta AI), a state-of-the-art foundational large language model. Recently I switched to using chatGPT and GitHub co-pilot extensively for my coding, so I’m excited about this space and the innovations.

https://ai.facebook.com/blog/large-language-model-llama-meta-ai/

LinkedIn: Sharing LinkedIn’s Responsible AI Principles

The advancement of AI/ ML techniques and the emerging LLMs create valid concerns over the impact of AI on privacy and its social impact. LinkedIn shares its Responsible AI principles as

Advance Economic Opportunity
Uphold Trust
Promote Fairness and Inclusion
Provide Transparency
Embrace Accountability

It will be interesting to learn more about how LinkedIn will monitor and measure the success of these principles and how transparent the findings will be. https://engineering.linkedin.com/blog/2023/linkedin-s-responsible-ai-principles-help-meet-the-big-moments-i

All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly