Data Engineering Weekly #120
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Editor’s Note: Debunking Data Contracts; How Schemata Make Data Contracts Easy, Scalable, and Meaningful
I talked about Schemata, Data Product Management, and Data Contracts in the last few weeks. Ashwin is running an excellent Data Community, “Data Heros,” where I interacted with the Data Heros community folks to discuss Schemata & Data Contract.
If you’ve not been following the Data Heros community, please do follow; They engage in a highly productive data engineering conversation. I highly recommend following their LinkedIn group for updates.
I sit with Scott Hirleman on Data Mesh Radio to talk about How We Make Data Contracts Easy, Scalable, and Meaningful. In the conversation, I discussed why collaboration around data is crucial and how data creation is a human-in-the-loop problem. You can hear the full episode here.
Colin Campbell: The Case for Data Contracts
The author published the case for a data contract, capturing the current state of the data contract marketplace and potential players in the market. The author points out that Data contracts are a technical implementation, not an organizational one. I believe Data Contract is a technology solution to bring organizational change. It is something like how Kubernetes is a technology solution, at the same time, drives the system architecture to certain characteristics. Data Contract platforms are the same, so this space is wide open, waiting for disruption.
Matt Turck: The 2023 MAD (Machine Learning, Artificial Intelligence & Data) Landscape
It’s time for Matt’s 2023 MAD landscape with 1,416 logos, up from 139 in 2012. the addition of the “Fully Managed” category is an exciting space to watch out for. Perhaps merge Data LakeHouse & Data Warehouse in the next edition? The gap is blurring between them.
Guy Fighel: Stop emphasizing the Data Catalog
We talked in the past about Data Catalog - A broken promise, and how data catalogs can become quickly outdated and may not reflect the current state of the data assets within an organization. We also talked about how Data Catalog operates in a disjointed workflow leads to usability nightmares. Similarly, the author points out that the semantic layer is a more efficient and dynamic approach to data catalogs. This an interesting observation to keep an eye on.
Sponsored: [New Guide] The Ultimate Guide to Data Mesh Architecture
If implementing data mesh is high on your list of priorities, you’re not alone. As organizations scale their use of data, centralized architectures can prevent data teams from keeping pace with stakeholder demands and system needs. In this guide, learn through strategies deployed by leading data teams that have successfully implemented data mesh.
Get The Guide
Chase Bank: Achieving Data Autonomy
One of the exciting case studies from Chase is about modernizing its data platform. The five series emphasize any modernization should bring its people along with a sufficient uplifting program.
600 ETL developers using point-and-click tools could be reskilled to adopt the solution [Code first approach with Spark & Java]
Part 1: Setting the Stage for Change
Part 2: Our Pilot Phase and the Beginning of a Modernization Journey
Part 3: Modernization at Scale — Starting with People
Part 4: Accelerating Data Modernization — Execution Methodology
Part 5: Lessons Learned (So Far)
Virgin Media O2: Riffing: our recipe for iterating fast, failing forward, and achieving success with data
The author writes about practical difficulties in building data products from the raw data and how Virgin Media O2 adopted Riffing engineering process to navigate it. Riffing is a 5 step process that contains
What is the goal?
Identify and study the raw data.
Test and optimize the output
Productionise into a usable format
Sponsored: Replacing GA4 with Analytics on your Data Cloud
The GA4 migration deadline is fast approaching. If you’re still heavily reliant on Google for data collection and reporting, now is the perfect time to center your data analytics strategy around your data warehouse. Join our webinar to learn how you can replace GA with analytics on your data cloud.
Funding Circle: Data Engineering Culture @ FC
Many companies show their technical excellence in their blogs. I’m thrilled to see Funding Circle writes about its Data Engineering culture. A good data team culture is vital to establish data-driven culture across the companies, and I hope many companies will write about its Data Engineering culture.
Yerachmiel Feltzman: Action-Position data quality assessment framework
If your data engineering team has not yet adopted either the “Write-Audit-Publish” or “Audit-Publish-Write” pattern, the time to implement the pattern is yesterday :-). The author published a data quality assessment framework for the pipeline patterns, including monitoring & investigation of data quality issues.
Ancestry: Scaling Ancestry.com - How to Optimize Updates for Iceberg Tables with 100 Billion Rows
Ancestry writes about using Apache Iceberg and its optimization strategy to update Iceberg tables. The solution mainly focuses on partitioning and compacting the Iceberg tables.
Square: Why You Need an Experimentation Template
Standardization & tracking the decision of experimentation is vital for the success of a data-driven organization. Square writes about why one should require an Experimentation template and share their copy for usage.
Data Council - Austin 2023 Discount Code
Data Council - Austin 2023 is nearing, and I’m super excited to meet all the data practitioners in person. Data Engineering Weekly readers can use the DataWeekly20 promo code to get a 20% discount on the ticket price.
Link to Register: https://www.datacouncil.ai/austin
Promo Code: DataWeekly20
Dair-AI: Prompt Engineering Guide
Prompt engineering is a relatively new discipline for developing and optimizing prompts to efficiently use language models (LMs) for various applications and research topics. The repo contains exciting reference materials to learn more about prompt engineering.
MetaAI: Introducing LLaMA: A foundational, 65-billion-parameter large language model
The chatGPT genuinely increased the curiosity about LLM (Large Language Model), with that MetaAI open-sourced LLaMA (Large Language Model Meta AI), a state-of-the-art foundational large language model. Recently I switched to using chatGPT and GitHub co-pilot extensively for my coding, so I’m excited about this space and the innovations.
LinkedIn: Sharing LinkedIn’s Responsible AI Principles
The advancement of AI/ ML techniques and the emerging LLMs create valid concerns over the impact of AI on privacy and its social impact. LinkedIn shares its Responsible AI principles as
Advance Economic Opportunity
Promote Fairness and Inclusion
It will be interesting to learn more about how LinkedIn will monitor and measure the success of these principles and how transparent the findings will be. https://engineering.linkedin.com/blog/2023/linkedin-s-responsible-ai-principles-help-meet-the-big-moments-i
All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.