Data Engineering Weekly #133
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
David Heinemeier Hansson: The luxury of working without metrics
“The metrics you choose as a measure of success inherently shape the essence of the company you aspire to build” - A Data Practitioner.
Sometimes, the metrics you don’t also have the same impact. Do you need data/ metrics at all? Should the metrics define how you operate your business? An interesting take on operating without metrics.
LakeFs: How to Implement Write-Audit-Publish (WAP)
I wrote extensively about the WAP pattern in my latest article, An Engineering Guide to Data Quality - A Data Contract Perspective. Super excited to see a complete guide on implementing the WAP pattern in Iceberg, Hudi, and of course, with LakeFs.
Uber: Spark Analysers: Catching Anti-Patterns In Spark Apps
One of the challenges in commoditizing data processing engines like Spark is that it requires an expert user to understand and operate this system. Uber writes about automating the detection of Spark anti-patterns to educate the users better and help reduce their processing costs.
Policy Genius: Data Warehouse Testing Strategies for Better Data Quality
Data Testing and Data Observability are widely discussed topics in Data Engineering Weekly. However, both techniques test once the transformation task is completed. Can we test SQL business logic during the development phase itself? Perhaps unit test the pipeline?
The author writes an exciting article about adopting unit testing in the data pipeline by producing sample tables during the development. We will see more tools around the unit test framework for the data pipeline soon. I don’t think testing data quality on all the PRs against the production database is not a cost-effective solution. We can do better than that, tbh.
Sponsored: [New Report] 2023 Data Quality Benchmarks
Data trust is at an all-time low, and teams are feeling the pain. Our latest report highlights the impact of bad data on your bottom line (did you know that poor data quality impacts 31% of revenue?!) and how the best teams are reducing incident resolution times.
Kaushik Muniandi: Text-Based Search - From Elastic Search to Vector Search
Last month or so, I experimented with vector search with embedding. I found the article that compares search efficiency among ElasticSearch, Azure Search, and custom vector search tables in DeltaLake and Apache Hudi interesting.
Jatin Solanki: Vector Database - Concepts and examples
Staying with the vector search, a new class of Vector Databases is emerging in the market to improve the semantic search experiences. The author writes an excellent introduction to vector databases and their applications.
KOHO: Handling Schema Evolution in the Data Pipelines at KOHO
Schema management at the data ingestion service and the DLQ (Dead Letter Queue) pattern is emerging as the standard architecture pattern in event processing. Koho writes about its architecture to handle DLQ and schema management.
Sponsored: The Evolution of the Customer Data Platform
Like the composable approach, The Warehouse Native CDP solves the data silo problem by building around the warehouse, but it deploys the integration, real-time transformation, and activation layers as a connected, governable, and observable end-to-end system.
Here, the RudderStack team examines prevailing approaches to the Customer Data Platform, including the recent composable CDP movement, and introduces a new approach that they believe best delivers the end goal — easy activation of complete customer profiles.
LlamaIndex: Combining Text-to-SQL with Semantic Search for Retrieval Augmented Generation
A query engine can leverage the expressivity of SQL over structured data and join it with unstructured context from a vector database - The future of data processing.
Many of the real-world data, all the way from medical images to astro monitoring, are unstructured data. The future we wish to live in is the promise of vector search combined with SQL to access structured information.
Grab: PII masking for privacy-grade machine learning.
Grab writes about the shift-left approach, where the data producer tags the PII information carried over the downstream processing system to take action for it.
I’m thrilled to see this approach reflecting the vision of Schemata with its Protobuf definition. The Schemata Open Contract definition has a comprehensive coverage of PII & Data Classification types here
Katharine Jarmul: Privacy Enhancing Technologies - An Introduction for Technologists
The PII masking is part of the privacy-enhancing technique. As a data practitioner, it is essential to understand privacy-enhancing technologies. The author writes a comprehensive guide about PET.
Gabe Araujo: Introducing PandasAI: The Generative AI Python Library
I’ve not tried PandasAI, but the promise of it looks exciting.
data = pd.read_csv('dataset.csv')
data_cleaned = pdai.impute_missing_values(data)
If the data cleaning is as simple as calling a function, working with data will be a delight.
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.