Data Engineering Weekly Is Brought to You by RudderStack
RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. See how it works today.
Editor’s Note: DewCon.ai - October 12, Bengaluru, India Update
Hey folks! 📣 Exciting news! We've finalized the agenda for the conference, and we will be launching middle of this week. 🎤 And guess what? We've given our conference website a fresh look. 🌐
Tickets are selling fast; use the code DATAHERO for a special discount. 🎟️ Oh, and if your company's thinking of bulk booking, drop an email to ananth@dataengineeringweekly.com to get some awesome discounts. 📩
Looking forward to seeing you all! 👋🙂
Wes McKinney: The Road to Composable Data Systems - Thoughts on the Last 15 Years and the Future
On LinkedIn, you may hear this frequently, "We did this 30 years ago." or "Don't reinvent the wheel.". The reality is that software is an abstract layer that interacts with the hardware components like the CPU, Network IO & memory. Any advancement in this hardware layer will significantly influence the software architecture, so we often see "what goes around, comes around."
The author summarizes the last 15 years of data computation infrastructure and the emerging trend of composable data systems.
https://wesmckinney.com/blog/looking-back-15-years/
Meta: Scheduling Jupyter Notebooks at Meta
As a data engineer, I love to work with the Notebook, but the challenge always comes while running through the standard pipelining practices. The orchestration engines like Airflow & Dagsters provide sophisticated pipeline authoring tooling, which is often hard to integrate with the notebooks. I often want to click the “Schedule this Notebook” button and automatically generate the Airflow code to schedule and commit in Github. On a similar experience, Meta writes about its automation process to simplify Notebook scheduling.
https://engineering.fb.com/2023/08/29/security/scheduling-jupyter-notebooks-meta/
Kyle Weller: Delta, Hudi, Iceberg — A Benchmark Compilation
The LakeHouse architecture brings the best of the database and the data lake into the data infrastructure. Which one to choose? Which one runs faster than the other? Following a benchmark result between Delta & Iceberg, Apache Hudi added its benchmark in the same repo. It’s exciting to see this benchmark happen in a public Github, which brings more openness into the ecosystem.
I do not believe in the benchmark as the answer is always “It depends on your workload,” and the author rightly mentioned it in the article.
One key thing to remember when running TPC-DS benchmarks comparing Hudi, Delta, and Iceberg is that by default, Delta + Iceberg is optimized for append-only workloads, while Hudi is, by default, optimized for mutable workloads.
https://medium.com/@kywe665/delta-hudi-iceberg-a-benchmark-compilation-a5630c69cffc
Sponsored: You're invited to IMPACT - The Data Observability Summit | November 8, 2023
Interested in learning how some of the best teams achieve data & AI reliability at scale? Learn from today's top data leaders and architects at The Data Observability Summit on how to build more trustworthy and reliable data & AI products with the latest technologies, processes, and strategies shaping our industry (yes, LLMs will be on the table).
Weibinzen: GraphAr - A Standard Data File Format for Graph Data Storage and Retrieval
Last week, I came across some interesting file formats on top of the existing columnar formats like Parquet. GraphAr is a standardized file format for graph data, computation/storage system independent, and provides a set of interfaces for generating, accessing, and transforming these formatted files. It piqued my interest in my outstanding question: Can we model DataWarehouse as a graph rather than a relational model?
Cloud Native Geo: Performance Explorations of GeoParquet (and DuckDB)
GeoParquet is a second data format I came across this week. GeoParquet is an incubating Open Geospatial Consortium (OGC) standard that adds interoperable geospatial types (Point, Line, Polygon) to Parquet. The author explores the performance of GeoParquet with DuckDB. Combining WebAssembly, DuckDB, and GeoParquet formats brings exciting interactive web applications like visualizing 1 million building footprints with 7 million total coordinates for the U.S. state of Utah and calculating their area on the fly.
https://cloudnativegeo.org/blog/2023/08/performance-explorations-of-geoparquet-and-duckdb/
Sponsored: How To Create a Customer 360
“Without stitching each unique identifier to the user, systems, and teams will wrongly assume that these three transactions are from three distinct users. Worse yet, you are unable to calculate — or inaccurately calculate — important computed traits like user_total_revenue because of the fragmented user identities.”
Customer 360 is, at the end of the day, a data problem. Here, the team at RudderStack details how to solve identity resolution in the warehouse and build user features to create activation-ready customer profiles.
https://www.rudderstack.com/blog/how-to-create-a-customer-360/
Areca Data: The definitive guide to debugging dbt
Every system will fail in production with unknown errors. The author writes about the set of common dbt error codes and how to debug them.
Related Note: If you're like me and get a kick out of system failures in production environments, you should definitely read the article "Operating Effectively in High Surprise Mode." It's a great resource for anyone who wants to learn how to deal with unexpected events calmly and collectedly.
https://www.arecadata.com/the-definitive-guide-for-debugging-dbt/
Adevinta: How we matured Fisher, our A/B testing Package
Experimentation is the core of many online business operations, and we’ve seen many companies trying to democratize the experimentation platform. On a similar line, Adevinta writes about Fisher, a Python package that enables Data Scientists to run straightforward hypothesis testing and to produce comprehensive reports with very few lines of code.
https://medium.com/adevinta-tech-blog/how-we-matured-fisher-our-a-b-testing-package-110c99b993fc
Sponsored: Great Data Debate–The State of Data Mesh
Since 2019, the data mesh has woven itself into every blog post, event presentation, and webinar. But 4 years later, in 2023 — where has the data mesh gotten us? Does its promise of a decentralized dreamland hold true?
Atlan is bringing together data leaders like Abhinav Sivasailam (CEO, Levers Labs), Barr Moses (Co-founder & CEO, Monte Carlo), Scott Hirleman (Founder & CEO, Data Mesh Understanding), Teresa Tung (Cloud First Chief Technologist, Accenture), Tristan Handy (Founder & CEO, dbt Labs), Prukalpa Sankar (Co-founder, Atlan), and more at the next edition of the Great Data Debate to discuss the state of data mesh – tech toolkit and cultural shift required to implement data mesh.
Watch the Recording of the Great Data Debate →
Databricks: Efficient Fine-Tuning with LoRA: A Guide to Optimal Parameter Selection for Large Language Models
The adoption of LLM in enterprises is slowly increasing, and so are the techniques and infrastructure around Fine-Tuning prompts. Databricks writes about LoRA, an improved fine-tuning method where instead of fine-tuning all the weights of the pre-trained large language model, two smaller matrices that approximate this larger matrix are fine-tuned.
https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms
Oana Olteanu: Fine-Tuning LLMs - learnings from the DeepLearning SF Meetup
In another exciting read on Fine-Tunning LLM, the author shared a lesson learned from the recent SF deep learning meetup. The author also highlights how Lora (Low-Rank Adaptation of Large Language Models) can get the same accuracy as end-to-end fine tuning but is much cheaper in training since you update a small set of parameters.
AWS: Develop a Cost-Aware Culture for Optimized and Efficient SaaS Growth
The Instacart S1 filling triggered some interesting conversations across many forums.
My take on this is that I admire what Instacart engineering can build on top of Snowflake. Anyone who has worked in a high-growth startup like Instacart can resonate with the "growth~cost-cutting cycles," known as the Pendulum Effect. Organizations swing the operating principles as growth at any cost in one year and then swing the pendulum towards cost optimization, hoping the chaos results in optimal resource usage efficiency.
However, the pendulum effect is not an excuse for inefficient system design. Engineers must be aware of the cost-fit function of their solution, and I learned the system design lesson during an incident mitigation process.
AWS writes an excellent article on the same, discussing how to develop a cost-aware engineering culture.
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.