Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack
Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.
Event: Join Impact 2021 on November 3, 2021: The First-Ever Data Observability Summit. Join Today's Leading Data Pioneers
Former CEO of Snowflake Bob Muglia joins a packed lineup of data leaders pioneering the technologies & processes shaping data engineering, along with First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!
Click To Get Your Free Ticket For All Data Engineering Weekly Readers
Preset: How the Modern Data Stack is Reshaping Data Engineering
The blog is a comprehensive narration of the recent trends in data engineering and how the modern data stack is reshaping data engineering. As noted in the blog, the critical trend to watch,
we've seen the Spark ecosystem increasingly become a database through the rise of SparkSQL. Meaning not only are the databases increasingly good at supporting ETL workloads, but some ETL systems are also increasingly good at acting as a database.
My two cents, the explosion of the modern data stack brings "best of the breed." solution. However, the core developer workflow still revolves around the ELT systems. It will be an exciting trend to watch how ETL systems emerge to provide integrated data experience.
https://preset.io/blog/reshaping-data-engineering/
Benn Stancil: A method for measuring analytical work
How do you measure the success of the analytical practices? The fact that the feedback loop for a business decision is long and often can't assess the counterfactual, is it still makes sense to measure the analytical practice by the outcome?
"Time To Insight" is the north star metric to measure the success of a data engineering team. The blog narrates how the "Time To Make Decision" metric measures analytical practice on the same line.
https://benn.substack.com/p/method-for-measuring-analytical-work
Twitter: Forecasting SQL query resource usage with machine learning
One of the challenges of data infrastructure is to balance the query performance and the cost. Twitter writes an exciting blog narrating how machine learning-driven optimization is on top of Presto to optimize resource usage.
Confluent: The Future of SQL - Databases Meet Stream Processing
SQL is a powerful language that allows us to express complex questions of our data with ease. How does SQL adopt not only the data at rest but also for the streaming data? The article narrates how the push vs. pull query execution changes the query complexity from O(number of records in input table) vs. O(rate of table change).
https://www.confluent.io/blog/databases-meet-stream-processing-the-future-of-sql/
Ryan Gross: Designing Data Platforms to Harness the Power of Fog Computing
The modern data stack predominately focused on the concept of a LakeHouse architecture. It takes the best attributes from traditional data warehouses and runs on platforms with data lake storage architectures. On following Confluent's thoughts on streaming SQL, the author raised great questions on the role of Fog computing in the modern data platform.
Sponsored: Live Tech Session - The Modern Data Stack Is Warehouse-First
Join leaders from Snowflake, Mammoth Growth, RudderStack, and Mixpanel to learn why the most sophisticated teams architect their data stacks around the data warehouse.
https://rudderstack.com/video-library/the-modern-data-stack-is-warehouse-first
StarTree: What makes Apache Pinot fast?
StarTree writes about why Pinot is fast, explaining various indexing & multi-model support. JSON indexing to support semi-structured data analysis, aggregation optimization using star tree indexing are some of the highlights to read.
Part 1: https://www.startree.ai/blogs/what-makes-apache-pinot-fast-chapter-1/
Part 2: https://www.startree.ai/blogs/what-makes-apache-pinot-fast-chapter-ii/
Qonto: Scaling Airflow on Kubernetes - lessons learned
Qonto shares its experience scaling Airflow on Kubernetes. The pod template files to optimize the resource consumption for the sensor & task operators, monitoring the lifecycle of a task & cluster elasticity are some of the exciting reads.
https://medium.com/qonto-way/scaling-airflow-on-kubernetes-lessons-learned-a0d3d0417fc1
Meltwater: Our Journey from Database to Data Lake
Meltwater writes about its journey to adopt the data lake from a single database for the reporting solution. The cost comparison matrix is a fascinating study that shows S3 + Athena is 6X cost-efficient than the RDS solution.
https://underthehood.meltwater.com/blog/2021/11/05/our-journey-from-database-to-data-lake/
Leev’s: A Practical Guide for Kafka Cost Reduction
A great read on practical tips on reducing the Kafka infrastructure cost, focus AWS instance type, compression, rake aware consumers to fetch data from closest replica, cluster rebalancing & cluster tuning configurations.
https://leevs.dev/kafka-cost-reduction/
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.