Data Engineering Weekly #116
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
It Depends: Andrew Brust on What Will Happen in 2023 in Data, Analytics, and AI
I’m a regular listener of the ItDepends podcast, and I found Sanjeev always bring insightful view from the guest speakers. I've read/watched a few 2023 data predictions but found this conversation more practical, with fewer buzzwords highlighting data practitioners' reality. The conversation emphasizes the importance of governance and even points to a new role, "Legal Engineering."
The conversation around data observability points out the growing gap in data observability [aka finding the things] vs. fixing the data quality. The data observability solutions require more attention in the remediation workflow to ensure they don’t end up as a disjointed workflow like Data Catalogs.
Meta: Tulip - Modernizing Meta’s data platform
Meta writes about Tulip's adoption story, its data transportation, and the serialization protocol for its data platform. The highlight for me is the debugging tool “loggertail.” The utility of writing the analytics data to a void space (/dev/null) and dynamic subscribe and consume from a command line is a very powerful one.
LucianoSphere: Build custom-informed GPT-3-based chatbots for your website with very simple code
ChatGPT is the talk of the town; I found this article an exciting one to build custom GPT-3-based chatbots. Please try it out and let us know what you build!!!
Sponsored: Upsolver - Write a SQL Query, Get a Data-in-Motion Pipeline!
Pipelines for data in motion can quickly turn into DAG hell. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query.
Streaming plus batch unified in a single platform.
Stateful processing at scale - joins, aggregations, upserts
Orchestration auto-generated from the data and SQL
Templates with sample data for Kafka/Kinesis/S3 sources -> S3/Athena/Snowflake/Redshift
Luke Lin: Incorporate data into your next product
The adoption of data products moves beyond empowering business operations to build customer-facing applications. The author writes from a product manager perspective things to consider while adopting data-driven customer-facing data products.
Jeff Chou: Is Databricks’s autoscaling cost efficient?
Autoscaling is a curse and a blessing on its own. Autoscale up & down is a highly preferable architecture for the data pipeline to adopt the workload spike at the peak of the hour. The author writes an interesting benchmark result comparing Databricks autoscaling vs. fixed computing.
In our analysis, we saw that a fixed cluster could outperform an autoscaled cluster in both runtime and costs for the 3 workloads we looked at by 37%, 28%, and 65%.
As with any benchmark results, it varies depending on your workload and configuration, but overall the findings highlight the autoscale infrastructure requires more maturity.
Sponsored: 10 Things to Consider Before Choosing a Data Observability Platform
Ready to stop fighting bad data and explore end-to-end coverage with Data Observability? Learn the 10 most important things to consider when choosing a data observability platform. Get the new platform guide, and take the next step in your journey to data trust.
Coinbase: SOON (Spark cOntinuOus iNgestion) for near real-time data at Coinbase - Part 1
Coinbase writes about its unified data ingestion framework SOON!! The blog highlights the complexity around data sourcing from various sources with the mix of CDC and non-CDC events. TIL about Databricks Change Data Feed, and yes Coinbase uses both Snowflake and Databricks :-)
Miro: Writing data product pipelines with Airflow
The Data Contract concept aims to establish the guarantees and expectations around the production-ready pipeline. While defining the expectations and constraints, developer productivity plays a critical role. Miro writes about writing a product pipeline with Airflow with a practical approach to implementing data contracts in the batch pipeline.
Sponsored: Webinar: How Khatabook Migrated to RudderStack from Segment
As Khatabook rapidly scaled users in 2022, Segment’s inferior support and unreasonable pricing became increasingly painful. Join this session with Khatabook’s engineering team to find out why they chose to move to RudderStack and get an overview of the migration process.
Etsy: Improving Support for Deep Learning in Etsy's ML Platform
Etsy writes about the changing landscape of its ML platform, from classic gradient-boosted trees to deep learning techniques. The blog shows an example of its search ranking and a walkthrough of how the team improves observability and early feedback tools.
Dropbox: Accelerating our A/B experiments with machine learning
The A/B experiment pipeline is often the last one to finish in a data infrastructure; The window of confidence (the acceptable data sample) is always a question while making a business decision by the learnings from the A/B tests. Dropbox writes about the practical complexity of longer experiments and the usage of machine learning to make an intuitive decision with an example of the Expected Revenue model.
Flat Pack Tech: Online Analytics Pipeline Re-engineering
Mehdio: 10 Lessons Learned In 10 Years Of Data
Mehdio shares 10 experiences out of a decade of a data engineering career. The blog dropped some very insightful trends of the past, highlighting past hype, The Cloud will take data engineers’ jobs, The aspiration to become Data scientists, the rush to put Notebooks into production, the modern data stack mess [Oh, my fav one], and Rust!!!
All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.