Data Engineering Weekly #118
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Editor’s Note: Launching “Behind the Tech” Series & Data Council 2023
Data Engineering Weekly strives to bring our readers the best curation every week. As a Data Engineer, I always want to know what is behind the scene of a product and how efficiently we can use system design to solve a business problem efficiently. We open the Behind the Scene series for startup founders to write in-depth technical articles, system design, and efficient product usage.
To begin, we partner with Pathway.com to launch a three-part series about unlocking stream processing, - where Pathway talks about applying linear regression & classification in real-time. Stay tuned for more.
If you wish to write the Behind the Scene series, write us email@example.com. You can also submit the Founder's story and article suggestion.
Data Council - Austin 2023 Discount Code
Data Council - Austin 2023 is nearing, and I’m super excited to meet all the data practitioners in person. Data Engineering Weekly readers can use the DataWeekly20 promo code to get a 20% discount on the ticket price.
Link to Register: https://www.datacouncil.ai/austin
Promo Code: DataWeekly20
MotherDuck: Big Data is Dead
All large data sets are generated over time. Time is almost always an axis in a data set. New orders come in every day. New taxi rides. New logging records. New games are being played. But compute needs will likely not change much over time; most analysis is done over recent data.
There is a lot of truth in this statement. Historical data processing is a rare event, where 99% of the computing happens over the last 24 hours of data. It’s true Big Data is dead, but we can’t deny it is a result of collective advancement in data processing techniques.
Dropbox: Balancing quality and coverage with our data validation framework
Data Testing should be part of the data creation lifecycle; it is not a standalone process. I believe the current data testing platforms can’t support the complex nature of data testing.
The Dropbox data team highlights the same problem. It describes how an extended Airflow operator that adopts the Write-Audit-Publish pattern with SQL helps to standardize the data testing strategy.
Mixpanel: Tracking events at millisecond granularity
My first reaction while reading “Historically, Mixpanel used to track events at second-level granularity.”, Wait, what? None of the systems is perfect. I admire the Mixpanel team discussing the complexity of changing the timestamp and the system design for it.
Sponsored: Fireside Chat: The Future of CDPs
Join this live session with BARK CTO Nari Sitaraman, & RudderStack Founder Soumyadeb Mitra on 2/15 at 9 AM PT to make sense of the CDP evolution and get practical advice on how to drive competitive advantage as a data leader in 2023.
Shopify: The Complex Data Models Behind Shopify's Tax Insights Feature
The blog comes at the right time when the data community frequently talks about the lost art of Data Modeling. Shopify shares its experience designing tax insight features, the business complexity, and lessons learned.
Picnic: Deploying Data Pipelines using the Saga pattern
An interesting take on pipeline orchestration engine as a Saga pattern implementation. Picnic writes about how it automates pipeline deployment. The blog definitely added to my curiosity to think more.
Sponsored: [New] Winning Strategies—2023 Modern Data Leader’s Playbook 🏅
Don't fumble your data strategy in 2023. Learn how other data managers, directors, and other leaders set their teams up for success. See how to drive organizational impact at scale, touching on the technologies, processes, and cultural requirements necessary to succeed in this role.
Atlassian: Data Processing Agreements (DPAs) 101: What app developers need to know
Atlassian continues to write about the importance of data privacy laws and what developers need to know about the regulatory requirements. A must-read for data engineering professionals.
Etsy: Adding Zonal Resiliency to Etsy’s Kafka Cluster
Cross-region (Zone) comes with its penalty of cost and latency in Kafka infrastructure. Etsy writes about resiliency engineering for Kafka infrastructure, adding Zonal resilience in Google Cloud.
Part 1: https://www.etsy.com/codeascraft/adding-zonal-resiliency-to-etsys-kafka-cluster-part-1
Part 2: https://www.etsy.com/codeascraft/leveraging-zonal-resiliency-to-improve-updates-for-etsys-kafka-cluster-part-2
Sponsored: Upsolver - Write a SQL Query, Get a Data-in-Motion Pipeline!
Pipelines for data in motion can quickly turn into DAG hell. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query.
Streaming plus batch unified in a single platform.
Stateful processing at scale - joins, aggregations, upserts
Orchestration auto-generated from the data and SQL
Templates with sample data for Kafka/Kinesis/S3 sources -> S3/Athena/Snowflake/Redshift
Paulo Salem: Building GPT-3 applications — beyond the prompt
I started using chatGPT assistance for my day-to-day coding; It is a huge productive booster, and I don’t think I can go back without it. I’m surprised by how quickly it does the habit building and found this article is a pretty exciting tutorial on building gpt-3 applications.
Twitter: The data platform cluster operator service for Hadoop cluster management
Speaking of “Big Data is Dead,” Twitter writes about streamlining the Hadoop cluster operations. Twitter in the past wrote about its move to Google BigQuery; interestingly, Hadoop is still not replaceable internally.
Bruce Momjian: Will Postgres live forever? – Postgres Innovation: Full-Text Search
Any modern databases should support storing and processing semi-structured data & free text search. Expecting a well-defined upfront schema modeling is practically impossible with the variety of data sources we deal with. I found the blog very informative, and it talks about advancements in PostgreSQL to support full-text search.
All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.