Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack
Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.
There is a lot of exciting stuff to catch up on in DataEngineeringWeekly this week, especially since we took off last week.
dbt labs: Coalesce - the analytics engineering conference 2021
Coalesce - the analytics engineering conference is a delight to watch this week. dbt as a metric layer is an exciting evolution and is the key highlight of the conference. If you missed the conference, you can watch all the replays here.
https://coalesce.getdbt.com/replays/
James Le: What I Learned From the Open Source Data Stack Conference 2021
Open-source Data Stack conference is another exciting data conference focused on the modern data stack in the open-source world. The author writes an excellent summary of the conference.
https://jameskle.com/writes/open-source-data-stack-2021
You can catch up on all the replays of the open source data stack conference 2021 here.
https://www.opensourcedatastack.com/stage/events
AWS: Top Announcements of AWS re:Invent 2021
AWS published the top announcements from the AWS re: Invent 2021 conference. The top announcements from the data engineering perspective are,
SageMaker Studio Lab, a free service to learn and experiment with ML
The availability of new storage-optimized EC2 instances (Im4gn & Is4gen)
https://aws.amazon.com/blogs/aws/top-announcements-of-aws-reinvent-2021/
LinkedIn: Evolving LinkedIn’s analytics tech stack - Lessons from a large-scale data platform migration
LinkedIn shares its analytical stack transition story from Teradata data warehouse systems to open source big data technologies. The analytical stack includes 1400+ datasets, 900+ data flows, and 2100+ users. The migration strategy with improving the data model
is an exciting read.
https://engineering.linkedin.com/blog/2021/evolving-linkedin-s-analytics-tech-stack
Tableau: Top Data books of 2021
Though the title says the top data books, the shortlisted books focus on data visualization or Tableau platform. Nonetheless, it is great to read data visualization books.
https://www.tableau.com/about/blog/2021/12/andy-cotgreave-top-data-books-2021
Sponsored: The Data Stack Show Live - What is the Modern Data Stack?
Join The Data Stack Show Live for a special panel with experts from Databricks, dbt, Fivetran, Essence VC, and Hinge. The panel will look at the modern stack from all angles and discuss the future of data tooling.
https://rudderstack.com/video-library/the-data-stack-show-live-what-is-the-modern-data-stack/
Erik Bernhardsson: Storm in the stratosphere - how the cloud will be reshuffled
Erik Bernhardsson writes an exciting prediction on cloud vendors' trends builds a case on top of the success of Snowflake over Redshift. It is undoubtedly true in the analytical world where AWS solutions always package and sell open-source tools but never go beyond simplifying the developer workflow. A couple of interesting predictions to highlight,
Kubernetes will be some weird thing people loved for five years, just like Hadoop was from 2009-2013, but the world will move on.
YAML will be something old jaded developers bring up after a few drinks. You know it's time to wrap up at the party at that point.
https://erikbern.com/2021/11/30/storm-in-the-stratosphere-how-the-cloud-will-be-reshuffled.html
Shreya Shankar: The Modern ML Monitoring Mess: Rethinking Streaming Evaluation
A streaming sliding window with a finite interval is a go-to ML metrics monitoring strategy. The threshold, window size, and alerts are still defined manually for each metric. The author argues why this procedure to evaluate ML on streams of data is broken, highlighting representation differences, varying sample size & delayed feedback on the sliding window.
https://www.shreya-shankar.com/rethinking-ml-monitoring-1/
Microsoft: SynapseML - A simple, multilingual, and massively parallel machine learning library
Microsoft announces the release of SynapseML (previously MMLSpark), an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines. With SynapseML, developers can build scalable and intelligent systems for solving challenges in domains such as Anomaly Detection, Computer Vision, Deep Learning, Text analytics, etc.
Data@Monzo: Mapping our data journey with column lineage
Monzo writes about its journey to bring column-level lineage to track and understand scope changes across the data warehouse and automatically detect unused columns. TIL about ZetaSQL
, which helps to parse & analyze BigQuery Sql, and looking forward to playing around with it.
https://medium.com/data-monzo/mapping-our-data-journey-with-column-lineage-56209c00606d
PayPal: Building Data Quality into the Enterprise Data Lake
PayPal writes about Rule Execution Framework to manage a centralized rule configuration system to manage data quality rules & rulesets. The adoption of SQL to write complex data validation rules and the workflow focused on the domain owners to define the data quality rules are exciting.
https://medium.com/paypal-tech/building-data-quality-into-the-enterprise-data-lake-9dec305c3757
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.