Welcome to the 31st edition of the data engineering newsletter. This week's release is a new set of articles that focus on Redpoint Ventures Reverse ETL, JP Morgan’s data mesh implementation, DBT’s modern data stack, ValidIO’s ML & Data trends 2021, Airbnb’s visualizing data timeline, Pinterest’s lesson learned from running Kafka at scale, Confluent’s 42 things to do once Zookeeper is gone, LinkedIn’s solving data integration problem with Apache Gobblin, Facebook’s mitigating the effect of silent data corruption, Reddit’s scaling reporting system, and LinkedIn’s GraphQL implementation of DataHub.
Reverse ETL — A Primer
Over the last decade, the cloud and SAAS products changed the way business operates. In a modern business, the customer data spread across many SAAS vendors. A single source of truth is a myth in the modern data infrastructure. I often call this a
"Truth In Motion." I shared a similar thought a year back.
On the same line, the blog narrates "Reverse ETL.," where the data are flowing from the internal data warehouse to SAAS providers like Salesforce, Zendesk & Intercom. It is an exciting space to watch as the success depends on how the SAAS vendors simplify the ingress and produce cost-effective time to value the customer's data.
Implementing a Data Mesh Architecture at JPMC
JPMC talked about the team's thoughts and implementation strategy of adopting the data mesh principles. The talk is a good narrative of structuring the data mesh principles, publishing taxonomy, and the need for a pragmatic compromise while adopting the data mesh principles.
The Modern Data Stack: Past, Present, and Future
How does the data engineering world look like beyond the Hadoop ecosystem? The blog from DBT gives a comprehensive overview of the modern data stack, starting from the introduction of Redshift and its impact on the data warehouse. The blog narrates the challenges ahead reminds this space is wide open for innovation over the next decades.
ML & Data Trends: Wrapping up 2020 and looking into 2021 & beyond
The blog narrates how the underlying data infrastructure influences the ML development in line with the recent trends on "Reverse ETL" and the modern cloud-native data stack. The blog also reiterates we are still in the early stages of MLOps, Data Quality tooling, and unified data architecture on the path to industrialization ML development.
Visualizing Data Timeliness at Airbnb
Commitment, Consistency & Clarity in the data pipeline are the core principles to build trust in data to empower a data-driven culture. Airbnb writes an exciting blog about SLA Tracker and how it took a data-driven approach to debug the data pipeline to improve efficiency.
Lessons Learned from Running Apache Kafka at Scale at Pinterest
Pinterest writes its lessons learned from running Apache Kafka at scale. Broker replacement, partition rebalancing, and cost control are the common challenges running Kafka at scale, and the blog narrates how automation can help run the tasks. The
Pinterest Orion is an exciting project to watch.
42 Things You Can Stop Doing Once ZooKeeper Is Gone from Apache Kafka
Confluent writes about the advantages of removing the Zookeeper dependency can improve the Kafka infrastructure with performance, capacity planning, operations, and monitoring. The
KIP-500 RFC on replacing Zookeeper with a self-managed quorum is an exciting read.
Solving the data integration variety problem at scale, with Gobblin
The growing niche SAAS applications add complexity to the data ingestions to the data warehouse system. LinkedIn writes about Apache Gobblin's unique approach to building data integration at scale. Instead of relying upon per source connectors, the multi-stage protocol & message format architecture seems an elegant solution for a complex problem.
Mitigating the effects of silent data corruption at scale
In a large-scale infrastructure, files usually compressed when they are not being read and decompressed when a request to read the file. What happens when the decompression fails? How often the failure? Facebook writes an exciting blog about its paper.
silent data corruption at scale.
Scaling Reporting at Reddit
Reddit writes about its journey on scaling the reporting platform from Redis to Apache Druid. The blog discusses the broader limitations of adopting the key-value storage for serving the analytics, the overhead on the application development, and operation issues with unknown bugs.
DataHub Project Updates (February 2021 Edition)
One of the challenges of adopting a modern data stack is that it is isolated towards dashboarding and reporting use cases. It is refreshing to read that the recent LinkedIn DataHub release focuses on adopting GraphQL to ease the integration with broader infrastructure components.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.