Welcome to the 15th edition of the data engineering newsletter. This week's release is a new set of articles that focus on how to structure data org? Bulldozer from Netflix, Lime's data catalog, How autonomous racecar crashed? Is it time for Decision Scientists?. A recap of RecSys and Spotify's experimentation framework.
How to structure the data team in a company? It is a question for any growing companies in the industry. The article is an excellent summarization of various modes the data team can operate. The article compares the centralized model, embedded model, full-stack model, pods, and chapter model.
https://medium.com/snaptravel/how-should-our-company-structure-our-data-team-e71f6846024d
The data-driven applications feedback the learning from the data infrastructure to the business process applications. The datawarehouse infrastructure tuned for a large volume of data than serve latency-sensitive applications. It requires the data from the data warehouse to a global, low-latency, and highly-reliable key-value store. Netflix writes about its self-serve data platform that moves data efficiently from data warehouse tables to key-value stores in batches Bulldozer.
Lime writes about how and why it builds the data catalog service, comparing the buy vs. build approach. Every data infrastructure is unique, and the legacy components make the adoption of standard data catalog services, and the article demonstrates the same.
https://medium.com/lime-eng/why-and-how-we-built-a-data-catalog-at-lime-6cb79419b7e2
The data processing tooling is growing in two separate ecosystems. Data engineering relies on JVM frameworks, where ML engineering relies on Python frameworks. Cylon is an exciting tool trying to unify the ecosystem with API developed on Java & Python and C++'s core implementation. The benchmark claims it is 12X performant than Apache Spark. Apache Spark, Flink, and Beam are trying to achieve a unified interface oneway or other, and this space continues to be an exciting development to watch.
https://supun-kamburugamuve.medium.com/cylon-library-for-fast-scalable-data-engineering-bf74742fe5d1
“Here's Why That Autonomous Race Car Crashed Straight Into a Wall” is a good reminder of how data quality is critical while building data-driven applications. Low data quality is a debt to a system. Even worst, it can cost people life.
https://www.thedrive.com/news/37366/why-that-autonomous-race-car-crashed-straight-into-a-wall
Is it a time for the data scientist to decision scientist? GoJek writes about how it thinks about data science and various functions and skill mapping for the analyst, data scientist, and decision scientist.
https://blog.gojekengineering.com/decision-scientists-at-gojek-the-who-what-why-960d7d27b0d0
Spotify writes about its experimentation platform journey. The article is an excellent read. It demonstrates the pattern of challenges in developing an experimentation platform, such as reducing time to process the metrics, reducing the data volume, improving custom metrics generation, and automation experiment distribution.
https://engineering.atspotify.com/2020/10/29/spotifys-new-experimentation-platform-part-1/
Etsy writes about its personalized search infrastructure. The challenges and considerations are a good reminder of personalized recommendations like the latency, context before serving the personalization, cold start, and privacy concerns.
https://codeascraft.com/2020/10/29/bringing-personalized-search-to-etsy/
Mattermost, an open-source messaging and collaboration platform, writes about its data stack. The Apache Airflow, Looker, Snowflake, and DBT slowly becoming the standard tech stack for business analytics.
Criteo writes a great summary of RecSys 2020. It is good to read in case you missed following this year’s RecSys conf.
https://medium.com/criteo-labs/highlights-of-recsys-2020-2a07690e0d8c
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.