Welcome to the 12th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Facebook's data discovery engine, a look back on Amundsen, Neilsen's AWS lambda design, building AI assistance, best practices on data engineering from Facebook, Airbnb, Linkedin, Lyft, 1mg, Neilsen.
Facebook writes about its data discovery engine Nemo. The article is an exciting read. It highlights the challenges of building a data discovery engine, what phrase to search, the relationship among the datasets, and the datasets' ranking. I expect a more machine learning-driven approach in the data discovery space soon.
https://engineering.fb.com/data-infrastructure/nemo/
Continuing on the data discovery, Amundsen reflects one year as an open-source project. The article reflects Amundsen's pluggable architecture, rich connectors, and fantastic community backing.
https://eng.lyft.com/amundsen-1-year-later-7b60bf28602
How can we ensure Data and Code Quality in Data Engineering? The article is an excellent checklist of the top 10 principles to follow, focusing on functional programming principles, documentation, naming convention, and modularization.
Linkedin's "People You May Know" is a classic example of building the data-driven network effect to scale the business. The article narrates how the system evolves to handle heterogeneous edges (connection, follow, subscribe models).
Search is the core business function of Airbnb. The article narrates improving the deep learning ranking for the Airbnb stays. The focus on eliminating bios, cold start for the new listing, eliminating bios of past preference overwhelming the result.
The growth of Slack-like tools driving the bot/ AI assistance to the mass market. The article narrates the five stages of AI assistance. Simple notification, answers simple FAQs, engages in dialogs, offers a personalized experience, and connects with other AI assistance. The case study to build and deploy the AI Bot is an exciting read.
https://www.infoq.com/articles/build-deploy-ai-assistants/
1mg, an online platform that provides services for medical diagnostics, consultation, lab tests, and general healthcare, writes about an overview of its data infrastructure. I learned about RudderStack, and it is exciting tools coming in the data sourcing space.
Facebooks write about another exciting tool, CG/SQL, that allows developers to write stored procedures in a variant of Transact-SQL (T-SQL) and compile them into C code that uses SQLite SQLite's C API to do the coded operations. It's an exciting approach to brings the type safety to large-scale procedures.
https://engineering.fb.com/open-source/cg-sql/
Nielsen and AWS published a good referential design of using AWS Lambda in the data pipeline. It's impressive usage of the lamdas as a dispatcher worker.
https://aws.amazon.com/blogs/architecture/nielsen-processing-55tb-of-data-per-day-with-aws-lambda/
How to Verify your ETL Result? The article is an excellent checklist of various quality checks and good working example integrating with the Airflow DAGs.
https://medium.com/@adiluhungdimas/how-to-verify-your-etl-result-aa4df57b6d9d
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.