1 Comment
Mar 14, 2022·edited Mar 14, 2022Liked by Ananth Packkildurai

Very excited to see this discussion happening across a wide breadth of the data community with everyone bringing their respective lenses. The surface area and complexity of the problem most certainly thwarts any single data persona from grasping its entirety. I share some of my own perspectives here as an attempt to shine more light on the aspects that I have born witness to.

To mono-DAG or not to mono-DAG?

Managing dependencies in even moderately complex software environments is STILL mostly an unsolved problem. And as we have seen the data space tends to lag the software space on these fronts. I would argue that data dependencies have additional complexities to overcome. Even if we had Maven level dataset lineage and declarative dependencies and outputs (think a pom.xml for your DAG nodes), we would still be woefully ill-equipped. I think asking the DAG question is similar to deciding to mono-repo or not; there will be zealots on both sides, decisions will be highly context dependent, and there will be right and wrong ways to do both.

Task + Model

I am _mostly_ in agreement with Maxime's call for Functional Data Engineering (https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a). In this world, tasks (i.e. DAG nodes) are idempotent and the model (i.e. data tables) is a type system for describing inputs and outputs. These analogies are far from perfect and we need to modify and extend them in a number of ways for data paradigms but they are a reasonable starting point. While most functional programming constructs treat functions themselves as first-class citizens that can be provided as inputs to or be output from other functions, I have generally found using tasks as dependencies to other tasks an anti-pattern. This is related to the "upstream task is reliable" fallacy.

Task Scope + Data Quality

Strongly believe that data testing & observability belong as part of every task (just as unit testing and instrumentation are part any micro-service). Also strongly believe that tasks should do continuous testing (think CI/CD for your data sets) - good example from Sam Redai here (https://tabular.io/blog/integrated-audits/). On the plus side, data is generally designed to be single producer - multiple consumer which makes side-by-side comparisons of data observability straight forward (I expect to see this capability baked in and automated in orchestration tools if it is not already).

Expand full comment