Data Assets versus Data Exploration
- 4 months ago
- Erik Mathiesen-Dreyfus
- 6 min read
This post is partly inspired by a tweet by @pdrmnvd and subsequent discussion about Mode's new DataSet feature. It is also something that has been bubbling around in the back of my head for a while, so wanted to write some words 🧑💻
TLDR; of the tweet, discussion and what I think Mode's Datasets are solving.With the spread of dbt we have become accustomed to a certain style of development when building new datasets: Develop dbt model > build tests for new model > create PR in repo > wait for tests to run >wait for PR to be approved > merge PR > trigger dbt run or wait for the scheduler to run > then share data with stakeholder > build dashboard, perform analysis or whatever your aim is
But sometimes this isn't the right setup – what if you want to create a dataset that will just be used a few times and perhaps isn't very accessible to others without context to use and what if you want to share this data with someone or use it to build an ad-hoc dashboard for a single piece of analysis.
In the current setup, you might put the dataset creation (SQL mangling) inside your dashboard or do a one-off query in your SQL IDE and export it as a CSV – but then what if you want to reuse it and/or share it? You probably then copy-paste the SQL magic into the next dashboard, paste it into Slack etc but that isn't great.
(I think) this is where Mode wants you to use Datasets. Very similar in a way to when you might use LookML, if you were a Looker user.
More generally, what we are solving for here is the case of ad-hoc analysis versus creating new data. When you are performing analysis you care about the insight, but less so about how you got there – when you create data the output is the dataset itself, so the process is key.
In other words, that I like more 😀, it is data exploration versus asset creation. When you are performing ad-hoc analysis you are exploring the data, when you are building dbt models, or similar, you are creating assets – things that (in theory) will live forever in the data layer and be used by others downstream.
Therefore when you build a dbt model, you should aim for a slower process, a la software engineering, of testing, reviewing, versioning and the like. When it comes to asset creation we want to run a tight ship, make sure the outputs are kosher and keep the processes strict and scaleable.
To me, the test when building a dbt model should be to ask yourself: would I be happy to have someone else use this in its current state? Because once you put something in a pipeline, and from there into the data layer, you should expect whoever to pick it up and start using it for whatever mad Frankenstein application downstream that you never even imagined. So it needs to be versioned, well-tested, reviewed, maintained and everything else that you would expect of a critical piece of software.
This split between exploration and assets also extend to dashboards (and data apps) themselves. Some dashboards are assets – they are widely shared and used, and a critical part of the business – others are only used for a specific piece of analysis and as a means to share the analysis with a small group of stakeholders.
Staying with the dichotomy of Assets versus Exploration, up until a few years ago most things were treated as exploratory analysis. Most semantic modelling was done in the frontend tool using tool-specific DSPs or straight in the dashboard, like LookML and Looker, or using a hotchpotch of Python scripts, fickle DAGs and cumbersome orchestration tools.
With better tooling for defining and managing pipelines and assets, in particular with the introduction of dbt, things swung the other way and we found ourselves doing everything by the book. Everything was now an asset - if there wasn't a dbt model for it then it wasn't legit.
The dogma was that "we are now software engineers and we must do things like software engineers do". The Analyst title was out and we called it Analytics Engineering – but does that analogy really hold up? Isn't analytics fundamentally different? Yes, we need to build assets to be used downstream and for repeated use but a lot of what we do is one-off things that won't ever be used again. I spent a large part of my career in Excel and never felt the need to follow engineering practices to get the job done.
In many ways it felt like the pendulum swung too much in one direction. We built solutions and best-practices for the asset creation part of the data world but neglected the exploration and ad-hoc analysis part.
It seems there is now a general vibe in the data eco-system that exploratory data analysis hasn't been solved and new solutions are needed. Lots of new products, including us at Infer 😆, have sprung up to help people perform better, easier to share, exploratory data analysis. Off the top of my head I can think of 10(and I have surely missed quite a few): Rill, PopSQL, Count, Canvas, HyperQuery, DataDistillr, Equals, Evidence, Whywhywhy, Glean as well as more established products expanding in that direction, like Mode, Hex, Sisu, Preset et al.
I think this vibe is correct (of course I would, wouldn't I). To perform research (in the traditional sense) or ad-hoc analysis is still too cumbersome in what we call the Modern Data Stack - we just don't have the right tools yet. However, I don't think these tools should be the same ones that we use for building assets. Some tools are good for building data apps and dashboards-as-assets but not for doing analysis and that is okay - by keeping the two use cases separate we can build great tools and processes for both use cases. We are solving for two separate use cases and having better, specialised tooling for both makes us all happier.
My overall point here is that we aren't necessarily meant to be like software engineers – engineers tend to not write single-use applications! Instead we are part-engineers, part-scientists. Engineers when we built assets, scientists when we do analysis – both with their own ways of working – and I think it would benefit us if we started thinking that way 🤗
Anyone catch Mode’s big launch yesterday? New feature called Datasets. Trying to understand the value of creating datasets in Mode say a dbt model. I don’t love having business logic living in multiple tools but maybe I’m missing something ?https://t.co/vYkBu3MGaa
— pedram.yml (@pdrmnvd) December 7, 2022
This is definitely an asset, of some kind, and