Thoughts on the new Modern Data Stack
- 6 months ago
- Erik Mathiesen-Dreyfus
- 6 min read
The Modern Data Stack: a chaotic mess of applications
The Data Stack is changing
The data analytics space is changing – in particular, our little subspace known as The Modern Data Stack.
After rapid expansion and a huge influx of VC-backed funding over the last 5 or so years, we can now easily count 100s of companies in the space split into at least two dozen different categories. Many of these categories are new and hence the potential unknown – does it address a real need? how big is that need? is it big enough to build a company around? is it big enough to build a VC-backed company around? is it big enough to sustain multiple VC-backed companies?
The answer to some of these questions are now starting to emerge – some categories are starting to look more like features within other categories and others, although big, may only really support 1 or 2 companies in the longer term.
The stack became too complex
At the same time, after an explosion of tools and vendors, users – the actual users – are starting to want consolidation. No one wants to maintain 10 different separate data tools, each with different configurations and workflows.
The idea of the "Modern Data Stack" (at least IMO) was to remove the need to build data stacks in-house. To enable companies to easily spin up their data infrastructure and get stuck in to the real point of having a data stack: using the data to better understand and optimise your business. But we now have a situation where setting up and maintaining your data stack has become a job in itself – due to the amount and complexity of tools and technologies required. It feels like we progressed a lot and then regressed a bit.
As a consequence we now have new categories of tools for orchestrating and monitoring the data stack to remove the headache of running data infrastructure. This make sense for production systems(dev ops), which require scalability, low-latency, high-uptime etc, but does it make sense for internal data systems, often working in batch mode and exclusively using third party systems. If only these third party systems were better integrated then we would be able to catch, diagnose and monitor data issues natively within the stack, we would be able to easily setup and maintain them without having to have another set of tools for orchestration and monitoring. If only things were a bit more consolidated...
I don't think the current situation is sustainable - but I don't think that is a bad thing, at least not if we take the view that we are still in the development phase of the next generation of data infrastructure and we are currently at an unstable mid-point, waiting to see which direction we will be moving next. What we currently refer to as the Modern Data Stack will end up having been a step towards the actual Modern Data Stack – a set of tools, technologies and best practices that will define data infrastructure for the next many years.
Is the current Data Stack a saddle point on the way to finding a
It looks to me like the next step might be consolidations – both in terms of companies but also in terms of technologies. On the technology side, potential consolidations could be ETLs and reverse-ETL, the modelling layer moving inside the data layer, data warehouses natively supporting ETL ingestion and transformations and so on.
On the company side, I think that after this phase of expansion and proliferation, the large players will once again eat up the challengers and consolidate their positions. After a brief stint where it looked they they would potentially be challenged, Google, AWS and Microsoft will probably (as always) come out on top but through the process a few newer companies will have been added to the list of established players. In particular, Snowflake and FiveTran will be considered established player as well as potentially DataBricks and DBT(if they start acquiring others).
However, whereas Google, AWS and MS will probably all, over time, offer full end-to-end data stack solution completely within their eco-systems, the new up-and-coming companies typically only address one part of the stack. To compete they will probably need to offer something more integrated and consolidated - the question is how they will achieve this.
The new Modern Data Stack
Some recent developments point to how the newer players could achieve this by integrating closer, focusing more on integrating with each other, rather than with GCP/AWS/MS, and as a result offer a superior experience when using a particular set of components in the stack. This would be an opinionated but well-integrated alternative data stack to the established GCP/AWS/MS offerings.
A few recent developments pointing towards this:
- FiveTran announced that they are dropping support for SQL transforms and going forward will rely completely on DBT for data transformations.
- Several companies are now being build exclusively around DBT - LightDash is a great example of this.
- DBT and Snowflake now both support Python models out of the box in a way that makes it incredibly easy to run DBT models on Snowflake.
- Snowflake and Salesforce are teaming up to provide native data integration between the two. I wonder if other source systems will follow in the shoes of Salesforce and how that will affect with ETL and rETL vendors.
- Hightouch integrating more and more closely with Snowflake as evidenced by the announcements at the Snowflake Summit 2022.
The tighter integrations and the ubiquitous use of DBT will enable a much more coherent and tightly coupled data stack, where DBT will play the role of the modelling and semantic layer across the stack. Whereas we have become used to viewing each layer as a separate entity, with clear separations of concern, each with its own data model, we can now start to think of the stack as a more integrated whole. This should enable better testing and monitoring across the stack.
With everything relying on DBT projects and no transformations performed outside of DBT, lineage and observability will also become "easy". Also (better) integrated testing will become possible. You will now be able to test your data modelling from the ETL layer through the data layer to the modelling layer and all the way up to the visualisation layer. This could potentially lead to quality and observability platforms becoming one of those categories that will end up becoming a feature of DBT and the modelling layer.
You could then imagine the "New Modern Data Stack" looking like this
Although we don't currently have an equivalent full stack within either the GCP, AWS or MS eco-systems, I think this is only a matter of time (GCP will probably be the first to have an equivalent stack) and we will then be in a situation where companies will be able to choose between 3 or 4 equivalently well-integrated, fully featured, powerful data stacks: GCP, AWS, MS or "new MDS" 🤗