Thoughts on Snowflake, dbt and Python
- a year ago
- Erik Mathiesen-Dreyfus
- 6 min read
Very exciting things are happening in the intersection of Snowflake, dbt and Python. The newly introduced support in both Snowflake(Snowpark for Python) and dbt(version 1.3's support for Python models) for Python opens up a ton of new possibilities. Now we will be able to write dbt models in Python and execute them within in the data layer using Snowpark.
I wanted to share a few early thoughts.
As someone who has spent a lot of time building data and ML infrastructure in Python I see a value in moving more complex transforms and models to Python instead of keeping them in SQL and dbt.
It obviously has its advantages: the clarity and conciseness of writing complex transforms in Python versus SQL, better and easier testing, better reusability, composability and modularity, tons of data and analytics packages etc. In many ways, some of what dbt did for SQL, ie bringing core engineering practices to data, will now come for free with Python UDFs.
Of course, implementing and orchestrating data transforms in Python isn't new - it has been happening for a while. Especially within the MLOps stack but also in the data space. In the past, I have used a variety of tools and platforms to run data pipelines defined in Python – Airflow, Dagster, Dask, AWS Lambda, PySpark etc etc. However, I never felt like these tools did a great job of bridging the gap between Python transforms and SQL transforms – that is why I am excited about dbt+SnowFlake's Python integration, the ability to better mix SQL and Python models and to execute them natively within the data layer. It feels like it could be a better, more coherent solution.
Once it all settles down, I imagine a common understanding will emerge around what jobs and transforms will be viewed as "should be done in SQL" versus "should be done in Python" and I am curious about what that will be.
It does seem the process of building transforms in Python will be much less dynamic than SQL because of how it is executed – like with dbt models, it is about building assets, UDFs, models and tables. Not so much about doing exploratory analysis in the many SQL based BI tools we have available.
Which would point to larger, intentional and reusable components being done in Python and smaller, ad-hoc exploratory work, eg ad-hoc analysis, being done in SQL. Similar to the division we see in ML Ops between assets generated by pipelines and ad-hoc outputs generated by notebooks.
This is, of course, assuming that performance issues will be ironed out and the performance of Python UDFs will be comparable to their SQL equivalents – that might not be the case and Python UDFs might then end up being a sort of middle-of-the-road thing: for transforms that are complex but isn't time critical or run on very large data sets.
I also wonder if this will mean more SWEs will be moving into data and a shift in who owns what in the data stack?
Where dbt models have typically been owned by data teams will their pythonic equivalents instead be owned by the engineering teams? And, if so, would this solve most frustrations in the interface between data and engineering - where pipelines maintained by data teams break because engineers change data representations upstream without consulting the data team.
Or, will data teams have to upskill in Python and learn how to write performant Python code. How will this change the profile of people working in data teams – less focus on business understanding and more on software skills?
Similarly, this feels like it could be another step towards unifying the ML and Data stacks. We have seen this trend come up a lot recently in various data stack products talking about bridging that gap.
I personally don't think we should merge the two - the use cases seem too different to me. The ML teams are typically focused on product features, ie real-time inference with highly customised modelling. The development cycles are very different and once a model is deployed it almost becomes a static asset. Rightly or wrongly, we are used to the data stack being much more fluid - we add and remove things, we change the meaning of things and so on, on an almost daily basis.
Merging the two feels to me like the recipe for a bloated data layer, unstable ML product features and rigid internal data analytics.
Python is notoriously poor at package and environment management. Many, many solutions have come and gone (currently I prefer poetry). Having a consistent development environment that lets you easily develop pipelines locally and then deploy them, without too much hassle, isn't easy. I can only see this becoming an even bigger problem as we broaden the use of Python in the data stack.
Finally, I wonder how this new python-native data layer will, or if it will, influence the trend towards an event-based/streaming data stack. There has been lots of chatter about this recently, with various solutions coming out for building a streaming-based data stack. It feels like the more we move away from SQL the more feasible this will become.
In a Python based data stack we can much more readily imagine having real-time data pipelines and flows that supports streaming as a first-class concept.
Referring back to the above point, about the ML and Data stacks potentially merging, I also don't think a streaming native data stack is a good idea :-) Do we really have that much need for real-time, streamed insights in the world of data analytics that we can't mimc it with higher frequency batch methods.
"Is that a Python job running inside Snowflake?" - Wargames was ahead of