Start of Main Content

Newly announced at Coalesce, dbt Core and Fusion will allow users to create and manage user-defined functions (UDFs) within a standard dbt project. Gone are the days of creating macros and managing various prehooks, posthooks, and/or on-run-start configurations.

However, just because you can doesn’t mean you should. Configuring UDFs in dbt represents a fundamental anti-pattern in the transformation layer of the data platform.

Let’s orient around two tenants: UDFs and Macros. These two features operate in distinct ways.

Macros are a long-standing feature of dbt. Macros serve to standardize repeatable blocks of logic and operations. Results from the data warehouse are not used or returned* from the macro and are not exposed to the underlying query, much like a stored procedure.

UDFs, on the other hand, are brand new to dbt. UDFs can receive input values from each row in the data set, and use these values to influence the single-result-per-row returned from the UDF. Modern data warehouses (e.g. columnar databases) are not optimized to run a UDF across every row, but you will use incur higher compute to incorporate a UDF into your transformations.

*There are cases where you can incorporate the value in subsequent operations. This does not (or should not) happen once per row. Macros operate in a much more constrained or limited manner

Giving dbt users the ability to author and manage UDFs in the data warehouse is less than ideal and we view it as an anti-pattern for what makes dbt great.

UDFs offer many of the same benefits as macros do, with the added benefit of persisting a database object (more on this later). However, during the transformation process, are there really many, if any, times when a UDF makes more sense than a macro or another model in the DAG?

UDFs are a shiny new object (in dbt at least), but don’t provide capabilities that streamline and/or enhance the transformation process.

For transformation logic that needs to be standardized, a macro is easier to understand. If the purpose of the UDF is to enable and support other teams or data products, a new model with additional dimensions of information makes interpreting the results easier for data consumers (both human and machine). In the case of persisting a new model, data tests add degrees of confidence to data sets being transformed that are more difficult to enumerate if one uses an UDF.

There are not many cases where a UDF makes more sense than existing functionality and patterns already available in dbt. And even in complex cases where a UDF unlocks a key transformation or data technique, subsequent nodes in the DAG make auditing results easier, less time intensive, and more explainable.

Where do your UDFs live today? In modern data warehouses, there are typically governed areas that exist where shared objects are created - UDFs, stored procedures, views, and tables. The governance is key: these are business critical objects for multiple teams and their processes.

dbt offers a lot of flexibility, which comes at a cost. We’ve seen plenty of escalations caused by a schema change in a model led to a PagerDuty alert to be fired. Exposures and contracts help, but these are people-driven processes. People-driven processes can fall flat, especially when teams are asked to move fast.

Centralizing shared objects, notably business-critical objects, provides a level of oversight and control to all teams directly touching the data warehouse. It also ensures proper visibility can be maintained, preventing breaking changes from entering production.

Put another way - would you rather track a breaking change to a UDF through many different, interconnected yet fragmented dbt projects, or in a single repository of all data warehouse infrastructure?

UDFs offer new functionality, but introduce a core anti-pattern to dbt projects.

Our point of view is that the greatest benefit of dbt is providing a wholistic and shared framework that functions best when it delivers shared understanding from shared process. UDFs violate these tenants and swing the balance between discoverability, maintainability, and speed in the wrong direction.

Many dbt projects operate on a tightrope trying to balance all three points mentioned above. Great dbt projects maintain this balance by knowing when and where to leverage the framework throughout the data platform. This leverage and balance combined unlocks value and increases confidence in the resulting data model.

Answering the “who” (governance) and the “why” is the first step in maintaining this balance and effectively leveraging dbt as a framework. It’s these answers that also create a strong foundation to build off of.

Just because you can doesn’t mean you should. Maybe a better starting point is with the question: what are you trying to achieve, and is what you’re doing the best approach?

Published:
  • Data Strategy and Governance
  • Data and Analytics Engineering
  • Data Stack Implementation
  • Dbt
  • dbt

Take advantage of our expertise on your next project