Take a Deep Dive into Filtering in DAX | Towards Data Science – Афганистан против Канады

We always use filters when developing DAX expressions, such as DAX measures, or when writing DAX queries.

But what happens exactly when we apply filters?

This piece is exactly about this question.

I will start with simple queries and add variants to explore what happens under the hood.

I use DAX Studio and the option to show server timings for each query.

In case you want to learn more about this feature and how to interpret the results, read the first article in the References section at the end of this piece.

Let’s start with the base query:

EVALUATE
	CALCULATETABLE(
				SUMMARIZECOLUMNS('Product'[BrandName]
						,"Online Sales", [Sum Online Sales]
						)
					)

Take a Deep Dive into Filtering in DAX | Towards Data Science — Figure 1 – Executing the query and the results. The results themselves are not important. But the execution statistics and the SE queries are (Figure by the Author)

When we activate the server timings and execute the query, we get the execution statistics and the Storage Engines (SE) query/queries needed to get the data:

Figure 2 – The execution statistics for the query. We see the Storage Engine query to retrieve the results (Figure by the Author)

As you can see, we need only one Storage Engine (SE) query to retrieve the results.

The query completes in only 47 ms and is served almost entirely by the SE (95.7%).

The more time the SE can spend on a query, the better, because it is the component that retrieves data from the data stores and tables.
Moreover, the SE can use multiple CPU cores, whereas the Formula Engine (FE) can use only one. We cannot examine exactly what happens in the FE as easily as we can with SE queries.

You can learn more about the difference between these two engines in the article mentioned above.

A short note:

A few months ago, I wrote an article here with a very similar title. But, while that one was only about date filters with Time Intelligence functions, this one goes one step deeper into the rabbit hole.

This is much more generic than that one.

If you missed it, I added the article link and additional resources on the current topic to the References section below.

Add simple filters

Next, we add a simple filter for the product color red to the query:

EVALUATE
	CALCULATETABLE(
				SUMMARIZECOLUMNS('Product'[BrandName]
						,"Online Sales", [Sum Online Sales]
						)
				,'Product'[ColorName] = "Red"
				)

Here is the query and the results restricted to the product color red:

Figure 3 – The query and the results for the data restricted to the product color red (Figure by the Author)

When we look at the query statistics, we see this:

Figure 4 – The query statistics and the SE query with the filter for red products (Figure by the Author)

As you can see, the entire query is executed in a single SE query.

The filter is in the query’s WHERE clause. Therefore, only the restricted data is retrieved.

This is visible in the “Rows” column, as only 14 rows are returned from this query.

But what happens when we use the FILTER() function to filter the products:

EVALUATE
	CALCULATETABLE(
				SUMMARIZECOLUMNS('Product'[BrandName]
						,"Online Sales", [Sum Online Sales]
						)
				,FILTER('Product'
						,'Product'[ColorName]  = "Red")
				)

As you might know, using the FILTER() function is not recommended due to how it works.

You can learn more about this topic in the second article linked in the References section below.

The result doesn’t change:

Figure 5 – The query and the results for the data restricted by the FILTER() function to the product color red (Figure by the Author)

But how does it affect the execution plan and the SE queries?

Figure 6 – The query statistics and the SE query with the filter for red products, but with the FILTER() function (Figure by the Author)

As you can see, in this case, the SE optimizes the query, yielding the same execution plan as before.

But, as we change our code, we will see that using FILTER() isn’t always a good idea.

Add multiple filters

Now, what happens when we add multiple filters to a query?

EVALUATE
    CALCULATETABLE(
                SUMMARIZECOLUMNS('Product'[BrandName]
                        ,"Online Sales", [Sum Online Sales]
                        )
                ,'Product'[ColorName]  = "Red"
                ,'Geography'[ContinentName] = "Europe"
                )

While the result is not that interesting to us, let’s look at the query statistics:

Figure 7 – Query statistics when applying multiple filters to an expression (Figure by the Author)

Again, the query can be served by a single SE query that contains both filters.

The query executes so quickly that the FE time percentage is relatively high, yet it still only takes 6ms.

When changing the query to use the FILTER() function, the SE query doesn’t change either:

Figure 8 – Query statistics when applying multiple filters with FILTER (Figure by the Author)

This shows that, with this kind of query, the engine can optimize execution to find the most efficient way to fulfill the DAX query.

Anyway, the result doesn’t change. It’s identical in both cases, as it should be, because we don’t change the filter per se. But please be patient with me; I’m getting back to the FILTER() function and why it’s important to understand its effects in a moment.

Moving filters into measures

Next, let’s see what happens when the filter is moved into the measure.

Until now, the query was built so that the measure [Sum Online Sales] received its filter from outside.

Let’s try this:

DEFINE 
MEASURE 'All Measures'[Online Sales A. Datum] =
		CALCULATE(
			SUMX('Online Sales', ( 'Online Sales'[UnitPrice] * 'Online Sales'[SalesQuantity]) - 'Online Sales'[DiscountAmount] )
			,'Product'[BrandName] = "A. Datum"
			)


EVALUATE
	CALCULATETABLE(
				SUMMARIZECOLUMNS('Product'[BrandName]
						,"Online Sales A. Datum", [Online Sales A. Datum]
						)
				)

As you can see, the filter is applied inside the measure [Online Sales A. Datum].

Of course, the resulting number is the same in each row of the result, as the Brand is set as “A. Datum”:

Figure 9 – The same result for each brand with the measure containing a filter (Figure by the Author)

But the execution is slightly different:

Figure 10 – Query statistics for the measure containing the filter (Figure by the Author)

This time, we have two SE queries.

The query to get the sales for the Brand “A. Datum”. This query contains the filter for that brand.
The second query is used to get the list for all brands in the result set.

The first query is most important to us, because it still shows the filter for the brand set within the measure.

This query can be fully served by the SE with a simple filter in a very efficient way.

But, in most cases, we want to add multiple measures to a query (or a visual in a report).

What happens when we add the [Sum Online Sales] measure to the query?

The result is not particularly important, as it shows one column with sales for each brand and another with sales for the filtered brand.

But the query statistics are interesting:

Figure 11 – Query statistics for the query with one measure containing a filter and another without. Where is the filter set in the measure? (Figure by the Author)

As you can see in the red-marked line in the SE query, the Brand filter is no longer present.

Because the engine recognizes that the filter in the measure is applied to the same column as the one in the query, it moves the filter to the FE and returns the result.

Now, what happens when we filter another column in the measure, for example, the color:

DEFINE 
MEASURE 'All Measures'[Online Sales Red] =
        CALCULATE(
            SUMX('Online Sales', ( 'Online Sales'[UnitPrice] * 'Online Sales'[SalesQuantity]) - 'Online Sales'[DiscountAmount] )
            ,'Product'[ColorName] = "Red"
            )


EVALUATE
    CALCULATETABLE(
                SUMMARIZECOLUMNS('Product'[BrandName]
                        ,"Online Sales", [Sum Online Sales]
                        ,"Online Sales Red", [Online Sales Red]
                        )
                )

Again, the result is not particularly interesting. We are interested in the query statistics:

Figure 12 – Query statistics for two measures with one filtering a column not contained in the result set (Figure by the Author)

As you can see, this time we have two queries by BrandName. One without and one with the filter for the color.

Both queries return the same number of rows (14) – one for each Brand.

The FE handles combining the two results into a single table.

The entire query is still served mainly by the SE, which is excellent.

But now, let’s add the FILTER() function to the Filter:

For this example, I change the measure to filter for two values with the IN operator:

,'Product'[BrandName] IN { "A. Datum", "Adventure Works" }

In this variant, the SE query is like the ones before.

The filter is passed directly into the query’s WHERE clause.

But what happens when I change it to this:

,FILTER('Product'

       ,'Product'[BrandName] IN { "A. Datum", "Adventure Works" }

       )

First of all, the result changes:

Figure 13 – Result of the measures with and without FILTER(). In red, the results with the simple filter, and in blue, the results for the measure with FILTER() (Figure by the Author)

The reason is that FILTER() works completely differently.

It retains the existing filter context and adds a new one.

I explained this behavior in another article that I added as the second link in the References section below.

Moreover, the SE cannot handle this in one query anymore:

Figure 14 – The query statistics for using the FILTER() function in the measure. Now, multiple queries are needed. (Figure by the Author)

The first two queries retrieve the values for the brand to filter (See the queries marked in pink).

Notice the large number of rows (324 and 2’560) returned by the first two queries. This is the materialization of intermediate results needed to perform the calculation.

The third query uses these intermediate results to filter the data (marked in red).

The result of the third query is only two rows—the two rows we see in the overall result.

As described in my other article, FILTER() must be used with care.

Not only is it considerably slower, but it also works completely differently from a simple filter.

Anyway, I can restore the previous behavior by adding an ALL() in the FILTER() call:

Figure 15 – Adding an ALL() within the FILTER() call restores the semantics of the simple filter. But why would you do this? (Figure by the Author)

I don’t want to hide the fact that this example is special, as the filter applied affects the same column as used in the query.

When changing the query to filter the country, the engine can optimize the execution and use the simple form again:

Figure 16 – Here we see that when filtering columns different from the columns used in the DAX query, the engine can optimize the execution and fall back to apply a simple filter. In the blue inset, you see the results (Figure by the Author)

As you can see, the engine optimizes the execution of the query and falls back to a simple filter when filtering columns that differ from those used in the DAX query. In the blue inset, you see the results.

I see this form of filtering very often when developers who are not as proficient write DAX measures.

Using the FILTER() function looks intuitive, but it can yield incorrect or confusing results and is slower than a simple filter. I strongly recommend reading my article linked below about this function, as well as the dax.guide documentation and the articles linked on SQLBI.com.

Additionally, I have to type much more than when using a simple filter.

As a lazy guy, this is an important reason not to use FILTER() when it’s unnecessary.

Add a complex filter

Finally, I want to show what happens when applying a filter using a DAX function, such as CONTAINSSTRING().

EVALUATE
	CALCULATETABLE(
				SUMMARIZECOLUMNS('Product'[BrandName]
						,"Online Sales", [Sum Online Sales]
						)
				,CONTAINSSTRING('Online Sales'[SalesOrderNumber], "202402252C")
				)

Such a query is executed when you use a slicer in your report to filter for a specific order and retrieve the brands of the purchased products.

As the result is not important at this point, let’s directly look at the query statistics:

Figure 17 – Query statistics for the query using a DAX function to filter the result. You can see that almost the entirety of the execution is performed in the FE (Figure by the Author)

While the query took more than 6 seconds to complete, 99.6% of the time was spent by the FE executing the CONTAINSSTRING() function to find matching rows in the data. This operation is very CPU-intensive, as the FE can use only one core. When I execute this query on my laptop, it takes more than 2 seconds longer.

I deliberately chose a slow function to demonstrate its effects.

But the SE was still able to execute the query with a single query. However, the positive effect of this fact is negligible in this case.

Conclusion

While it is not my intention to give you advice on what to do and what not to do, I wanted to show you the consequences of the different ways to write DAX code and apply filters in your measures or queries.

The DAX engine(s) are very efficient in optimizing the queries, but they have limitations.

Therefore, we must always take care when writing our DAX code.

If the performance is poor or the code written by someone else looks strange, we should analyze it to determine how to improve it.

I wanted to show you how to do it and what to look for when analyzing your DAX code.

Remember:

The Storage engine (SE) can use multiple CPU cores.
The more work is done by the SE, the better.
The SE can execute only simple aggregations and simple math functions (like +, -, x, and /)
Try to reduce the workload on the Formula Engine (FE)
The FE can use only one CPU core.
Try to reduce the materialization of data (The Rows column in the query statistics).
Try to reduce the number of SE queries.

I know that the requirements will force us to write DAX code, which is not optimal.

Even worse, the Report designers might add logic to the report that causes a poor performance.

In such cases, eliminate that logic and check the response time again. It might be worth exploring creating a dedicated measure for such cases. Remember that it is possible to create local measures in a report that is connected to a Semantic model via a life connection.

But most importantly: Take your time when writing DAX code. You might save time by avoiding the need to optimize your DAX code, which was written in a hurry. I speak from experience. This is a very bad feeling.

I hope you learned something new.

References

To learn the details about how to interpret the results of the Server Timings in DAX Studio, read this piece:

Are you curious about how to use the FILTER() function correctly? Read this:

Another DAX function that can harm performance is KEEPFILTERS(). To learn more about the KEEPFILTERS() function, read this piece:

Here, the mentioned piece about date filters:

An interesting blog post by Data Mozart about the Storage engine:

Like in my previous articles, I use the Contoso sample dataset. You can download the ContosoRetailDW Dataset for free from Microsoft here.

The Contoso Data can be used freely under the MIT License, as described in this document. I changed the dataset to shift the data to contemporary dates.