Skip to content
Jessica Hill By Jessica Hill Software Engineer I
Encoding categorical data for Power BI: Using label encoded data vs one-hot encoded data in Power BI

Whilst one-hot encoding has its advantages, specifically in the context of machine learning, it may not be the most suitable technique for encoding categorical data for analysis in Power BI.

This blog post will dive deeper into the implications of using both label encoded data and one-hot encoded data in Power BI, bringing to light why label encoded data is the preferred format for analysis of categorical data in Power BI.

One-hot encoded data in Power BI

One-hot encoding creates a separate column for each category, treating the categories as distinct entities. While this may work in certain contexts, it can complicate downstream analysis in Power BI, as it often results in unnecessarily complex, poorly performing DAX measures and increased semantic model size.

To illustrate the above, let's take our survey questions introduced in the previous blog post in this series: "Please select your preferred mode of transportation", with a choice of response options: 'Car', 'Bus' and 'Train', and another question which asks respondents "Please select your favourite colour", with a choice of response options: 'Red', 'Blue' and 'Green'. The one-hot encoded response data would look as follows:

respondent_id transportation_car transportation_bus transportation_train colour_red colour_blue colour_green
R001 1 0 0 0 0 1
R002 0 1 0 1 0 0
R003 0 0 1 0 0 1

The impact of one-hot encoded data on DAX measures

In Power BI, if we want to calculate the percentage of respondents who chose 'Car' as their preferred mode of transportation, the DAX measure must calculate the number of respondents with a value of 1 in the transportation_car column over the total number of respondents with a value of 1 in any of the columns for that survey question: transportation_car, transportation_bus and transportation_train. This total represents all respondents who answered the survey question. The resulting DAX measure would look as follows:

% Car Responses = 
DIVIDE(
    COUNTROWS(
        FILTER(
            'Survey Responses',
            'Survey Responses'[transportation_car] = 1
        )
    ),
    COUNTROWS(
        FILTER(
            'Survey Responses',
            'Survey Responses'[transportation_car] = 1 || 'Survey Responses'[transportation_bus] = 1 || 'Survey Responses'[transportation_train] = 1
        )
    ),
    0
)

One-hot encoding categorical data can complicate DAX calculations because the measure must explicitly include all relevant columns to compute the denominator, making the DAX measures harder to write, maintain and test. Taking our transportation survey question, we can see that the denominator needs to contain all of the columns that compose this singular categorical question. This problem is particularly emphasised if a dataset has a large number of categories. For example, if our transportation survey question had 30+ response options, this would result in a very lengthy DAX measure that is both difficult to write and manage.

One-hot encoding categorical data can also impact DAX query performance. Since each category is stored as a separate column, DAX measures over one-hot encoded data take longer to execute. This is because they require scanning more columns, performing multiple Boolean operations, and are over a larger, less efficient data model. For example, in our % Car Responses DAX measure, the engine must scan all of the columns in the denominator to check which contains a 1; increasing query execution time.

DAX measures over one-hot encoded data also lack flexibility due to their dependence on the current structure of the dataset. Adding a new category to the dataset would require updating the denominator of the DAX measure to include the new column. For example, if we added an extra response option of 'Cycle' to our transportation survey question, this must be accounted for in the denominator of our % Car Responses DAX measure, as well as any other measures that consider the total number of responses to this survey question. This add extra maintenance effort and increases the risk of introducing errors to the DAX calculations.

The rigidity of the DAX measures over one-hot encoded data also introduces problems in terms of scalability. For datasets with multiple categorical questions, each requiring similar calculations, the number of lengthy and rigid DAX measures grows rapidly, making the semantic model much harder to manage.

The impact of one-hot encoded data on the semantic model

One-hot encoding categorical data also impacts semantic model size and memory usage in Power BI. As each category is treated as its own separate entity with its own column, one-hot encoding significantly increases the number of columns in the dataset. This can bloat the size of the semantic model, increasing memory usage and can potentially lead to performance issues. VertiPaq (Power BI's compression engine) compresses fewer large columns much more efficiently than many sparse columns with mostly zeros. One-hot encoding creates many sparse columns, and therefore reduces compression efficiency which makes queries slower.

Power BI Weekly is a collation of the week's top news and articles from the Power BI ecosystem, all presented to you in one, handy newsletter!

One-hot encoding also causes difficulties in terms of defining relationships between tables in the semantic model. One-hot encoding fragments categorical data into multiple columns, which breaks the relational and aggregation-friendly nature of Power BI's relational model.

Label encoded data in Power BI

Label encoding is the preferred technique for encoding categorical data for analysis in Power BI. It simplifies the representation of the categorical data making it much easier to write DAX measures over, improving their performance and decreasing the size of the semantic model.

To illustrate the above, let's take our survey questions, the label encoded response data would look as follows:

respondent_id transportation colour
R001 2 1
R002 1 3
R003 2 1

With the associated categories as reference data:

categorical variable category encoding
transportation Car 1
transportation Bus 2
transportation Train 3
colour Red 1
colour Blue 2
colour Green 3

The impact of label encoded data on DAX measures

In Power BI, if we want to calculate the percentage of respondents who chose 'Car' as their preferred mode of transportation, the DAX calculation would only need to consider the transportation column. This column represents the single categorical question. It would need to calculate the number of respondents that selected 1 in the transportation column over the total number of respondents that answered this survey question, which is simply all of the non-blank responses in this column. The DAX measure would look as follows:

% Car Responses = 
DIVIDE(
    COUNTROWS(
        FILTER(
            'Survey Responses',
            'Survey Responses'[transportation] = 1
        )
    ),
    COUNTROWS(
        FILTER(
            'Survey Responses',
            NOT(ISBLANK('Survey Responses'[transportation]))
        )
    ),
    0
)

This is a more streamlined DAX measure than the DAX measure over the one-hot encoded data. With label encoded data, all of the responses are stored in a single column so we don't have to consider the columns for all of the different response options in the denominator of the DAX measure. This consequently simplifies the development and maintenance of the DAX measures. This also speeds up DAX query execution time as only one column needs to be scanned by the DAX query.

DAX measures over label encoded data are much more flexible as they do not depend on the structure of the underlying dataset. With our transportation survey question example, if a new response option of 'Cycle' was added, this would not require a change to the denominator of the % Car Responses DAX measure. This decreases maintenance effort and mitigates the risk of introducing errors to the DAX calculations.

The impact of label encoded data on the semantic model

Label encoded data in Power BI is more efficient in terms of semantic model size and memory usage. By consolidating all of the response options into a single column, label encoding significantly reduces the number of columns in the dataset. This simplified representation of the data optimises storage and improves overall performance of the semantic model.

A single-column categorical format is also inherently more compatible with Power BI's relational model. It simplifies operations such as filtering, aggregations, and slicer interactions.

With label encoding, you can further simplify and optimise the semantic model by avoiding separate columns for each survey question entirely. Instead, you can structure the data with a single answer_value column containing the label encoded responses for all questions. To associate each response with a specific question, you can include a question_id column that links to a 'Survey Questions' dimension table. This approach reduces the complexity of the semantic model and follows the dimensional modeling approach that is optimal for Power BI.

The 'Survey Responses' table would look as follows:

respondent_id answer_value question_id
R001 2 1
R002 1 1
R003 2 1
R001 1 2
R002 3 2
R003 1 2

The 'Survey Questions' table would look as follows:

question_id question_name
1 transportation
2 colour

The % Car Responses DAX measure over this further optimised dataset would look as follows:

% Car Responses = 
DIVIDE(
    COUNTROWS(
        FILTER(
            'Survey Responses',
            RELATED('Survey Questions'[question_name]) = "transportation" &&
            'Survey Responses'[answer_value] = 1
        )
    ),
    COUNTROWS(
        FILTER(
            'Survey Responses',
            RELATED('Survey Questions'[question_name]) = "transportation"
        )
    ),
    0
)

Quantifying the improvements to semantic model size and DAX performance

As we now know, label encoding instead of one-hot encoding can significantly reduce semantic model size and improve DAX query performance. Let's quantify these improvements by loading one-hot encoded and label encoded survey response data (10 respondents, 30 questions, and 5 response options per question) to Power BI, developing equivalent measures over the data, and analysing the semantic model size and DAX performance.

When comparing the semantic model sizes in Bravo for Power BI, we can see that the one-hot encoded dataset was 1.06 MB, whereas the label encoded dataset was 1.02 MB, saving approximately 4% of memory. While this percentage may seem small, the impact scales significantly with larger datasets. We can also see that label encoding has reduced the column count from 151 to just 3, increasing compression efficiency and therefore improving query performance.

One-hot encoded dataset semantic model size.

Label encoded dataset semantic model size.

When comparing the DAX query performance for the % Q1 Option 1 DAX measure in DAX Studio, we can see that the total query duration over the one-hot encoded dataset was 10 ms, whereas it was just 5 ms over the label encoded dataset. This gives us an improvement in query execution time of 50%.

DAX measure over one-hot encoded dataset query performance.

DAX measure over label encoded dataset query performance.

These statistics show us that using label encoding instead of one hot encoding can lead to substantial improvements to the size and performance of our Power BI reports over categorical data.

Conclusion

One-hot encoding and label encoding are two different techniques for encoding categorical data. While one-hot encoding has its advantages in certain contexts, it is not well-suited for use in Power BI. The proliferation of columns in the one-hot encoded format significantly increases the semantic model size and memory usage. Moreover, it leads to poorly performant, rigid DAX measures that are difficult to write, maintain and test especially when dealing with datasets that contain a high number of categories or when the underlying data evolves.

Discover your Power BI Maturity Score by taking our FREE 5 minute quiz.

Label encoding, on the other hand, provides a cleaner and more efficient approach to handling categorical data in Power BI. By representing all category values in a single column, label encoding reduces the number of columns, optimising semantic model size and performance. Label encoding simplifies the development and maintenance of DAX measures as well as improving their performance. This format also aligns with Power BI's relational model, making it easier to create relationships, aggregations, and slicers.

As data professionals, we might receive data in one-hot encoded format. However, as discussed in this blog post, we know that this is not an optimal format for Power BI. The next blog post in this series will describe how to convert one-hot encoded data to the preferred label encoded format.

Jessica Hill

Software Engineer I

Jessica Hill

Jessica comes from a Biosciences background, having gained a 1st Class Bachelor of Science in Biology from The University of Manchester.

During the lockdown Jessica used the opportunity to explore her interest in technology by taking two Code First Girls online courses; Introduction to Web Development and Python Programming.

This led Jessica to look for job opportunities in the technology sector and joined endjin's 2021 apprenticeship cohort, which had over 200 applicants.