Skip to content
Jessica Hill By Jessica Hill Software Engineer I
Encoding categorical data for Power BI: Label encoding vs one-hot encoding - which encoding technique to use?

Categorical data refers to data that can be grouped into distinct categories rather than measured numerically. Categorical variables are used to capture these groups of categories. For example, the categorical variable 'Colour' could contain values of 'Red,' 'Blue,' and 'Green,' while the categorical variable 'Risk Level' could contain values of 'Low,' 'Medium,' and 'High.'

As data professionals, dealing with categorical data is common, but often this data needs to be transformed into a numerical format so that it is optimal for transfer, storage or analysis. In order to do this, we can use different categorical variable encoding techniques, which involve the encoding of categorical variables, such as 'Colour' or 'Risk Level', into numerical representations. The choice of which encoding technique to use primarily depends on factors such as the nature of the data, the analysis task at hand, and the requirements of downstream processes such as data analysis in Power BI or machine learning algorithms.

The first blog post in this series will explore two commonly used categorical variable encoding techniques: one-hot encoding and label encoding. We'll look at the advantages and disadvantages of these techniques more generally and then with a specific focus on Power BI.

What is label encoding?

Label encoding works by assigning a unique integer label to each category within a categorical variable. To explain label encoding using an example, imagine we have a survey which asks respondents "Please select your preferred mode of transportation", with a choice of response options: 'Car', 'Bus' and 'Train', and another question which asks respondents "Please select your favourite colour", with a choice of response options: 'Red', 'Blue' and 'Green'. The survey questions represent the categorical variables, and the response options represent the categories contained within these categorical variables.

Label encoding creates one column for each categorical variable. Our survey response data would therefore have a transportation column and a colour column. Within each column, the category selected by the respondent for that question is represented by an assigned arbitrary integer value.

respondent_id transportation colour
R001 2 1
R002 1 3
R003 2 1

Given we had assigned the following integer values to each category:

categorical variable category encoding
transportation Car 1
transportation Bus 2
transportation Train 3
colour Red 1
colour Blue 2
colour Green 3

We can see that in the label encoded response data, the respondent with the respondent_id of 'R001' has selected 'Bus' for the transportation question and 'Red' for the favourite colour question as indicated by the values in the transportation and colour columns for this respondent.

Note: The assigned integer values must be unique for each category contained within a categorical variable, but they don't need to be unique across the whole dataset.

What is one-hot encoding?

One-hot encoding works by converting categorical data into binary format. It creates new binary columns for each category of each categorical variable. Only one of the binary columns for each categorical variable will have a value of 1, this indicates the selection of that category, and all of the other columns will have a value of 0, indicating that these categories were not selected. This "one-hot" characteristic, where exactly one column is active (hot) for each categorical variable gives the technique its name.

To explain one-hot encoding using our survey questions example, each response option to each categorical question is represented in the data as a separate binary column e.g. transportation_car, transportation_bus and transportation_train for the transportation question and colour_red, colour_blue and colour_green for the favourite colour question. Only the column for the response option which has been selected by the respondent for that categorical question will have a value of 1, and all of the other columns are encoded 0.

For example, the respondent with the respondent_id of 'R001' selected 'Car' for the transportation question and 'Green' for the favourite colour question as indicated by the value of 1 in these binary columns for this respondent. The binary columns for all of the other unselected response options are encoded 0.

respondent_id transportation_car transportation_bus transportation_train colour_red colour_blue colour_green
R001 1 0 0 0 0 1
R002 0 1 0 1 0 0
R003 0 0 1 0 0 1

Advantages and disadvantages of one-hot encoding

One-hot encoding has many advantages for machine learning. The primary purpose of one-hot encoding is to ensure that categorical data can be effectively used in machine learning models. One-hot encoding maintains categorical independence, treating every category as a separate entity. This is advantageous if we have categorical data with no inherent ordering between the categories as it prevents the model preserving ordinal relationships between the categories. An ordinal relationship is a type of relationship between categories where there is a meaningful order or hierarchy amongst them, but the precise numerical differences between the categories may not be well-defined or consistent. For example, the following categories have an ordinal relationship: 'Strongly Disagree', 'Disagree', 'Neither Agree nor Disagree', 'Agree', 'Strongly Agree'.

If we have categorical data with no inherent ordering between the categories, like in our transportation survey question, there is no relationship between the response options of 'Car', 'Bus' and 'Train', if we were to assign arbitrary numerical values to each category the model might imply a false order or hierarchy amongst the categories or interpret this as ranking, leading to biased predictions. For example, by assigning the value of 1 to 'Car' and 2 to 'Bus' the machine learning algorithm interpret that 'Car' is a preferred transport option over 'Bus'. One-hot encoding mitigates this risk entirely through the maintenance of categorical independence.

One-hot encoding also mitigates the problem of "neighbour categories" that occurs when we assign arbitrary numerical values to each category. This is where the algorithm incorrectly assumes that categories with adjacent numerical values are more similar than those with non-adjacent values. One-hot encoding eliminates these risks by treating each category as a separate, unrelated entity. Finally, one-hot encoding allows the algorithm to learn separate weights for each category, this enhances the model's ability to make accurate predictions and also allows us to observe the impact of each category on the prediction.

Whilst one-hot encoding provides many benefits for machine learning algorithms, it may not be a suitable encoding technique for other use cases, particularly, for downstream analytics in Power BI. As each category is treated as its own separate entity with its own column, this can lead to a proliferation of columns, particularly with high cardinality categorical data (data that has many unique values). In Power BI, this can be problematic for a number of reasons. Firstly, if there are a large number of columns, this can bloat the size of the semantic model, increasing memory usage and potentially leading to performance issues. Secondly, having a column for each category consequently leads to unnecessarily complex DAX measures. Taking our transportation survey question, the DAX calculations over this dataset would need to bake in the various different columns that compose this singular categorical question. This makes makes the DAX measures harder to write, maintain and test. For a more in depth discussion on this, please refer the next blog post in this series.

Advantages Disadvantages
Improved accuracy of machine learning models by preventing the algorithm from assuming an ordinal relationship between the categories. Negatively impacts data analysis in data analytics tools such as Power BI as the complexity of the data consequently leads to complex DAX measures and increased semantic model size.
Improved accuracy of machine learning models by mitigating the problem of "neighbour categories".
Improved accuracy of machine learning models by allowing the algorithm to learn separate weights for each category, and therefore make more accurate predictions.

Advantages and disadvantages of label encoding

Label encoding is a useful technique for use in machine learning algorithms in specific scenarios where you do want ordinal relationships to be maintained. This is because label encoding preserves ordinal relationships between categories. For example, there is a natural order between the following categories: 'Strongly Disagree', 'Disagree', 'Neither Agree nor Disagree', 'Agree', 'Strongly Agree', so label encoding would be good choice of encoding technique to prepare this data for machine learning algorithms. However, it is important to keep in mind that label encoding may introduce ordinal relationships where none exist. For example, take our transportation survey question, as discussed above, if we label encoded the response data by assigning the values of 1 to 'Car', 2 to 'Bus', 3 to 'Train', and used it in a machine learning model, the model might mistakenly imply a false order or hierarchy amongst the categories where there is none, leading to biased predictions.

Label encoding simplifies the representation of the data, making it extremely useful for downstream analytics in Power BI, offering many advantages. Firstly, label encoded data is memory-efficient, by using integers in place of strings, label encoding reduces the size of your semantic model, optimising memory usage and improving the performance of your Power BI reports. Secondly, the consolidation of categorical variables into singular columns, makes the data easier to interpret and analyse. Specifically, the DAX measures over the top of the data can be more streamlined and flexible. For example, to analyse our transportation survey question, the DAX calculations over this dataset would only need to consider one column, the transportation column, as this column represents the single categorical question. For a more in depth discussion on this, please refer the next blog post in this series.

Advantages Disadvantages
Preserves ordinal relationship between the categories, therefore useful encoding technique for machine learning models where there is a natural order to the categories. If there is no natural order to the categories, machine learning models might mistakenly imply a false order or hierarchy amongst the categories, leading to biased predictions.
Positively impacts data analysis in data analytics tools such as Power BI as it simplifies the representation of the data which consequently leads to streamlined and flexible DAX measures and reduced semantic model size.

Conclusion

One-hot encoding and label encoding are two different techniques for encoding categorical data. One-hot encoding is particularly useful in machine learning models over categorical data where there is no inherent relationship between the categories. This is because one-hot encoding maintains categorical independence therefore preventing biased predictions. However one-hot encoding is not a desired technique for encoding categorical data for analysis in Power BI as it increases semantic model size and complicates DAX measures.

By contrast, label encoding is the preferred technique for encoding categorical data for analysis in Power BI as it simplifies the representation of the data, which consequently leads to more flexible DAX measures and reduced semantic model size. It is also useful in machine learning algorithms over categorical data where we want to preserve ordinal relationships between the categories. However, if the data has no natural order amongst the categories, label encoding may cause biased predictions.

Ultimately, the choice between the two encoding techniques depends on factors such as the nature of the data, the analysis task at hand, and the requirements of downstream processes, whether that be machine learning algorithms or data analysis in Power BI.

Now we've looked at the advantages and disadvantages of one-hot encoding and label-encoding in a wider context, the next blog post in this series will dive deeper into the specific implications of using one-hot encoded data and label-encoded data in Power BI. Using specific examples, we will see why label encoding is the preferred technique for encoding categorical data for analysis in Power BI.

Jessica Hill

Software Engineer I

Jessica Hill

Jessica comes from a Biosciences background, having gained a 1st Class Bachelor of Science in Biology from The University of Manchester.

During the lockdown Jessica used the opportunity to explore her interest in technology by taking two Code First Girls online courses; Introduction to Web Development and Python Programming.

This led Jessica to look for job opportunities in the technology sector and joined endjin's 2021 apprenticeship cohort, which had over 200 applicants.