Ordering Sorting Categorical

Master ordering and sorting categorical data in Pandas for efficient machine learning. Learn to leverage the power of the Categorical data type for better performance.

Ordering and Sorting Categorical Data in Pandas

Categorical data is a fundamental data type for variables with a limited and fixed number of possible values, often referred to as categories. Examples include gender, country, education levels, or product ratings. Pandas offers a powerful Categorical data type, which is more memory-efficient and performs better than using standard object or string types for such data.

This guide provides a comprehensive overview of how to leverage Pandas' Categorical data type for ordering, reordering, and sorting categorical data.

What is Categorical Data?

Categorical data represents variables whose values fall into distinct categories. These categories can be:

  • Nominal: Categories without an inherent order (e.g., "Male", "Female", "Other").

  • Ordinal: Categories with a defined, meaningful order (e.g., "Low", "Medium", "High"; "Freshman", "Sophomore", "Junior", "Senior").

Utilizing Pandas' category dtype can lead to significant memory savings and performance improvements, especially when dealing with columns containing many repeated string values.

Ordering Categorical Data in Pandas

By default, Pandas treats categorical data as unordered. This means that operations like min(), max(), or direct comparisons between categories might not behave as expected or might default to alphabetical sorting.

To enable ordered operations, you must explicitly define the order of the categories. This is achieved using the .cat.as_ordered() method.

Example: Creating an Ordered Categorical Series

import pandas as pd

## Create a categorical series
s = pd.Series(["a", "b", "c", "a", "a", "a", "b", "b"]).astype("category")

## Convert to an ordered categorical series
s_ordered = s.cat.as_ordered()

print("Ordered Categorical Series:\n", s_ordered)
print("Minimum value:", s_ordered.min())
print("Maximum value:", s_ordered.max())

Output:

Ordered Categorical Series:
0    a
1    b
2    c
3    a
4    a
5    a
6    b
7    b
dtype: category
Categories (3, object): ['a' < 'b' < 'c']
Minimum value: a
Maximum value: c

Once a categorical series is marked as ordered, operations like finding the minimum or maximum value will respect the defined sequence of categories, rather than relying on alphabetical order.

Reordering and Setting Categories

Sometimes, the default order (often alphabetical) is not the desired order, or you need to modify the set of categories present in a column. Pandas provides two primary methods for this:

  • .cat.reorder_categories(new_categories): This method reorders the existing categories within a categorical series or column without changing the underlying data values or introducing new categories. The new_categories argument must be a list containing all the original categories in the desired order.

  • .cat.set_categories(new_categories): This method allows you to redefine the entire set of categories. You can change the order, add new categories, or remove existing ones. The new_categories argument is a list of the desired categories, and their order in this list defines the new categorical order. If the original data contains values not present in new_categories, they will be converted to NaN.

Example: Reordering and Setting Categories

import pandas as pd

s = pd.Series(["b", "a", "c", "a", "b"], dtype="category")

## Reorder existing categories
s_reordered = s.cat.reorder_categories(["b", "a", "c"], ordered=True)
print("Reordered Categories:\n", s_reordered)

## Set new categories, potentially adding or removing
s_new = s.cat.set_categories(["d", "b", "a", "c"], ordered=True)
print("\nNew Categories Set:\n", s_new)

## Example with a value not in new categories
s_with_extra = pd.Series(["b", "a", "c", "e"], dtype="category")
s_new_limited = s_with_extra.cat.set_categories(["b", "a"], ordered=True)
print("\nSet with limited new categories:\n", s_new_limited)

Output:

Reordered Categories:
0    b
1    a
2    c
3    a
4    b
dtype: category
Categories (3, object): ['b' < 'a' < 'c']

New Categories Set:
0    b
1    a
2    c
3    a
4    b
dtype: category
Categories (4, object): ['d' < 'b' < 'a' < 'c']

Set with limited new categories:
0      b
1      a
2      c
3    NaN
dtype: category
Categories (2, object): ['b' < 'a']

Key Points for set_categories:

  • If ordered=True is specified, the new categories will be treated as ordered.

  • Ensure the new_categories list contains all unique values from the original series if you don't want them to become NaN.

Sorting Categorical Data

The behavior of sorting operations on categorical data in Pandas depends directly on whether the Categorical type is ordered or not.

  • Unordered Categories: Sorting will be performed lexicographically (alphabetically).

  • Ordered Categories: Sorting will respect the defined order of the categories.

Example: Sorting with and without Category Order

import pandas as pd

## Unordered sorting (lexical)
s_unordered = pd.Series(["a", "b", "c", "a", "a", "a", "b", "b"], dtype="category")
print("Lexical Sorting (Unordered):\n", s_unordered.sort_values())

## Ordered sorting
s_ordered = s_unordered.cat.set_categories(['c', 'a', 'b'], ordered=True)
print("\nSorted with Defined Order (Ordered):\n", s_ordered.sort_values())

Output:

Lexical Sorting (Unordered):
0    a
3    a
4    a
5    a
1    b
6    b
7    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

Sorted with Defined Order (Ordered):
2    c
0    a
3    a
4    a
5    a
1    b
6    b
7    b
dtype: category
Categories (3, object): ['c' < 'a' < 'b']

As seen in the example, sort_values() on an ordered categorical series respects the defined order (c before a before b), while on an unordered series, it defaults to alphabetical order (a before b before c).

Multi-Column Sorting with Categorical Data

When sorting a DataFrame containing multiple categorical columns, Pandas applies the sorting logic for each column based on its defined order. This allows for sophisticated, multi-level sorting that adheres to the logical sequence of categories.

Example: Sorting by Multiple Categorical Columns

import pandas as pd

df = pd.DataFrame({
    "A": pd.Categorical(["X", "X", "Y", "Y", "X", "Z", "Z", "X"],
                        categories=["Y", "Z", "X"], ordered=True),
    "B": [1, 2, 1, 2, 2, 1, 2, 1]
})

sorted_df = df.sort_values(by=["A", "B"])
print("Sorted DataFrame:\n", sorted_df)

Output:

Sorted DataFrame:
    A  B
2  Y  1
3  Y  2
5  Z  1
6  Z  2
0  X  1
4  X  2
7  X  1
1  X  2

In this example, the DataFrame is first sorted by column "A" according to its defined categorical order (Y, then Z, then X). Within each group of identical "A" values, the rows are then sorted by column "B" in ascending numerical order.

Conclusion

Effectively using Pandas' Categorical data type is essential for efficient and logically sound data manipulation. By converting relevant columns to this type and explicitly defining category order where necessary, you can achieve:

  • Memory Savings: Significantly reduce memory footprint for columns with repeated string values.

  • Performance Boosts: Accelerate operations that involve these columns.

  • Correct Sorting and Comparisons: Ensure that operations like sorting, min(), max(), and comparisons respect the inherent order of your data.

Key Methods for Categorical Data:

| Method | Description | | :--------------------- | :------------------------------------------------------------- | | astype("category") | Convert an existing column or series to the categorical type. | | .cat.as_ordered() | Mark an existing unordered categorical series as ordered. | | .cat.reorder_categories() | Reorder the existing categories within a categorical series. | | .cat.set_categories()| Define a new set of categories, allowing for additions or removals. | | sort_values() | Sorts data, respecting the defined category order if present. |

Embrace the Categorical dtype when working with string data that represents discrete groups, ordinal values, or hierarchical labels to optimize your data analysis workflow.