Ordering Sorting Categorical
Master ordering and sorting categorical data in Pandas for efficient machine learning. Learn to leverage the power of the Categorical data type for better performance.
Ordering and Sorting Categorical Data in Pandas
Categorical data is a fundamental data type for variables with a limited and fixed number of possible values, often referred to as categories. Examples include gender, country, education levels, or product ratings. Pandas offers a powerful Categorical
data type, which is more memory-efficient and performs better than using standard object or string types for such data.
This guide provides a comprehensive overview of how to leverage Pandas' Categorical
data type for ordering, reordering, and sorting categorical data.
What is Categorical Data?
Categorical data represents variables whose values fall into distinct categories. These categories can be:
Nominal: Categories without an inherent order (e.g., "Male", "Female", "Other").
Ordinal: Categories with a defined, meaningful order (e.g., "Low", "Medium", "High"; "Freshman", "Sophomore", "Junior", "Senior").
Utilizing Pandas' category
dtype can lead to significant memory savings and performance improvements, especially when dealing with columns containing many repeated string values.
Ordering Categorical Data in Pandas
By default, Pandas treats categorical data as unordered. This means that operations like min()
, max()
, or direct comparisons between categories might not behave as expected or might default to alphabetical sorting.
To enable ordered operations, you must explicitly define the order of the categories. This is achieved using the .cat.as_ordered()
method.
Example: Creating an Ordered Categorical Series
import pandas as pd
## Create a categorical series
s = pd.Series(["a", "b", "c", "a", "a", "a", "b", "b"]).astype("category")
## Convert to an ordered categorical series
s_ordered = s.cat.as_ordered()
print("Ordered Categorical Series:\n", s_ordered)
print("Minimum value:", s_ordered.min())
print("Maximum value:", s_ordered.max())
Output:
Ordered Categorical Series:
0 a
1 b
2 c
3 a
4 a
5 a
6 b
7 b
dtype: category
Categories (3, object): ['a' < 'b' < 'c']
Minimum value: a
Maximum value: c
Once a categorical series is marked as ordered, operations like finding the minimum or maximum value will respect the defined sequence of categories, rather than relying on alphabetical order.
Reordering and Setting Categories
Sometimes, the default order (often alphabetical) is not the desired order, or you need to modify the set of categories present in a column. Pandas provides two primary methods for this:
.cat.reorder_categories(new_categories)
: This method reorders the existing categories within a categorical series or column without changing the underlying data values or introducing new categories. Thenew_categories
argument must be a list containing all the original categories in the desired order..cat.set_categories(new_categories)
: This method allows you to redefine the entire set of categories. You can change the order, add new categories, or remove existing ones. Thenew_categories
argument is a list of the desired categories, and their order in this list defines the new categorical order. If the original data contains values not present innew_categories
, they will be converted toNaN
.
Example: Reordering and Setting Categories
import pandas as pd
s = pd.Series(["b", "a", "c", "a", "b"], dtype="category")
## Reorder existing categories
s_reordered = s.cat.reorder_categories(["b", "a", "c"], ordered=True)
print("Reordered Categories:\n", s_reordered)
## Set new categories, potentially adding or removing
s_new = s.cat.set_categories(["d", "b", "a", "c"], ordered=True)
print("\nNew Categories Set:\n", s_new)
## Example with a value not in new categories
s_with_extra = pd.Series(["b", "a", "c", "e"], dtype="category")
s_new_limited = s_with_extra.cat.set_categories(["b", "a"], ordered=True)
print("\nSet with limited new categories:\n", s_new_limited)
Output:
Reordered Categories:
0 b
1 a
2 c
3 a
4 b
dtype: category
Categories (3, object): ['b' < 'a' < 'c']
New Categories Set:
0 b
1 a
2 c
3 a
4 b
dtype: category
Categories (4, object): ['d' < 'b' < 'a' < 'c']
Set with limited new categories:
0 b
1 a
2 c
3 NaN
dtype: category
Categories (2, object): ['b' < 'a']
Key Points for set_categories
:
If
ordered=True
is specified, the new categories will be treated as ordered.Ensure the
new_categories
list contains all unique values from the original series if you don't want them to becomeNaN
.
Sorting Categorical Data
The behavior of sorting operations on categorical data in Pandas depends directly on whether the Categorical
type is ordered or not.
Unordered Categories: Sorting will be performed lexicographically (alphabetically).
Ordered Categories: Sorting will respect the defined order of the categories.
Example: Sorting with and without Category Order
import pandas as pd
## Unordered sorting (lexical)
s_unordered = pd.Series(["a", "b", "c", "a", "a", "a", "b", "b"], dtype="category")
print("Lexical Sorting (Unordered):\n", s_unordered.sort_values())
## Ordered sorting
s_ordered = s_unordered.cat.set_categories(['c', 'a', 'b'], ordered=True)
print("\nSorted with Defined Order (Ordered):\n", s_ordered.sort_values())
Output:
Lexical Sorting (Unordered):
0 a
3 a
4 a
5 a
1 b
6 b
7 b
2 c
dtype: category
Categories (3, object): ['a', 'b', 'c']
Sorted with Defined Order (Ordered):
2 c
0 a
3 a
4 a
5 a
1 b
6 b
7 b
dtype: category
Categories (3, object): ['c' < 'a' < 'b']
As seen in the example, sort_values()
on an ordered categorical series respects the defined order (c
before a
before b
), while on an unordered series, it defaults to alphabetical order (a
before b
before c
).
Multi-Column Sorting with Categorical Data
When sorting a DataFrame containing multiple categorical columns, Pandas applies the sorting logic for each column based on its defined order. This allows for sophisticated, multi-level sorting that adheres to the logical sequence of categories.
Example: Sorting by Multiple Categorical Columns
import pandas as pd
df = pd.DataFrame({
"A": pd.Categorical(["X", "X", "Y", "Y", "X", "Z", "Z", "X"],
categories=["Y", "Z", "X"], ordered=True),
"B": [1, 2, 1, 2, 2, 1, 2, 1]
})
sorted_df = df.sort_values(by=["A", "B"])
print("Sorted DataFrame:\n", sorted_df)
Output:
Sorted DataFrame:
A B
2 Y 1
3 Y 2
5 Z 1
6 Z 2
0 X 1
4 X 2
7 X 1
1 X 2
In this example, the DataFrame is first sorted by column "A" according to its defined categorical order (Y
, then Z
, then X
). Within each group of identical "A" values, the rows are then sorted by column "B" in ascending numerical order.
Conclusion
Effectively using Pandas' Categorical
data type is essential for efficient and logically sound data manipulation. By converting relevant columns to this type and explicitly defining category order where necessary, you can achieve:
Memory Savings: Significantly reduce memory footprint for columns with repeated string values.
Performance Boosts: Accelerate operations that involve these columns.
Correct Sorting and Comparisons: Ensure that operations like sorting,
min()
,max()
, and comparisons respect the inherent order of your data.
Key Methods for Categorical Data:
| Method | Description | | :--------------------- | :------------------------------------------------------------- | | astype("category")
| Convert an existing column or series to the categorical type. | | .cat.as_ordered()
| Mark an existing unordered categorical series as ordered. | | .cat.reorder_categories()
| Reorder the existing categories within a categorical series. | | .cat.set_categories()
| Define a new set of categories, allowing for additions or removals. | | sort_values()
| Sorts data, respecting the defined category order if present. |
Embrace the Categorical
dtype when working with string data that represents discrete groups, ordinal values, or hierarchical labels to optimize your data analysis workflow.