OneHotEncoder
OneHotEncoderis a transformer that converts categorical variables into a binary matrix, where each unique category becomes a new column containing 1s and 0s. Unlike ordinal encoding, it does not assume a natural order between categories, making it ideal for nominal data like colors or city names.
Essential Parameters
sparse_output: (Boolean, default=True) Determines whether the result is returned as a memory-efficient sparse matrix or a dense NumPy array.- Note: In older versions, this was simply named
sparse.
- Note: In older versions, this was simply named
handle_unknown: Set to'ignore'to handle categories in the test set that were not seen during training; this will result in a row of all zeros for that feature.drop: Used to avoid multicollinearity (the “dummy variable trap”) by dropping one category per feature. Common values include'first'or'if_binary'.min_frequency/max_categories: Allows you to group infrequent categories into a single “Other” column to prevent the feature space from exploding.
Basic Implementation
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
# Data with 2 nominal features
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green'], 'City': ['NY', 'LDN', 'NY']})
# Instantiate with dense output for visibility
encoder = OneHotEncoder(sparse_output=False)
# Fit and transform
encoded_data = encoder.fit_transform(df)
# Retrieve column names for the new binary features
feature_names = encoder.get_feature_names_out()
encoded_df = pd.DataFrame(encoded_data, columns=feature_names)When to Use
- Nominal Data: When there is no inherent ranking (e.g., “Cat,” “Dog,” “Bird”).
- Linear Models / SVMs: These algorithms often require one-hot encoding because they cannot handle categorical strings directly and need numeric binary inputs.
- Low to Medium Cardinality: For features with thousands of unique values (high cardinality), consider Target Encoding or Hashing to keep the dataset size manageable.
For more details, visit scikit-learn official API doc1
OrdinalEncoder
OrdinalEncoderis a preprocessing transformer used to convert categorical features into numerical integers. It is specifically designed for features with a natural ranking (e.g., "low," "medium," "high").
Unlike LabelEncoder, which is meant for the target variable (𝑦), OrdinalEncoder is optimized for input features (𝑋) and can transform multiple columns simultaneously.
Key Parameters and Usage
categories: By default ('auto'), it determines categories from the data and sorts them alphabetically. To specify a custom order, pass a list of lists where each inner list represents the ordered categories for a column.handle_unknown: Set to'use_encoded_value'to handle categories not seen during training. You must also provide anunknown_value(e.g.,-1).encoded_missing_value: Allows you to explicitly set the integer value for missing data (e.g.,np.nan).min_frequency/max_categories: Used to group infrequent categories into a single “other” category, reducing dimensionality (added in Scikit-learn 1.1+).
*Python Example * This example demonstrates how to set a custom order for a “Size” column:
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
# Sample data
df = pd.DataFrame({'Size': ['Medium', 'Small', 'Large', 'Medium']})
# Define the order: Small=0, Medium=1, Large=2
categories = [['Small', 'Medium', 'Large']]
encoder = OrdinalEncoder(categories=categories)
# Fit and transform
df['Size_Encoded'] = encoder.fit_transform(df[['Size']])
print(df)When to Use
- Ordinal Data: Use when the relative order matters (e.g., Education levels: High School < Bachelor’s < PhD).
- Tree-based Models: Random Forests and Gradient Boosting models often perform better with
OrdinalEncoderthanOneHotEncoderbecause it keeps the feature space compact. - Avoid for Linear Models: If there is no inherent order, using
OrdinalEncodercan introduce a fake relationship that confuses linear models; use OneHotEncoder instead.
For more detailed technical specifications, refer to the official Scikit-learn OrdinalEncoder documentation2.
Target Encoder
TargetEncoderis a preprocessing transformer designed to encode categorical features into numerical values based on the target variable. It is specifically useful for high-cardinality features where techniques like one-hot encoding would create too many sparse columns.
Core Functionality
- Encoding Logic: Each category is replaced with a shrunk mean of the target variable for that specific category.
- Regression: Replaced by the mean value of the target for the category.
- Classification: Replaced by the conditional probability (expected value) of the target given that category.
- Smoothing: It applies a shrinkage/smoothing parameter to move categorical means toward the global target mean, which helps handle rare categories and prevents extreme values.
- Automatic Overfitting Prevention: In
fit_transform, it uses an internal cross-fitting scheme. It splits training data into 𝑘 folds and encodes each fold using the values learned from the other 𝑘−1 folds, preventing the model from “memorizing” labels.
Usage Example: Regression
from sklearn.preprocessing import TargetEncoder
import pandas as pd
# Sample data with categories and a target
X = pd.DataFrame({'city': ['New York', 'London', 'New York', 'Paris', 'London']})
y = [10, 20, 15, 30, 25]
# Initialize with 'auto' smoothing and target detection
encoder = TargetEncoder(smooth="auto", target_type="continuous")
# Use fit_transform on training data to enable cross-fitting
X_encoded = encoder.fit_transform(X, y)
# Use transform on new data (uses the global means learned during fit)
new_data = pd.DataFrame({'city': ['New York', 'Tokyo']})
new_encoded = encoder.transform(new_data) # Tokyo will use the global mean
print(new_encoded)
# Output
# array([[12.94117647],
# [20. ]])Here is the exact breakdown of how those numbers are calculated.
- The Value for ‘Tokyo’ (20.0)
- Logic: ‘Tokyo’ is a category that the encoder never saw during the training phase.
- Formula: When a category is unknown (count = 0), the encoder defaults entirely to the Global Mean.
- Calculation:
- Target
yvalues:[10, 20, 15, 30, 25] - Sum: 100
- Count: 5
- Global Mean =
- Target
- The Value for ‘New York’ (12.9411…)
- This value comes from mixing the Local Mean of New York with the Global Mean. This is done to prevent overfitting (e.g., if New York only had 1 data point, we wouldn’t want to trust it 100%).
- Sklearn uses this formula for the encoding:
- Where:
- : The count of the category (For ‘New York’, ).
- : The average of ‘New York’ values (), so .
- : .
- : The smoothing parameter. (Since you set
smooth="auto", sklearn calculated a specific value for based on the variance of your data).
Usage Example: Classification
from sklearn.preprocessing import TargetEncoder
import pandas as pd
import numpy as np
# 1. Generate a robust synthetic dataset (300 rows)
np.random.seed(42)
n_samples = 300
# Create categories with different underlying probabilities of success (1)
# High: ~80% success, Medium: ~50% success, Low: ~20% success
categories = np.random.choice(['High_Prob', 'Medium_Prob', 'Low_Prob'], size=n_samples)
y = []
for cat in categories:
if cat == 'High_Prob':
y.append(np.random.choice([0, 1], p=[0.2, 0.8]))
elif cat == 'Medium_Prob':
y.append(np.random.choice([0, 1], p=[0.5, 0.5]))
else: # Low_Prob
y.append(np.random.choice([0, 1], p=[0.8, 0.2]))
X = pd.DataFrame({'category': categories})
y = np.array(y)
print("--- Data Summary ---")
print(X['category'].value_counts())
print(f"Global Target Mean: {np.mean(y):.4f}\n")
# 2. Initialize TargetEncoder
# With sufficient data, we can use the default cv=5 safely.
# This splits data into 5 folds: training on 4, encoding the 5th.
encoder = TargetEncoder(smooth="auto", cv=5)
# 3. Fit and Transform (Training Phase)
# The values here represent the probability of the target being 1.
# Because of CV=5, different rows of the same category will get slightly different values
# depending on which fold they fell into.
X_encoded = encoder.fit_transform(X, y)
# 4. Analyze the Results
# TargetEncoder returns a 2D array (n_samples, 1). We need to flatten it for the DataFrame.
results = pd.DataFrame({
'Category': X['category'],
'Target': y,
'Encoded_Value': X_encoded.ravel()
})
print("--- Comparison: Raw Mean vs Encoded Value ---")
# We group by category to see how the encoder handled each group
summary = results.groupby('Category').agg(
Count=('Target', 'count'),
Raw_Mean=('Target', 'mean'), # The actual average in the data
Encoded_Mean=('Encoded_Value', 'mean'), # The average of the encoded values
Encoded_Min=('Encoded_Value', 'min'), # SHOWS VARIATION
Encoded_Max=('Encoded_Value', 'max') # SHOWS VARIATION
)
print(summary)
print("\nNotice: 'Encoded_Min' and 'Encoded_Max' are different for the same category.")
print("This confirms that fit_transform produced different values for the same category")
print("because it calculated means using different Cross-Validation folds.\n")
# 5. Transform New Data (Test Phase)
# 'Unknown_City' is a category the model has never seen.
new_data = pd.DataFrame({'category': ['High_Prob', 'Medium_Prob', 'Low_Prob', 'Unknown_City']})
new_encoded = encoder.transform(new_data)
print("--- New Data Transformation ---")
# When using .transform(), the variance disappears. Every 'High_Prob' gets the same value.
new_data['encoded_value'] = new_encoded.ravel()
print(new_data)
print("\nNote: During .transform(), the encoder uses the full training set average,")
print("so there is no variance for the same category anymore.")Output
--- Data Summary ---
category
Low_Prob 107
High_Prob 99
Medium_Prob 94
Name: count, dtype: int64
Global Target Mean: 0.4933
--- Comparison: Raw Mean vs Encoded Value ---
Count Raw_Mean Encoded_Mean Encoded_Min Encoded_Max
Category
High_Prob 99 0.868687 0.866779 0.856706 0.877825
Low_Prob 107 0.214953 0.216307 0.197529 0.246293
Medium_Prob 94 0.414894 0.414476 0.393589 0.442245
Notice: 'Encoded_Min' and 'Encoded_Max' are different for the same category.
This confirms that fit_transform produced different values for the same category
because it calculated means using different Cross-Validation folds.
--- New Data Transformation ---
category encoded_value
0 High_Prob 0.866965
1 Medium_Prob 0.415696
2 Low_Prob 0.216699
3 Unknown_City 0.493333
Note: During .transform(), the encoder uses the full training set average,
so there is no variance for the same category anymore.
Key Advantages & Disadvantages
- Pros:
- Keeps the feature space compact (retains one column instead of 𝑁 columns).
- Captures a direct relationship between the feature and the target.
- Cons:
- High risk of data leakage if not used with internal cross-fitting (handled automatically by
fit_transformin Scikit-learn). - Can lead to overfitting if categorical cardinality is extremely high relative to the number of samples.
- High risk of data leakage if not used with internal cross-fitting (handled automatically by
Related Libraries For advanced target encoding methods (like Leave-One-Out, James-Stein, or M-estimate), many practitioners use the Category Encoders library, a Scikit-learn-contrib package.
Label Encoder
Encode target labels with value between 0 and n_classes-1.
This transformer should be used to encode target values, i.e.
y, and not the inputX.
Converts categorical text labels (like “cat”, “dog”) into unique numbers (0, 1, 2…) essential for machine learning models that need numerical input, assigning each category a distinct integer without implying order, though it’s best for encoding target variables (y) rather than input features (X) to avoid misinterpreting numerical order.
How it works
- Encoding: It maps each unique category to an integer from 0 to n-1 (where ‘n’ is the number of classes).
fit_transform(): Combines learning the categories and applying the transformation.classes_attribute: Shows the original categories learned.inverse_transform(): Converts numbers back to original labels
Example (Python)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
labels = ["cat", "dog", "bird", "cat"]
encoded_labels = le.fit_transform(labels)
print(encoded_labels) # Output: [1 2 0 1] (order depends on alphabetical sort)- Note: It’s typically for the target variable (y) because assigning numbers to features (X) can wrongly imply order (e.g., 2 > 1), which might confuse some models. For features, One-Hot Encoding is often preferred.
Scikit-learn’s LabelEncoder sorts labels in alphabetical (lexicographical) order before assigning integer values, starting from 0.
For example, if you have the labels ['cat', 'dog', 'bird']:
'bird'will be assigned0'cat'will be assigned1'dog'will be assigned2
For more details, visit official scikit-learn doc 3.
Summary Comparison
| Encoder | Used For | Feature Type | Cardinality | Purpose | Why? |
|---|---|---|---|---|---|
| OrdinalEncoder | Input Features (𝑋) | Ordinal (Ordered) | Any | Ranking categories (Low=0, High=2) | Preserves meaningful hierarchy (e.g., Cold < Warm < Hot). |
| OneHotEncoder | Input Features (𝑋) | Nominal (No order) | Low (< 15) | Creating binary columns (0s and 1s) | Prevents model from assuming a fake rank without creating too many columns. |
| TargetEncoder | Input Features (𝑋) | Nominal (No order) | High (> 15) | Replacing categories with target means | Efficiently handles many categories (e.g., Zip Codes, bank account numbers) by linking them to the outcome (𝑦). |
| LabelEncoder | Only Target (𝑦) | Nominal | Low | Simple integer mapping (0, 1, 2) | Standardizes the output labels into integers (0, 1, 2) for classification algorithms. |
High Cardinality: If you have millions of unique identifiers (like banking account number), even
TargetEncodercan struggle. In those cases, Feature Engineering (extracting parts of the ID) or Feature Hashing is preferred.