OneHotEncoder

OneHotEncoder is a transformer that converts categorical variables into a binary matrix, where each unique category becomes a new column containing 1s and 0s. Unlike ordinal encoding, it does not assume a natural order between categories, making it ideal for nominal data like colors or city names.

Essential Parameters

sparse_output: (Boolean, default=True) Determines whether the result is returned as a memory-efficient sparse matrix or a dense NumPy array.
- Note: In older versions, this was simply named sparse.
handle_unknown: Set to 'ignore' to handle categories in the test set that were not seen during training; this will result in a row of all zeros for that feature.
drop: Used to avoid multicollinearity (the “dummy variable trap”) by dropping one category per feature. Common values include 'first' or 'if_binary'.
min_frequency / max_categories: Allows you to group infrequent categories into a single “Other” column to prevent the feature space from exploding.

Basic Implementation

from sklearn.preprocessing import OneHotEncoder
import pandas as pd
 
# Data with 2 nominal features
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green'], 'City': ['NY', 'LDN', 'NY']})
 
# Instantiate with dense output for visibility
encoder = OneHotEncoder(sparse_output=False)
 
# Fit and transform
encoded_data = encoder.fit_transform(df)
 
# Retrieve column names for the new binary features
feature_names = encoder.get_feature_names_out()
encoded_df = pd.DataFrame(encoded_data, columns=feature_names)

When to Use

Nominal Data: When there is no inherent ranking (e.g., “Cat,” “Dog,” “Bird”).
Linear Models / SVMs: These algorithms often require one-hot encoding because they cannot handle categorical strings directly and need numeric binary inputs.
Low to Medium Cardinality: For features with thousands of unique values (high cardinality), consider Target Encoding or Hashing to keep the dataset size manageable.

For more details, visit scikit-learn official API doc¹

OrdinalEncoder

OrdinalEncoder is a preprocessing transformer used to convert categorical features into numerical integers. It is specifically designed for features with a natural ranking (e.g., "low," "medium," "high").

Unlike LabelEncoder, which is meant for the target variable (𝑦), OrdinalEncoder is optimized for input features (𝑋) and can transform multiple columns simultaneously.

How to Perform Ordinal Encoding Using Sklearn - GeeksforGeeks

Key Parameters and Usage

categories: By default ('auto'), it determines categories from the data and sorts them alphabetically. To specify a custom order, pass a list of lists where each inner list represents the ordered categories for a column.
handle_unknown: Set to 'use_encoded_value' to handle categories not seen during training. You must also provide an unknown_value (e.g., -1).
encoded_missing_value: Allows you to explicitly set the integer value for missing data (e.g., np.nan).
min_frequency / max_categories: Used to group infrequent categories into a single “other” category, reducing dimensionality (added in Scikit-learn 1.1+).

*Python Example * This example demonstrates how to set a custom order for a “Size” column:

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
 
# Sample data
df = pd.DataFrame({'Size': ['Medium', 'Small', 'Large', 'Medium']})
 
# Define the order: Small=0, Medium=1, Large=2
categories = [['Small', 'Medium', 'Large']]
encoder = OrdinalEncoder(categories=categories)
 
# Fit and transform
df['Size_Encoded'] = encoder.fit_transform(df[['Size']])
print(df)

When to Use

Ordinal Data: Use when the relative order matters (e.g., Education levels: High School < Bachelor’s < PhD).
Tree-based Models: Random Forests and Gradient Boosting models often perform better with OrdinalEncoder than OneHotEncoder because it keeps the feature space compact.
Avoid for Linear Models: If there is no inherent order, using OrdinalEncoder can introduce a fake relationship that confuses linear models; use OneHotEncoder instead.

For more detailed technical specifications, refer to the official Scikit-learn OrdinalEncoder documentation².

Target Encoder

TargetEncoder is a preprocessing transformer designed to encode categorical features into numerical values based on the target variable. It is specifically useful for high-cardinality features where techniques like one-hot encoding would create too many sparse columns.

Core Functionality

Encoding Logic: Each category is replaced with a shrunk mean of the target variable for that specific category.
- Regression: Replaced by the mean value of the target for the category.
- Classification: Replaced by the conditional probability (expected value) of the target given that category.
Smoothing: It applies a shrinkage/smoothing parameter to move categorical means toward the global target mean, which helps handle rare categories and prevents extreme values.
Automatic Overfitting Prevention: In fit_transform, it uses an internal cross-fitting scheme. It splits training data into 𝑘 folds and encodes each fold using the values learned from the other 𝑘−1 folds, preventing the model from “memorizing” labels.

Usage Example: Regression

from sklearn.preprocessing import TargetEncoder
import pandas as pd
 
# Sample data with categories and a target
X = pd.DataFrame({'city': ['New York', 'London', 'New York', 'Paris', 'London']})
y = [10, 20, 15, 30, 25]
 
# Initialize with 'auto' smoothing and target detection
encoder = TargetEncoder(smooth="auto", target_type="continuous")
 
# Use fit_transform on training data to enable cross-fitting
X_encoded = encoder.fit_transform(X, y)
 
# Use transform on new data (uses the global means learned during fit)
new_data = pd.DataFrame({'city': ['New York', 'Tokyo']})
new_encoded = encoder.transform(new_data) # Tokyo will use the global mean
 
print(new_encoded) 
 
# Output
# array([[12.94117647],
#       [20.        ]])

Here is the exact breakdown of how those numbers are calculated.

The Value for ‘Tokyo’ (20.0)
- Logic: ‘Tokyo’ is a category that the encoder never saw during the training phase.
- Formula: When a category is unknown (count = 0), the encoder defaults entirely to the Global Mean.
- Calculation:
  - Target y values: [10, 20, 15, 30, 25]
  - Sum: 100
  - Count: 5
  - Global Mean = $100/5 = 20.0$
The Value for ‘New York’ (12.9411…)
- This value comes from mixing the Local Mean of New York with the Global Mean. This is done to prevent overfitting (e.g., if New York only had 1 data point, we wouldn’t want to trust it 100%).
- Sklearn uses this formula for the encoding:

$Encoded Value = \frac{n \times LocalMean + m \times GlobalMean}{n + m}$

Where:
- $n$ : The count of the category (For ‘New York’, $n = 2$ ).
- $LocalMean$ : The average of ‘New York’ values ( $[10, 15]$ ), so $12.5$ .
- $GlobalMean$ : $20.0$ .
- $m$ : The smoothing parameter. (Since you set smooth="auto", sklearn calculated a specific value for $m$ based on the variance of your data).

Usage Example: Classification

from sklearn.preprocessing import TargetEncoder
import pandas as pd
import numpy as np
 
# 1. Generate a robust synthetic dataset (300 rows)
np.random.seed(42)
n_samples = 300
 
# Create categories with different underlying probabilities of success (1)
# High: ~80% success, Medium: ~50% success, Low: ~20% success
categories = np.random.choice(['High_Prob', 'Medium_Prob', 'Low_Prob'], size=n_samples)
y = []
 
 
for cat in categories:
    if cat == 'High_Prob':
        y.append(np.random.choice([0, 1], p=[0.2, 0.8]))
    elif cat == 'Medium_Prob':
        y.append(np.random.choice([0, 1], p=[0.5, 0.5]))
    else: # Low_Prob
        y.append(np.random.choice([0, 1], p=[0.8, 0.2]))
 
X = pd.DataFrame({'category': categories})
y = np.array(y)
 
print("--- Data Summary ---")
print(X['category'].value_counts())
print(f"Global Target Mean: {np.mean(y):.4f}\n")
 
# 2. Initialize TargetEncoder
# With sufficient data, we can use the default cv=5 safely.
# This splits data into 5 folds: training on 4, encoding the 5th.
encoder = TargetEncoder(smooth="auto", cv=5)
 
# 3. Fit and Transform (Training Phase)
# The values here represent the probability of the target being 1.
# Because of CV=5, different rows of the same category will get slightly different values
# depending on which fold they fell into.
X_encoded = encoder.fit_transform(X, y)
 
# 4. Analyze the Results
# TargetEncoder returns a 2D array (n_samples, 1). We need to flatten it for the DataFrame.
results = pd.DataFrame({
    'Category': X['category'],
    'Target': y,
    'Encoded_Value': X_encoded.ravel()
})
 
print("--- Comparison: Raw Mean vs Encoded Value ---")
# We group by category to see how the encoder handled each group
summary = results.groupby('Category').agg(
    Count=('Target', 'count'),
    Raw_Mean=('Target', 'mean'),       # The actual average in the data
    Encoded_Mean=('Encoded_Value', 'mean'), # The average of the encoded values
    Encoded_Min=('Encoded_Value', 'min'),   # SHOWS VARIATION
    Encoded_Max=('Encoded_Value', 'max')    # SHOWS VARIATION
)
print(summary)
print("\nNotice: 'Encoded_Min' and 'Encoded_Max' are different for the same category.")
print("This confirms that fit_transform produced different values for the same category")
print("because it calculated means using different Cross-Validation folds.\n")
 
# 5. Transform New Data (Test Phase)
# 'Unknown_City' is a category the model has never seen.
new_data = pd.DataFrame({'category': ['High_Prob', 'Medium_Prob', 'Low_Prob', 'Unknown_City']})
new_encoded = encoder.transform(new_data)
 
print("--- New Data Transformation ---")
# When using .transform(), the variance disappears. Every 'High_Prob' gets the same value.
new_data['encoded_value'] = new_encoded.ravel()
print(new_data)
print("\nNote: During .transform(), the encoder uses the full training set average,")
print("so there is no variance for the same category anymore.")

Output

--- Data Summary ---
category
Low_Prob       107
High_Prob       99
Medium_Prob     94
Name: count, dtype: int64
Global Target Mean: 0.4933

--- Comparison: Raw Mean vs Encoded Value ---
             Count  Raw_Mean  Encoded_Mean  Encoded_Min  Encoded_Max
Category                                                            
High_Prob       99  0.868687      0.866779     0.856706     0.877825
Low_Prob       107  0.214953      0.216307     0.197529     0.246293
Medium_Prob     94  0.414894      0.414476     0.393589     0.442245

Notice: 'Encoded_Min' and 'Encoded_Max' are different for the same category.
This confirms that fit_transform produced different values for the same category
because it calculated means using different Cross-Validation folds.

--- New Data Transformation ---
       category  encoded_value
0     High_Prob       0.866965
1   Medium_Prob       0.415696
2      Low_Prob       0.216699
3  Unknown_City       0.493333

Note: During .transform(), the encoder uses the full training set average,
so there is no variance for the same category anymore.

Key Advantages & Disadvantages

Pros:
- Keeps the feature space compact (retains one column instead of 𝑁 columns).
- Captures a direct relationship between the feature and the target.
Cons:
- High risk of data leakage if not used with internal cross-fitting (handled automatically by fit_transform in Scikit-learn).
- Can lead to overfitting if categorical cardinality is extremely high relative to the number of samples.

Related Libraries For advanced target encoding methods (like Leave-One-Out, James-Stein, or M-estimate), many practitioners use the Category Encoders library, a Scikit-learn-contrib package.

Label Encoder

Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y, and not the input X.

Converts categorical text labels (like “cat”, “dog”) into unique numbers (0, 1, 2…) essential for machine learning models that need numerical input, assigning each category a distinct integer without implying order, though it’s best for encoding target variables (y) rather than input features (X) to avoid misinterpreting numerical order.

How it works

Encoding: It maps each unique category to an integer from 0 to n-1 (where ‘n’ is the number of classes).
fit_transform(): Combines learning the categories and applying the transformation.
classes_ attribute: Shows the original categories learned.
inverse_transform(): Converts numbers back to original labels

Example (Python)

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
labels = ["cat", "dog", "bird", "cat"]
encoded_labels = le.fit_transform(labels)
print(encoded_labels) # Output: [1 2 0 1] (order depends on alphabetical sort)

Note: It’s typically for the target variable (y) because assigning numbers to features (X) can wrongly imply order (e.g., 2 > 1), which might confuse some models. For features, One-Hot Encoding is often preferred.

Scikit-learn’s LabelEncoder sorts labels in alphabetical (lexicographical) order before assigning integer values, starting from 0. For example, if you have the labels ['cat', 'dog', 'bird']:

'bird' will be assigned 0
'cat' will be assigned 1
'dog' will be assigned 2

For more details, visit official scikit-learn doc ³.

Summary Comparison

Encoder	Used For	Feature Type	Cardinality	Purpose	Why?
OrdinalEncoder	Input Features (𝑋)	Ordinal (Ordered)	Any	Ranking categories (Low=0, High=2)	Preserves meaningful hierarchy (e.g., Cold < Warm < Hot).
OneHotEncoder	Input Features (𝑋)	Nominal (No order)	Low (< 15)	Creating binary columns (0s and 1s)	Prevents model from assuming a fake rank without creating too many columns.
TargetEncoder	Input Features (𝑋)	Nominal (No order)	High (> 15)	Replacing categories with target means	Efficiently handles many categories (e.g., Zip Codes, bank account numbers) by linking them to the outcome (𝑦).
LabelEncoder	Only Target (𝑦)	Nominal	Low	Simple integer mapping (0, 1, 2)	Standardizes the output labels into integers (0, 1, 2) for classification algorithms.

High Cardinality: If you have millions of unique identifiers (like banking account number), even TargetEncoder can struggle. In those cases, Feature Engineering (extracting parts of the ID) or Feature Hashing is preferred.

Thangavel PrasanthTP

Explorer

Encodings

OneHotEncoder

OrdinalEncoder

Target Encoder

Label Encoder

Summary Comparison

Footnote

Graph View

Table of Contents

Backlinks

Thangavel PrasanthTP

Explorer

Encodings

OneHotEncoder

OrdinalEncoder

Target Encoder

Label Encoder

Summary Comparison

Footnote

Footnotes

Graph View

Table of Contents

Backlinks