What is Cartesian Product?
The Cartesian product of two sets and , denoted , is the set of all ordered pairs where and .
Example: Let
Then
It grows combinatorially: if and , then .
How is Cartesian Product Used in Categorical Data: Feature Crosses?
🧠 Feature Crosses (aka combinatorial features) are: New features created by combining two or more categorical features into a single categorical feature representing joint levels (interactions).
🚀 Example:
Suppose you have two categorical features:
• Country = {US, UK}
• Device = {Mobile, Desktop}
A feature cross (i.e., Cartesian product) would generate:
{(US, Mobile), (US, Desktop), (UK, Mobile), (UK, Desktop)}
Which becomes a new categorical feature like:
• US_Mobile
• US_Desktop
• UK_Mobile
• UK_Desktop
You can then:
• One-hot encode these combined categories • Feed them to tree-based models, embedding layers, or wide & deep models
Why use Feature Crosses?
- Capture interactions between categories that have joint effects (e.g., “users in US on mobile” behave differently from “users in UK on desktop”).
- Improve model accuracy, especially for models like logistic regression or neural nets that otherwise assume independence.
⚠️ Trade-Offs:
Pros | Cons |
---|---|
Can improve expressiveness | High cardinality explosion (combinatorial) |
Captures important patterns | May lead to overfitting or sparsity |
Especially useful in Wide & Deep Models | Need embedding or hashing to manage |
In Practice:
• Manual Crosses: You select features to cross based on domain knowledge.
• Automated Crosses: Libraries like tf.feature_column.crossed_column
(TensorFlow), or via embedding layers in deep learning.
• Hashed Crosses: Avoids exploding dimensions by hashing crossed features into fixed buckets.
Resources
- Google > Machine Learning > Crash Course > Working with categorical data > Feature Crosses