Blog
Feature Engineering (Categorical Encoding)

Categorical encoding refers to replacing the category strings by a numerical representation. This is done because machines can’t use categories while training machine learning models.
Types of categorical Encoding Techniques
- Traditional techniques
- One hot encoding
- Count/frequency encoding
- Ordinal/Label encoding
- Monotonic Relationship
- Ordered label encoding
- Mean encoding
- Weight of evidence
- Alternative techniques
- Binary Encoding
- Feature hashing
- Others
Monotonic Relationship
When the variable increases and the target also increases
When the variable increases and the target decreases.
One hot encoding
One hot encoding consists in encoding each categorical variable with a set of boolean variables which take values 0 or 1, indicating if a category is present for each observation. A categorical variable should be encoded by created k-1 binary variables, where k is the number of distinct categories.
Label or integers or Ordinal Encoding
Encoding consists in replacing the categories by digits from 1 to n or 0 to n-1, depending on the implementation, where n is the number of distinct categories of the variable. In this technique the numbers are assigned arbitrarily.
Count or frequency encoding
Categories are replaced by the count or percentage of observations that show that category in the dataset.
Ordered Ordinal Encoding
Categories are replaced by integers from 1 to K, where K is the number of distinct categories in the variable, but this numbering is informed by the mean of the target for each category.
Mean or Target Encoding
Mean encoding implies replacing the category by the average target value for that category.
Weight of Evidence
WOE = ln(Distribution of Goods / Distribution of Bads)
Rare Labels Encoding
Are those labels that appear only in a tiny proportion of the observations in a dataset. These rare values are utilized to encode the missing values.
Binary Encoding and Feature Hashing
Use a collection of 0s and 1s, to encode the meaning of the variable. Individually derived variables lack human readable meaning.
Hashing method: number of variable → hashing → encoding