My Notes

Feature Engineering (Categorical Encoding)

feature engineering categorical encoding

Categorical encoding refers to replacing the category strings by a numerical representation. This is done because machines can’t use categories while training machine learning models.

Types of categorical Encoding Techniques

  • Traditional techniques
    • One hot encoding
    • Count/frequency encoding
    • Ordinal/Label encoding
  • Monotonic Relationship
    • Ordered label encoding
    • Mean encoding
    • Weight of evidence
  • Alternative techniques
    • Binary Encoding
    • Feature hashing
    • Others

Monotonic Relationship

When the variable increases  and the target also increases

When the variable increases and the target decreases. 

One hot encoding

One hot encoding consists in encoding each categorical variable with a set of boolean variables which take values 0 or 1, indicating if a category is present for each observation. A categorical variable should be encoded by created k-1 binary variables, where k is the number of distinct categories.

Label or integers or Ordinal Encoding

Encoding consists in replacing the categories by digits from 1 to n or 0 to n-1, depending on the implementation, where n is the number of distinct categories of the variable. In this technique the numbers are assigned arbitrarily.

Count or frequency encoding

Categories are replaced by the count or percentage of observations that show that category in the dataset.

Ordered Ordinal Encoding

Categories are replaced by integers from 1 to K, where K is the number of distinct categories in the variable, but this numbering is informed by the mean of the target for each category.

Mean or Target Encoding

Mean encoding implies replacing the category by the average target value for that category.

Weight of Evidence

 WOE = ln(Distribution of Goods / Distribution of Bads)

Rare Labels Encoding

Are those labels that appear only in a tiny proportion of the observations in a dataset. These rare values are utilized to encode the missing values.

Binary Encoding and Feature Hashing

Use a collection of 0s and 1s, to encode the meaning of the variable. Individually derived variables lack human readable meaning.

Hashing method: number of variable → hashing → encoding