My Notes

Feature Engineering (Missing Value Handling)

Feature Engineering (Missing Value Handling)

Missing data or missing values, occur when no data is stored for a certain observation in a variable. It is one one of the common occurrences in most datasets. For example, let’s say we collected information for 1000 people about their name, age, address etc. out of this let say 55 people don’t like to share their address. These 55 data points will be considered as missing data.

  • A value can be missing because it was forgotten, lost or not stored properly.

Missing Data Mechanisms:

  • MCAR (Missing Completely at random
  • MAR (Missing at Random)
  • MNAR (Missing data not at random)

Missing Data Completely at Random (MCAR)

Missing completely at random MCAR is a type of missing data mechanism in which the probability of a value being missing is unrelated to both the observed data and the missing data.

The probability of being missing is the same for all the observations. In this type of missing data, there’s absolutely no relationship between the data missing and any other values, observed or missing within the dataset, disregarding those cases would not bias the inferences made.

Missing Data At Random (MAR)

The probability of an observation being missing depends on available information.

Missing data not at random (MNAR)

There is a mechanism or a reason why missing values are introduced in the dataset.

Missing Data Imputation

Imputation is the act of replacing missing data with statistical estimates of the missing values. The goal of any imputation techniques is to produce a complete dataset that can be used for train machine learning models.

Missing data can be input in different ways for different types of data. For numerical variables most common imputation techniques are:

  • Mean / median imputation
  • Arbitrary value imputation
  • End of the tail imputation

For categorical variable imputation techniques are:

  • Frequent category imputation
  • Adding a “missing” category

In some cases we can use some techniques that is used for both types of variables;

These are:

  • Complete case analysis
  • Adding a “missing” indicator
  • Random sample imputation.

Complete Case Analysis

Complete case analysis (CCA), also called “list-wise  deletion” of cases, consists in discarding observations, where values in any of the variables are missing. In complete case analysis, we analyze only those observations for which there is information in all of the variables in the dataset. This is suitable for categorical and numerical variables. 

Assumption for this technique is, data missing at random.This technique works better when no more than 5% of the data is missing.

Mean & Median Imputation

Mean of median imputation consists of replacing all the occurrences of missing values within a variable by the mean or median. This is suitable for numerical variables.

Note that if the variable is normally distributed the mean or median imputation is quite the same. If the variable is skewed, the median is a better representation.

Assumption

  • Data missing at random.
  • The missing observations, most likely look like the majority of the observations in the variable.

When to use:

  • Data is missing completely at random.
  • No more than 5% of the variables contain missing data.

Arbitrary Value Imputation

Arbitrary value imputation consists of replacing all occurrences of missing values (NA) within a variable by an arbitrary value. Typically used arbitrary values are 0, 999, -999 or any other combinations of 9s or -1.

This type of imputation is suitable for numerical and categorical variables. For categorical variable the arbitrary value can be “missing’

Assumptions:

Data is not missing at random.

Limitation

  • Distortion of the original variable distribution
  • Distortion of the original variance
  • Distortion of the covariance with the remaining variables of the dataset.
  • Need to be careful not to choose an arbitrary value too similar to the mean or median.
  • The higher the percentage of NA, the higher the distortions.

End of Tail Imputation

End of tail imputation is equivalent to arbitrary value imputation, but automatically selecting arbitrary value at the end of the variable distribution. For example, if the variable is normally distributed, we can use the mean plus or minus 3 times the standard deviation, or if the variable is skewed, we can use the IQR proximity rule. This type of imputation is suitable for numerical variables.

Frequent Category Imputation

Frequent category or mode imputation consists of replacing all occurrences of missing values (NA) within a variable   by the mode of the most frequent value. This is suitable for categorical variables.

Assumptions

  • Data missing at random
  • The missing observations, most likely mode.

Limitation

  • Distortion the relation of the most frequent label with other variables in the data data.
  • May lead to an over-representation of the most frequent label if there i a big number of NA
  • The higher the percentage of NA, the higher the distortions are.

Missing Category Imputation

This method consists in treating missing data as an additional label or category of the variable. The missing observations are then grouped in the newly created label “missing”

Limitation

If the number of NA is too small, creating an additional category is in essence adding another rare label to the variable.

Random Sample Imputation

Random sample imputation consists in taking a random observation from the pool of available observations of the variable and using that randomly extracted value to fill the NA.

Assumption

Data Missing at random

The idea is to replace the population of missing values with a population of values with the same distribution of the original variable.

Missing Indication

A missing indicator is an additional binary variable, which indicates whether the data was missing for an observation (1) or not (0). This technique is suitable for both numerical and categorical variables.

  • The missing indicator is used together with methods that assume data is missing at random. For example mean, median, mode or random sample imputation.

Assumption

  • Data is NOT missing at random
  • Missing data are predictive.