Blog
Variable Types & Characteristics

Variable
A variable is any characteristic, number or quantity that can be measured or counted.
Variables are two types.
- Numerical
- Categorical
Numerical Variables can be classified into two categories.
- Discrete numerical: The variable whose values are whole numbers (counts) is called a discrete numerical variable, this will have only the whole number. Examples: number of items bought by a customer in the supermarket, number of active bank accounts, number of pets in the family, number of children in the family. etc.
- Continuous numerical: A variable that may contain any value within some range is called continuous, this will have a fraction. Example: amount paid by a customer, house price, total debt as percentage of total income.
The values of categorical variables are selected from a group of categories, also called labels. Example: marital status, intended use of laon, mobile network provider, gender, etc.
Categorical variables can be classified into three categories:
- Nominal: Categorical variable that doesn’t show any intrinsic order of the label ordon’t have any order. Example: country of birth, postcode, vehicle made.
- Ordinal: Categorical variables in which categories can be meaningfully ordered are called ordinal. Examples: student’s grade in exam, days of week, educational level, etc.
- Date/time: Date and time, or datetime variables they take dates and / or time as values. Example: date of birth, date of application, time of accident, payment date.
Special cases: categorical variables where categories are encoded as numbers. (e.g. gender may be coded as 0 for male and 1 for females)
Id variables: number that uniquely identifies an observation.
Mixed Variables: variables that contain both numbers and categories. Two types of mixed variable:
- Number or label in different observation
- Numbers and labels in the same observation
Observations show either numbers or categories among their values. Number of credit accounts, number of missed payments, 1-3, D-A.
Observations show both numbers and categories in their values. Cabin, ticket, vehicle registration.
It is possible to enrich the dataset dramatically by extracting information from the label and the number of mixed variables.
It is possible to enrich a dataset dramatically by extracting information from the datetime variable.
Variable Characteristics
- Missing Data
- Categorical Variables
- Linear Model Assumption
- Distributions
- Outliers
- Feature magnitude
Missing Data:
Missing data or missing values, occur when no data is stored for a certain observation in a variable. It is one one of the common occurrences in most datasets. For example, let’s say we collected information for 1000 people about their name, age, address etc. out of this let say 55 people don’t like to share their address. This 55 data points will be considered as missing data.
- A value can be missing because it was forgotten, lost or not stored properly.
Missing Data Mechanisms:
- MCAR (Missing Completely at random
- MAR (Missing at Random)
- MNAR (Missing data not at random)
Missing Data Completely at Random (MCAR)
Missing completely at random MCAR is a type of missing data mechanism in which the probability of a value being missing is unrelated to both the observed data and the missing data.
The probability of being missing is the same for all the observations. In this type of missing data, there’s absolutely no relationship between the data missing and any other values, observed or missing within the dataset, disregarding those cases would not bias the inferences made.
Missing Data At Random (MAR)
The probability of an observation being missing depends on available information.
Missing data not at random (MNAR)
There is a mechanism or a reason why missing values are introduced in the dataset.
Cardinality
The values of a categorical variable are selected from a group of categories and are also called labels. The number of different labels is also known as cardinality.
Rare Labels
Rare labels are those that appear only in a tiny proportion of the observations in a dataset. For example, for variable ‘city’ where a German citizen lives: Hamburg is a frequent category, Suderburg is a rare category (few people live there)
Linear Model Assumptions
Linear models make the following assumptions about the independent variable (Xs)
- Linear relationship between the variables and the target
- Multivariable normality
- No or little collinearity
- Homoscedasticity
Normality
Variables follow a gaussian distribution which can be statically tested, for example with the kolmogorov smimov test.
When the variable is not normally distributed a non-linear transformation may fix this issue.
Multicollinearity:
Multicollinearity occurs when the independent variables are correlated with each other. This can be assessed with a correlation matrix or the variance inflation factor (VIF).
Homoscedasticity
The error term that is the noise in the relationship between the independent variables x and the dependent variable y is the same across all the independent variables.
The independent variables have the same finite variance, also known as homogeneity of variance.
Probability Distributions
A probability distribution is a function that describes the likelihood of obtaining the possible values that a variable can take. For example, for the variable height, the probability distribution describes how often we can get a value of 161 cm or 174 cm or 200 cm, etc.
Probability distributions are two types.
- Discrete
- Binomial
- Poisson
- Continuous
- Gaussian
- Skewed
- Many others
Outliers
An outlier is a data point which is significantly different from the remaining data. Depending on cases outliers should be given special attention or ignore completely. One of the most general approach is to calculate outliers is to calculate quantities and then interquartile range (IQR)
IQR = 75th quantities – 25th quantities
Upper limit = 75th quartile + IQR*1.5
Lower limit = 25th quartile + IQR*1.5
Note, for extreme outliers, we can also multiply the IQR by 3 instead of 1.5
Feature Magnitude
The regression coefficient is directly influenced by the scale of the variable.
Variables with bigger magnitude / value range dominate over the ones with smaller magnitude / value range.
Gradient descent converges faster when features are similar scales.
Feature scaling helps decrease the time to find support vectors for SVMs.
Euclidean distances are sensitive to feature magnitude.