my notes: Missing data & Data imputation

3 min readJan 14, 2020

1. Do nothing

Use algorithm that handle missing data
XGBoost: Decides best way to impute for each dataset on the training loss reduction
LightGBM: Input parameter to ignore missing values

Remove records in the data frame with missing data when the percentage is low (< 5%)
Remove entire columns when the percentage of missing data in each column is high (> 50 %) and is not an essential feature to the model

Calculate mean/median of each column using the non-missing values and replacing it into the missing cells within each column
Fast and easy
Works only for numerical data
For data with outliers use median instead of mean
Ignore interrelationships between features
Diminish any correlations between features with imputations (affects multivariate analysis)

Replace the missing cells with the most frequent value occurring within the column
Replace missing cells with 0 or a user defined constant value
Fast and easy
Works for Numerical and Categorical data
Diminish any correlations between features with imputations (affects multivariate analysis)
Introduce bias

Missing values are predicted based on its similarity (feature similarity) to neighbouring points within the data set
Values from the k neighbouring points can be computed based on the average or weighted distance (for continuous data) and most frequent value (for categorical data)
Rows to be removed when there are too many cells missing
The # of training data must be > # of nearest neighbour
When there are > 1 feature missing, all the other features are used as the multiple neighbour donor
Works better for numerical data than categorical data
Creates a single model that can be used across all features
Slow for large data (scan all data to find most similar)
Computationally expensive (entire dataset is stored in memory)
Have to decide choice of K
Sensitive to outliers
Accuracy decrease with higher dimension as the difference between the nearest and farthest neighbour is diminished

Uses a random forest model to impute missing data by using it as the target variable

Multiple regression models are created sequentially by using different columns with missing values as the target variable.
A regression model is fitted to the predictors to impute the missing data
With the imputation in the target variable, another column with missing value is chosen as the target variable and fitted on the original and imputed data
The imputation is repeated and at the end of 1 cycle, all the columns with missing values are filled will be filled by the predictions from regression models
The cycle is repeated by n times (where n is user defined) or when the coefficient in the regression models converges
High accuracy
Works for Numerical and Categorical data
Measures uncertainty of missing values
Able to handle complexities in data such as bounds or skip patterns

Build a neural network model to impute missing values for both categorical and numerical features
Preferred choice for categorical data
Model is able to work with categorical data using feature encoder
Imputations are done on single column each time by specifying features used for training (on target variable)
Slow for large data

Interpolation is estimating the missing value based on other observations within the range of a set of known data points
Extrapolation is estimating beyond the range of the data and requires more assumptions

Predict missing values (as target variable) by using a regression line and relevant features as predictors from data
Assumes linear relationship between features
May affect/restrict variability and distribution of data

Similar to regression imputation with an additional residual term added to each prediction
Residual term is normally distributed with mean = 0 and variance = variance of predictor variable

Find sample of points that are similar to the missing values on other variables and randomly choosing a point from the sample
Imputation is restricted to the range of the sample
The random component increases variability in the data