1. Do nothing
- Use algorithm that handle missing data
- XGBoost: Decides best way to impute for each dataset on the training loss reduction
- LightGBM: Input parameter to ignore missing values
2. Delete
- Remove records in the data frame with missing data when the percentage is low (< 5%)
- Remove entire columns when the percentage of missing data in each column is high (> 50 %) and is not an essential feature to the model
3. Mean/ Median Imputation
- Calculate mean/median of each column using the non-missing values and replacing it into the missing cells within each column
- Fast and easy
- Works only for numerical data
- For data with outliers use median instead of mean
- Ignore interrelationships between features
- Diminish any correlations between features with imputations (affects multivariate analysis)
4. Mode/ Constant value Imputation
- Replace the missing cells with the most frequent value occurring within the column
- Replace missing cells with 0 or a user defined constant value
- Fast and easy
- Works for Numerical and Categorical data
- Diminish any correlations between features with imputations (affects multivariate analysis)
- Introduce bias
5. K-Nearest Neighbour Imputation
- Missing values are predicted based on its similarity (feature similarity) to neighbouring points within the data set
- Values from the k neighbouring points can be computed based on the average or weighted distance (for continuous data) and most frequent value (for categorical data)
- Rows to be removed when there are too many cells missing
- The # of training data must be > # of nearest neighbour
- When there are > 1 feature missing, all the other features are used as the multiple neighbour donor
- Works better for numerical data than categorical data
- Creates a single model that can be used across all features
- Slow for large data (scan all data to find most similar)
- Computationally expensive (entire dataset is stored in memory)
- Have to decide choice of K
- Sensitive to outliers
- Accuracy decrease with higher dimension as the difference between the nearest and farthest neighbour is diminished
6. Random Forest Imputation
- Uses a random forest model to impute missing data by using it as the target variable
7. Multivariate Imputation by Chained Equation (MICE) Imputation
- Multiple regression models are created sequentially by using different columns with missing values as the target variable.
- A regression model is fitted to the predictors to impute the missing data
- With the imputation in the target variable, another column with missing value is chosen as the target variable and fitted on the original and imputed data
- The imputation is repeated and at the end of 1 cycle, all the columns with missing values are filled will be filled by the predictions from regression models
- The cycle is repeated by n times (where n is user defined) or when the coefficient in the regression models converges
- High accuracy
- Works for Numerical and Categorical data
- Measures uncertainty of missing values
- Able to handle complexities in data such as bounds or skip patterns
8. Deep Learning Imputation
- Build a neural network model to impute missing values for both categorical and numerical features
- Preferred choice for categorical data
- Model is able to work with categorical data using feature encoder
- Imputations are done on single column each time by specifying features used for training (on target variable)
- Slow for large data
9. Extrapolation/ Interpolation
- Interpolation is estimating the missing value based on other observations within the range of a set of known data points
- Extrapolation is estimating beyond the range of the data and requires more assumptions
10. Regression Imputation
- Predict missing values (as target variable) by using a regression line and relevant features as predictors from data
- Assumes linear relationship between features
- May affect/restrict variability and distribution of data
11. Stochastic Regression Imputation
- Similar to regression imputation with an additional residual term added to each prediction
- Residual term is normally distributed with mean = 0 and variance = variance of predictor variable
12. Hot-Deck Imputation
- Find sample of points that are similar to the missing values on other variables and randomly choosing a point from the sample
- Imputation is restricted to the range of the sample
- The random component increases variability in the data