my notes: Missing data & Data imputation

Cheryl
3 min readJan 14, 2020

--

1. Do nothing

  • Use algorithm that handle missing data
  • XGBoost: Decides best way to impute for each dataset on the training loss reduction
  • LightGBM: Input parameter to ignore missing values

2. Delete

  • Remove records in the data frame with missing data when the percentage is low (< 5%)
  • Remove entire columns when the percentage of missing data in each column is high (> 50 %) and is not an essential feature to the model

3. Mean/ Median Imputation

  • Calculate mean/median of each column using the non-missing values and replacing it into the missing cells within each column
  • Fast and easy
  • Works only for numerical data
  • For data with outliers use median instead of mean
  • Ignore interrelationships between features
  • Diminish any correlations between features with imputations (affects multivariate analysis)

4. Mode/ Constant value Imputation

  • Replace the missing cells with the most frequent value occurring within the column
  • Replace missing cells with 0 or a user defined constant value
  • Fast and easy
  • Works for Numerical and Categorical data
  • Diminish any correlations between features with imputations (affects multivariate analysis)
  • Introduce bias

5. K-Nearest Neighbour Imputation

  • Missing values are predicted based on its similarity (feature similarity) to neighbouring points within the data set
  • Values from the k neighbouring points can be computed based on the average or weighted distance (for continuous data) and most frequent value (for categorical data)
  • Rows to be removed when there are too many cells missing
  • The # of training data must be > # of nearest neighbour
  • When there are > 1 feature missing, all the other features are used as the multiple neighbour donor
  • Works better for numerical data than categorical data
  • Creates a single model that can be used across all features
  • Slow for large data (scan all data to find most similar)
  • Computationally expensive (entire dataset is stored in memory)
  • Have to decide choice of K
  • Sensitive to outliers
  • Accuracy decrease with higher dimension as the difference between the nearest and farthest neighbour is diminished

6. Random Forest Imputation

  • Uses a random forest model to impute missing data by using it as the target variable

7. Multivariate Imputation by Chained Equation (MICE) Imputation

  • Multiple regression models are created sequentially by using different columns with missing values as the target variable.
  • A regression model is fitted to the predictors to impute the missing data
  • With the imputation in the target variable, another column with missing value is chosen as the target variable and fitted on the original and imputed data
  • The imputation is repeated and at the end of 1 cycle, all the columns with missing values are filled will be filled by the predictions from regression models
  • The cycle is repeated by n times (where n is user defined) or when the coefficient in the regression models converges
  • High accuracy
  • Works for Numerical and Categorical data
  • Measures uncertainty of missing values
  • Able to handle complexities in data such as bounds or skip patterns

8. Deep Learning Imputation

  • Build a neural network model to impute missing values for both categorical and numerical features
  • Preferred choice for categorical data
  • Model is able to work with categorical data using feature encoder
  • Imputations are done on single column each time by specifying features used for training (on target variable)
  • Slow for large data

9. Extrapolation/ Interpolation

  • Interpolation is estimating the missing value based on other observations within the range of a set of known data points
  • Extrapolation is estimating beyond the range of the data and requires more assumptions

10. Regression Imputation

  • Predict missing values (as target variable) by using a regression line and relevant features as predictors from data
  • Assumes linear relationship between features
  • May affect/restrict variability and distribution of data

11. Stochastic Regression Imputation

  • Similar to regression imputation with an additional residual term added to each prediction
  • Residual term is normally distributed with mean = 0 and variance = variance of predictor variable

12. Hot-Deck Imputation

  • Find sample of points that are similar to the missing values on other variables and randomly choosing a point from the sample
  • Imputation is restricted to the range of the sample
  • The random component increases variability in the data

--

--

Cheryl
Cheryl

Written by Cheryl

trouvez vous un cato. etre un cato.

No responses yet