study notes: Handling Skewed data for Machine Learning models

Cheryl
2 min readJan 21, 2020

--

Data is skewed when its distribution curve is asymmetrical (as compared to a normal distribution curve that is perfectly symmetrical) and skewness is the measure of the asymmetry. The skewness for a normal distribution is 0.

There are 2 different types of skews in data, left(negative) or right(positive) skew.

Effects of skewed data: Degrades the model’s ability (especially regression based models) to describe typical cases as it has to deal with rare cases on extreme values. ie right skewed data will predict better on data points with lower value as compared to those with higher values. Skewed data also does not work well with many statistical methods. However, tree based models are not affected.

To ensure that the machine learning model capabilities is not affected, skewed data has to be transformed to approximate to a normal distribution. The method used to transform the skewed data depends on the characteristics of the data.

To check for skew in data:

df.skew().sort_values(ascending=False)

Dealing with skew data:

1.log transformation: transform skewed distribution to a normal distribution

  • Not able to log 0 or negative values (add a constant to all value to ensure values > 1)
# Log transform a single column 
df[‘col1’] = np.log(df[‘col1’]

# Log transform multiple columns in dataframe
df = df[[‘col1', ‘col2']].apply(lambda x: np.log(x))

2.Remove outliers

3.Normalize (min-max)

4.Cube root: when values are too large. Can be applied on negative values

5.Square root: applied only to positive values

6.Reciprocal

7.Square: apply on left skew

8.Box Cox transformation: transform non-normal to approximate a normal distribution using eqn 1 below, with λ between [-5,5]. The optimum λ is selected which gives the best approximation to a normal distribution.

eqn 1

eqn 1 only works for positive values of y. Use eqn 2 when there are negative values of y.

eqn 2

Skewness in target variable: Use undersampling, oversampling or SMOTE

--

--