my notes: Gradient Boosting for Regression

3 min readJan 6, 2020

In boosting, an ensemble of weak learners, eg taking a mean or a simple decision trees (DT), are used to form a strong learner.

These weak learners are restricted in its depth and size (reduces complexity). Each model learns from the error (residual) made by the previous model. An ensemble consists of n weak learners where the final model is a function of each weak learner. Ensemble technique for bagging — average the prediction for each weak learner, and for boosting — weak learners are trained sequentially to reduce bias and variance. Ensemble techniques reduces the issue of high variance with new data that is characteristic of DTs.

The model is trained on the explanatory variables and predict on the target variable (Age).

Using the weak learner, the initial prediction value is taken to be the mean of the target variable. Using the predicted value for the target variable, the residual can be calculated using the formula residual = y — ŷ (can be positive or negative).

A second weak learner will be built using Residual 1 as the target variable.

A scaling factor is applied to each these model, called the Learning Rate. The Learning rate (between 0–1) will prevent a high variance in the prediction by scaling (or limiting) the contribution from each tree.

The predicted results is shown in column Prediction 2. Combining with the column Mean Age, we get the predicted results in column Combine.

FYI the values in Prediction 2 are obtained by taking the mean of values(if exists multiple) that fall within the same terminal node.

The new residual values are calculated using column Age-Combine to obtain column Residual 2. A new weak learner is created to train on Residual 2.

The steps are repeated until the max iterations or when error function does not change (ie additional trees fail to improve the fit) or when the residuals become very small.

Note: the weak learner built for each iteration can be different (ie the splits and features used)

hyperparameters to tune:

min_sample_split: min # of sample in each node for splitting
min_sample_leaf: min # of sample in terminal/leaf node
max_depth: max depth of tree, high value will lead to overfitting, can tune with cross validation
max_leaf_nodes: max # terminal/leaf node, use either this or max_depth. usually between 8–32
max_features: # features to consider(random) when splitting, usually use sq root of total # of features (or 30–40%). tree is not built using all features, only a subset at each time

my notes: Gradient Boosting for Regression

Written by Cheryl

No responses yet