The trees in Adaboost/Normal boosting are built sequentially and each tree boosts the attributes that lead to the misclassifications/errors in the previous tree.
Essentially, these sequential trees help in decreasing the error of the previous misclassifications and the learnings are carried forward as we move though the iterations.
The results of the final model encompass the learnings of all the weak learners.
Gradient boost has a little difference.
In GBM, learning happens by optimising the loss function(just like how a regression model tries to optimise the sum of squared errors). Too much jargon!
Maybe an example might help in this case:
Distance in miles, time in hours
Distance | Time | Speed |
---|---|---|
5 m | 1 h | 5 m/h |
6 m | 2 h | 3 m/h |
3 m | 0.5 h | 6 m/h |
Let’s say we don’t know how to calculate speed given the time and distance of a vehicle, so we have to take help of the gradient boosting algo(yeah! because calculators aren’t available, can any scenario be more hypothetical than this!!)
As a first step, let’s estimate the speed randomly.
Distance | Time | Speed | Estimated Speed |
---|---|---|---|
5 m | 1 h | 5 m/h | 4 m/h |
6 m | 2 h | 3 m/h | 4 m/h |
3 m | 0.5 h | 6 m/h | 4 m/h |
From the estimate, we’ll calculate the first set of residual(actual-predicted):
Distance | Time | Speed | Estimated Speed | First Residual |
---|---|---|---|---|
5 m | 1 h | 5 m/h | 4 m/h | 1 |
6 m | 2 h | 3 m/h | 4 m/h | -1 |
3 m | 0.5 h | 6 m/h | 4 m/h | 2 |
Now, as the next step, the GBM will use the first residual as the target i.e. it will use Distance and time columns to predict the first residual. This will be the first residual model. Predicted residual is the column which is the prediction of this residual column.
Distance | Time | Speed | Estimated Speed | First Residual | Predicted Residual 1 |
---|---|---|---|---|---|
5 m | 1 h | 5 m/h | 4 m/h | 1 | 1.5 |
6 m | 2 h | 3 m/h | 4 m/h | -1 | -1.2 |
3 m | 0.5 h | 6 m/h | 4 m/h | 2 | 2.4 |
After this first model, the new estimated speed will be:
Estimated speed + learning rate * predicted residual 1
If learning rate is 0.1, then new estimate for the first row will be: 4 + 0.1 * 1.5 = 4.15
which is a little closer to 5 than 4 is.
Distance | Time | Speed | Estimated Speed | First Residual | Predicted Residual 1 | New Estimate |
---|---|---|---|---|---|---|
5 m | 1 h | 5 m/h | 4 m/h | 1 | 1.5 | 4.15 |
6 m | 2 h | 3 m/h | 4 m/h | -1 | -1.2 | 3.88 |
3 m | 0.5 h | 6 m/h | 4 m/h | 2 | 2.4 | 4.24 |
Use the New Estimate
values to calculate second residuals(Speed - New estimate).
Distance | Time | Speed | Estimated Speed | First Residual | Predicted Residual 1 | New Estimate | Second Residual |
---|---|---|---|---|---|---|---|
5 m | 1 h | 5 m/h | 4 m/h | 1 | 1.5 | 4.15 | 0.75 |
6 m | 2 h | 3 m/h | 4 m/h | -1 | -1.2 | 3.88 | -0.88 |
3 m | 0.5 h | 6 m/h | 4 m/h | 2 | 2.4 | 4.24 | 1.76 |
As you see, that the second residuals are smaller than the first residual.
Now, for the second residual model, use the second residual as the target and predict a value.
Distance | Time | Speed | Estimated Speed | First Residual | Predicted Residual 1 | New Estimate | Second Residual | Predicted Residual 2 |
---|---|---|---|---|---|---|---|---|
5 m | 1 h | 5 m/h | 4 m/h | 1 | 1.5 | 4.15 | 0.75 | 0.8 |
6 m | 2 h | 3 m/h | 4 m/h | -1 | -1.2 | 3.88 | -0.88 | -0.6 |
3 m | 0.5 h | 6 m/h | 4 m/h | 2 | 2.4 | 4.24 | 1.76 | 2 |
Again, use the New Estimate + learning rate * predicted residual 2
to update the estimated Speed, let’s call it new estiamte 2.
Distance | Time | Speed | Estimated Speed | First Residual | Predicted Residual 1 | New Estimate | Second Residual | Predicted Residual 2 | New Estimate 2 |
---|---|---|---|---|---|---|---|---|---|
5 m | 1 h | 5 m/h | 4 m/h | 1 | 1.5 | 4.15 | 0.75 | 0.8 | 4.23 |
6 m | 2 h | 3 m/h | 4 m/h | -1 | -1.2 | 3.88 | -0.88 | -0.6 | 3.82 |
3 m | 0.5 h | 6 m/h | 4 m/h | 2 | 2.4 | 4.24 | 1.76 | 2 | 4.44 |
Again, find the Third Residual
by subtracting, New estimate 2 from speed and use the third residual as the target that your third residual model will predict.
This will keep going on until a stopping criteria is reached such as number of iterations, depth of a tree, time limit etc
Essentially, in this algorithm, we are boosting the gradient(though the learning param and new residuals calculated at each step) and it is helping in decreasing the overall error.
Also, as we can see that it is an additive model.
Estmated Speed + learning rate * predicted residual 1 + learning rate * predicted residual 2 + ...
That’s how Gradient Boosting works!! Quite simple :P