Ensemble Methods in ML

A group of predictors is called an ensemble; thus, this technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.

Suppose you ask a complex question to thousands of random people, then aggregate their answers. In many cases you will find that this aggregated answer is better than an expert's answer. This is called the wisdom of the crowd.

Likewise, if you aggregate the predictions of a group of predictors (e.g. decision tree classifer, SVM, logistic regression), you will often get better predictions than with the best individual predictor.

Avengers Ensemble!

avengers


We'll cover three popular ensemble methods:

  1. Voting Classifier
  2. Bagging and Random Forests
  3. Gradient Boosting

Method #1: Voting Classifier

To create a VotingClassifier, simply aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifier.

Here's how it's done in Scikit-Learn:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from sklearn.ensemble import VotingClassifier

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

# Gather a set of predictors
estimators = [('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)]

# Initialize the Voting Classifier
voting_clf = VotingClassifier(estimators=estimators,
                              voting='hard')

# Fit the data
voting_clf.fit(X_train, y_train)

Extra: Using Soft Voting

If all classifiers are able to estimate class probabilities (i.e. they have a predict_proba() method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers.

This is called soft voting, and it often achieves higher performance than hard voting because it gives more weight to highly confident votes.

To perform soft voting, all you need to do is replace voting='hard' with voting='soft' and ensure that all classifiers can estimate class probabilities.

Note: The SVC class can't estimate class probabilities by default, so you'll need to set its probability hyperparameter to True, as this will make the SVC class use cross-validation to estimate class probabilities (which slows training down), and it will add a predict_proba() method.


Method #2: Bagging and Random Forests

Another approach is to use the same training algorithm (e.g. logistic regression) for every predictor, but to train them on different random subsets of the training set.

For example (where lr_predictor refers to an instance of logistic regression):

  • lr_predictor_1 trained on subset_1_of_training_set
  • lr_predictor_2 trained on subset_2_of_training_set
  • lr_predictor_3 trained on subset_3_of_training_set
  • ...

This is called bagging, which is short for bootstrap aggregating.

For bagging, sampling is performed with replacement.

When sampling is performed without replacement, it is called pasting.

In other words, both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.

The advantages of using bagging are that:

  • The net result is that the bagging ensemble has a similar bias but a lower variance than a single predictor trained on the original training set.
  • It scales very well. The individual predictors can all be trained in parallel, via different CPU cores or even different servers. Similarly, predictions can be made in parallel.

Armed with the above knowledge, let's look into Random Forests, which is a specific instance of bagging.


A Random Forest is an ensemble of Decision Trees that are generally trained via the bagging method, typically with max_samples set to the size of the training set.

In other words:
 
Random Forests = Bagging of Decision Trees

However, instead of building a BaggingClassifier and passing it to a DecisionTreeClassifier, you can use the RandomForestClassifier, which is more convenient and optimized for Decision Trees.

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, 
                                 max_leaf_nodes=16, 
                                 n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

The following BaggingClassifier is roughly equivalent to the previous RandomForestClassifier:

bag_clf = BaggingClassifier(
              DecisionTreeClassifier(splitter='random', 
                                     max_leaf_nodes=16),
              n_estimators=500, 
              max_samples=1.0, 
              bootstrap=True, 
              n_jobs=-1)

Extra: Feature Importance

One of the great qualities of Random Forests is that they make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest). More precisely, it is a weighted average, where each node's weight is equal to the number of training samples that are associated with it.

You can access the result using the feature_importances_ variable. Here's an example:

from sklearn.datasets import load_iris
iris = load_iris()

rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])

pd.DataFrame({'feature': iris['feature_names'],
              'importance': rnd_clf.feature_importances_}).set_index('feature')
importance
feature
sepal length (cm) 0.096814
sepal width (cm) 0.022546
petal length (cm) 0.433662
petal width (cm) 0.446977

Method #3: Gradient Boosting

Boosting (originally called hypothesis boosting) refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.

Gradient Boosting is by far the most popular boosting method. It works by:

  • Sequentially adding predictors to an ensemble, each one correcting its predecessor.
  • Each new predictor is fitted to the residual errors made by the previous predictor.

To better illustrate this, let's go through a simple regression example using Decision Trees as the base predictors.

First, let’s fit a DecisionTreeRegressor to the training set:

from sklearn.tree import DecisionTreeRegressor
tree_reg_1 = DecisionTreeRegressor(max_depth=5)
tree_reg_1.fit(X, y)

Now train a second DecisionTreeRegressor on the residual errors made by the first predictor:

y2 = y - tree_reg_1.predict(X)
tree_reg_2 = DecisionTreeRegressor(max_depth=5)
tree_reg_2.fit(X, y2)

Then we train a third regressor on the residual errors made by the second predictor:

y3 = y2 - tree_reg_2.predict(X)
tree_reg_3 = DecisionTreeRegressor(max_depth=5)
tree_reg_3.fit(X, y3)

We now have an ensemble containing three trees. It can make predictions on a new instance simply by adding up the predictions of all the trees:

trees = [tree_reg_1, tree_reg_2, tree_reg_3]
y_pred = sum(t.predict(X_new) for t in trees)

A Simpler Way

We can use Scikit-Learn's GradientBoostingRegressor to condense the above codes:

from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=5, 
                                 n_estimators=3, 
                                 learning_rate=1.0)
gbrt.fit(X, y)

Extra: Finding the Optimal Number of Trees

The following code trains a GBRT ensemble with 120 trees, then measures the validation error at each stage of training to find the optimal number of trees, and finally trains another GBRT ensemble using the optimal number of trees:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y)

# Initialize and fit GBRT
gbrt = GradientBoostingRegressor(max_depth=5, n_estimators=120)
gbrt.fit(X_train, y_train)

# Measure the validation error at each stage of training
errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]

# Find the optimal number of trees
best_n_estimators = np.argmin(errors)

# Train another GBRT ensemble using the optimal number of trees
gbrt_best = GradientBoostingRegressor(max_depth=5,
                                      n_estimators=best_n_estimators)
gbrt_best.fit(X_train, y_train)

Extra: Stochastic Gradient Boosting

The GradientBoostingRegressor class also supports a subsample hyperparameter, which specifies the fraction of training instances to be used for training each tree. For example, if subsample=0.25, then each tree is trained on 25% of the training instances, selected randomly.

Advantages of using Stochastic Gradient Boosting:

  • Trades a higher bias for a lower variance.
  • Speeds up training considerably.

In a Nutshell

  1. Voting Classifier
    • Voting aggregates the predictions of each classifier (e.g. logistic regression, SVC, etc.) and predicts the class that gets the most votes.
    • Always use soft voting for higher performance as it gives more weight to highly confident votes.
  2. Bagging and Random Forests
    • Bagging
      • Bagging uses the same training algorithm for every predictor and trains them on different random subsets of the training set.
      • Compared to a single predictor trained on the original training set, a bagging ensemble results in a similar bias but a lower variance.
      • Bagging can be trained in parallel. Predictions can be made in parallel too.
    • Random Forests
      • The Random Forests algorithm is a specific instance of bagging.
      • One of the great qualities of Random Forests is that they make it easy to measure the relative importance of each feature.
  3. Gradient Boosting
    • Gradient Boosting (GBRT) sequentially adds (weak) predictors to an ensemble, each one correcting its predecessor.
    • Each new predictor is fitted to the residual errors made by the previous predictor.
    • GBRT cannot be trained in parallel due to its sequential nature.

If you enjoyed this post and want to buy me a cup of coffee...

The thing is, I'll always accept a cup of coffee. So feel free to buy me one.

Cheers! ☕️