Jaykumar Patel: Exploring hyper-parameter of Random forest

Jaykumar patel

Creating and evaluating a simple random forest model

In this blog post I have created random forest and showed how we can evaluate the model.

Importing Python Library

import pandas as pd

Importing R library

library(reticulate)
library(tidyverse)
library(here)
library(ggthemes)

Read the Data

df <- read.csv(file = here("_python/2021-11-02-random-forest", "house_data.csv"))
head(df)

  bedrooms bathrooms m2_living floors m2_above m2_basement
1        3      1.50       124    1.5      124           0
2        5      2.50       339    2.0      313          26
3        3      2.00       179    1.0      179           0
4        3      2.25       186    1.0       93          93
5        4      2.50       180    1.0      106          74
6        2      1.00        82    1.0       82           0
  m2_lot view quality yr_built renovated_last_5 city
1    735    0       3     1961                0   37
2    841    4       5     1927                1   36
3   1110    0       4     1972                1   19
4    746    0       4     1969                1    4
5    975    0       4     1982                0   32
6    593    0       3     1944                0   36
  statezip   price
1       63  313000
2       59 2384000
3       27  342000
4        8  420000
5       32  550000
6       55  490000

Check Null Value

r.df.isnull().sum()

bedrooms            0
bathrooms           0
m2_living           0
floors              0
m2_above            0
m2_basement         0
m2_lot              0
view                0
quality             0
yr_built            0
renovated_last_5    0
city                0
statezip            0
price               0
dtype: int64

Seperate response variables and explanatory variable

y = r.df['price']
X = r.df.drop('price', axis = 1)

Train and test split

from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.2, random_state=0)
 
print('Shape of X_train = ', X_train.shape)

Shape of X_train =  (3680, 13)

print('Shape of X_validation = ', y_train.shape)

Shape of X_validation =  (3680,)

print('Shape of X_test = ', X_validation.shape)

Shape of X_test =  (920, 13)

print('Shape of y_validation = ',y_validation.shape)

Shape of y_validation =  (920,)

Creat a random forest model

from sklearn.ensemble import RandomForestRegressor
 
regressor1 = RandomForestRegressor(random_state= 0)
regressor1.fit(X_train, y_train)

RandomForestRegressor(random_state=0)

Model evalution

train_pred = regressor1.predict(X_train)
valid_pred = regressor1.predict(X_validation)

from sklearn.metrics import mean_absolute_error

train_mae = mean_absolute_error(y_train, train_pred)
valid_mae = mean_absolute_error(y_validation, valid_pred)
                                
print(f'Validation set mean absolute error is {round(valid_mae,2)}')

Validation set mean absolute error is 142365.36

Exploring the n_estimators hyper-parameter

In this part
- I have used for loop to create a random forest model for each value of n_estimators from 1 to 30 to check for which estimator we are getting best performance.

Model training

train_mae2 = [] #stores MAE for training set for each n-estimators
valid_mae2 = [] #stores MAE for validation set for each n-estimators

for nesti in range(1,31):
    regressor2 = RandomForestRegressor(n_estimators=nesti, random_state= 0)
    regressor2.fit(X_train, y_train)
    train_pred2 = regressor2.predict(X_train)
    valid_pred2 = regressor2.predict(X_validation)
    train_mae2.append(mean_absolute_error(y_train, train_pred2))
    valid_mae2.append(mean_absolute_error(y_validation, valid_pred2))

RandomForestRegressor(n_estimators=1, random_state=0)
RandomForestRegressor(n_estimators=2, random_state=0)
RandomForestRegressor(n_estimators=3, random_state=0)
RandomForestRegressor(n_estimators=4, random_state=0)
RandomForestRegressor(n_estimators=5, random_state=0)
RandomForestRegressor(n_estimators=6, random_state=0)
RandomForestRegressor(n_estimators=7, random_state=0)
RandomForestRegressor(n_estimators=8, random_state=0)
RandomForestRegressor(n_estimators=9, random_state=0)
RandomForestRegressor(n_estimators=10, random_state=0)
RandomForestRegressor(n_estimators=11, random_state=0)
RandomForestRegressor(n_estimators=12, random_state=0)
RandomForestRegressor(n_estimators=13, random_state=0)
RandomForestRegressor(n_estimators=14, random_state=0)
RandomForestRegressor(n_estimators=15, random_state=0)
RandomForestRegressor(n_estimators=16, random_state=0)
RandomForestRegressor(n_estimators=17, random_state=0)
RandomForestRegressor(n_estimators=18, random_state=0)
RandomForestRegressor(n_estimators=19, random_state=0)
RandomForestRegressor(n_estimators=20, random_state=0)
RandomForestRegressor(n_estimators=21, random_state=0)
RandomForestRegressor(n_estimators=22, random_state=0)
RandomForestRegressor(n_estimators=23, random_state=0)
RandomForestRegressor(n_estimators=24, random_state=0)
RandomForestRegressor(n_estimators=25, random_state=0)
RandomForestRegressor(n_estimators=26, random_state=0)
RandomForestRegressor(n_estimators=27, random_state=0)
RandomForestRegressor(n_estimators=28, random_state=0)
RandomForestRegressor(n_estimators=29, random_state=0)
RandomForestRegressor(n_estimators=30, random_state=0)

Plotting Model Performance graph

Performance on training set

ToDataframe <- function(data) {
  new_df <- data.frame(matrix(unlist(data), nrow=length(data), byrow=TRUE)) # Converting python list to the dataframe
  names(new_df)[1] <- "error"
new_df$enu <- seq.int(nrow(new_df)) # adding new column with the number of n_estimators 
return(new_df)
}

train_mae <- ToDataframe(py$train_mae2)
train_mae %>%
  ggplot(aes(x=enu, y=error))+
  geom_line() +
  labs(title = "Distribution of MAE over n_estimators for training set",
         x = "n_estimators",
         y = "mean absolute error") +
    theme_minimal()+
  themes()

Performance on validation set

valid_mae <- ToDataframe(py$valid_mae2)

valid_mae %>%
  ggplot(aes(x=enu, y=error))+
  geom_line() +
  labs(title = "Distribution of MAE over n_estimators for validation set",
         x = "n_estimators",
         y = "mean absolute error") +
    theme_minimal()+
  themes()

Best model performnace

minimum2 = valid_mae2.index(min(valid_mae2))
print(f'Minimum Mean Absolute error is {round(min(valid_mae2),2)}\nMinimum error is at {minimum2 + 1} n_estimators ')

Minimum Mean Absolute error is 138483.76
Minimum error is at 4 n_estimators

Overall observation

Which value of n_estimators gives the best results for the validation set?
- 4 n_estimators gives the best results
How I decided that this value for n_estimators gave the best results?
- I created model for n_estimators from 1 to 30 and stored value in the list valid_mae2.
- Then I compare all the results and looked for the minimum error that is how I decided the value of the n_estimators.

Exploring the max_features hyper-parameter

In this part
- I have used for loop to create a random forest model for each value of max_features from 1 to number of features present in the data.
- I have used n-estimatro 4 as we found that it gives us the best performance.

Model training


train_mae3 = [] #stores MAE for training set for each n-estimators
valid_mae3 = [] #stores MAE for training set for each n-estimators

for max_fea in range(1,14):
    regressor3 = RandomForestRegressor(n_estimators=4, max_features=max_fea, random_state= 0) 
    regressor3.fit(X_train, y_train)
    train_pred3 = regressor3.predict(X_train)
    valid_pred3 = regressor3.predict(X_validation)
    train_mae3.append(mean_absolute_error(y_train, train_pred3))
    valid_mae3.append(mean_absolute_error(y_validation, valid_pred3))

RandomForestRegressor(max_features=1, n_estimators=4, random_state=0)
RandomForestRegressor(max_features=2, n_estimators=4, random_state=0)
RandomForestRegressor(max_features=3, n_estimators=4, random_state=0)
RandomForestRegressor(max_features=4, n_estimators=4, random_state=0)
RandomForestRegressor(max_features=5, n_estimators=4, random_state=0)
RandomForestRegressor(max_features=6, n_estimators=4, random_state=0)
RandomForestRegressor(max_features=7, n_estimators=4, random_state=0)
RandomForestRegressor(max_features=8, n_estimators=4, random_state=0)
RandomForestRegressor(max_features=9, n_estimators=4, random_state=0)
RandomForestRegressor(max_features=10, n_estimators=4, random_state=0)
RandomForestRegressor(max_features=11, n_estimators=4, random_state=0)
RandomForestRegressor(max_features=12, n_estimators=4, random_state=0)
RandomForestRegressor(max_features=13, n_estimators=4, random_state=0)

Plotting Model Performance graph

train_mae <- ToDataframe(py$train_mae3)
train_mae %>%
  ggplot(aes(x=enu, y=error))+
  geom_line() +
  labs(title = "Distribution of MAE over Max_features for training set",
         x = "# of features",
         y = "mean absolute error") +
    theme_minimal()+
  themes()

valid_mae <- ToDataframe(py$valid_mae3)
valid_mae %>%
  ggplot(aes(x=enu, y=error))+
  geom_line() +
  labs(title = "Distribution of MAE over Max_features for validation set",
         x = "# of features",
         y = "mean absolute error") +
    theme_minimal()+
  themes()

## Best model performance

minimum3 = valid_mae3.index(min(valid_mae3))
print(f'Minimum Mean Absolute error is {round(min(valid_mae3),2)}\nMinimum error is at {minimum3 + 1} max_features ')

Minimum Mean Absolute error is 138483.76
Minimum error is at 13 max_features

Exploring hyper-parameter of Random forest

Creating and evaluating a simple random forest model

Importing Python Library

Importing R library

Read the Data

Check Null Value

Seperate response variables and explanatory variable

Train and test split

Creat a random forest model

Model evalution

Exploring the n_estimators hyper-parameter

Model training

Plotting Model Performance graph

Performance on training set

Performance on validation set

Best model performnace

Overall observation

Exploring the max_features hyper-parameter

Model training

Plotting Model Performance graph

Citation