Estimators
Overview
- Object-Oriented Programming (OOP)
- Inheritance (OOP)
- Estimators
- Transformers
- Custom Estimators
- Pipeline
- Common Scikit-learn modules
Prerequisite:
- Basic understanding of
numpy
Scikit-learn
is an open source machine learning library that supports supervised and unsupervised learning.
It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.
In this tutorial, we will be using a very popular classification dataset used in machine-learning: the iris dataset.
Scikit-learn
provides a load_iris
function to retrieve this dataset from the sklearn.datasets
module.
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> X.shape
(150, 4)
>>> y.shape
(150,)
According to the Scikit-learn
documentation, it provides dozens of built-in machine learning algorithms and models.
These models (aka estimators
) are implemented as classes using the OOP paradigm, and provide common methods for processing data.
The RandomForestClassifier
from the sklearn.ensemble
module is one such estimator
for classification problems.
>>> from sklearn.ensemble import RandomForestClassifier
Each estimator receives different arguments during instantiation depending on the requirements of the algorithm, so it is handy to have access to the online documentation for these estimators.
Fortunately, most estimator classes in Scikit-learn
provide sensible default arguments,
so we can start using the models without worrying too much about the arguments to pass in.
>>> # Instantiate RandomForestClassifier estimator
>>> estimator = RandomForestClassifier(random_state=0)
Note: Some
Scikit-learn
estimators accept an optionalrandom_state
argument during instantiation. It is recommended to set this argument to a constantint
throughout your program. This is to ensure a consistent result when you run your program multiple times.
Every estimator in Scikit-learn
implements a fit
method that accepts training data for learning.
>>> # X is the features of your training set
>>> # y is the label for your training set
>>> estimator.fit(X, y)
Since it is best practice to set aside some of the data to evaluate the ability of a model, we will split the data (X, y
) to training and test sets.
Scikit-learn
provides utilities for working with your data.
The one we are going to use to split our data is the train_test_split
function in the sklearn.model_selection
module.
>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> X_train.shape, X_test.shape
((112, 4), (38, 4))
>>> y_train.shape, y_test.shape
((112,), (38,))
Now we can call the RandomForestClassifier.fit method on X_train and y_train.
>>> estimator = RandomForestClassifier(random_state=0)
>>> estimator.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=0, verbose=0,
warm_start=False)
After training the classifier on the training set, we can now make predictions with the predict
method.
As an example, let's make a prediction on the first 5 elements of the test set, and compare with the actual results.
>>> y_pred = estimator.predict(X_test[0:5])
>>> y_pred # predictions
array([1, 2, 1, 0, 1])
>>> y_test[0:5] # true result
array([1, 2, 1, 0, 1])
Note:
Scikit-learn
usesnumpy
arrays in the background for working with data. Therefore it is advised to be familiar with the basic ofnumpy
before starting out withScikit-learn
Our RandomForestClassifier
estimator predicted the first 5 elements correctly.
Considering that we are not interested in the predictions themselves, we just want to know how well our model performed.
The RandomForestClassifier
provides a score
method, to determine how accurate our model is on a test set.
>>> estimator.score(X_test, y_test)
0.9736842105263158
It appears that our model predicted 97%
of our test set correctly.
I would encourage you to take a look at the Scikit-learn
documentation to get familiar with several models and functions the package provides.
You can refer to this image 👇 in the case where you are not sure which estimator to use.
Figure 1: Scikit-learn Machine Learning Map |
In the next tutorial, we will be taking a look at Transformers
and how to use them for data preprocessing.
Prev - Inheritance (OOP) | Next - Transformers |
---|