Custom Estimators
- Object-Oriented Programming (OOP)
- Inheritance (OOP)
- Estimators
- Transformers
- Custom Estimators
- Pipeline
- Common Scikit-learn modules
In the last two parts of this series we took a look at some of the estimators that Scikit-learn
provides out of the box.
Note: Transformers are estimators as well.
However, Scikit-learn
allows us to define our custom estimators by inheriting from some base classes in the sklearn.base
One thing to keep in mind is that every estimator, regardless of its purpose and functionality inherits from the sklearn.base.BaseEstimator
Creating an estimator is as easy as:
>>> from sklearn.base import BaseEstimator
>>> class MyEstimator(BaseEstimator):
... pass
Congratulations, you have just created your own custom estimator. However, at this stage it does nothing.
In fact, the only methods defined in MyEstimator
__setstate__(self, state)
get_params(self, deep=True)
set_params(self, **params)
These are inherited from sklearn.base.BaseEstimator
and are used internally by Scikit-learn
, so we rarely interact directly with these methods.
In order to create a useful estimator, we must first recognize its purpose to identify the category it fits in. Some categories include:
- Transformation
- Regression
- Classification
- Clustering
The sklearn.base
module exposes other base classes in addition to BaseEstimator
for specific types of estimators.
The TransformerMixin
class for transformers, RegressorMixin
for regressors, ClassifierMixin
for classifiers, e.t.c
For example, sklearn.base.TransformerMixin
can be inherited in tandem with sklearn.base.BaseEstimator
to define a custom Transformer
estimator. This is how all built-in Scikit-learn
estimators operate.
>>> from sklearn.base import BaseEstimator, TransformerMixin
>>> class MyTransformer(BaseEstimator, TransformerMixin):
... pass
Note: The
must be inherited as well. This is howScikit-learn
knows that a class is indeed an estimator.
Each *Mixin
class requires a different set of methods to be defined in their subclasses to function appropriately.
As an example, the TransformerMixin
class requires its subclasses to define the following methods:
.fit(self, X, y=None)
.transform(self, X, y=None)
subclasses inherit the .fit_transform(self, X, y=None)
, which executes the .fit
and .transform
methods in sequence.
>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> seed = 0
>>> np.random.seed(seed)
>>> X, y = load_iris(return_X_y=True)
>>> class MyTransformer(BaseEstimator, TransformerMixin):
... def fit(self, X, y=None):
... '''
... 1. learns from the data (fits the data)
... 2. returns self
... '''
... return self
... def transform(self, X, y=None):
... '''
... return a transformation of the input X data.
... '''
... return X
>>> my_transformer = MyTransformer()
>>> X_transformed = my_transformer.fit_transform(X)
Other *Mixin
classes include:
Regression estimatorsClassifierMixin
Classification estimatorsClusterMixin
Clustering estimatorsOutlierMixin
Outlier detection estimators
In the last tutorial, we used sklearn.preprocessing.LabelEncoder
to encode our iris target
column to numeric values.
Let's implement our own custom LabelEncoder.
But first, we'll load the data.
>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> seed = 0
>>> np.random.seed(seed)
>>> iris = load_iris()
>>> labels = iris['target_names'][iris['target']]
>>> columns = iris['feature_names'] + ['target']
>>> values = np.c_[iris['data'], labels]
>>> df = pd.DataFrame(values, columns=columns)
>>> df.sample(5, random_state=seed)
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
114 | 5.8 | 2.8 | 5.1 | 2.4 | virginica |
62 | 6 | 2.2 | 4 | 1 | versicolor |
33 | 5.5 | 4.2 | 1.4 | 0.2 | setosa |
107 | 7.3 | 2.9 | 6.3 | 1.8 | virginica |
7 | 5 | 3.4 | 1.5 | 0.2 | setosa |
Our first step now is to identify the type of estimator we are creating, and use the appropriate *Mixin
Hint: Transformer
>>> from sklearn.base import BaseEstimator, TransformerMixin
>>> class MyLabelEncoder(BaseEstimator, TransformerMixin):
... pass
Next, we define the required methods for TransformerMixin
requires both .fit(self, X, y=None)
and .transform(self, X, y=None)
, where .fit
returns the current estimator object, and .transform
returns the transformed data.
>>> class MyLabelEncoder(BaseEstimator, TransformerMixin):
... def fit(self, X, y=None):
... ''' Not yet implemented '''
... return self
... def transform(self, X, y=None):
... ''' Not yet implemented '''
... pass
In the transformation phase, we want to swap the target names with their corresponding integer pair:
target names | integer |
setosa | 0 |
versicolor | 1 |
virginica | 2 |
However, we need to compute the corresponding integer pairs in the .fit
method when we first receive the training data. The first flower name will be encoded as 0, the second as 1, and third as 2.
>>> class MyLabelEncoder(BaseEstimator, TransformerMixin):
... def fit(self, X, y=None):
... # Get the unique values from array X
... unique_values = np.unique(X) # ['setosa', 'versicolor', 'virginica']
... # create a dictionary that maps unique_values to integers
... # mapping = { 'setosa': 0, 'versicolor': 1, 'virginica': 2 }
... mapping = dict()
... for integer, value in enumerate(unique_values):
... mapping[value] = integer
... # save the mapping on the current object to be used in the .transform method
... self.mapping = mapping
... # Scikit-learn expects the .fit method to return the current object
... return self
... def transform(self, X, y=None):
... mapping = self.mapping
... # Swap each occurrence of the unique values with their integer pairs
... transformed_X = []
... for iris_name in X:
... integer = mapping[iris_name] # get the flower's corresponding integer
... transformed_X.append(integer)
... # return the transformed data as a numpy array
... return np.array(transformed_X)
Let's compare our MyLabelEncoder
with sklearn.preprocessing.LabelEncoder
>>> from sklearn.preprocessing import LabelEncoder
>>> label_encoder = LabelEncoder()
>>> my_label_encoder = MyLabelEncoder()
>>> values = df['target'].values
>>> sklearn_encoding = label_encoder.fit_transform(df['target'])
>>> custom_encoding = my_label_encoder.fit_transform(df['target']) # .fit_transform inherited from TransformerMixin
>>> np.all(sklearn_encoding == custom_encoding) # all elements in both encoded arrays are equal
Our custom MyLabelEncoder
produces the same result as sklearn.preprocessing.LabelEncoder
Note: The
method is for extracting information from the data passed in (usually the training set) on how to transform subsequent data. Therefore, it should only be used once, and on the training set only.Similarly, the
method should be used only once on the training set because it
internally. For, every subsequent attempt to transform the data, use.transform
As a bonus, let's create a custom classification estimator that randomly classifies instances in a dataset.
>>> # 1. Identify estimator type
>>> # classification estimators inherit from ClassifierMixin
>>> from sklearn.base import BaseEstimator, ClassifierMixin
>>> # 2. Defined required methods
>>> # ClassifierMixin require .fit and .predict from its subclasses
>>> class MyRandomClassifier(BaseEstimator, ClassifierMixin):
... def fit(self, X, y=None):
... ''' Not yet implemented '''
... return self
... def predict(self, X):
... ''' Not yet implemented '''
... pass
The ClassifierMixin
requires a .predict
method defined in its Subclasses
, and provides us with a .score
Since we are randomly classifying instances of the data, we might want to allow users of our MyRandomClassifier
estimator to provide a random_state
value for reproducing the results of the model.
The random_state
is not a parameter that the estimator learns, rather it is explicitly provided by users of our estimator.
Parameters that are not computed during the .fit
method (learned from the data) are called hyperparameters and are accepted in the __init__(...)
>>> class MyRandomClassifier(BaseEstimator, ClassifierMixin):
... def __init__(self, random_state=None):
... self.random_state = random_state
... def fit(self, X, y):
... ''' Extract labels from the training set, in the `y` parameter.
... '''
... self.labels = np.unique(y)
... return self # 👈 required
... def predict(self, X):
... ''' Randomly classifies the rows in the data.
... In order to reproduce the random results via random_state,
... we use a np.random.Generator object which implements the .choice method similar to np.random.choice
... '''
... generator = np.random.default_rng(self.random_state) # gets np.random.Generator object
... labels = self.labels
... predictions = generator.choice(labels, size=len(X)) # generate random predictions
... return predictions
>>> clf = MyRandomClassifier(random_state=seed)
>>>, y)
>>> clf.predict(X)
array([2, 1, 1, 0, 0, 0, 0, 0, 0, 2, 1, 2, 1, 1, 2, 2, 1, 1, 1, 2, 0, 2,
2, 0, 1, 2, 1, 0, 2, 2, 2, 0, 0, 2, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0,
0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 1, 2, 0,
1, 2, 2, 1, 1, 0, 1, 1, 2, 2, 0, 2, 1, 1, 2, 1, 0, 0, 2, 1, 1, 1,
2, 1, 0, 2, 0, 0, 2, 1, 0, 0, 1, 2, 1, 2, 0, 0, 2, 2, 0, 0, 2, 1,
1, 0, 2, 1, 2, 2, 2, 0, 2, 0, 1, 1, 2, 0, 2, 0, 1, 1, 2, 0, 2, 2,
2, 0, 2, 2, 0, 1, 1, 0, 1, 1, 2, 2, 1, 1, 1, 2, 0, 1])
>>> # Accuracy Score
>>> clf.score(X, y) # implemented by ClassifierMixin
Further Reading:
Understanding how Scikit-learn
estimators work under the hood will help you write cleaner code that interacts nicely with the Scikit-learn
API. Especially pipelines, which we will be taking a look at next.
Prev - Transformers | Next - Pipeline |