Transformers
Overview
- Object-Oriented Programming (OOP)
- Inheritance (OOP)
- Estimators
- Transformers
- Custom Estimators
- Pipeline
- Common Scikit-learn modules
Prerequisite:
- Basic understanding of
numpy
Last time we were introduced to Scikit-learn
estimators.
These are classes that implement machine-learning algorithms that are trained on our data
using the .fit
method.
We also saw how to make predictions using the .predict
method.
We trained the RandomForestClassifier
estimator on the iris
dataset for classifying the type of flower, and we got a 97%
accuracy.
>>> from sklearn.datasets import load_iris
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.model_selection import train_test_split
>>>
>>> # Load data
>>> X, y = load_iris(return_X_y=True)
>>> X.shape # 150 flowers, 4 features
(150, 4)
>>> y.shape
(150,)
>>>
>>> # Instantiate model
>>> clf = RandomForestClassifier(random_state=0)
>>>
>>> # Train model
>>> clf.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=0, verbose=0,
warm_start=False)
>>>
>>> # Evaluate model accuracy
>>> clf.score(X_test, y_test)
0.9736842105263158
In the example above, the iris
dataset was already prepared to be trained on using the RandomForestClassifier.fit
method.
However, most data in the real-world are not so clean.
In fact, datasets may contain missing information, outliers, and other issues.
On the other hand, most estimators in Scikit-learn
expect to receive data in a particular format to perform any kind of procedure.
Almost, if not all, estimators in Scikit-learn
expect the data to be numeric, and they cannot function with data that contains missing values.
Take a look at the following dataset:
>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>>
>>> def make_dirty(iris_data):
... """ a utility function to dirty up the iris dataset
... to make it unpresentable to most Scikit-learn estimators.
... Changes:
... 1. Transformed iris target values into their respective text names
... ('setosa', 'versicolor', 'virginica')
... 2. Scaled dimensions of 'petal length' and 'petal width'
... """
...
... features = iris_data['data']
... target = iris_data['target']
... columns = iris_data['feature_names'] + ['target']
...
... df = pd.DataFrame(np.c_[features, target], columns=columns)
...
... # change target from number to text
... df['target'] = iris_data['target_names'][target]
...
... # make petal features with dimensions in meters
... df[['petal length', 'petal width']] = df[['petal length (cm)', 'petal width (cm)']] * .01
...
... return df[['sepal length (cm)', 'sepal width (cm)', 'petal length', 'petal width', 'target']]
>>>
>>> iris = load_iris()
>>> df = make_dirty(iris)
>>> df.sample(5, random_state=0)
sepal length (cm) | sepal width (cm) | petal length | petal width | target | |
---|---|---|---|---|---|
114 | 5.8 | 2.8 | 0.051 | 0.024 | virginica |
62 | 6 | 2.2 | 0.04 | 0.01 | versicolor |
33 | 5.5 | 4.2 | 0.014 | 0.002 | setosa |
107 | 7.3 | 2.9 | 0.063 | 0.018 | virginica |
7 | 5 | 3.4 | 0.015 | 0.002 | setosa |
The target
column is an issue for most Scikit-learn
estimators because its datatype is not numeric.
Luckily, Scikit-learn
provides a some classes that implement certain procedures to transform your data into a more compatible format for your model,
one of which is the LabelEncoder
class from the sklearn.preprocessing
module.
>>> from sklearn.preprocessing import LabelEncoder
>>> label_encoder = LabelEncoder()
All transformers in Scikit-learn
implement the following methods:
-
.fit
extracts essential information from the provided data for transforming the subsequent data. -
.transform
returns a transformation of the data.
You may recall that the .fit
method was present in the RandomForestClassifier
.
Transformers are essentially estimators.
We established last time that estimators train on provided data to make predictions.
Well, transformers also train on provided data, but rather than output a prediction, they return a transformation of the input data.
In fact, making predictions is the same concept: you receive input data, apply one or more transformations on the data, and return the result.
>>> target = df['target']
>>> # .fit phase
>>> label_encoder.fit(target)
>>> # .transform phase
>>> result = label_encoder.transform(target)
>>> df['target'] = result
>>> df.sample(5, random_state=0)
sepal length (cm) | sepal width (cm) | petal length | petal width | target | |
---|---|---|---|---|---|
114 | 5.8 | 2.8 | 0.051 | 0.024 | 2 |
62 | 6 | 2.2 | 0.04 | 0.01 | 1 |
33 | 5.5 | 4.2 | 0.014 | 0.002 | 0 |
107 | 7.3 | 2.9 | 0.063 | 0.018 | 2 |
7 | 5 | 3.4 | 0.015 | 0.002 | 0 |
Note:
LabelEncoder
should only be used for the column being predicted, and not on feature columns. If you are looking to encode feature columns, consider other forms of encoding classes such assklearn.preprocessing.OneHotEncoder
orsklearn.preprocessing.OrdinalEncoder
. Alternatively, consider thecategory_encoders
package
What Happened?
In simplest terms, all occurrences of setosa
, versicolor
, and virginica
were replaced with 0
, 1
, and 2
respectively.
target names | encoded int |
---|---|
setosa | 0 |
versicolor | 1 |
virginica | 2 |
During the .fit
phase, the label_encoder
extracted the unique values within the target
array (setosa
, versicolor
, and virginica
)
and stored them internally along with their corresponding mapping. Since setosa
appeared first, it was paired with integer 0
, versicolor
with 1
and virginica
with 2
.
In the .transform
phase, the label_encoder
returned a new array with each occurrence of setosa
, versicolor
, and virginica
swapped with their paired integer.
Another, more subtle issue, with this data is the inconsistent scale of the features.
sepal length (cm) | sepal width (cm) | petal length | petal width | |
---|---|---|---|---|
mean | 5.84333 | 3.05733 | 0.03758 | 0.0119933 |
std | 0.828066 | 0.435866 | 0.017653 | 0.00762238 |
The mean
and std
of the sepal
and petal
features vary by a great deal.
Depending on the algorithm you're working with, the difference in scale between features can negatively affect the performance of the model.
The example below trains a Support Vector Machine Classifier on the current dataset and evaluates it accuracy.
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.svm import SVC # Support Vector Classifier
>>> X = df[['sepal length (cm)', 'sepal width (cm)', 'petal length', 'petal width']]
>>> y = df['target']
>>> X.shape
(150, 4)
>>> y.shape
(150,)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>>
>>> clf = SVC(random_state=0)
>>> clf.fit(X_train, y_train)
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
verbose=False)
>>> clf.score(X_test, y_test)
0.7631578947368421
Apply scaling on each feature in the dataset using the StandardScaler
transformer from the sklearn.preprocessing
module before fitting the model yields a better accuracy.
>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler()
>>> scaler.fit(X_train)
StandardScaler(copy=True, with_mean=True, with_std=True)
>>> X_train_transformed = scaler.transform(X_train)
>>> # X_train_transformed = scaler.fit_transform(X_train) # composes the previous two steps
>>>
>>> clf = SVC(random_state=0)
>>> clf.fit(X_train_transformed, y_train)
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
verbose=False)
>>> X_test_transformed = scaler.transform(X_test)
>>> clf.score(X_test_transformed, y_test)
0.9736842105263158
As you can see, the accuracy went up from 76%
all the way to 97%
.
The boost in performance is as a result of a similar scale between all features.
sepal length (cm) | sepal width (cm) | petal length | petal width | |
---|---|---|---|---|
mean | -0.0498882 | 0.0127753 | -0.0214369 | -0.0306981 |
std | 0.954636 | 1.00373 | 0.984748 | 0.979828 |
Note:
Support Vector Machine is used in this tutorial to illustrate the significance of scaling for certain machine-learning algorithms, because Random Forest models are not affected by scale.
Hopefully, this example sheds light on the effectiveness of transformers for manipulating your data.
For further reading, you can refer to the Scikit-learn
documentation on Dataset transformations
In the next tutorial we will be creating our own estimators and transformers using what we have learned so far in this series.
Prev - Estimators | Next - Custom Estimators |
---|