Scikit Learn to Pandas: Data types shouldn't be this hard

by Dr. Phil Winder , CEO

Nearly everyone using Python for Data Science has used or is using the Pandas Data Analysis/Preprocessing library. It is as much of a mainstay as Scikit-Learn. Despite this, one continuing bugbear is the different core data types used by each: pandas.DataFrame and np.array. Wouldn’t it be great if we didn’t have to worry about converting DataFrames to numpy types and back again? Yes, it would. Step forward Scikit Pandas.

Sklearn Pandas

Sklearn Pandas, part of the Scikit Contrib package, adds some syntactic sugar to use Dataframes in sklearn pipelines and back again.

Let’s take the the example in the README. This gives us some simple data that contains categorical and numeric data:

data = pd.DataFrame({'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
                     'children': [4., 6, 3, 3, 2, 3, 5, 4],
                     'salary':   [90., 24, 44, 27, 32, 59, 36, 27]})
data['pet'] = data['pet'].astype("category")

Now we can use the library to create a map that allows us to use our Pandas array with sklearn:

mapper = DataFrameMapper([
    ('pet', preprocessing.LabelBinarizer()),
    (['children'], preprocessing.StandardScaler())
])
mapper.fit_transform(data.copy())

We’re using the new class DataFrameMapper which we will use to map an input, data, to whatever is the output of the sklearn functions declared in the array. Note that the class conforms to the standard sklearn fit/transform api. When we run run this we get:

array([[ 1.        ,  0.        ,  0.        ,  0.20851441],
       [ 0.        ,  1.        ,  0.        ,  1.87662973],
       [ 0.        ,  1.        ,  0.        , -0.62554324],
       [ 0.        ,  0.        ,  1.        , -0.62554324],
       [ 1.        ,  0.        ,  0.        , -1.4596009 ],
       [ 0.        ,  1.        ,  0.        , -0.62554324],
       [ 1.        ,  0.        ,  0.        ,  1.04257207],
       [ 0.        ,  0.        ,  1.        ,  0.20851441]])

The first thing to note is that the output is a numpy one. This was a little surprising, since it is supposed to be a library that can map back and forth from Pandas.

The second thing to notice is that the new DataFrameMapper looks very similar to sklearn’s pipeline.Pipeline class. In fact, I would go so far as saying that this is duplicating the functionality of the Pipeline class.

Also, and this is a gripe with the Pipeline class too, but I don’t like the use of a named tuple. It would have been much cleaner to treat this like what it really is; a functional pipeline. Passing in a lambda to map data via an sklearn class/function would make it much cleaner and far more reusable.

Scikit-learn’s Pipeline is All You Need

These ideas aren’t just mine. John Ramey presents a simple Adapter class that slects the right datatype for the operation (Ramey, 2018). Tom de Ruijter developed the same idea too (Ruijter, 2017).

Essentially what they do is create a class that filters for specific features (see how we’re still using functional language here). In the example below we filter for a data type, but we could have easily filtered upon different parameters, like the name of the feature.

from sklearn.base import BaseEstimator, TransformerMixin
class TypeSelector(BaseEstimator, TransformerMixin):
    def __init__(self, dtype):
        self.dtype = dtype
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.select_dtypes(include=[self.dtype])

We can use this filter in front of the mappers to ensure we have the right type. For a feature that is catagorical, for example, we can now create a standard sklearn pipeline like this:

pipeline.make_pipeline(
    TypeSelector("category"),
    preprocessing.OneHotEncoder()
)

All we need to do now is repeat this pattern for each data type or feature and then merge them back together again. Here it is in action:

pipe = pipeline.make_union(
    pipeline.make_pipeline(
        TypeSelector("category"),
        preprocessing.OneHotEncoder()
    ),
    pipeline.make_pipeline(
        TypeSelector(np.number),
        preprocessing.StandardScaler()
    )
)
pipe.fit_transform(data.copy()).toarray()
array([[ 1.        ,  0.        ,  0.        ,  0.20851441,  2.27500192],
       [ 0.        ,  1.        ,  0.        ,  1.87662973, -0.87775665],
       [ 0.        ,  1.        ,  0.        , -0.62554324,  0.07762474],
       [ 0.        ,  0.        ,  1.        , -0.62554324, -0.73444944],
       [ 1.        ,  0.        ,  0.        , -1.4596009 , -0.49560409],
       [ 0.        ,  1.        ,  0.        , -0.62554324,  0.79416078],
       [ 1.        ,  0.        ,  0.        ,  1.04257207, -0.30452782],
       [ 0.        ,  0.        ,  1.        ,  0.20851441, -0.73444944]])

There we have it. Almost the same functionality as the library, with fewer lines of code using standard methods. The only thing that we haven’t done that the library does is maintain the feature metadata at the end of the pipeline. The result of the code above is a numpy array.

Conclusion: Extra Complexity You Don’t Need

The scikit pandas library also has some helper wrapper methods that override the sklearn implementation, like a wrapper for cross validation and a vectorised function mapper . Again, I think these are superfluous. You can do this very easily with standard numpy methods or a bit of functional python.

Considering how simple it should be, I’m also worried about the cyclomatic complexity of the library. The _transform method has a complexity of 18 (21 is considered high - Subandri and Sarno, 2017).

I wouldn’t recommend the use of this library as it currently stands. I think it would be wise to utilise sklearns Pipeline or a functional library with some wrapper methods/classes.

But this leads me the question, considering that these two libraries are some of the most popular Data Science libraries in the world, why is there such poor integration?

References

More articles

Principal Component Analysis

Where there are large numbers of features it becomes hard to visualise data and can cause problems with computational complexity. This python tutorial introduces dimensionality reduction and principal component analysis.

Read more

101: Why Data Science?

This section introduces Data Science. It explains what it is and why we need it. We discuss some of the reasons for doing Data Science and provides famous examples from around the world.

Read more
}