Scikit Learn to Pandas: Data types shouldn't be this hard
by Dr. Phil Winder , CEO
Nearly everyone using Python for Data Science has used or is using the Pandas Data Analysis/Preprocessing library. It is as much of a mainstay as Scikit-Learn. Despite this, one continuing bugbear is the different core data types used by each: pandas.DataFrame
and np.array
. Wouldn’t it be great if we didn’t have to worry about converting DataFrame
s to numpy
types and back again? Yes, it would. Step forward Scikit Pandas.
Sklearn Pandas
Sklearn Pandas, part of the Scikit Contrib package, adds some syntactic sugar to use Dataframes in sklearn pipelines and back again.
Let’s take the the example in the README. This gives us some simple data that contains categorical and numeric data:
data = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
'children': [4., 6, 3, 3, 2, 3, 5, 4],
'salary': [90., 24, 44, 27, 32, 59, 36, 27]})
data['pet'] = data['pet'].astype("category")
Now we can use the library to create a map that allows us to use our Pandas array with sklearn:
mapper = DataFrameMapper([
('pet', preprocessing.LabelBinarizer()),
(['children'], preprocessing.StandardScaler())
])
mapper.fit_transform(data.copy())
We’re using the new class DataFrameMapper
which we will use to map an input, data
, to whatever is the output of the sklearn functions declared in the array. Note that the class conforms to the standard sklearn fit
/transform
api. When we run run this we get:
array([[ 1. , 0. , 0. , 0.20851441],
[ 0. , 1. , 0. , 1.87662973],
[ 0. , 1. , 0. , -0.62554324],
[ 0. , 0. , 1. , -0.62554324],
[ 1. , 0. , 0. , -1.4596009 ],
[ 0. , 1. , 0. , -0.62554324],
[ 1. , 0. , 0. , 1.04257207],
[ 0. , 0. , 1. , 0.20851441]])
The first thing to note is that the output is a numpy
one. This was a little surprising, since it is supposed to be a library that can map back and forth from Pandas.
The second thing to notice is that the new DataFrameMapper
looks very similar to sklearn’s pipeline.Pipeline
class. In fact, I would go so far as saying that this is duplicating the functionality of the Pipeline
class.
Also, and this is a gripe with the Pipeline
class too, but I don’t like the use of a named tuple. It would have been much cleaner to treat this like what it really is; a functional pipeline. Passing in a lambda to map data via an sklearn class/function would make it much cleaner and far more reusable.
Scikit-learn’s Pipeline
is All You Need
These ideas aren’t just mine. John Ramey presents a simple Adapter class that slects the right datatype for the operation (Ramey, 2018). Tom de Ruijter developed the same idea too (Ruijter, 2017).
Essentially what they do is create a class that filters for specific features (see how we’re still using functional language here). In the example below we filter for a data type
, but we could have easily filtered upon different parameters, like the name of the feature.
from sklearn.base import BaseEstimator, TransformerMixin
class TypeSelector(BaseEstimator, TransformerMixin):
def __init__(self, dtype):
self.dtype = dtype
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.select_dtypes(include=[self.dtype])
We can use this filter in front of the mappers to ensure we have the right type. For a feature that is catagorical, for example, we can now create a standard sklearn pipeline like this:
pipeline.make_pipeline(
TypeSelector("category"),
preprocessing.OneHotEncoder()
)
All we need to do now is repeat this pattern for each data type
or feature and then merge them back together again. Here it is in action:
pipe = pipeline.make_union(
pipeline.make_pipeline(
TypeSelector("category"),
preprocessing.OneHotEncoder()
),
pipeline.make_pipeline(
TypeSelector(np.number),
preprocessing.StandardScaler()
)
)
pipe.fit_transform(data.copy()).toarray()
array([[ 1. , 0. , 0. , 0.20851441, 2.27500192],
[ 0. , 1. , 0. , 1.87662973, -0.87775665],
[ 0. , 1. , 0. , -0.62554324, 0.07762474],
[ 0. , 0. , 1. , -0.62554324, -0.73444944],
[ 1. , 0. , 0. , -1.4596009 , -0.49560409],
[ 0. , 1. , 0. , -0.62554324, 0.79416078],
[ 1. , 0. , 0. , 1.04257207, -0.30452782],
[ 0. , 0. , 1. , 0.20851441, -0.73444944]])
There we have it. Almost the same functionality as the library, with fewer lines of code using standard methods. The only thing that we haven’t done that the library does is maintain the feature metadata at the end of the pipeline. The result of the code above is a numpy
array.
Conclusion: Extra Complexity You Don’t Need
The scikit pandas library also has some helper wrapper methods that override the sklearn
implementation, like a wrapper for cross validation and a vectorised function mapper . Again, I think these are superfluous. You can do this very easily with standard numpy
methods or a bit of functional python.
Considering how simple it should be, I’m also worried about the cyclomatic complexity of the library. The _transform
method has a complexity of 18 (21 is considered high - Subandri and Sarno, 2017).
I wouldn’t recommend the use of this library as it currently stands. I think it would be wise to utilise sklearn
s Pipeline
or a functional library with some wrapper methods/classes.
But this leads me the question, considering that these two libraries are some of the most popular Data Science libraries in the world, why is there such poor integration?
References
- Ruijter, Tom de. “Integrating Pandas and Scikit-Learn with Pipelines.” Bigdatarepublic (blog), November 21, 2017. https://medium.com/bigdatarepublic/integrating-pandas-and-scikit-learn-with-pipelines-f70eb6183696.
- Subandri, Muhammad Asep, and Riyanarto Sarno. “Cyclomatic Complexity for Determining Product Complexity Level in COCOMO II.” Procedia Computer Science, 4th Information Systems International Conference 2017, ISICO 2017, 6-8 November 2017, Bali, Indonesia, 124 (January 1, 2017): 478–86. https://doi.org/10.1016/j.procs.2017.12.180.