PatsyTransformer in inference #505

petrhrobar · 2022-04-04T08:38:17Z

Hey Vincent,
I am a huge lover of your sklego project, especially patsy implementation within sklean.

However, there is one thing I still would like your opinion on - how do you use a pipeline containing PatsyTransformer only for inference?

As the pickling is not yet supported on the patsy side I came up with a workaround.

import seaborn as sns
from joblib import dump, load
from sklego.preprocessing import PatsyTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Load The data

data = sns.load_dataset("tips")

# Basic Pipeline
pipe = Pipeline([
    ("patsy", PatsyTransformer("tip + C(day)")),
    ("model", LinearRegression())
])

data

# Train the pipeline
pipe.fit(data, data['total_bill'])

from sklearn.base import BaseEstimator, TransformerMixin

# Class for inferencing with pre-trained model (fit only passes, no training happens)
class Model_Inferencer(BaseEstimator, TransformerMixin):
    """
    Function that applyes pre-trained models within a pipeline setting.
    """

    def __init__(self, pre_trained_model=None):
        self.pre_trained_model = pre_trained_model

    def transform(self, X):

        preds = self.pre_trained_model.predict(X)
        return preds

    def predict(self, X):

        preds = self.pre_trained_model.predict(X)
        return preds

    def fit(self, X, y=None, **fit_params):
        return self

pipe.predict(data)[:10]

# Save the model
dump(pipe['model'], 'model_github.joblib')

# Load The model
loaded_model = load('model_github.joblib')

# Create Inference Pipeline
pipe_inference = Pipeline([
    ("patsy", PatsyTransformer("tip + C(day)")),
    ("inferencer", Model_Inferencer(loaded_model))
])


# Inference pipeline needs to be fitted 
# pipe_inference.fit(data)

# Save predictions (works only when fitted)
pipe_inference.predict(data)

Still not sure thou is something may go wrong - e.g. wrong dummy coding inside patsy compared to when model was just being trained (first pipeline). Let's say in some batch there will be only 3 levels of a categorical variable while in original data there were 10 of them...

What is the approach you would use when in need of using patsy in production (only inference)?

Thanks

koaning · 2022-04-04T08:42:37Z

As the picking is not yet supported on the patsy side

Do you mean "pickling"? Or something else?

koaning · 2022-04-04T08:44:30Z

Awh heck! It seems we're not applying our standard tests to the PatsyTransformer.

petrhrobar · 2022-04-04T08:47:42Z

Yes, I mean pickling the pipeline and then loading it and only using it for predictions on new data. Pickling is not yet supported by the patsy.

Perhaps saving design_info_ via h5?

I figured this would be very helpful to a lot of people that's why I came with this question

koaning · 2022-04-04T09:02:49Z

You're the first person to bring it up 😅 . I'll look into it though. Storing the design info might work, but you'd need to do that manually.

petrhrobar · 2022-04-04T09:04:41Z

I am trying to play around with it:

import h5py

def save_patsy(patsy_step, filename):
    """Save the coefficients of a linear model into a .h5 file."""
    with h5py.File(filename, 'w') as hf:
        hf.create_dataset("design_info",  data=patsy_step.design_info_)

def load_coefficients(patsy_step, filename):
    """Attach the saved coefficients to a linear model."""
    with h5py.File(filename, 'r') as hf:
        design_info = hf['design_info'][:]
    patsy_step.design_info_ = design_info


save_patsy(pipe['patsy'], "clf.h5")

returns an error :(

 Object dtype dtype('O') has no native HDF5 equivalent

Sorry for adding another issue to an already large stack. However, I really feel like this is a big one.

koaning · 2022-04-04T09:19:24Z

You can also try orjson instead there, that can also serialize Numpy arrays if I recall correctly. Sure, .h5 files load a lot quicker. But we're not talking huge data here.

koaning · 2022-04-20T07:42:53Z

Come to think of it. Maybe something like https://pypi.org/project/dill/ would work for your pipeline.

koaning · 2022-04-20T19:40:16Z

I think it's best to close this issue for now. The issue won't be fixed in our library and anybody who is interested in this feature should collaborate here.

petrhrobar · 2022-04-26T07:16:29Z

Just following up with another package formulaic, which does exactly the same. Wrapping inside sklean should not be super hard.

And pickling should also be supported.

petrhrobar · 2022-04-26T09:19:45Z

Yes, the library seems to be doing exactly what the issue mentions.

I put together simple "quick-dirty overview"

import seaborn as sns
import pandas

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from formulaic import Formula


df = sns.load_dataset('tips')

class FormulaicTransformer(TransformerMixin, BaseEstimator):
    """
    
    """

    def __init__(self, formula):
        self.formula = formula

    def fit(self, X, y = None):
        """Fits the estimator"""
        y, X = Formula(self.formula).get_model_matrix(X)

        return self

    def transform(self, X, y= None):
        """Fits the estimator"""
        y, X = Formula(self.formula).get_model_matrix(X)
        return X.values

pipe = Pipeline([
    ("formula", FormulaicTransformer('total_bill ~ tip + C(sex) + C(day)')),
    ("scale", StandardScaler()),
    ("model", LinearRegression())
])

pipe.fit(df, df['total_bill'])

from joblib import dump, load

dump(pipe, 'model_github.joblib')

# REstart Kernel Before
pipe_infer = load('model_github.joblib')

pipe_infer.predict(df)

Do you think it is worth implementing it into the package?

petrhrobar added the enhancement New feature or request label Apr 4, 2022

koaning closed this as completed Apr 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PatsyTransformer in inference #505

PatsyTransformer in inference #505

petrhrobar commented Apr 4, 2022 •

edited

Loading

koaning commented Apr 4, 2022

koaning commented Apr 4, 2022

petrhrobar commented Apr 4, 2022 •

edited

Loading

koaning commented Apr 4, 2022

petrhrobar commented Apr 4, 2022 •

edited

Loading

koaning commented Apr 4, 2022

koaning commented Apr 20, 2022

koaning commented Apr 20, 2022

petrhrobar commented Apr 26, 2022

petrhrobar commented Apr 26, 2022 •

edited

Loading

PatsyTransformer in inference #505

PatsyTransformer in inference #505

Comments

petrhrobar commented Apr 4, 2022 • edited Loading

koaning commented Apr 4, 2022

koaning commented Apr 4, 2022

petrhrobar commented Apr 4, 2022 • edited Loading

koaning commented Apr 4, 2022

petrhrobar commented Apr 4, 2022 • edited Loading

koaning commented Apr 4, 2022

koaning commented Apr 20, 2022

koaning commented Apr 20, 2022

petrhrobar commented Apr 26, 2022

petrhrobar commented Apr 26, 2022 • edited Loading

petrhrobar commented Apr 4, 2022 •

edited

Loading

petrhrobar commented Apr 4, 2022 •

edited

Loading

petrhrobar commented Apr 4, 2022 •

edited

Loading

petrhrobar commented Apr 26, 2022 •

edited

Loading