Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PatsyTransformer in inference #505

Closed
petrhrobar opened this issue Apr 4, 2022 · 10 comments
Closed

PatsyTransformer in inference #505

petrhrobar opened this issue Apr 4, 2022 · 10 comments
Labels
enhancement New feature or request

Comments

@petrhrobar
Copy link

petrhrobar commented Apr 4, 2022

Hey Vincent,
I am a huge lover of your sklego project, especially patsy implementation within sklean.

However, there is one thing I still would like your opinion on - how do you use a pipeline containing PatsyTransformer only for inference?

As the pickling is not yet supported on the patsy side I came up with a workaround.

import seaborn as sns
from joblib import dump, load
from sklego.preprocessing import PatsyTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Load The data

data = sns.load_dataset("tips")

# Basic Pipeline
pipe = Pipeline([
    ("patsy", PatsyTransformer("tip + C(day)")),
    ("model", LinearRegression())
])

data

# Train the pipeline
pipe.fit(data, data['total_bill'])

from sklearn.base import BaseEstimator, TransformerMixin

# Class for inferencing with pre-trained model (fit only passes, no training happens)
class Model_Inferencer(BaseEstimator, TransformerMixin):
    """
    Function that applyes pre-trained models within a pipeline setting.
    """

    def __init__(self, pre_trained_model=None):
        self.pre_trained_model = pre_trained_model

    def transform(self, X):

        preds = self.pre_trained_model.predict(X)
        return preds

    def predict(self, X):

        preds = self.pre_trained_model.predict(X)
        return preds

    def fit(self, X, y=None, **fit_params):
        return self

pipe.predict(data)[:10]

# Save the model
dump(pipe['model'], 'model_github.joblib')

# Load The model
loaded_model = load('model_github.joblib')

# Create Inference Pipeline
pipe_inference = Pipeline([
    ("patsy", PatsyTransformer("tip + C(day)")),
    ("inferencer", Model_Inferencer(loaded_model))
])


# Inference pipeline needs to be fitted 
# pipe_inference.fit(data)

# Save predictions (works only when fitted)
pipe_inference.predict(data)

Still not sure thou is something may go wrong - e.g. wrong dummy coding inside patsy compared to when model was just being trained (first pipeline). Let's say in some batch there will be only 3 levels of a categorical variable while in original data there were 10 of them...

What is the approach you would use when in need of using patsy in production (only inference)?

Thanks

@petrhrobar petrhrobar added the enhancement New feature or request label Apr 4, 2022
@koaning
Copy link
Owner

koaning commented Apr 4, 2022

As the picking is not yet supported on the patsy side

Do you mean "pickling"? Or something else?

@koaning
Copy link
Owner

koaning commented Apr 4, 2022

Awh heck! It seems we're not applying our standard tests to the PatsyTransformer.

@petrhrobar
Copy link
Author

petrhrobar commented Apr 4, 2022

Yes, I mean pickling the pipeline and then loading it and only using it for predictions on new data. Pickling is not yet supported by the patsy.

Perhaps saving design_info_ via h5?

I figured this would be very helpful to a lot of people that's why I came with this question

@koaning
Copy link
Owner

koaning commented Apr 4, 2022

You're the first person to bring it up 😅 . I'll look into it though. Storing the design info might work, but you'd need to do that manually.

@petrhrobar
Copy link
Author

petrhrobar commented Apr 4, 2022

I am trying to play around with it:

import h5py

def save_patsy(patsy_step, filename):
    """Save the coefficients of a linear model into a .h5 file."""
    with h5py.File(filename, 'w') as hf:
        hf.create_dataset("design_info",  data=patsy_step.design_info_)

def load_coefficients(patsy_step, filename):
    """Attach the saved coefficients to a linear model."""
    with h5py.File(filename, 'r') as hf:
        design_info = hf['design_info'][:]
    patsy_step.design_info_ = design_info


save_patsy(pipe['patsy'], "clf.h5")

returns an error :(

 Object dtype dtype('O') has no native HDF5 equivalent

Sorry for adding another issue to an already large stack. However, I really feel like this is a big one.

@koaning
Copy link
Owner

koaning commented Apr 4, 2022

You can also try orjson instead there, that can also serialize Numpy arrays if I recall correctly. Sure, .h5 files load a lot quicker. But we're not talking huge data here.

@koaning
Copy link
Owner

koaning commented Apr 20, 2022

Come to think of it. Maybe something like https://pypi.org/project/dill/ would work for your pipeline.

@koaning
Copy link
Owner

koaning commented Apr 20, 2022

I think it's best to close this issue for now. The issue won't be fixed in our library and anybody who is interested in this feature should collaborate here.

@koaning koaning closed this as completed Apr 20, 2022
@petrhrobar
Copy link
Author

Just following up with another package formulaic, which does exactly the same. Wrapping inside sklean should not be super hard.

And pickling should also be supported.

@petrhrobar
Copy link
Author

petrhrobar commented Apr 26, 2022

Yes, the library seems to be doing exactly what the issue mentions.

I put together simple "quick-dirty overview"

import seaborn as sns
import pandas

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from formulaic import Formula


df = sns.load_dataset('tips')

class FormulaicTransformer(TransformerMixin, BaseEstimator):
    """
    
    """

    def __init__(self, formula):
        self.formula = formula

    def fit(self, X, y = None):
        """Fits the estimator"""
        y, X = Formula(self.formula).get_model_matrix(X)

        return self

    def transform(self, X, y= None):
        """Fits the estimator"""
        y, X = Formula(self.formula).get_model_matrix(X)
        return X.values

pipe = Pipeline([
    ("formula", FormulaicTransformer('total_bill ~ tip + C(sex) + C(day)')),
    ("scale", StandardScaler()),
    ("model", LinearRegression())
])

pipe.fit(df, df['total_bill'])

from joblib import dump, load

dump(pipe, 'model_github.joblib')

# REstart Kernel Before
pipe_infer = load('model_github.joblib')

pipe_infer.predict(df)

Do you think it is worth implementing it into the package?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants