-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PatsyTransformer in inference #505
Comments
Do you mean "pickling"? Or something else? |
Awh heck! It seems we're not applying our standard tests to the |
Yes, I mean pickling the pipeline and then loading it and only using it for predictions on new data. Pickling is not yet supported by the patsy. Perhaps saving I figured this would be very helpful to a lot of people that's why I came with this question |
You're the first person to bring it up 😅 . I'll look into it though. Storing the design info might work, but you'd need to do that manually. |
I am trying to play around with it: import h5py
def save_patsy(patsy_step, filename):
"""Save the coefficients of a linear model into a .h5 file."""
with h5py.File(filename, 'w') as hf:
hf.create_dataset("design_info", data=patsy_step.design_info_)
def load_coefficients(patsy_step, filename):
"""Attach the saved coefficients to a linear model."""
with h5py.File(filename, 'r') as hf:
design_info = hf['design_info'][:]
patsy_step.design_info_ = design_info
save_patsy(pipe['patsy'], "clf.h5") returns an error :( Object dtype dtype('O') has no native HDF5 equivalent Sorry for adding another issue to an already large stack. However, I really feel like this is a big one. |
You can also try |
Come to think of it. Maybe something like https://pypi.org/project/dill/ would work for your pipeline. |
I think it's best to close this issue for now. The issue won't be fixed in our library and anybody who is interested in this feature should collaborate here. |
Just following up with another package formulaic, which does exactly the same. Wrapping inside sklean should not be super hard. And pickling should also be supported. |
Yes, the library seems to be doing exactly what the issue mentions. I put together simple "quick-dirty overview" import seaborn as sns
import pandas
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from formulaic import Formula
df = sns.load_dataset('tips')
class FormulaicTransformer(TransformerMixin, BaseEstimator):
"""
"""
def __init__(self, formula):
self.formula = formula
def fit(self, X, y = None):
"""Fits the estimator"""
y, X = Formula(self.formula).get_model_matrix(X)
return self
def transform(self, X, y= None):
"""Fits the estimator"""
y, X = Formula(self.formula).get_model_matrix(X)
return X.values
pipe = Pipeline([
("formula", FormulaicTransformer('total_bill ~ tip + C(sex) + C(day)')),
("scale", StandardScaler()),
("model", LinearRegression())
])
pipe.fit(df, df['total_bill'])
from joblib import dump, load
dump(pipe, 'model_github.joblib')
# REstart Kernel Before
pipe_infer = load('model_github.joblib')
pipe_infer.predict(df) Do you think it is worth implementing it into the package? |
Hey Vincent,
I am a huge lover of your sklego project, especially patsy implementation within sklean.
However, there is one thing I still would like your opinion on - how do you use a pipeline containing PatsyTransformer only for inference?
As the pickling is not yet supported on the patsy side I came up with a workaround.
Still not sure thou is something may go wrong - e.g. wrong dummy coding inside patsy compared to when model was just being trained (first pipeline). Let's say in some batch there will be only 3 levels of a categorical variable while in original data there were 10 of them...
What is the approach you would use when in need of using patsy in production (only inference)?
Thanks
The text was updated successfully, but these errors were encountered: