Skip to content

KeyError [...77789,77790] index not found error when I put AutoML in sklearn pipeline and run it #517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
busekoseoglu opened this issue Apr 14, 2022 · 3 comments

Comments

@busekoseoglu
Copy link

Code:

categorical_transformer = Pipeline(steps=[('one_hot', OneHotEncoder())])
categorical_features = ['merchant_category', 'merchant_group',"name_in_email"]

preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical_features)
])

clf = Pipeline(steps=[('missing', fill_missing()),
('outlier', outlier_filling()),
('preprocessor', preprocessor),
('classifier', AutoML())])

clf.fit(X_train, y_train)

Note: It works when RandomForestClassifier is replaced with AutoML.

@sonichi
Copy link
Contributor

sonichi commented Apr 14, 2022

@busekoseoglu I can't reproduce this problem with my synthetic data for testing. Could you please share an example dataset to reproduce this problem?
BTW, you don't have to use one hot encoding before AutoML.fit(). It often works better without this encoding.

@busekoseoglu
Copy link
Author

Of course, I am attaching an example csv file. I ran it without One hot encoding but I'm wondering if it will work as well
sampledf.csv
.

@sonichi
Copy link
Contributor

sonichi commented Apr 15, 2022

This works for me:

from flaml import AutoML
import pandas as pd

df = pd.read_csv("https://github.com/microsoft/FLAML/files/8496779/sampledf.csv")
X = df.drop(columns="has_paid")
y = df["has_paid"]
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
categorical_transformer = Pipeline(steps=[('one_hot', OneHotEncoder())])
categorical_features = ['merchant_category', 'merchant_group',"name_in_email"]

preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical_features)
])

clf = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', AutoML())])

clf.fit(X, y)

I removed the first two steps in your pipeline because they are undefined.

('missing', fill_missing()),
('outlier', outlier_filling()),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants