Skip to content

[FLAML Crash] [Classification] ValueError: Categorical categories must be unique #548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mossadhelali opened this issue May 13, 2022 · 7 comments

Comments

@mossadhelali
Copy link

Hey, thanks for the great system.

I am experiencing a crash with a specific dataset. I get the following error when I fit FLAML on the Higgs dataset, a binary classification dataset used in the FLAML paper:

  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 1524, in fit
    self._search()
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 2009, in _search
    self._search_sequential()
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 1825, in _search_sequential
    use_ray=False,
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/tune/tune.py", line 382, in run
    result = training_function(trial_to_run.config)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 240, in _compute_with_config_base
    self.fit_kwargs,
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/ml.py", line 323, in compute_estimator
    log_training_metric=log_training_metric, fit_kwargs=fit_kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/ml.py", line 196, in get_test_loss
    estimator.fit(X_train, y_train, budget, **fit_kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 731, in fit
    X_train = self._preprocess(X_train)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 689, in _preprocess
    lambda x:
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/frame.py", line 8740, in apply
    return op.apply()
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/apply.py", line 688, in apply
    return self.apply_standard()
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/apply.py", line 812, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/apply.py", line 828, in apply_series_generator
    results[i] = self.f(v)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 692, in <lambda>
    for c in x.cat.categories]))
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/accessor.py", line 93, in f
    return self._delegate_method(name, *args, **kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 2631, in _delegate_method
    res = method(*args, **kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 1053, in rename_categories
    cat.categories = new_categories
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 733, in categories
    new_dtype = CategoricalDtype(categories, ordered=self.ordered)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 183, in __init__
    self._finalize(categories, ordered, fastpath=False)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 337, in _finalize
    categories = self.validate_categories(categories, fastpath=fastpath)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 540, in validate_categories
    raise ValueError("Categorical categories must be unique")
ValueError: Categorical categories must be unique

Here is my script:

df = pd.read_csv('higgs.csv')
X, y = df.drop('class', axis=1), df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
automl_model = AutoML()
automl_model.fit(X_train, y_train, task='classification',
                           time_budget=300,
                           retrain_full='budget',
                           verbose=0, metric='macro_f1')

Your feedback is appreciated.

@sonichi
Copy link
Contributor

sonichi commented May 13, 2022

Could you share the .csv file? A few lines are enough as long as this error can be reproduced. Also, could you let me know the flaml version?

@mossadhelali
Copy link
Author

Thanks @sonichi for your reply. Please find the .csv file of (higgs). I am using FLAML v0.6.3

@sonichi
Copy link
Contributor

sonichi commented May 15, 2022

df = pd.read_csv('higgs.csv')
X, y = df.drop('class', axis=1), df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
automl_model = AutoML()
automl_model.fit(X_train, y_train, task='classification',
time_budget=300,
retrain_full='budget',
verbose=0, metric='macro_f1')

Thanks. I received a warning when reading the csv:

sys:1: DtypeWarning: Columns (20,21,22,23,24,25,26,27,28) have mixed types.Specify dtype option on import or set low_memory=False.

Then, I found that the last row contains ? in it. After I removed ?, the warning is gone and I don't get an error.

@mossadhelali
Copy link
Author

Thanks for your reply, @sonichi . This worked for the higgs dataset, but I can imagine it might appear again for other datasets. Any plans to fix this in future releases?

@sonichi
Copy link
Contributor

sonichi commented May 17, 2022

Thanks for your reply, @sonichi . This worked for the higgs dataset, but I can imagine it might appear again for other datasets. Any plans to fix this in future releases?

Your suggestion is welcome here. I don't know how common it is to use "?" for missing data, and how we are supposed to infer that without explicit hint from users. For example, we can't simply replace all "?" by "" because it could be a legitimate value. What would you recommend to address this kind of ambiguity?

@mossadhelali
Copy link
Author

I have seen multiple OpenML datasets where the NaN values are stored as "?". I understand that a blind replacement of "?" with NaN might not be desired in some situation, but how about an analysis of the data types of column values? If e.g. 90%+ of values are integers, then "?" can be interpreted as NaN. If FLAML already has column data type checks, this can be integrated into it.

@sonichi
Copy link
Contributor

sonichi commented May 17, 2022

I have seen multiple OpenML datasets where the NaN values are stored as "?". I understand that a blind replacement of "?" with NaN might not be desired in some situation, but how about an analysis of the data types of column values? If e.g. 90%+ of values are integers, then "?" can be interpreted as NaN. If FLAML already has column data type checks, this can be integrated into it.

Interesting idea. Do you suggest replacing "?" with NaN automatically if 90%+ of values are integers and the remainder are "?"? What if I do use 0 and ? to represent two categories and I have 90% 0s in my data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants