Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegexGenerator can't support complex regular expressions and will throw an error when used on non-id type columns. #2208

Closed
jalr4ever opened this issue Sep 9, 2024 · 6 comments
Labels
bug Something isn't working resolution:duplicate This issue or pull request already exists resolution:WAI The software is working as intended

Comments

@jalr4ever
Copy link

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 1.16.1
  • Python version: 3.8.19
  • Operating System: macOS Sonoma 14.5 (M3 MAX)

Error Description

Hello, SDV. My request: I want to customize my simulation data so that a specific column outputs data that matches my regular expression rules.

I found that SDV provides RegexGenerator, and the documentation states that the column just needs to be of text type. I manually updated my column to text type using metadata.update_column. However, when running it, I encountered two issues:
Issue 1: It only supports ID column types.
Issue 2: It throws an error with complex regular expressions.

Regarding Issue 1, this might not be a bug; it could be an issue with how I'm using the API? As for Issue 2, is it a bug? How can I meet my requirements? Does SDV provide any relevant solutions?

Steps to reproduce

Issue 1: It only supports ID column types - code snippet

import pandas as pd
from rdt.transformers import AnonymizedFaker, PseudoAnonymizedFaker, RegexGenerator
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer

# Create sample data
data = {
    'UserID': [1, 2, 3, 4, 5],
    'Name': ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu', 'Qian Qi'],
    'CreditCardNumber': ['1234-5678-9012-3456', '2345-6789-0123-4567', '3456-7890-1234-5678', '4567-8901-2345-6789', '5678-9012-3456-7890'],
    'SocialSecurityNumber': ['123-45-6789', '987-65-4321', '555-55-5555', '666-66-6666', '777-77-7777']
}
real_data = pd.DataFrame(data)

# Create metadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
metadata.update_column('CreditCardNumber', sdtype='text')
metadata.update_column('SocialSecurityNumber', sdtype='text')
metadata.update_column('UserID', sdtype='id')

# Create GaussianCopulaSynthesizer instance
synthesizer = GaussianCopulaSynthesizer(metadata=metadata)

# Automatically assign column transformers
synthesizer.auto_assign_transformers(real_data)

# Update transformers for anonymization and pseudo-anonymization
simple_regex = '^User_[A-Za-z0-9]{4}$'
# simple_regex = '^(?=.{3,255}$)[a-zA-Z0-9][-a-zA-Z0-9]{0,62}(\.[a-zA-Z0-9][-a-zA-Z0-9]{0,62})*$'

synthesizer.update_transformers(column_name_to_transformer={
    'CreditCardNumber': AnonymizedFaker(provider_name='credit_card', function_name='credit_card_number', cardinality_rule='unique'),
    'UserID': RegexGenerator(regex_format=simple_regex, enforce_uniqueness=True),

    # 'SocialSecurityNumber': PseudoAnonymizedFaker(provider_name='ssn', function_name='ssn')
    'SocialSecurityNumber': RegexGenerator(regex_format=simple_regex, enforce_uniqueness=True),
})

# Preprocess data
processed_data = synthesizer.preprocess(real_data)

# Train the model
synthesizer.fit_processed_data(processed_data)

# Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=5)

# Print results
print("\nSynthetic Data:")
print(synthetic_data)

error:

  File "/app/miniconda3/envs/sdv-tool-example/lib/python3.8/site-packages/sdv/data_processing/data_processor.py", line 664, in update_transformers
    self._hyper_transformer.update_transformers(column_name_to_transformer)
  File "/app/miniconda3/envs/sdv-tool-example/lib/python3.8/site-packages/rdt/hyper_transformer.py", line 522, in update_transformers
    raise InvalidConfigError(
rdt.errors.InvalidConfigError: Column 'SocialSecurityNumber' is a pii column, which is incompatible with the 'RegexGenerator' transformer.

Issue 2: It throws an error with complex regular expressions - code snippet

import pandas as pd
from rdt.transformers import AnonymizedFaker, PseudoAnonymizedFaker, RegexGenerator
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer

# Create sample data
data = {
    'UserID': [1, 2, 3, 4, 5],
    'Name': ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu', 'Qian Qi'],
    'CreditCardNumber': ['1234-5678-9012-3456', '2345-6789-0123-4567', '3456-7890-1234-5678', '4567-8901-2345-6789', '5678-9012-3456-7890'],
    'SocialSecurityNumber': ['123-45-6789', '987-65-4321', '555-55-5555', '666-66-6666', '777-77-7777']
}
real_data = pd.DataFrame(data)

# Create metadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
metadata.update_column('CreditCardNumber', sdtype='text')
metadata.update_column('SocialSecurityNumber', sdtype='text')
metadata.update_column('UserID', sdtype='id')

# Create GaussianCopulaSynthesizer instance
synthesizer = GaussianCopulaSynthesizer(metadata=metadata)

# Automatically assign column transformers
synthesizer.auto_assign_transformers(real_data)

# Update transformers for anonymization and pseudo-anonymization
# simple_regex = '^User_[A-Za-z0-9]{4}$'
simple_regex = '^(?=.{3,255}$)[a-zA-Z0-9][-a-zA-Z0-9]{0,62}(\.[a-zA-Z0-9][-a-zA-Z0-9]{0,62})*$'

synthesizer.update_transformers(column_name_to_transformer={
    'CreditCardNumber': AnonymizedFaker(provider_name='credit_card', function_name='credit_card_number', cardinality_rule='unique'),
    'UserID': RegexGenerator(regex_format=simple_regex, enforce_uniqueness=True),
    'SocialSecurityNumber': PseudoAnonymizedFaker(provider_name='ssn', function_name='ssn')
})

# Preprocess data
processed_data = synthesizer.preprocess(real_data)

# Train the model
synthesizer.fit_processed_data(processed_data)

# Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=5)

# Print results
print("\nSynthetic Data:")
print(synthetic_data)

error:

File "/app/miniconda3/envs/sdv-tool-example/lib/python3.8/site-packages/rdt/transformers/text.py", line 142, in reset_randomization
    self.generator, self.generator_size = strings_from_regex(self.regex_format)
  File "/app/miniconda3/envs/sdv-tool-example/lib/python3.8/site-packages/rdt/transformers/utils.py", line 153, in strings_from_regex
    generator, size = _GENERATORS[option](args, max_repeat)
  File "/app/miniconda3/envs/sdv-tool-example/lib/python3.8/site-packages/rdt/transformers/utils.py", line 54, in _max_repeat
    _, size = _GENERATORS[option](args, max_repeat)
KeyError: SUBPATTERN
@jalr4ever jalr4ever added bug Something isn't working new Automatic label applied to new issues labels Sep 9, 2024
@npatki
Copy link
Contributor

npatki commented Sep 9, 2024

Hi @jalr4ever nice to meet you! It may be helpful to separate out your two observed problems into separate GitHub tickets, as the discussions and answers for them are different. To that point, I will respond to the first issue below. For the second issue, I have created a separate ticket here, where we may continue the discussion.

Issue 1: It only supports ID column types - code snippet

Here is the documentation for sdtypes. I'm not sure where you are reading that sdtype text is compatible with a regex format. From what I see, only sdtype id is compatible with the regex. Could you point us to which documentation page you are referring to that is mentioning that text is ok?

Note that the id sdtype is meant to be used for identifiers such as primary keys, foreign keys, product codes, etc. that inherently do not have meaning of their own. Looking at your data, it seems that UserID fits this definition.

For something like SocialSecurityNumber, you have two options:

  1. (Recommended) Just make the sdtype of this column ssn. SDV will then ensure the synthetic data has social security numbers. No other configuration is required.
  2. Or you can mark the sdtype as id, which will allow you to supply a regex format for it.

And for something like CreditCardNumber, the recommendation would be to use sdtype credit_card_number. To see all options for sdtypes, please see this documentation.

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Sep 9, 2024
@jalr4ever
Copy link
Author

Hi @jalr4ever nice to meet you! It may be helpful to separate out your two observed problems into separate GitHub tickets, as the discussions and answers for them are different. To that point, I will respond to the first issue below. For the second issue, I have created a separate ticket here, where we may continue the discussion.

Issue 1: It only supports ID column types - code snippet

Here is the documentation for sdtypes. I'm not sure where you are reading that sdtype text is compatible with a regex format. From what I see, only sdtype id is compatible with the regex. Could you point us to which documentation page you are referring to that is mentioning that text is ok?

Note that the id sdtype is meant to be used for identifiers such as primary keys, foreign keys, product codes, etc. that inherently do not have meaning of their own. Looking at your data, it seems that UserID fits this definition.

For something like SocialSecurityNumber, you have two options:

  1. (Recommended) Just make the sdtype of this column ssn. SDV will then ensure the synthetic data has social security numbers. No other configuration is required.
  2. Or you can mark the sdtype as id, which will allow you to supply a regex format for it.

And for something like CreditCardNumber, the recommendation would be to use sdtype credit_card_number. To see all options for sdtypes, please see this documentation.

Hi, @npatki , thanks for your reply!

1. Which documentation page I referring to that is mentioning that text is ok with RegexGenerator?

This link shows that RegexGenerator compatible with text type :(https://docs.sdv.dev/rdt/transformers-glossary/text-id/regexgenerator)

2. Why do I need to pass data generated by regular expressions?

In short: The preset functions provided by Faker and SDV do not support the anonymous data format of my sensitive columns.

The sample code is a simulated test scenario of mine. It was just put forward to help you understand my needs. In actual business data, relying solely on the predefined sensitive PII sdtype from SDV may not be sufficient. Through testing, I found that the preset PII sdtype only supports functions built into Faker.
However, one of my columns of sensitive data might contain complex Chinese characters, such as some descriptions or addresses in Chinese. These pieces of information do not actually match the format of the functions built into Faker. That's why I thought perhaps I could pass a regex to the field column so that it can directly generate anonymized data that conforms to my defined format?

3. One last question
I see that your proposal is to set the column where I want to pass the regular expression as the id column. I would like to understand how SDV handles the id column throughout the simulation process: what other impacts might there be in table data if multiple columns are set as id? For example, could it cause errors in multi-table simulations or affect the performance and effectiveness of simulated data?

@npatki
Copy link
Contributor

npatki commented Sep 10, 2024

Hi @jalr4ever thanks for your responses!

This link shows that RegexGenerator compatible with text type :(https://docs.sdv.dev/rdt/transformers-glossary/text-id/regexgenerator)

Got it. The RDT library uses a slightly different system of sdtypes than SDV, which I can understand will lead to some confusion. I have filed an issue here to better align the sdtypes between the two libraries. Since you are using SDV synthesizer, I would recommend referring the SDV documentation. Sorry for the confusion!

However, one of my columns of sensitive data might contain complex Chinese characters, such as some descriptions or addresses in Chinese.

Very interested to hear more about this. Do note that you can set a locale in the synthesizer, for example zh_CN for mainland China. This will set the mode of all Faker objects, and should create Chinese characters.

# Create GaussianCopulaSynthesizer instance
synthesizer = GaussianCopulaSynthesizer(metadata=metadata, locales=['zh_CN'])

But I understand that your usage may be complicated. If you are able to share more information with us, we'd appreciate it! Can you provide any examples where a Regex would be fine, but Faker won't help? It will help us develop better features for SDV.

I would like to understand how SDV handles the id column throughout the simulation process

SDV will not learn any properties about the ID values, since they are only used to identify rows. Your synthetic data will contain random, newly created IDs based on the regex. If you have multi-tables, the primary and foreign key IDs will match up. So because there is no learning, this should have minimal impact on performance.

Hope that answers your questions -- and appreciate your responses!

@jalr4ever
Copy link
Author

Hi @npatki , Thank you for your responses, it answers my confusion!

Can you provide any examples where a Regex would be fine, but Faker won't help? It will help us develop better features for SDV.

Yes, I can give an example. For instance, in a certain column, I might have a categorical variable and I only want it to generate my fixed types of Chinese strings. Overall, there are probably twenty or thirty such formats, but Faker does not provide this kind of format. That's why I'm thinking of using regex; I can roughly define the shape of the data so that it appears in the simulated data. Here are some examples of those Chinese strings:

  • 正常分发
  • 延迟分发工资一
  • 延迟分发工资九
  • 漏税补交
  • ...

For another example, when faker generates a Chinese mobile phone number, one possible generated instance could be 13633254431. However, to better align with the format of the original data, it may be necessary to add an area code to the phone number, such as +86(13633254431).

@npatki
Copy link
Contributor

npatki commented Sep 11, 2024

No problem, and thanks for the examples @jalr4ever.

Just as an FYI our SDV Enterprise package will allow you to maintain the phone number format/other properties, as it contains additional transformer options (see AnonymizedGeoExtractor). I understand that is only available in a paid plan.

Let me know if there are any additional Qs around this, or I can mark this as resolved. The RegexGenerator being able to generate more complex strings is in a separate issue that we will keep open for tracking.

@jalr4ever
Copy link
Author

@npatki Oh yeah for sure, this issue has been resolved. I submitted an inquiry for a purchase of the SDV commercial version today and plan to try it out, hoping to receive a response soon. 🧐

@npatki npatki added resolution:duplicate This issue or pull request already exists resolution:WAI The software is working as intended and removed under discussion Issue is currently being discussed labels Sep 19, 2024
@npatki npatki closed this as completed Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working resolution:duplicate This issue or pull request already exists resolution:WAI The software is working as intended
Projects
None yet
Development

No branches or pull requests

2 participants