Skip to content

RegexGenerator can't support complex regular expressions and will throw an error when used on non-id type columns. #2208

Closed
@jalr4ever

Description

@jalr4ever

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 1.16.1
  • Python version: 3.8.19
  • Operating System: macOS Sonoma 14.5 (M3 MAX)

Error Description

Hello, SDV. My request: I want to customize my simulation data so that a specific column outputs data that matches my regular expression rules.

I found that SDV provides RegexGenerator, and the documentation states that the column just needs to be of text type. I manually updated my column to text type using metadata.update_column. However, when running it, I encountered two issues:
Issue 1: It only supports ID column types.
Issue 2: It throws an error with complex regular expressions.

Regarding Issue 1, this might not be a bug; it could be an issue with how I'm using the API? As for Issue 2, is it a bug? How can I meet my requirements? Does SDV provide any relevant solutions?

Steps to reproduce

Issue 1: It only supports ID column types - code snippet

import pandas as pd
from rdt.transformers import AnonymizedFaker, PseudoAnonymizedFaker, RegexGenerator
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer

# Create sample data
data = {
    'UserID': [1, 2, 3, 4, 5],
    'Name': ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu', 'Qian Qi'],
    'CreditCardNumber': ['1234-5678-9012-3456', '2345-6789-0123-4567', '3456-7890-1234-5678', '4567-8901-2345-6789', '5678-9012-3456-7890'],
    'SocialSecurityNumber': ['123-45-6789', '987-65-4321', '555-55-5555', '666-66-6666', '777-77-7777']
}
real_data = pd.DataFrame(data)

# Create metadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
metadata.update_column('CreditCardNumber', sdtype='text')
metadata.update_column('SocialSecurityNumber', sdtype='text')
metadata.update_column('UserID', sdtype='id')

# Create GaussianCopulaSynthesizer instance
synthesizer = GaussianCopulaSynthesizer(metadata=metadata)

# Automatically assign column transformers
synthesizer.auto_assign_transformers(real_data)

# Update transformers for anonymization and pseudo-anonymization
simple_regex = '^User_[A-Za-z0-9]{4}$'
# simple_regex = '^(?=.{3,255}$)[a-zA-Z0-9][-a-zA-Z0-9]{0,62}(\.[a-zA-Z0-9][-a-zA-Z0-9]{0,62})*$'

synthesizer.update_transformers(column_name_to_transformer={
    'CreditCardNumber': AnonymizedFaker(provider_name='credit_card', function_name='credit_card_number', cardinality_rule='unique'),
    'UserID': RegexGenerator(regex_format=simple_regex, enforce_uniqueness=True),

    # 'SocialSecurityNumber': PseudoAnonymizedFaker(provider_name='ssn', function_name='ssn')
    'SocialSecurityNumber': RegexGenerator(regex_format=simple_regex, enforce_uniqueness=True),
})

# Preprocess data
processed_data = synthesizer.preprocess(real_data)

# Train the model
synthesizer.fit_processed_data(processed_data)

# Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=5)

# Print results
print("\nSynthetic Data:")
print(synthetic_data)

error:

  File "/app/miniconda3/envs/sdv-tool-example/lib/python3.8/site-packages/sdv/data_processing/data_processor.py", line 664, in update_transformers
    self._hyper_transformer.update_transformers(column_name_to_transformer)
  File "/app/miniconda3/envs/sdv-tool-example/lib/python3.8/site-packages/rdt/hyper_transformer.py", line 522, in update_transformers
    raise InvalidConfigError(
rdt.errors.InvalidConfigError: Column 'SocialSecurityNumber' is a pii column, which is incompatible with the 'RegexGenerator' transformer.

Issue 2: It throws an error with complex regular expressions - code snippet

import pandas as pd
from rdt.transformers import AnonymizedFaker, PseudoAnonymizedFaker, RegexGenerator
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer

# Create sample data
data = {
    'UserID': [1, 2, 3, 4, 5],
    'Name': ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu', 'Qian Qi'],
    'CreditCardNumber': ['1234-5678-9012-3456', '2345-6789-0123-4567', '3456-7890-1234-5678', '4567-8901-2345-6789', '5678-9012-3456-7890'],
    'SocialSecurityNumber': ['123-45-6789', '987-65-4321', '555-55-5555', '666-66-6666', '777-77-7777']
}
real_data = pd.DataFrame(data)

# Create metadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
metadata.update_column('CreditCardNumber', sdtype='text')
metadata.update_column('SocialSecurityNumber', sdtype='text')
metadata.update_column('UserID', sdtype='id')

# Create GaussianCopulaSynthesizer instance
synthesizer = GaussianCopulaSynthesizer(metadata=metadata)

# Automatically assign column transformers
synthesizer.auto_assign_transformers(real_data)

# Update transformers for anonymization and pseudo-anonymization
# simple_regex = '^User_[A-Za-z0-9]{4}$'
simple_regex = '^(?=.{3,255}$)[a-zA-Z0-9][-a-zA-Z0-9]{0,62}(\.[a-zA-Z0-9][-a-zA-Z0-9]{0,62})*$'

synthesizer.update_transformers(column_name_to_transformer={
    'CreditCardNumber': AnonymizedFaker(provider_name='credit_card', function_name='credit_card_number', cardinality_rule='unique'),
    'UserID': RegexGenerator(regex_format=simple_regex, enforce_uniqueness=True),
    'SocialSecurityNumber': PseudoAnonymizedFaker(provider_name='ssn', function_name='ssn')
})

# Preprocess data
processed_data = synthesizer.preprocess(real_data)

# Train the model
synthesizer.fit_processed_data(processed_data)

# Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=5)

# Print results
print("\nSynthetic Data:")
print(synthetic_data)

error:

File "/app/miniconda3/envs/sdv-tool-example/lib/python3.8/site-packages/rdt/transformers/text.py", line 142, in reset_randomization
    self.generator, self.generator_size = strings_from_regex(self.regex_format)
  File "/app/miniconda3/envs/sdv-tool-example/lib/python3.8/site-packages/rdt/transformers/utils.py", line 153, in strings_from_regex
    generator, size = _GENERATORS[option](args, max_repeat)
  File "/app/miniconda3/envs/sdv-tool-example/lib/python3.8/site-packages/rdt/transformers/utils.py", line 54, in _max_repeat
    _, size = _GENERATORS[option](args, max_repeat)
KeyError: SUBPATTERN

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingresolution:WAIThe software is working as intendedresolution:duplicateThis issue or pull request already exists

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions