Description
Environment Details
Please indicate the following details about the environment in which you found the bug:
- SDV version: 1.16.1
- Python version: 3.8.19
- Operating System: macOS Sonoma 14.5 (M3 MAX)
Error Description
Hello, SDV. My request: I want to customize my simulation data so that a specific column outputs data that matches my regular expression rules.
I found that SDV provides RegexGenerator
, and the documentation states that the column just needs to be of text
type. I manually updated my column to text
type using metadata.update_column
. However, when running it, I encountered two issues:
Issue 1: It only supports ID column types.
Issue 2: It throws an error with complex regular expressions.
Regarding Issue 1, this might not be a bug; it could be an issue with how I'm using the API? As for Issue 2, is it a bug? How can I meet my requirements? Does SDV provide any relevant solutions?
Steps to reproduce
Issue 1: It only supports ID column types - code snippet
import pandas as pd
from rdt.transformers import AnonymizedFaker, PseudoAnonymizedFaker, RegexGenerator
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer
# Create sample data
data = {
'UserID': [1, 2, 3, 4, 5],
'Name': ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu', 'Qian Qi'],
'CreditCardNumber': ['1234-5678-9012-3456', '2345-6789-0123-4567', '3456-7890-1234-5678', '4567-8901-2345-6789', '5678-9012-3456-7890'],
'SocialSecurityNumber': ['123-45-6789', '987-65-4321', '555-55-5555', '666-66-6666', '777-77-7777']
}
real_data = pd.DataFrame(data)
# Create metadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
metadata.update_column('CreditCardNumber', sdtype='text')
metadata.update_column('SocialSecurityNumber', sdtype='text')
metadata.update_column('UserID', sdtype='id')
# Create GaussianCopulaSynthesizer instance
synthesizer = GaussianCopulaSynthesizer(metadata=metadata)
# Automatically assign column transformers
synthesizer.auto_assign_transformers(real_data)
# Update transformers for anonymization and pseudo-anonymization
simple_regex = '^User_[A-Za-z0-9]{4}$'
# simple_regex = '^(?=.{3,255}$)[a-zA-Z0-9][-a-zA-Z0-9]{0,62}(\.[a-zA-Z0-9][-a-zA-Z0-9]{0,62})*$'
synthesizer.update_transformers(column_name_to_transformer={
'CreditCardNumber': AnonymizedFaker(provider_name='credit_card', function_name='credit_card_number', cardinality_rule='unique'),
'UserID': RegexGenerator(regex_format=simple_regex, enforce_uniqueness=True),
# 'SocialSecurityNumber': PseudoAnonymizedFaker(provider_name='ssn', function_name='ssn')
'SocialSecurityNumber': RegexGenerator(regex_format=simple_regex, enforce_uniqueness=True),
})
# Preprocess data
processed_data = synthesizer.preprocess(real_data)
# Train the model
synthesizer.fit_processed_data(processed_data)
# Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=5)
# Print results
print("\nSynthetic Data:")
print(synthetic_data)
error:
File "/app/miniconda3/envs/sdv-tool-example/lib/python3.8/site-packages/sdv/data_processing/data_processor.py", line 664, in update_transformers
self._hyper_transformer.update_transformers(column_name_to_transformer)
File "/app/miniconda3/envs/sdv-tool-example/lib/python3.8/site-packages/rdt/hyper_transformer.py", line 522, in update_transformers
raise InvalidConfigError(
rdt.errors.InvalidConfigError: Column 'SocialSecurityNumber' is a pii column, which is incompatible with the 'RegexGenerator' transformer.
Issue 2: It throws an error with complex regular expressions - code snippet
import pandas as pd
from rdt.transformers import AnonymizedFaker, PseudoAnonymizedFaker, RegexGenerator
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer
# Create sample data
data = {
'UserID': [1, 2, 3, 4, 5],
'Name': ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu', 'Qian Qi'],
'CreditCardNumber': ['1234-5678-9012-3456', '2345-6789-0123-4567', '3456-7890-1234-5678', '4567-8901-2345-6789', '5678-9012-3456-7890'],
'SocialSecurityNumber': ['123-45-6789', '987-65-4321', '555-55-5555', '666-66-6666', '777-77-7777']
}
real_data = pd.DataFrame(data)
# Create metadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
metadata.update_column('CreditCardNumber', sdtype='text')
metadata.update_column('SocialSecurityNumber', sdtype='text')
metadata.update_column('UserID', sdtype='id')
# Create GaussianCopulaSynthesizer instance
synthesizer = GaussianCopulaSynthesizer(metadata=metadata)
# Automatically assign column transformers
synthesizer.auto_assign_transformers(real_data)
# Update transformers for anonymization and pseudo-anonymization
# simple_regex = '^User_[A-Za-z0-9]{4}$'
simple_regex = '^(?=.{3,255}$)[a-zA-Z0-9][-a-zA-Z0-9]{0,62}(\.[a-zA-Z0-9][-a-zA-Z0-9]{0,62})*$'
synthesizer.update_transformers(column_name_to_transformer={
'CreditCardNumber': AnonymizedFaker(provider_name='credit_card', function_name='credit_card_number', cardinality_rule='unique'),
'UserID': RegexGenerator(regex_format=simple_regex, enforce_uniqueness=True),
'SocialSecurityNumber': PseudoAnonymizedFaker(provider_name='ssn', function_name='ssn')
})
# Preprocess data
processed_data = synthesizer.preprocess(real_data)
# Train the model
synthesizer.fit_processed_data(processed_data)
# Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=5)
# Print results
print("\nSynthetic Data:")
print(synthetic_data)
error:
File "/app/miniconda3/envs/sdv-tool-example/lib/python3.8/site-packages/rdt/transformers/text.py", line 142, in reset_randomization
self.generator, self.generator_size = strings_from_regex(self.regex_format)
File "/app/miniconda3/envs/sdv-tool-example/lib/python3.8/site-packages/rdt/transformers/utils.py", line 153, in strings_from_regex
generator, size = _GENERATORS[option](args, max_repeat)
File "/app/miniconda3/envs/sdv-tool-example/lib/python3.8/site-packages/rdt/transformers/utils.py", line 54, in _max_repeat
_, size = _GENERATORS[option](args, max_repeat)
KeyError: SUBPATTERN