-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RegexGenerator can't support complex regular expressions and will throw an error when used on non-id type columns. #2208
Comments
Hi @jalr4ever nice to meet you! It may be helpful to separate out your two observed problems into separate GitHub tickets, as the discussions and answers for them are different. To that point, I will respond to the first issue below. For the second issue, I have created a separate ticket here, where we may continue the discussion.
Here is the documentation for sdtypes. I'm not sure where you are reading that sdtype Note that the id sdtype is meant to be used for identifiers such as primary keys, foreign keys, product codes, etc. that inherently do not have meaning of their own. Looking at your data, it seems that For something like
And for something like |
Hi, @npatki , thanks for your reply! 1. Which documentation page I referring to that is mentioning that text is ok with This link shows that RegexGenerator compatible with 2. Why do I need to pass data generated by regular expressions? In short: The preset functions provided by Faker and SDV do not support the anonymous data format of my sensitive columns. The sample code is a simulated test scenario of mine. It was just put forward to help you understand my needs. In actual business data, relying solely on the predefined sensitive PII sdtype from SDV may not be sufficient. Through testing, I found that the preset PII sdtype only supports functions built into Faker. 3. One last question |
Hi @jalr4ever thanks for your responses!
Got it. The RDT library uses a slightly different system of sdtypes than SDV, which I can understand will lead to some confusion. I have filed an issue here to better align the sdtypes between the two libraries. Since you are using SDV synthesizer, I would recommend referring the SDV documentation. Sorry for the confusion!
Very interested to hear more about this. Do note that you can set a locale in the synthesizer, for example # Create GaussianCopulaSynthesizer instance
synthesizer = GaussianCopulaSynthesizer(metadata=metadata, locales=['zh_CN']) But I understand that your usage may be complicated. If you are able to share more information with us, we'd appreciate it! Can you provide any examples where a Regex would be fine, but Faker won't help? It will help us develop better features for SDV.
SDV will not learn any properties about the ID values, since they are only used to identify rows. Your synthetic data will contain random, newly created IDs based on the regex. If you have multi-tables, the primary and foreign key IDs will match up. So because there is no learning, this should have minimal impact on performance. Hope that answers your questions -- and appreciate your responses! |
Hi @npatki , Thank you for your responses, it answers my confusion!
Yes, I can give an example. For instance, in a certain column, I might have a categorical variable and I only want it to generate my fixed types of Chinese strings. Overall, there are probably twenty or thirty such formats, but Faker does not provide this kind of format. That's why I'm thinking of using regex; I can roughly define the shape of the data so that it appears in the simulated data. Here are some examples of those Chinese strings:
For another example, when faker generates a Chinese mobile phone number, one possible generated instance could be 13633254431. However, to better align with the format of the original data, it may be necessary to add an area code to the phone number, such as +86(13633254431). |
No problem, and thanks for the examples @jalr4ever. Just as an FYI our SDV Enterprise package will allow you to maintain the phone number format/other properties, as it contains additional transformer options (see AnonymizedGeoExtractor). I understand that is only available in a paid plan. Let me know if there are any additional Qs around this, or I can mark this as resolved. The RegexGenerator being able to generate more complex strings is in a separate issue that we will keep open for tracking. |
@npatki Oh yeah for sure, this issue has been resolved. I submitted an inquiry for a purchase of the SDV commercial version today and plan to try it out, hoping to receive a response soon. 🧐 |
Environment Details
Please indicate the following details about the environment in which you found the bug:
Error Description
Hello, SDV. My request: I want to customize my simulation data so that a specific column outputs data that matches my regular expression rules.
I found that SDV provides
RegexGenerator
, and the documentation states that the column just needs to be oftext
type. I manually updated my column totext
type usingmetadata.update_column
. However, when running it, I encountered two issues:Issue 1: It only supports ID column types.
Issue 2: It throws an error with complex regular expressions.
Regarding Issue 1, this might not be a bug; it could be an issue with how I'm using the API? As for Issue 2, is it a bug? How can I meet my requirements? Does SDV provide any relevant solutions?
Steps to reproduce
Issue 1: It only supports ID column types - code snippet
error:
Issue 2: It throws an error with complex regular expressions - code snippet
error:
The text was updated successfully, but these errors were encountered: