Anonymize
The Anonymize operation allows you to protect sensitive data by replacing original values with anonymized ones. This operation is applied during the transformation phase of the pipeline.
Overview
The Anonymize operation enables you to:
- Protect sensitive personally identifiable information (PII)
- Replace original data with realistic fake data
- Apply different anonymization techniques to different fields
- Maintain data utility while ensuring privacy compliance
- Work with various data types (strings, numbers, dates)
Configuration
Field-Level Anonymization
The Anonymize operation is configured at the field level, allowing you to specify different anonymization methods for each sensitive field. When configuring field-level anonymization, you'll see a pen icon (✎) in the actions column that opens the Edit Function panel for more detailed configuration.
Each field can be configured with one of the following anonymization methods organized in the dropdown menu:
Use Fake Data
Replace values with realistic fake data based on field labels. When selecting this option, a label dropdown appears that lets the end user choose the label standard fake data generator. This dropdown contains various labels categorized by type (Language, Date, Global, Custom) that determine what kind of fake data will be generated. For example, a field labeled as "person/name" would be replaced with fake names, while a field labeled as "contact/email" would be replaced with fake email addresses.
Fake data options
- Language-based labels (e.g., "person/name", "contact/email")
- Date format labels (e.g., "date/yyyy-mm-dd")
- Global labels (e.g., "global/url")
- Custom labels (e.g., "custom/IBAN")
Functions
Built-in anonymization functions that can be applied to fields.
Masking
Replace parts of values with mask characters while preserving format. For example, a credit card number "1234-5678-9012-3456" might become "--****-3456".
Shuffling
Randomly reorder values within the dataset while maintaining the same value distribution.
List
Replace values by picking randomly from a predefined list of values.
Delete field
Completely remove the field from the output dataset.
Blank field
Replace all values with null/empty values.
Saved Functions
Use a previously created and saved custom function. These saved functions come from your Project Functions which can be reused across different models within the same project.
Custom Function
Write your own anonymization function using JavaScript code. For more information on creating custom functions, see Custom Functions.
No Action
Keep the original values unchanged (useful for testing or when certain fields don't need anonymization).
Edit Function Options
When you click the pen icon (✎) for a field with the "Fake data" anonymization method, you'll see several configuration options:
Locale: Specify the locale to use for generating fake data. This affects the cultural characteristics of the generated data such as names, addresses, and phone numbers. For example, using locale "es-ES" will generate Spanish names and addresses, while "en-US" will generate American ones. The locale is automatically set based on the selected label's locale but can be overridden.
Text Format: Control the format of the generated fake data. Options include:
- None: Keep the original formatting from the generator
- UPPERCASE: Convert all text to uppercase
- lowercase: Convert all text to lowercase
- Title Case: Capitalize the first letter of each word
- Snake_case: Convert spaces to underscores
- Kebab-case: Convert spaces to hyphens
Prefix: Add a custom prefix to all generated fake data values. Enable the prefix option with the checkbox, then enter your desired prefix in the text field. For example, with prefix "TEST_" a generated name "John Doe" would become "TEST_John Doe".
Suffix: Add a custom suffix to all generated fake data values. Enable the suffix option with the checkbox, then enter your desired suffix in the text field. For example, with suffix "_USER" a generated name "John Doe" would become "John Doe_USER".
Dictionary: Control how replacement values are mapped and reused. This option determines the scope in which generated values are stored and reused for consistency. For detailed information about dictionary modes, see Dictionary Functions.
Dictionary Modes
When anonymizing data, you can control how replacement values are mapped using different dictionary modes:
Inherit from rule
Use the default dictionary behavior defined at the rule level.
Skip dictionary
Don't maintain consistent mapping between original and replacement values.
Label scope
Maintain consistent mapping within fields that have the same label.
Fieldname scope
Maintain consistent mapping within fields that have the same name.
Entity/Field scope
Maintain consistent mapping within the same entity and field combination.
Global scope
Maintain consistent mapping across all entities and fields.
User-defined scope
Define your own scope for consistent mapping using a custom scope string. When selected, you can specify a custom scope name in the provided text field.
Examples
Basic Anonymization
To anonymize customer data:
- Run a discover operation first to identify sensitive fields
- Select the customer entity
- For the "name" field, choose "Fake data" with "name" label
- For the "email" field, choose "Fake data" with "email" label
- For the "phone" field, choose "Masking" to preserve format while hiding real numbers
Consistent Anonymization
To ensure the same customer name always gets replaced with the same fake name:
- Select "Label scope" dictionary mode for name fields
- This ensures that whenever "John Smith" appears in any field labeled as "name", it will always be replaced with the same fake name like "Jane Doe"
Custom Anonymization Function
To apply a custom anonymization algorithm:
- Select a field and choose "Custom function"
- Write JavaScript code that takes the original value and returns an anonymized version
This operation helps ensure your data complies with privacy regulations while maintaining realistic data characteristics for testing and development purposes.