Dictionary
What is a Dictionary?
A Dictionary in Gigantics is a smart storage system that remembers how you've transformed data values. Think of it as a memory system that ensures consistent anonymization across your entire project.
Why Dictionaries Matter
When you anonymize data, you often need the same original value to become the same anonymized value every time it appears. This is called referential integrity - maintaining relationships between data even after anonymization.
Example Scenario:
Dictionaries ensure that "John Smith" always becomes "Mark Johnson" and "john@example.com" always becomes "mark@email.com" throughout your entire project, maintaining data relationships and consistency.
Where Dictionaries Are Used in Gigantics
Dictionaries are integrated throughout the Gigantics application in several key areas:
1. Project Dictionary Page
The main dictionary management interface is located at:
Navigation Path:
This is where you can:
- View all dictionary entries
- Import/Export dictionaries
- Search and filter entries
- Clear the entire dictionary
- View summary statistics by scope
2. Rule Configuration
Dictionaries are configured when creating or editing Rules and Pipelines:
Navigation Path:
Here you configure:
- Dictionary mode (Field, Label, Global, None)
- Cache dictionary option
- Store new transformations
- Overwrite existing dictionary
3. Field-Level Transformations
When configuring individual field transformations in anonymize operations:
Navigation Path:
Each field can have its own dictionary settings:
- Dictionary mode override
- Replace label option
- Custom scope definition
- With options toggle
- Nulls handling
4. Pipeline Configuration
When setting up automated pipelines:
Navigation Path:
Pipelines inherit dictionary settings that will be used for all job executions.
Dictionary UI Components
Dictionary Main Page
The dictionary page (/projects/dictionary) provides a comprehensive interface for managing dictionary entries: for managing dictionary entries:
Toolbar Actions:
| Button | Icon | Function | When to Use |
|---|---|---|---|
| View Summary | Eye | Shows count of entries by scope | To get overview of dictionary structure |
| Export | Export | Downloads dictionary as CSV or JSON | To backup or migrate dictionary |
| Import | Import | Uploads dictionary from file | To restore or merge dictionaries |
| Clear Dictionary | Clear | Removes all entries | To start fresh or reset dictionary |
Search Functionality:
- Search by key (MD5 hash)
- Search by scope name
- Search by transformed value
- Real-time filtering as you type
Sorting Options:
- Sort by Scope (default)
- Sort by Key
- Sort by Value
Import Dictionary Modal
Import Options:
| Option | Description | Use Case |
|---|---|---|
| Append | Adds new entries, updates matching keys | Merging dictionaries or updating specific entries |
| Overwrite | Replaces entire dictionary | Restoring from backup or complete replacement |
File Format:
- JSON format required
- Each entry must have:
key,val,scope
Export Dictionary Modal
Export Options:
| Option | Description | When to Use |
|---|---|---|
| Full Dictionary | Exports all entries | Complete backup or migration |
| Select Scope | Exports specific scopes | Partial backup or scope-specific analysis |
| CSV Format | Comma-separated values | Spreadsheet analysis or external tools |
| JSON Format | JSON structure | Programmatic use or re-import |
Dictionary Summary Modal
Shows a breakdown of dictionary entries by scope:
Rule Configuration - Dictionary Options
When configuring rules, you'll see the Dictionary section:
Configuration Options:
| Option | Description | Impact |
|---|---|---|
| Mode: None | Disables dictionary usage | Maximum randomness, no consistency |
| Mode: Field | Reuse per entity+field combination | Different values in different fields |
| Mode: Label | Reuse per label type | Consistent across same data types |
| Mode: Global | Reuse everywhere | Maximum consistency, single scope |
| Cache dictionary | Store in memory for faster access | Better performance, uses more memory |
| Store new transformations | Save new transformations for future use | Dictionary grows, enables reuse across jobs |
| Overwrite existing | Clear dictionary before job starts | Fresh start, removes old entries |
Field-Level Dictionary Configuration
When configuring individual field transformations:
Field-Level Options:
| Option | Description | Example Use Case |
|---|---|---|
| Inherit from rule | Uses rule-level dictionary settings | Default behavior, consistent with rule |
| Skip dictionary | Bypasses dictionary for this field | Maximum randomness for sensitive fields |
| Label scope | Uses field's label for scoping | Standard consistency within data type |
| Fieldname scope | Uses field name across entities | Consistent for fields with same name |
| Entity/Field scope | Field-specific scope | Different values per field |
| Global scope | Project-wide consistency | Maximum consistency |
| User-defined scope | Custom scope name | Custom grouping logic |
| Replace Label | Override automatic label detection | Treat field as different type |
| Scope | Custom scope identifier | Custom grouping when using user-defined |
| With options | Include function options in key | Different transformations for same value with different params |
| Nulls handling | Store and reuse null transformations | Consistent null value handling |
Dictionary Modes Explained
Mode: None (Disabled)
What it does: Dictionary is completely disabled for this rule or field.
Behavior:
- No transformations are stored
- No lookups are performed
- Each transformation is independent
- Maximum randomness
When to use:
- When you want maximum randomization
- For one-time transformations
- When consistency is not required
- Testing or exploration scenarios
Example:
Mode: Field (Entity + Field)
What it does: Reuses transformations within the same entity and field combination only.
Behavior:
- Same value in same field → same output
- Same value in different field → different output
- Same value in different entity → different output
When to use:
- When fields should have independent transformations
- When same value means different things in different fields
- Testing field-specific anonymization
Example:
Mode: Label (Default)
What it does: Reuses transformations for fields with the same label, regardless of entity or field name.
Behavior:
- Same value + same label → same output
- Works across different entities
- Works across different field names
- Most common mode for data consistency
When to use:
- Maintaining referential integrity
- When labels represent data types (configured during discovery)
- Standard anonymization workflows
- Recommended for most use cases
Example:
Mode: Global
What it does: All transformations share a single project-wide dictionary.
Behavior:
- Same value anywhere → same output
- Maximum consistency
- Single shared scope
- Works across all entities, fields, and labels
When to use:
- Maximum referential integrity
- When you want identical values to always transform identically
- Simple, global consistency requirements
- When label detection is unreliable (check discovery settings)
Example:
When and Why to Use Dictionaries
Use Dictionaries When:
-
Maintaining Referential Integrity
- You need the same person/company/identifier to map consistently across multiple tables (configured via schema)
- Foreign key relationships must be preserved
- Data relationships matter for testing or analytics
-
Consistent Anonymization Across Jobs
- You run jobs multiple times (using pipelines)
- You want deterministic results
- You need to compare results over time
-
Cross-Database Consistency
- Same data appears in multiple databases (configured via sinks)
- You need consistent anonymization across all sources
- Migrations between environments
-
Realistic Test Data
- Generated data needs to look realistic
- Relationships must make sense
- Consistency improves data quality
-
Compliance and Auditing
- Trackable anonymization patterns
- Reproducible transformations
- Audit trail of transformations
Don't Use Dictionaries When:
-
Maximum Randomization Needed
- Security testing
- Privacy-critical scenarios
- When uniqueness is more important than consistency
-
One-Time Transformations
- Single-use data exports
- No future reuse needed
- Disposable test environments
-
Different Contexts Require Different Values
- When "John" in Customer table should differ from "John" in Employee table
- Context-dependent anonymization
- Field-specific privacy requirements
Strategies for Using Dictionaries in Gigantics
Strategy 1: Label-Based Consistency (Recommended)
Best for: Most standard anonymization workflows
Setup:
- Configure rule with Dictionary Mode: Label
- Ensure fields are properly labeled (person/name, email, phone, etc.)
- Enable "Store new transformations in the dictionary"
- Enable "Cache dictionary" for performance
Benefits:
- Automatic consistency across related data types
- Works across multiple entities
- Maintains referential integrity
- Easy to configure
Example Workflow:
Strategy 2: Progressive Dictionary Building
Best for: Iterative development and refinement
Setup:
- Start with "Store new transformations" enabled
- Run initial job with smaller dataset
- Review dictionary entries
- Export dictionary for backup
- Run full job - dictionary already contains partial entries
Benefits:
- Build consistency over time
- Test with smaller datasets first
- Can refine and re-import dictionary
- Incremental approach
Workflow:
Strategy 3: Scope-Specific Dictionaries
Best for: Complex projects with different consistency requirements
Setup:
- Use User-defined scope mode for specific fields
- Define custom scopes (e.g., "customer-identifiers", "financial-data")
- Group related fields under same scope
- Different scopes maintain separate dictionaries
Benefits:
- Fine-grained control
- Different consistency rules per data type
- Flexible grouping
- Can export/import specific scopes
Example:
Strategy 4: Pipeline with Dictionary Reuse
Best for: Scheduled jobs and automation
Setup:
- Configure pipeline with dictionary settings
- Enable "Store new transformations"
- Disable "Overwrite existing dictionary"
- Schedule pipeline to run regularly
Benefits:
- Dictionary grows over time
- Consistency across scheduled runs
- Automated consistency
- Can export dictionary between runs
Workflow:
Strategy 5: Dictionary Import/Export Workflow
Best for: Multi-environment deployment and migration
Setup:
- Develop dictionary in development environment
- Export dictionary after testing
- Import dictionary to staging/production
- Use same dictionary across environments
Benefits:
- Consistent anonymization across environments
- Can test dictionary before production
- Reproducible deployments
- Backup and restore capability
Workflow:
Strategy 6: Field-Level Overrides
Best for: Mixing consistency and randomness
Setup:
- Rule-level: Dictionary Mode: Label (default)
- Most fields: Inherit from rule
- Specific fields: Override with "Skip dictionary" or different mode
Benefits:
- Default consistency for most fields
- Specific control for sensitive fields
- Flexible per-field configuration
- Best of both worlds
Example:
Strategy 7: Null Handling Strategy
Best for: Datasets with many null values
Setup:
- Enable "Nulls handling" in dictionary options
- Nulls will be consistently transformed
- Useful for maintaining data patterns
Benefits:
- Consistent null value anonymization
- Preserves null patterns in data
- Can transform nulls to consistent placeholder
Example:
Best Practices
1. Start with Label Mode
- Most versatile and useful mode
- Works automatically with discovery labels
- Provides good balance of consistency and flexibility
2. Enable Caching for Performance
- Cache dictionary option improves lookup speed
- Especially important for large dictionaries
- Uses memory but significantly faster
3. Store Transformations for Reuse
- Enable "Store new transformations" unless you need one-time jobs
- Builds dictionary over time
- Enables consistency across job runs
4. Export Regularly
- Export dictionary as backup
- Export before major changes
- Export for migration between environments
5. Use Appropriate Scope Granularity
- Too broad (Global): May cause unintended consistency
- Too narrow (Field): May miss relationships
- Just right (Label): Balances consistency and flexibility
6. Monitor Dictionary Size
- Large dictionaries may impact performance
- Use "View Summary" to monitor by scope
- Consider scope-specific exports if too large
7. Test Before Production
- Build dictionary in development
- Test with sample datasets
- Export and import to staging (via sinks)
- Verify consistency
8. Document Custom Scopes
- Document user-defined scopes
- Keep scope naming consistent
- Document why certain fields use custom scopes
Common Use Cases
Use Case 1: Customer Database Anonymization
Scenario: Anonymize customer data while maintaining relationships
Configuration:
- Dictionary Mode: Label
- Store new transformations: Yes
- Cache dictionary: Yes
Result:
- Customer "John Smith" → "Mark Johnson" everywhere
- Email "john@example.com" → "mark@email.com" everywhere
- Relationships preserved across tables
Use Case 2: Multi-Database Consistency
Scenario: Same data in multiple databases, need consistent anonymization
Configuration:
- Dictionary Mode: Global
- Store new transformations: Yes
- Export dictionary after first run
- Import into subsequent database jobs
Result:
- Identical anonymization across all databases
- Can share dictionary between projects
Use Case 3: Incremental Data Processing
Scenario: Process new data periodically, maintain consistency with historical data
Configuration:
- Dictionary Mode: Label
- Store new transformations: Yes
- Overwrite existing: No
- Run pipeline on schedule
Result:
- New data uses existing dictionary
- New entries added to dictionary
- Growing consistency over time
Use Case 4: Selective Consistency
Scenario: Some fields need consistency, others need randomness
Configuration:
- Rule Default: Label mode
- Specific fields: Skip dictionary or Field mode
Result:
- Important fields: Consistent
- Sensitive fields: Random
- Flexible per-field control
Troubleshooting
Dictionary Not Working
Problem: Transformations are different each run
Solutions:
- Check dictionary mode is not "None"
- Verify "Store new transformations" is enabled
- Check if "Overwrite existing" is clearing dictionary
- Ensure cache is enabled for performance
Performance Issues
Problem: Job runs slowly with dictionary enabled
Solutions:
- Enable "Cache dictionary" option
- Check dictionary size - may need to clear old entries
- Consider scope-specific dictionaries
- Monitor with dictionary summary
Inconsistent Results
Problem: Same value transforming differently
Solutions:
- Check if using correct dictionary mode
- Verify labels are consistent across fields (check discovery results)
- Check if field-level overrides are set in rule configuration
- Review scope settings
Dictionary Too Large
Problem: Dictionary has too many entries
Solutions:
- Use "View Summary" to identify large scopes
- Export specific scopes only
- Clear dictionary and rebuild if needed
- Consider splitting into multiple scopes
Summary
Dictionaries are a powerful feature in Gigantics that enable:
- Consistent anonymization across jobs and databases
- Referential integrity preservation
- Flexible configuration from global to field-level
- Import/Export for backup and migration
- Performance optimization through caching
- Fine-grained control through modes and scopes
Start with Label mode for most scenarios, enable caching and storage for best results, and use export/import for backup and migration workflows.