Planning Disk Space
Disk requirements depend on the types of data stored by Gigantics and the volumes of data being handled in your site. Using Gigantics does not mean storing data locally, since the anonymization and/or synthesis process could happen within one same database, or in between 2 or more data sources (taps and sinks).
If no significant data is moved into Gigantics, the basic metadata will typically fit in a <1GB MongoDB database. Only data and logs from the loading, pumping and dumping jobs will be stored.
However, Gigantics stores 5 types of data:
| Data Type | Description | Estimated Size | Required/Optional |
|---|---|---|---|
| Metadata | Users, datasource connection data, schemas, rules, audits | <1GB | Required |
| Logs | Job and pipeline logs, data dump/load/pump details, event logging | 100MB-10GB | Required |
| Dictionaries | Hashed data fields that are anonymized | 10MB-10GB | Optional |
| Datasets | Data dumps - i.e. anonymized data from production databases that are loaded on-demand into other datasources | 1GB-100GB+ | Optional |
| Backups | Datasets that hold secure original data (before modification). Used only if critical production databases are used for in-place anonymization | 1GB-100GB+ | Optional |
Estimated sizes are just typical sizes and may not correspond to your installation.
Guidelines for Estimating Disk Space Requirements
For planning, careful analysis of your datasources is needed. Here are some guidelines for estimating your disk space requirements:
-
Start with the minimum: Allocate at least 10GB for basic metadata and logs, even if you plan to use only optional features occasionally.
-
Analyze your data sources:
- Count the number of databases and their approximate sizes
- Identify which databases will be used for anonymization/synthesis
- Determine if you'll be using in-place anonymization (which requires backups)
-
Estimate optional storage needs:
- For dictionaries: Plan for 1-2% of your total data volume
- For datasets: Plan for 10-50% of your largest datasource if you'll be creating copies
- For backups: Plan for 100% of any database where you'll do in-place anonymization
-
Account for log retention:
- Default log retention is 30 days
- Adjust allocation if you need longer retention periods
-
Add a safety margin: Always add a 20-30% buffer to your calculated requirements to account for growth and unexpected needs.
-
Consider concurrent operations: If running multiple jobs simultaneously, ensure adequate space for temporary files during processing.
Example Calculation
For a setup with two datasources (10GB and 50GB), using Gigantics for dictionary creation and occasional dataset exports:
- Metadata and logs: 2GB
- Dictionaries: 1GB (2% of 60GB total)
- Datasets: 20GB (exporting ~40% of largest datasource)
- Total recommended allocation: 25GB (with 30% safety margin: ~33GB)