Discover
In this document you will learn how to:
- Create a new discover job.
- Identify fields and sensitive data of your tap.
- Classify fields with system or custom labels.
The discovery job is the most important step in the model context, we recommend make a very accurate and correct labelling of the data, will be the core to anonymize the tap or generate synthesized data automatically.
What is a discovery
A discovery is a process by which Gigantics analyzes the tap data and classify each field with a % of probability that the values of the fields belong to that label.
Labels can be modified, deleted or customized to fine-tune the labelling process.
These labels are useful because:
- They are used to classify your data and analyze the risk of the fields in your tap.
- Some of the transformations that will be subsequently performed on the rules, will be based on the labels of the fields to assign values that resemble to the real ones.
Create new discover
You will find two types of discover:
-
Complete discover: The complete discover has a longer duration since it analyzes the database and assigns labels to each of the fields in the database according to the stored data.
-
Scan only: This type of discover will only do a quick scan in order to get the schema of your tap. After this job, you can view the schema from the Schema page.
In both cases, Gigantics will create a new job that you can find in the Jobs page of the model.
If you scan your database first, during the discover setup you can select the entities you want to discover.
Complete discover
Entities
Allows you to include or exclude the entities you want to discover, the more entities you have, the more complete the final report will be. You can choose which entities will be part of the analysis or which entities will not be part of the analysis using the tables or using a regular expression.
It allows you to discover your tap partially by selecting the entities you want to discover. In order to perform this step, it is necessary to have the tap scanned previously.
Configuration
In this step, you can configure your discover job settings.
-
Merge with previous discover: If a previous discovery exists, it does not overwrite existing labels.
-
Rate limit: Limits the processing rows per second to avoid overloading the servers.
-
Concurrency: Specifies the number of times Gigantics will analyze table columns. The higher the concurrency, the more accurate the classification will be but the longer it will take. Recommended values: 1, 2 or 3.
-
Row limit: Sets the limit of the row sample to be analyzed. The higher the percentage, the more accurate the analysis will be but also the longer it will take.
-
Labels probability: Allows you to set a limit beyond which the system will automatically assign a label to the field.
Schedule
Allows you to schedule the analysis for a specific date and time or to run it at that time. This process will run a job that you can consult at any time in the Jobs window.
Scan only
Scan the tap to get the schema and the list of entities and fields.
Heatmap
The heatmap is a visual representation of the risk of the database in case of database leak.
This representation is made based on the labels automatically assigned by Gigantics. If you want to know more about how these labels work, you can see more information here.
Each label contains two parameters that represent the risk of an entity:
- PII Field: Indicates if the field contains sensitive data.
- Severity: Shows the sensitivity of the field based on the data it contains.
Based on these two parameters, the entity is drawn with a color ranging from green (no risk) to strong red (very high risk).
From the heatmap and selecting one or several entities you will be able to edit the labels that have been assigned in the discover process.
If you have made a partial discovery not including all entities, the undiscovered entities will be displayed in gray.
Edit labels
Before to confirm an entity, you will be able to modify the labels of each one of the fields of the entity.
By default, each label has two default options. The first is if the field contains sensitive data (e.g. a personal identification or an complete address) and finally a severity level.
However, these parameters can be modified on this page but will only apply for that discover. If you want to change these default values, you can do so from Config -> Labels.
It is up to the user to evaluate the severity and the risk of the field and set it. Gigantics adds default values to facilitate this task.
Confirm entities
After editing the field labels, you can confirm that the entity is correct. Be careful because after confirmation, the field labels cannot be changed without a reason.
By confirming the entities, we prevent other users from changing the labeling and manipulating the data risk. This is important for external audits in the future.