Discovery

Labels

Labels in Gigantics are classifications assigned to database fields during the discovery process. These labels determine how data will be handled during anonymization and synthesis operations.

System Labels

Gigantics comes with a comprehensive set of predefined system labels for automatically detecting various types of sensitive data. These labels are organized into categories based on the type of information they identify:

Business Information

LabelDescriptionPII StatusRisk LevelDetection Method
business/companyCompany names and organizationsNoLowColumn name patterns, contextual data analysis
business/departmentDepartment names within organizationsNoLowColumn name patterns, contextual data analysis
business/job_titleProfessional job titlesNoLowColumn name patterns, predefined lists of job titles

Datetime Information

LabelDescriptionPII StatusRisk LevelDetection Method
datetime/date/format1 to datetime/date/format12Various date formats (MM/DD/YYYY, DD/MM/YYYY, etc.)ConditionallyLow to MediumPattern matching against multiple date format regexes
datetime/time_zoneTime zone identifiersNoLowColumn name patterns, predefined time zone lists
datetime/timeTime valuesConditionallyLowPattern matching, column name analysis

Financial Information

LabelDescriptionPII StatusRisk LevelDetection Method
finance/bitcoinBitcoin addressesConditionallyHighPattern matching using Bitcoin address format regex
finance/creditcard_typeCredit card type identifiersNoMediumColumn name patterns, predefined credit card type lists
finance/creditcardCredit card numbersYesVery HighPattern matching using Luhn algorithm validation
finance/currency_codeCurrency codes (USD, EUR, etc.)NoLowColumn name patterns, predefined currency code lists
finance/currencyCurrency names and symbolsNoLowColumn name patterns, predefined currency lists
finance/ethereumEthereum addressesConditionallyHighPattern matching using Ethereum address format regex
finance/ibanInternational Bank Account NumbersYesVery HighPattern matching using IBAN format validation
finance/moneyMonetary valuesNoMediumPattern matching, column name analysis

Health Information

LabelDescriptionPII StatusRisk LevelDetection Method
health/drugDrug names and medicationsNoMediumColumn name patterns, predefined drug name databases

Identifiers

LabelDescriptionPII StatusRisk LevelDetection Method
identifier/deaDEA (Drug Enforcement Administration) numbersYesHighPattern matching using DEA format validation
identifier/dniDocument National Identity numbersYesVery HighPattern matching using DNI format validation
identifier/isbnInternational Standard Book NumbersNoLowPattern matching using ISBN format validation
identifier/nhsNational Health Service numbersYesHighPattern matching using NHS number format validation
identifier/ninoNational Insurance NumbersYesHighPattern matching using NINO format validation
identifier/ssnSocial Security NumbersYesVery HighPattern matching using SSN format validation

Location Information

LabelDescriptionPII StatusRisk LevelDetection Method
location/addressPhysical street addressesYesHighPattern matching, column name analysis
location/cityCity namesConditionallyMediumColumn name patterns, predefined city name databases
location/city/deGerman city namesConditionallyMediumLanguage-specific city databases
location/city/esSpanish city namesConditionallyMediumLanguage-specific city databases
location/country_codeCountry codes (US, UK, DE, etc.)NoLowColumn name patterns, predefined country code lists
location/country/arArabic country namesNoLowLanguage-specific country databases
location/country/enEnglish country namesNoLowLanguage-specific country databases
location/country/esSpanish country namesNoLowLanguage-specific country databases
location/latitudeGeographic latitude coordinatesConditionallyLowPattern matching, column name analysis
location/longitudeGeographic longitude coordinatesConditionallyLowPattern matching, column name analysis
location/phonePhone numbers (general)YesHighPattern matching, column name analysis
location/phone/format1 to location/phone/format4Different phone number formatsYesHighFormat-specific pattern matching
location/state/US/abbrUS state abbreviationsConditionallyLowPattern matching, predefined state lists
location/state/US/fullFull US state namesConditionallyLowPattern matching, predefined state lists
location/zip_codeZIP/postal codesConditionallyMediumPattern matching, column name analysis

Personal Information

LabelDescriptionPII StatusRisk LevelDetection Method
person/genderGender identifiersYesLowColumn name patterns, predefined gender lists
person/name/en/firstEnglish first namesYesHighPattern matching against English name databases
person/name/en/fullFull English namesYesHighMulti-word pattern matching
person/name/en/lastEnglish last namesYesHighPattern matching against English surname databases
person/name/esSpanish namesYesHighLanguage-specific name databases
person/name/frFrench namesYesHighLanguage-specific name databases
person/raceRace/Ethnicity identifiersYesHighColumn name patterns, predefined race lists

Technical Information

LabelDescriptionPII StatusRisk LevelDetection Method
tech/emailEmail addressesYesMediumPattern matching using email regex validation
tech/guidGlobally Unique IdentifiersConditionallyLowPattern matching using GUID format regex
tech/hex_colorHexadecimal color codesNoLowPattern matching using hex color format regex
tech/ipv4IPv4 addressesConditionallyMediumPattern matching using IPv4 format validation
tech/ipv6IPv6 addressesConditionallyMediumPattern matching using IPv6 format validation
tech/localeLocale/Regional settingsNoLowPattern matching, predefined locale lists
tech/macMAC addressesConditionallyLowPattern matching using MAC address format regex
tech/md5MD5 hash valuesConditionallyLowPattern matching using MD5 format regex
tech/mime_typeMIME type identifiersNoLowPattern matching, predefined MIME type lists
tech/sha1SHA1 hash valuesConditionallyLowPattern matching using SHA1 format regex
tech/sha256SHA256 hash valuesConditionallyLowPattern matching using SHA256 format regex
tech/urlWeb URLsConditionallyLowPattern matching using URL regex validation
tech/user_agentBrowser user agent stringsConditionallyLowPattern matching, predefined user agent patterns

Miscellaneous

LabelDescriptionPII StatusRisk LevelDetection Method
misc/arArabic words and phrasesNoLowLanguage-specific pattern matching
misc/commonCommon wordsNoLowPattern matching against common word lists
misc/enEnglish wordsNoLowPattern matching against English word lists
misc/esSpanish wordsNoLowPattern matching against Spanish word lists
misc/frFrench wordsNoLowPattern matching against French word lists
misc/numbersNumeric patternsNoLowPattern matching, data type analysis

Label Properties

Each label has two key properties:

PII Field

Indicates whether the field contains Personally Identifiable Information:

  • True: Field contains sensitive personal data
  • False: Field does not contain sensitive personal data

Severity

Represents the risk level if the data were exposed:

  • Low: Minimal risk (e.g., Gender)
  • Medium: Moderate risk (e.g., Email addresses)
  • High: Significant risk (e.g., Names, Addresses)
  • Very High: Critical risk (e.g., SSN, Credit Cards)

Label Assignment Process

During discovery, Gigantics automatically assigns labels using a multi-layered approach:

  1. Column names: Matching against known patterns (e.g., "email", "phone")
  2. Data patterns: Analyzing sample values for format matches using regex and validation algorithms
  3. Dictionary lookup: Comparing against predefined sensitive data dictionaries with thousands of entries
  4. Machine learning: Using trained neural network models to recognize complex patterns
  5. Contextual analysis: Examining data in context with related fields for more accurate classification

Confidence levels are displayed as percentages indicating how certain the system is about the label assignment.

Managing Labels

After discovery, you can:

Labels are essential for ensuring accurate anonymization and data synthesis in subsequent steps.

Label Assignment Process

During discovery, Gigantics automatically assigns labels based on:

  1. Column names: Matching against known patterns (e.g., "email", "phone")
  2. Data patterns: Analyzing sample values for format matches
  3. Dictionary lookup: Comparing against predefined sensitive data dictionaries
  4. Machine learning: Using trained models to recognize complex patterns

Confidence levels are displayed as percentages indicating how certain the system is about the label assignment.

Managing Labels

After discovery, you can:

Labels are essential for ensuring accurate anonymization and data synthesis in subsequent steps.