Modelo

Creating a New Pipeline

The New Pipeline wizard guides you through a three-step process to create automated data processing pipelines. This comprehensive interface provides options for configuring the pipeline job type, execution settings, and scheduling preferences.

Pipeline Creation Overview

The pipeline creation process uses a multi-step wizard with the following stages:

  1. General Settings - Configure pipeline name and job type
  2. Configuration - Set concurrency, load options, transform options, and logging
  3. Scheduling - Define execution timing and frequency

Step 1: General Settings

Pipeline Name

The basic identifier for your pipeline. This name will appear in the pipeline list and be used to identify the pipeline in logs and API calls.

Job Type Selection

The job type determines what kind of data processing the pipeline will perform. Each type enables different configuration options:

Job TypeDescriptionUse Case
ScanScan the datasource looking for changesSchema monitoring change detection
DiscoverCreates a new discovery by labeling datasource fieldsField discovery and analysis
DumpCreate a new dataset from tap using a ruleData extraction with transformation
PumpPut tap data into a sink using a ruleDirect data streaming with transformation
LoadLoad an existing dataset into a sinkDataset replication and migration

Job Type-Specific Options

For Dump Operations

  • Rule Selection: Choose a transformation rule to apply (optional - defaults to "no rule")
    • Selecting "no rule" allows direct data dumping without sensitive data protection
    • Rules enable data masking, anonymization, and transformation

For Pump Operations

  • Rule Selection: Choose a transformation rule to apply to streaming data
  • Destination: Choose where to send the transformed data:
    • Sink: Load data to an external sink destination
    • Overwrite Tap: Update the source tap directly

For Load Operations

  • Dataset Selection: Choose an existing dataset to load
  • Rule Selection: Apply transformation rules (optional)
  • Destination: Select target (Sink or Overwrite Tap)

Destination Options

When Sink is selected as destination:

  • Sink Selection: Choose from available data sinks
  • Each sink displays its driver type (database, file system, etc.)

Make sure to create your sinks before creating a pipeline that requires a sink destination. Without pre-configured sinks, the sink selection dropdown will be empty and you won't be able to complete the pipeline configuration.

Step 2: Configuration Options

Concurrency Options

Configure how the pipeline processes data in parallel:

SettingDescriptionRange/Options
ConcurrencyNumber of concurrent processing streamsMinimum: 1 (integer)
Continue Streaming On FailKeep processing if an entity fails during streamingCheckbox (default: enabled)

Load Options

Available for Pump and Load operations when rules are applied:

SettingDescriptionOptions
Target EntitiesAction when sink already contains dataTruncate: Clear existing data
Drop: Drop and recreate entities
Append: Add to existing data
Update: Update existing records
Merge: Merge with existing data
Stream ContentType of content shown in job logsData and metadata: Full information
Data only: Just data records
Metadata only: Just metadata
Read Batch SizeRecords read from source per cycleInteger with "Records" suffix
Write Batch SizeRecords written to destination per cycleInteger with "Records" suffix

Note: Available target entity options depend on the sink's supported write modes. Some options like Truncate and Drop are disabled when overwriting a tap.

Transform Options

Available when a transformation rule is selected:

Determinism

  • Random: Different results for each rule execution
  • Deterministic: Same results guaranteed for each execution (requires seed value)

Seed Value

  • Used to initialize pseudo-random number generator
  • Required when Deterministic mode is selected
  • Range: 1 to 9,999,999
  • Generate button creates random seed value

Foreign Key Checking

  • Auto: Automatically determine if foreign key checking is needed
  • Do not check: Skip foreign key validation (faster execution)
  • Force check: Always validate foreign keys (consistent but slower)

Dictionary Settings

Controls reuse of previously masked values for consistent transformations:

Dictionary ModeDescription
No dictionaryNo reuse of previous transformations
Reuse values on the same entity+fieldConsistent values within the same record field
Reuse values with the same label or same entity+fieldConsistent based on labeling or field location
Reuse values in every fieldGlobal consistency across all transformations

Dictionary Options (disabled when "No dictionary" selected):

  • Cache dictionary: Store dictionary in memory for faster access
  • Store new transformations in the dictionary: Save new masked values for future reuse
  • Overwrite existing dictionary: Replace stored values with new transformations

Logging Options

Configure pipeline execution logging:

SettingDescriptionOptions
Logging TypeDetail level shown in job details pageNormal: Standard logging
Debug: Additional debugging information
Trace: Maximum detail for troubleshooting
Logging LevelLevel of detail in job logs (higher = more traces)1, 2, 3: Numeric detail levels
max: Maximum available detail

Dataset Options (Dump Operations Only)

Dataset Naming

  • Dataset Name: Custom name for the created dataset (text input)

Run Mode Options

ModeDescription
Create new datasetGenerate a new dataset with unique name on each run
Overwrite the same datasetReplace the existing dataset contents on each run
Merge into datasetCombine new data with existing dataset (requires selection)

For merge mode, select an existing dataset to merge data into.

Step 3: Scheduling

Execution Type

TypeDescription
Manual ExecutionRun on-demand via web interface or API call
RepeatAutomatic execution on a recurring schedule

Manual Execution

  • Pipeline runs when user clicks Run button in the web interface
  • Can also be triggered via API endpoint (generated in pipeline management)
  • No automatic scheduling

Repeat Execution

Next Run Date (Optional)

  • Next run on: Date and time for first execution
  • Date picker with time selection (YYYY-MM-DD HH:mm format)
  • Cannot select past dates/times
  • Time validation prevents invalid future times

Recurrence Settings

  • Repeat every: Numeric value for frequency (1-365)
  • Frequency Unit:
    • Hour: Every X hours
    • Day: Every X days
    • Month: Every X months
    • Year: Every X years

Pipeline Creation Workflow

The interface provides intelligent form behavior:

  1. Dynamic Options: Configuration sections appear/disappear based on selected job type
  2. Validation: Required fields are validated before proceeding to next step
  3. Form Reset: Changing job type clears dependent selections (rules, sinks, datasets)
  4. Progress Tracking: Sidebar shows current step and allows navigation between steps
  5. Smart Defaults: Intelligent default values for most settings

After Creation

Once created, pipelines appear in the main pipeline list with the following management options:

  • Enable/Disable: Toggle active status
  • Run: Execute immediately (manual pipelines)
  • Share: Generate API endpoints for external execution
  • Edit: Modify pipeline configuration
  • Delete: Remove pipeline permanently

Example Pipeline Configurations

Data Masking Pipeline

  • Job Type: Pump
  • Rule: Data transformation rule
  • Destination: Sink (test database)
  • Load Options: Update mode
  • Transform: Deterministic with seed for consistent masking
  • Schedule: Every Day (for fresh test data)

Schema Monitoring Pipeline

  • Job Type: Scan
  • No additional options required
  • Logging: Debug mode for detailed change tracking
  • Schedule: Every Hour (continuous monitoring)

Dataset Export Pipeline

  • Job Type: Load
  • Dataset: Existing production dataset
  • Destination: Sink (file system)
  • Load Options: Append mode
  • Schedule: Manual (on-demand exports)