Creating a New Pipeline

The New Pipeline wizard guides you through a three-step process to create automated data processing pipelines. This comprehensive interface provides options for configuring the pipeline job type, execution settings, and scheduling preferences.

Pipeline Creation Overview

The pipeline creation process uses a multi-step wizard with the following stages:

General Settings - Configure pipeline name and job type
Configuration - Set concurrency, load options, transform options, and logging
Scheduling - Define execution timing and frequency

Step 1: General Settings

Pipeline Name

The basic identifier for your pipeline. This name will appear in the pipeline list and be used to identify the pipeline in logs and API calls.

Job Type Selection

The job type determines what kind of data processing the pipeline will perform. Each type enables different configuration options:

Job Type	Description	Use Case
Scan	Scan the datasource looking for changes	Schema monitoring change detection
Discover	Creates a new discovery by labeling datasource fields	Field discovery and analysis
Dump	Create a new dataset from tap using a rule	Data extraction with transformation
Pump	Put tap data into a sink using a rule	Direct data streaming with transformation
Load	Load an existing dataset into a sink	Dataset replication and migration

Job Type-Specific Options

For Dump Operations

Rule Selection: Choose a transformation rule to apply (optional - defaults to "no rule")
- Selecting "no rule" allows direct data dumping without sensitive data protection
- Rules enable data masking, anonymization, and transformation

For Pump Operations

Rule Selection: Choose a transformation rule to apply to streaming data
Destination: Choose where to send the transformed data:
- Sink: Load data to an external sink destination
- Overwrite Tap: Update the source tap directly

For Load Operations

Dataset Selection: Choose an existing dataset to load
Rule Selection: Apply transformation rules (optional)
Destination: Select target (Sink or Overwrite Tap)

Destination Options

When Sink is selected as destination:

Sink Selection: Choose from available data sinks
Each sink displays its driver type (database, file system, etc.)

Make sure to create your sinks before creating a pipeline that requires a sink destination. Without pre-configured sinks, the sink selection dropdown will be empty and you won't be able to complete the pipeline configuration.

Step 2: Configuration Options

Concurrency Options

Configure how the pipeline processes data in parallel:

Setting	Description	Range/Options
Concurrency	Number of concurrent processing streams	Minimum: 1 (integer)
Continue Streaming On Fail	Keep processing if an entity fails during streaming	Checkbox (default: enabled)

Load Options

Available for Pump and Load operations when rules are applied:

Setting	Description	Options
Target Entities	Action when sink already contains data	Truncate: Clear existing data Drop: Drop and recreate entities Append: Add to existing data Update: Update existing records Merge: Merge with existing data
Stream Content	Type of content shown in job logs	Data and metadata: Full information Data only: Just data records Metadata only: Just metadata
Read Batch Size	Records read from source per cycle	Integer with "Records" suffix
Write Batch Size	Records written to destination per cycle	Integer with "Records" suffix

Note: Available target entity options depend on the sink's supported write modes. Some options like Truncate and Drop are disabled when overwriting a tap.

Transform Options

Available when a transformation rule is selected:

Determinism

Random: Different results for each rule execution
Deterministic: Same results guaranteed for each execution (requires seed value)

Seed Value

Used to initialize pseudo-random number generator
Required when Deterministic mode is selected
Range: 1 to 9,999,999
Generate button creates random seed value

Foreign Key Checking

Auto: Automatically determine if foreign key checking is needed
Do not check: Skip foreign key validation (faster execution)
Force check: Always validate foreign keys (consistent but slower)

Dictionary Settings

Controls reuse of previously masked values for consistent transformations:

Dictionary Mode	Description
No dictionary	No reuse of previous transformations
Reuse values on the same entity+field	Consistent values within the same record field
Reuse values with the same label or same entity+field	Consistent based on labeling or field location
Reuse values in every field	Global consistency across all transformations

Dictionary Options (disabled when "No dictionary" selected):

Cache dictionary: Store dictionary in memory for faster access
Store new transformations in the dictionary: Save new masked values for future reuse
Overwrite existing dictionary: Replace stored values with new transformations

Logging Options

Configure pipeline execution logging:

Setting	Description	Options
Logging Type	Detail level shown in job details page	Normal: Standard logging Debug: Additional debugging information Trace: Maximum detail for troubleshooting
Logging Level	Level of detail in job logs (higher = more traces)	1, 2, 3: Numeric detail levels max: Maximum available detail

Dataset Options (Dump Operations Only)

Dataset Naming

Dataset Name: Custom name for the created dataset (text input)

Run Mode Options

Mode	Description
Create new dataset	Generate a new dataset with unique name on each run
Overwrite the same dataset	Replace the existing dataset contents on each run
Merge into dataset	Combine new data with existing dataset (requires selection)

For merge mode, select an existing dataset to merge data into.

Step 3: Scheduling

Execution Type

Type	Description
Manual Execution	Run on-demand via web interface or API call
Repeat	Automatic execution on a recurring schedule

Manual Execution

Pipeline runs when user clicks Run button in the web interface
Can also be triggered via API endpoint (generated in pipeline management)
No automatic scheduling

Repeat Execution

Next Run Date (Optional)

Next run on: Date and time for first execution
Date picker with time selection (YYYY-MM-DD HH:mm format)
Cannot select past dates/times
Time validation prevents invalid future times

Recurrence Settings

Repeat every: Numeric value for frequency (1-365)
Frequency Unit:
- Hour: Every X hours
- Day: Every X days
- Month: Every X months
- Year: Every X years

Pipeline Creation Workflow

The interface provides intelligent form behavior:

Dynamic Options: Configuration sections appear/disappear based on selected job type
Validation: Required fields are validated before proceeding to next step
Form Reset: Changing job type clears dependent selections (rules, sinks, datasets)
Progress Tracking: Sidebar shows current step and allows navigation between steps
Smart Defaults: Intelligent default values for most settings

After Creation

Once created, pipelines appear in the main pipeline list with the following management options:

Enable/Disable: Toggle active status
Run: Execute immediately (manual pipelines)
Share: Generate API endpoints for external execution
Edit: Modify pipeline configuration
Delete: Remove pipeline permanently

Example Pipeline Configurations

Data Masking Pipeline

Job Type: Pump
Rule: Data transformation rule
Destination: Sink (test database)
Load Options: Update mode
Transform: Deterministic with seed for consistent masking
Schedule: Every Day (for fresh test data)

Schema Monitoring Pipeline

Job Type: Scan
No additional options required
Logging: Debug mode for detailed change tracking
Schedule: Every Hour (continuous monitoring)

Dataset Export Pipeline

Job Type: Load
Dataset: Existing production dataset
Destination: Sink (file system)
Load Options: Append mode
Schedule: Manual (on-demand exports)

Tabla de Contenidos