Creating a New Pipeline
The New Pipeline wizard guides you through a three-step process to create automated data processing pipelines. This comprehensive interface provides options for configuring the pipeline job type, execution settings, and scheduling preferences.
Pipeline Creation Overview
The pipeline creation process uses a multi-step wizard with the following stages:
- General Settings - Configure pipeline name and job type
- Configuration - Set concurrency, load options, transform options, and logging
- Scheduling - Define execution timing and frequency
Step 1: General Settings
Pipeline Name
The basic identifier for your pipeline. This name will appear in the pipeline list and be used to identify the pipeline in logs and API calls.
Job Type Selection
The job type determines what kind of data processing the pipeline will perform. Each type enables different configuration options:
| Job Type | Description | Use Case |
|---|---|---|
| Scan | Scan the datasource looking for changes | Schema monitoring change detection |
| Discover | Creates a new discovery by labeling datasource fields | Field discovery and analysis |
| Dump | Create a new dataset from tap using a rule | Data extraction with transformation |
| Pump | Put tap data into a sink using a rule | Direct data streaming with transformation |
| Load | Load an existing dataset into a sink | Dataset replication and migration |
Job Type-Specific Options
For Dump Operations
- Rule Selection: Choose a transformation rule to apply (optional - defaults to "no rule")
- Selecting "no rule" allows direct data dumping without sensitive data protection
- Rules enable data masking, anonymization, and transformation
For Pump Operations
- Rule Selection: Choose a transformation rule to apply to streaming data
- Destination: Choose where to send the transformed data:
- Sink: Load data to an external sink destination
- Overwrite Tap: Update the source tap directly
For Load Operations
- Dataset Selection: Choose an existing dataset to load
- Rule Selection: Apply transformation rules (optional)
- Destination: Select target (Sink or Overwrite Tap)
Destination Options
When Sink is selected as destination:
- Sink Selection: Choose from available data sinks
- Each sink displays its driver type (database, file system, etc.)
Make sure to create your sinks before creating a pipeline that requires a sink destination. Without pre-configured sinks, the sink selection dropdown will be empty and you won't be able to complete the pipeline configuration.
Step 2: Configuration Options
Concurrency Options
Configure how the pipeline processes data in parallel:
| Setting | Description | Range/Options |
|---|---|---|
| Concurrency | Number of concurrent processing streams | Minimum: 1 (integer) |
| Continue Streaming On Fail | Keep processing if an entity fails during streaming | Checkbox (default: enabled) |
Load Options
Available for Pump and Load operations when rules are applied:
| Setting | Description | Options |
|---|---|---|
| Target Entities | Action when sink already contains data | Truncate: Clear existing data Drop: Drop and recreate entities Append: Add to existing data Update: Update existing records Merge: Merge with existing data |
| Stream Content | Type of content shown in job logs | Data and metadata: Full information Data only: Just data records Metadata only: Just metadata |
| Read Batch Size | Records read from source per cycle | Integer with "Records" suffix |
| Write Batch Size | Records written to destination per cycle | Integer with "Records" suffix |
Note: Available target entity options depend on the sink's supported write modes. Some options like Truncate and Drop are disabled when overwriting a tap.
Transform Options
Available when a transformation rule is selected:
Determinism
- Random: Different results for each rule execution
- Deterministic: Same results guaranteed for each execution (requires seed value)
Seed Value
- Used to initialize pseudo-random number generator
- Required when Deterministic mode is selected
- Range: 1 to 9,999,999
- Generate button creates random seed value
Foreign Key Checking
- Auto: Automatically determine if foreign key checking is needed
- Do not check: Skip foreign key validation (faster execution)
- Force check: Always validate foreign keys (consistent but slower)
Dictionary Settings
Controls reuse of previously masked values for consistent transformations:
| Dictionary Mode | Description |
|---|---|
| No dictionary | No reuse of previous transformations |
| Reuse values on the same entity+field | Consistent values within the same record field |
| Reuse values with the same label or same entity+field | Consistent based on labeling or field location |
| Reuse values in every field | Global consistency across all transformations |
Dictionary Options (disabled when "No dictionary" selected):
- Cache dictionary: Store dictionary in memory for faster access
- Store new transformations in the dictionary: Save new masked values for future reuse
- Overwrite existing dictionary: Replace stored values with new transformations
Logging Options
Configure pipeline execution logging:
| Setting | Description | Options |
|---|---|---|
| Logging Type | Detail level shown in job details page | Normal: Standard logging Debug: Additional debugging information Trace: Maximum detail for troubleshooting |
| Logging Level | Level of detail in job logs (higher = more traces) | 1, 2, 3: Numeric detail levels max: Maximum available detail |
Dataset Options (Dump Operations Only)
Dataset Naming
- Dataset Name: Custom name for the created dataset (text input)
Run Mode Options
| Mode | Description |
|---|---|
| Create new dataset | Generate a new dataset with unique name on each run |
| Overwrite the same dataset | Replace the existing dataset contents on each run |
| Merge into dataset | Combine new data with existing dataset (requires selection) |
For merge mode, select an existing dataset to merge data into.
Step 3: Scheduling
Execution Type
| Type | Description |
|---|---|
| Manual Execution | Run on-demand via web interface or API call |
| Repeat | Automatic execution on a recurring schedule |
Manual Execution
- Pipeline runs when user clicks Run button in the web interface
- Can also be triggered via API endpoint (generated in pipeline management)
- No automatic scheduling
Repeat Execution
Next Run Date (Optional)
- Next run on: Date and time for first execution
- Date picker with time selection (YYYY-MM-DD HH:mm format)
- Cannot select past dates/times
- Time validation prevents invalid future times
Recurrence Settings
- Repeat every: Numeric value for frequency (1-365)
- Frequency Unit:
- Hour: Every X hours
- Day: Every X days
- Month: Every X months
- Year: Every X years
Pipeline Creation Workflow
The interface provides intelligent form behavior:
- Dynamic Options: Configuration sections appear/disappear based on selected job type
- Validation: Required fields are validated before proceeding to next step
- Form Reset: Changing job type clears dependent selections (rules, sinks, datasets)
- Progress Tracking: Sidebar shows current step and allows navigation between steps
- Smart Defaults: Intelligent default values for most settings
After Creation
Once created, pipelines appear in the main pipeline list with the following management options:
- Enable/Disable: Toggle active status
- Run: Execute immediately (manual pipelines)
- Share: Generate API endpoints for external execution
- Edit: Modify pipeline configuration
- Delete: Remove pipeline permanently
Example Pipeline Configurations
Data Masking Pipeline
- Job Type: Pump
- Rule: Data transformation rule
- Destination: Sink (test database)
- Load Options: Update mode
- Transform: Deterministic with seed for consistent masking
- Schedule: Every Day (for fresh test data)
Schema Monitoring Pipeline
- Job Type: Scan
- No additional options required
- Logging: Debug mode for detailed change tracking
- Schedule: Every Hour (continuous monitoring)
Dataset Export Pipeline
- Job Type: Load
- Dataset: Existing production dataset
- Destination: Sink (file system)
- Load Options: Append mode
- Schedule: Manual (on-demand exports)