New Job

The New Job modal launches whenever you click the New Job button on the Jobs page (or when other workflows defer to the job engine). It is a multi-panel form that adapts to your selections to produce either a one-off job or a reusable pipeline.

┌───────────────────────────────────────────────────────────────────────┐
│ New Job                                                               │
│ ┌───────────────────────────────────────────────────────────────┐     │
│ │ From ▾ | To ▾ | Rule ▾ | Load options ▾ | Transform ▾ | …     │     │
│ └───────────────────────────────────────────────────────────────┘     │
│ Panels collapse once configured so you can scan the summary rows      │
└───────────────────────────────────────────────────────────────────────┘

When Manual execution or Repeat is selected in the Schedule panel, the footer reveals a required Pipeline name field and the primary button switches to Save as Pipeline. In all other cases the button reads Run Now or Run Later depending on the selected schedule.

Panel overview

Panel	Purpose	Summary text when collapsed
From	Choose the source tap or a saved dataset.	Shows the tap (with driver/env badges) or the dataset name.
To	Route the data into a sink, tap, or dataset.	Lists the sink or dataset action (new, overwrite, merge).
Rule	Attach an optional masking rule and schema version.	Displays the selected rule (or “No rule”).
Load options	Tune batches and write mode for the selected destination.	Condenses the selected write mode and `read/write` batch sizes.
Transform options (conditional)	Configure determinism, dictionary behaviour, and foreign-key handling when a rule is active.	Shows determinism + dictionary state.
Schedule	Decide when and how often to run the job.	Shows “Now”, the scheduled timestamp, or pipeline cadence.

The following sections break down every configurable field.

From panel

Field	Control	Description	Dependencies & defaults
Source selector	Radio buttons (Tap, Dataset)	Choose whether to stream data directly from your source database (Tap) or start from an existing dataset.	Defaults to Tap. Selecting Dataset will enable the dataset selection dropdown.
Tap summary	Read-only display with badges	Shows the name of your source database, along with environment and driver information.	Always visible to confirm your source environment before launching.
Dataset	Dropdown selection	Choose an existing dataset to use as your data source.	Required when Dataset source is selected. Disabled when Tap source is selected.

Internal behaviour:

Choosing Tap keeps your job type aligned with the target (pump/dump).
Selecting a dataset stores its ID for use in later steps (merge and overwrite operations require it).

To panel

Destination option	Extra controls	Description	Dependencies & defaults
Sink	Dropdown selection	Stream the output to one of your configured destinations.	Required when selected. Sets the run type to load-to-sink. Job type becomes pump (tap source) or load (dataset source).
Overwrite current tap	None	Write the data back directly into your source database (only available for certain database types).	Disabled unless your source database supports overwrite operations. Forces write mode to update.
New Dataset	Text input field	Create a new dataset to store your results; leave blank to auto-generate a name.	Available for all source types. Sets job type to dump for tap sources, or pump for dataset copies.
Overwrite dataset	Dropdown selection	Replace all data in an existing dataset with your new results.	Required when selected. Keeps job type in sync with the source (dump/pump).
Merge into dataset	Dropdown selection	Add new data to an existing dataset, combining records based on your model configuration.	Required when selected. Available for both tap and dataset sources.

Behind the scenes the system automatically adjusts job settings whenever you change your source or destination. This prevents incompatible combinations (for example, when overwriting your source database, certain write modes like Truncate or Drop are automatically disabled).

Rule panel

Field	Control	Description	Notes
Rule	Dropdown selection	Attach a transformation or anonymisation rule to modify your data during processing.	Defaults to none. Selecting a rule will pre-fill related options in the Load and Transform panels.
Schema version	Dropdown selection	Choose which version of your data schema to use during execution.	Shows all available schema versions. The most recent version is labelled as "Latest".
Concurrency	Configuration option	Set limits on how many data entities can be processed simultaneously.	Helps balance job performance against system resource usage.

Selecting a rule unlocks the Transform options panel described below.

Transform options (conditional)

Field	Description	Key details
Determinism	Choose between Random and Deterministic processing modes.	Deterministic mode requires a seed value and guarantees repeatable outputs.
Seed	Numeric input field with "Generate" button.	Only available in deterministic mode. Accepts values between 1 and 9,999,999.
Check foreign keys	Dropdown selection (Auto, Do not check, Force).	Balances processing speed against data integrity guarantees.
Dictionary	Radio buttons (No dictionary, Field, Label, Global) with Cache, Keep, and Overwrite toggle switches.	Configure how dictionary values are handled during processing. Disabled when no rule is selected.
Continue streaming on fail	Checkbox	Keeps processing data even when individual records fail (enabled by default).

Load options panel

The Load options panel appears when your destination requires writing to a sink or database. For simple dataset copies, this panel is automatically hidden.

Field	Control	Description	Behaviour & defaults
Target entities	Dropdown selection	Choose how to handle existing data: Truncate, Drop, Append, Update, or Merge.	Defaults to Truncate. Available options depend on your selected destination type. When overwriting your source database, only Update mode is available.
Stream content	Dropdown selection	Control what type of data is loaded into the database.	Options: Data and metadata (default), Data only, or Metadata only.
Read batch size	Number input	How many records to retrieve from your source in each batch.	Defaults to 32768 records per batch. Transformation rules may override this setting.
Write batch size	Number input	How many records to write to your destination in each batch.	Defaults to 32768 records per batch.
Concurrent workers	Number input	Set how many workers can process data simultaneously.	Defaults are determined by your selected transformation rule or system settings.

Stream content options

The Stream content setting determines what type of information is written to your database during job execution:

Data and metadata (default): Loads both your actual data records and additional metadata about the processing, including schema information and job statistics.
Data only: Loads only your actual data records without any additional metadata information.
Metadata only: Loads only metadata information about the job processing without your actual data records.

Write batch size impact

The Write batch size affects how data is committed to your database:

Smaller batch sizes: Provide more frequent commits, which can be helpful for monitoring progress but may result in slower overall processing due to increased transaction overhead.
Larger batch sizes: Reduce transaction overhead and can improve overall throughput, but provide less frequent progress updates and may consume more memory during processing.

Concurrent workers behavior

The Concurrent workers setting controls parallel processing but with an important limitation:

Each worker processes a different table in parallel, allowing multiple tables to be handled simultaneously.
A single table is never updated in parallel by multiple workers to maintain data consistency.
Increasing workers can speed up jobs with multiple tables but won't improve performance for single-table operations.

Batch size and concurrency visualization

Below is a visualization showing how read and write batch sizes interact with concurrent workers:

Understanding backpressure

When the read batch size is significantly larger than the write batch size, backpressure can occur. This means that data is being read from the source faster than it can be written to the destination, causing a buildup of data in memory.

For example, if you set:

Read batch size: 100,000 records
Write batch size: 1,000 records
Concurrent workers: 3

Each worker processes 100,000 records before moving to the next batch, but only writes 1,000 records at a time. This creates a buffer where processed data waits to be written, consuming memory. The system may slow down as it tries to manage this backlog, and in extreme cases, could run out of memory.

To optimize performance and memory usage:

Keep read and write batch sizes relatively balanced
If you need larger read batches, consider increasing write batch sizes proportionally
Monitor your system's memory usage during job execution

Schedule panel

Choice	Additional fields	Outcome	Primary button label
Run now	None	Job starts immediately.	Run Now
Schedule	Select run date and time	Creates a one-time scheduled job; appears in the Scheduled tab until execution.	Run Later
Manual execution	Pipeline name (in footer)	Saves the configuration as a reusable pipeline template; no job is queued.	Save as Pipeline
Repeat	Select next run date/time, Repeat interval (number), Repeat unit (Hour/Day/Month/Year), Pipeline name	Creates a recurring pipeline that automatically runs at set intervals.	Save as Pipeline

Schedule summaries when collapsed

When you've configured your schedule options, the panel will collapse and show a summary of your selection:

┌────────────┬─────────────────────────────────────────────────┐
│ Run now    │ Summary shows: "Run: Now"                       │
├────────────┼─────────────────────────────────────────────────┤
│ Schedule   │ Shows your selected date and time               │
├────────────┼─────────────────────────────────────────────────┤
│ Manual     │ Displays "Run pipeline manually"                │
├────────────┼─────────────────────────────────────────────────┤
│ Repeat     │ Shows next run date/time + repeat interval      │
│            │ (e.g. "Next: 2024-07-24 18:00 — Every 1 day")   │
└────────────┴─────────────────────────────────────────────────┘

Putting it together

Start with From and To. These two panels determine how your data will flow and automatically configure appropriate settings for the next steps.
Optionally select a Rule to apply masking or transformation logic to your data; doing so will enable additional configuration options.
Use Load options and Transform options to fine-tune performance settings like batch sizes, determinism, and dictionary handling.
Finish in the Schedule panel to decide when your job should run. If you're creating a reusable pipeline, provide a unique name in the footer before saving.
Review the collapsed panel summaries to ensure all required fields are filled. Each header will highlight missing required fields so misconfigurations are easy to spot.

Closing the modal clears all your selections so each new job starts with fresh default settings.

On this page