Model

New Job modal

The New Job modal launches whenever you click the New Job button on the Jobs page (or when other workflows defer to the job engine). It is a multi-panel form that adapts to your selections to produce either a one-off job or a reusable pipeline.

┌───────────────────────────────────────────────────────────────────────┐
│ New Job                                                               │
│ ┌───────────────────────────────────────────────────────────────┐     │
│ │ From ▾ | To ▾ | Rule ▾ | Load options ▾ | Transform ▾ | …     │     │
│ └───────────────────────────────────────────────────────────────┘     │
│ Panels collapse once configured so you can scan the summary rows      │
└───────────────────────────────────────────────────────────────────────┘

When Manual execution or Repeat is selected in the Schedule panel, the footer reveals a required Pipeline name field and the primary button switches to Save as Pipeline. In all other cases the button reads Run Now or Run Later depending on the selected schedule.

Panel overview

PanelPurposeSummary text when collapsed
FromChoose the source tap or a saved dataset.Shows the tap (with driver/env badges) or the dataset name.
ToRoute the data into a sink, tap, or dataset.Lists the sink or dataset action (new, overwrite, merge).
RuleAttach an optional masking rule and schema version.Displays the selected rule (or “No rule”).
Load optionsTune batches and write mode for the selected destination.Condenses the selected write mode and read/write batch sizes.
Transform options (conditional)Configure determinism, dictionary behaviour, and foreign-key handling when a rule is active.Shows determinism + dictionary state.
ScheduleDecide when and how often to run the job.Shows “Now”, the scheduled timestamp, or pipeline cadence.

The following sections break down every configurable field.

From panel

FieldControlDescriptionDependencies & defaults
Source selectorRadio buttons (Tap, Dataset)Choose whether to stream data directly from your source database (Tap) or start from an existing dataset.Defaults to Tap. Selecting Dataset will enable the dataset selection dropdown.
Tap summaryRead-only display with badgesShows the name of your source database, along with environment and driver information.Always visible to confirm your source environment before launching.
DatasetDropdown selectionChoose an existing dataset to use as your data source.Required when Dataset source is selected. Disabled when Tap source is selected.

Internal behaviour:

  • Choosing Tap keeps your job type aligned with the target (pump/dump).
  • Selecting a dataset stores its ID for use in later steps (merge and overwrite operations require it).

To panel

Destination optionExtra controlsDescriptionDependencies & defaults
SinkDropdown selectionStream the output to one of your configured destinations.Required when selected. Sets the run type to load-to-sink. Job type becomes pump (tap source) or load (dataset source).
Overwrite current tapNoneWrite the data back directly into your source database (only available for certain database types).Disabled unless your source database supports overwrite operations. Forces write mode to update.
New DatasetText input fieldCreate a new dataset to store your results; leave blank to auto-generate a name.Available for all source types. Sets job type to dump for tap sources, or pump for dataset copies.
Overwrite datasetDropdown selectionReplace all data in an existing dataset with your new results.Required when selected. Keeps job type in sync with the source (dump/pump).
Merge into datasetDropdown selectionAdd new data to an existing dataset, combining records based on your model configuration.Required when selected. Available for both tap and dataset sources.

Behind the scenes the system automatically adjusts job settings whenever you change your source or destination. This prevents incompatible combinations (for example, when overwriting your source database, certain write modes like Truncate or Drop are automatically disabled).

Rule panel

FieldControlDescriptionNotes
RuleDropdown selectionAttach a transformation or anonymisation rule to modify your data during processing.Defaults to none. Selecting a rule will pre-fill related options in the Load and Transform panels.
Schema versionDropdown selectionChoose which version of your data schema to use during execution.Shows all available schema versions. The most recent version is labelled as "Latest".
ConcurrencyConfiguration optionSet limits on how many data entities can be processed simultaneously.Helps balance job performance against system resource usage.

Selecting a rule unlocks the Transform options panel described below.

Transform options (conditional)

FieldDescriptionKey details
DeterminismChoose between Random and Deterministic processing modes.Deterministic mode requires a seed value and guarantees repeatable outputs.
SeedNumeric input field with "Generate" button.Only available in deterministic mode. Accepts values between 1 and 9,999,999.
Check foreign keysDropdown selection (Auto, Do not check, Force).Balances processing speed against data integrity guarantees.
DictionaryRadio buttons (No dictionary, Field, Label, Global) with Cache, Keep, and Overwrite toggle switches.Configure how dictionary values are handled during processing. Disabled when no rule is selected.
Continue streaming on failCheckboxKeeps processing data even when individual records fail (enabled by default).

Load options panel

The Load options panel appears when your destination requires writing to a sink or database. For simple dataset copies, this panel is automatically hidden.

FieldControlDescriptionBehaviour & defaults
Target entitiesDropdown selectionChoose how to handle existing data: Truncate, Drop, Append, Update, or Merge.Defaults to Truncate. Available options depend on your selected destination type. When overwriting your source database, only Update mode is available.
Stream contentDropdown selectionControl what type of data is loaded into the database.Options: Data and metadata (default), Data only, or Metadata only.
Read batch sizeNumber inputHow many records to retrieve from your source in each batch.Defaults to 32768 records per batch. Transformation rules may override this setting.
Write batch sizeNumber inputHow many records to write to your destination in each batch.Defaults to 32768 records per batch.
Concurrent workersNumber inputSet how many workers can process data simultaneously.Defaults are determined by your selected transformation rule or system settings.

Stream content options

The Stream content setting determines what type of information is written to your database during job execution:

  • Data and metadata (default): Loads both your actual data records and additional metadata about the processing, including schema information and job statistics.
  • Data only: Loads only your actual data records without any additional metadata information.
  • Metadata only: Loads only metadata information about the job processing without your actual data records.

Write batch size impact

The Write batch size affects how data is committed to your database:

  • Smaller batch sizes: Provide more frequent commits, which can be helpful for monitoring progress but may result in slower overall processing due to increased transaction overhead.
  • Larger batch sizes: Reduce transaction overhead and can improve overall throughput, but provide less frequent progress updates and may consume more memory during processing.

Concurrent workers behavior

The Concurrent workers setting controls parallel processing but with an important limitation:

  • Each worker processes a different table in parallel, allowing multiple tables to be handled simultaneously.
  • A single table is never updated in parallel by multiple workers to maintain data consistency.
  • Increasing workers can speed up jobs with multiple tables but won't improve performance for single-table operations.

Batch size and concurrency visualization

Below is a visualization showing how read and write batch sizes interact with concurrent workers:

Source DBRead batch: 32768Concurrent WorkersW1W2W3Max concurrent: 3Target DBWrite batch: 32768Read →← Write

Understanding backpressure

When the read batch size is significantly larger than the write batch size, backpressure can occur. This means that data is being read from the source faster than it can be written to the destination, causing a buildup of data in memory.

For example, if you set:

  • Read batch size: 100,000 records
  • Write batch size: 1,000 records
  • Concurrent workers: 3

Each worker processes 100,000 records before moving to the next batch, but only writes 1,000 records at a time. This creates a buffer where processed data waits to be written, consuming memory. The system may slow down as it tries to manage this backlog, and in extreme cases, could run out of memory.

To optimize performance and memory usage:

  • Keep read and write batch sizes relatively balanced
  • If you need larger read batches, consider increasing write batch sizes proportionally
  • Monitor your system's memory usage during job execution

Schedule panel

ChoiceAdditional fieldsOutcomePrimary button label
Run nowNoneJob starts immediately.Run Now
ScheduleSelect run date and timeCreates a one-time scheduled job; appears in the Scheduled tab until execution.Run Later
Manual executionPipeline name (in footer)Saves the configuration as a reusable pipeline template; no job is queued.Save as Pipeline
RepeatSelect next run date/time, Repeat interval (number), Repeat unit (Hour/Day/Month/Year), Pipeline nameCreates a recurring pipeline that automatically runs at set intervals.Save as Pipeline

Schedule summaries when collapsed

When you've configured your schedule options, the panel will collapse and show a summary of your selection:

┌────────────┬─────────────────────────────────────────────────┐
│ Run now    │ Summary shows: "Run: Now"                       │
├────────────┼─────────────────────────────────────────────────┤
│ Schedule   │ Shows your selected date and time               │
├────────────┼─────────────────────────────────────────────────┤
│ Manual     │ Displays "Run pipeline manually"                │
├────────────┼─────────────────────────────────────────────────┤
│ Repeat     │ Shows next run date/time + repeat interval      │
│            │ (e.g. "Next: 2024-07-24 18:00 — Every 1 day")   │
└────────────┴─────────────────────────────────────────────────┘

Putting it together

  1. Start with From and To. These two panels determine how your data will flow and automatically configure appropriate settings for the next steps.
  2. Optionally select a Rule to apply masking or transformation logic to your data; doing so will enable additional configuration options.
  3. Use Load options and Transform options to fine-tune performance settings like batch sizes, determinism, and dictionary handling.
  4. Finish in the Schedule panel to decide when your job should run. If you're creating a reusable pipeline, provide a unique name in the footer before saving.
  5. Review the collapsed panel summaries to ensure all required fields are filled. Each header will highlight missing required fields so misconfigurations are easy to spot.

Closing the modal clears all your selections so each new job starts with fresh default settings.