New Job modal
The New Job modal launches whenever you click the New Job button on the Jobs page (or when other workflows defer to the job engine). It is a multi-panel form that adapts to your selections to produce either a one-off job or a reusable pipeline.
┌───────────────────────────────────────────────────────────────────────┐
│ New Job │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ From ▾ | To ▾ | Rule ▾ | Load options ▾ | Transform ▾ | … │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ Panels collapse once configured so you can scan the summary rows │
└───────────────────────────────────────────────────────────────────────┘When Manual execution or Repeat is selected in the Schedule panel, the footer reveals a required Pipeline name field and the primary button switches to Save as Pipeline. In all other cases the button reads Run Now or Run Later depending on the selected schedule.
Panel overview
| Panel | Purpose | Summary text when collapsed |
|---|---|---|
| From | Choose the source tap or a saved dataset. | Shows the tap (with driver/env badges) or the dataset name. |
| To | Route the data into a sink, tap, or dataset. | Lists the sink or dataset action (new, overwrite, merge). |
| Rule | Attach an optional masking rule and schema version. | Displays the selected rule (or “No rule”). |
| Load options | Tune batches and write mode for the selected destination. | Condenses the selected write mode and read/write batch sizes. |
| Transform options (conditional) | Configure determinism, dictionary behaviour, and foreign-key handling when a rule is active. | Shows determinism + dictionary state. |
| Schedule | Decide when and how often to run the job. | Shows “Now”, the scheduled timestamp, or pipeline cadence. |
The following sections break down every configurable field.
From panel
| Field | Control | Description | Dependencies & defaults |
|---|---|---|---|
| Source selector | Radio buttons (Tap, Dataset) | Choose whether to stream data directly from your source database (Tap) or start from an existing dataset. | Defaults to Tap. Selecting Dataset will enable the dataset selection dropdown. |
| Tap summary | Read-only display with badges | Shows the name of your source database, along with environment and driver information. | Always visible to confirm your source environment before launching. |
| Dataset | Dropdown selection | Choose an existing dataset to use as your data source. | Required when Dataset source is selected. Disabled when Tap source is selected. |
Internal behaviour:
- Choosing Tap keeps your job type aligned with the target (pump/dump).
- Selecting a dataset stores its ID for use in later steps (merge and overwrite operations require it).
To panel
| Destination option | Extra controls | Description | Dependencies & defaults |
|---|---|---|---|
| Sink | Dropdown selection | Stream the output to one of your configured destinations. | Required when selected. Sets the run type to load-to-sink. Job type becomes pump (tap source) or load (dataset source). |
| Overwrite current tap | None | Write the data back directly into your source database (only available for certain database types). | Disabled unless your source database supports overwrite operations. Forces write mode to update. |
| New Dataset | Text input field | Create a new dataset to store your results; leave blank to auto-generate a name. | Available for all source types. Sets job type to dump for tap sources, or pump for dataset copies. |
| Overwrite dataset | Dropdown selection | Replace all data in an existing dataset with your new results. | Required when selected. Keeps job type in sync with the source (dump/pump). |
| Merge into dataset | Dropdown selection | Add new data to an existing dataset, combining records based on your model configuration. | Required when selected. Available for both tap and dataset sources. |
Behind the scenes the system automatically adjusts job settings whenever you change your source or destination. This prevents incompatible combinations (for example, when overwriting your source database, certain write modes like Truncate or Drop are automatically disabled).
Rule panel
| Field | Control | Description | Notes |
|---|---|---|---|
| Rule | Dropdown selection | Attach a transformation or anonymisation rule to modify your data during processing. | Defaults to none. Selecting a rule will pre-fill related options in the Load and Transform panels. |
| Schema version | Dropdown selection | Choose which version of your data schema to use during execution. | Shows all available schema versions. The most recent version is labelled as "Latest". |
| Concurrency | Configuration option | Set limits on how many data entities can be processed simultaneously. | Helps balance job performance against system resource usage. |
Selecting a rule unlocks the Transform options panel described below.
Transform options (conditional)
| Field | Description | Key details |
|---|---|---|
| Determinism | Choose between Random and Deterministic processing modes. | Deterministic mode requires a seed value and guarantees repeatable outputs. |
| Seed | Numeric input field with "Generate" button. | Only available in deterministic mode. Accepts values between 1 and 9,999,999. |
| Check foreign keys | Dropdown selection (Auto, Do not check, Force). | Balances processing speed against data integrity guarantees. |
| Dictionary | Radio buttons (No dictionary, Field, Label, Global) with Cache, Keep, and Overwrite toggle switches. | Configure how dictionary values are handled during processing. Disabled when no rule is selected. |
| Continue streaming on fail | Checkbox | Keeps processing data even when individual records fail (enabled by default). |
Load options panel
The Load options panel appears when your destination requires writing to a sink or database. For simple dataset copies, this panel is automatically hidden.
| Field | Control | Description | Behaviour & defaults |
|---|---|---|---|
| Target entities | Dropdown selection | Choose how to handle existing data: Truncate, Drop, Append, Update, or Merge. | Defaults to Truncate. Available options depend on your selected destination type. When overwriting your source database, only Update mode is available. |
| Stream content | Dropdown selection | Control what type of data is loaded into the database. | Options: Data and metadata (default), Data only, or Metadata only. |
| Read batch size | Number input | How many records to retrieve from your source in each batch. | Defaults to 32768 records per batch. Transformation rules may override this setting. |
| Write batch size | Number input | How many records to write to your destination in each batch. | Defaults to 32768 records per batch. |
| Concurrent workers | Number input | Set how many workers can process data simultaneously. | Defaults are determined by your selected transformation rule or system settings. |
Stream content options
The Stream content setting determines what type of information is written to your database during job execution:
- Data and metadata (default): Loads both your actual data records and additional metadata about the processing, including schema information and job statistics.
- Data only: Loads only your actual data records without any additional metadata information.
- Metadata only: Loads only metadata information about the job processing without your actual data records.
Write batch size impact
The Write batch size affects how data is committed to your database:
- Smaller batch sizes: Provide more frequent commits, which can be helpful for monitoring progress but may result in slower overall processing due to increased transaction overhead.
- Larger batch sizes: Reduce transaction overhead and can improve overall throughput, but provide less frequent progress updates and may consume more memory during processing.
Concurrent workers behavior
The Concurrent workers setting controls parallel processing but with an important limitation:
- Each worker processes a different table in parallel, allowing multiple tables to be handled simultaneously.
- A single table is never updated in parallel by multiple workers to maintain data consistency.
- Increasing workers can speed up jobs with multiple tables but won't improve performance for single-table operations.
Batch size and concurrency visualization
Below is a visualization showing how read and write batch sizes interact with concurrent workers:
Understanding backpressure
When the read batch size is significantly larger than the write batch size, backpressure can occur. This means that data is being read from the source faster than it can be written to the destination, causing a buildup of data in memory.
For example, if you set:
- Read batch size: 100,000 records
- Write batch size: 1,000 records
- Concurrent workers: 3
Each worker processes 100,000 records before moving to the next batch, but only writes 1,000 records at a time. This creates a buffer where processed data waits to be written, consuming memory. The system may slow down as it tries to manage this backlog, and in extreme cases, could run out of memory.
To optimize performance and memory usage:
- Keep read and write batch sizes relatively balanced
- If you need larger read batches, consider increasing write batch sizes proportionally
- Monitor your system's memory usage during job execution
Schedule panel
| Choice | Additional fields | Outcome | Primary button label |
|---|---|---|---|
| Run now | None | Job starts immediately. | Run Now |
| Schedule | Select run date and time | Creates a one-time scheduled job; appears in the Scheduled tab until execution. | Run Later |
| Manual execution | Pipeline name (in footer) | Saves the configuration as a reusable pipeline template; no job is queued. | Save as Pipeline |
| Repeat | Select next run date/time, Repeat interval (number), Repeat unit (Hour/Day/Month/Year), Pipeline name | Creates a recurring pipeline that automatically runs at set intervals. | Save as Pipeline |
Schedule summaries when collapsed
When you've configured your schedule options, the panel will collapse and show a summary of your selection:
┌────────────┬─────────────────────────────────────────────────┐
│ Run now │ Summary shows: "Run: Now" │
├────────────┼─────────────────────────────────────────────────┤
│ Schedule │ Shows your selected date and time │
├────────────┼─────────────────────────────────────────────────┤
│ Manual │ Displays "Run pipeline manually" │
├────────────┼─────────────────────────────────────────────────┤
│ Repeat │ Shows next run date/time + repeat interval │
│ │ (e.g. "Next: 2024-07-24 18:00 — Every 1 day") │
└────────────┴─────────────────────────────────────────────────┘Putting it together
- Start with From and To. These two panels determine how your data will flow and automatically configure appropriate settings for the next steps.
- Optionally select a Rule to apply masking or transformation logic to your data; doing so will enable additional configuration options.
- Use Load options and Transform options to fine-tune performance settings like batch sizes, determinism, and dictionary handling.
- Finish in the Schedule panel to decide when your job should run. If you're creating a reusable pipeline, provide a unique name in the footer before saving.
- Review the collapsed panel summaries to ensure all required fields are filled. Each header will highlight missing required fields so misconfigurations are easy to spot.
Closing the modal clears all your selections so each new job starts with fresh default settings.