# Predictor API Schema

## Predictor response

| Key                 | Value type - Required/Optional                   | Description: Value options  | Example   |
|--------------|--------------|-------------------------------|--------------|
| `predictor_name` | `string`- Required           | Unique identifier for the Predictor. Constructed automatically in `config.py` by appending the container's build timestamp (read from Apptainer's `/.singularity.d/labels.json`) to the model's base name. The format is `{ModelName}_{YYYYMMDD-HHMMSS}_{TZ}`. In development mode (outside a container), `_dev` is appended instead. This ensures every container rebuild produces a unique, sortable identifier, allowing Evaluators to distinguish between builds even when the model name has not changed. <br>  <br> This is especially important for Predictors that have undergone updates that led to different predictions.                                                                        | `"predictor_name": "deBoerTestModel_20260127-171101_PST"`      |
| `matcher_version` | `string`- Optional           | If a Matcher was used by the Predictor, the `matcher_version` returned by the Matcher should be passed through to the Evaluator. This is the build-timestamped Matcher name, following the same versioning convention as `predictor_name`.                                                                                                                                                                         | `"matcher_version": "Matcher_20260127-171101_PST"`      |
| `bin_size` | `integer` - Required for track based models           | Resolution of the model's predictions.                                                                                                                                                    | `"bin_size" : 1`          |
| `prediction_tasks`  | `array of objects` - Required        | Each object must contain the following keys: `name`, `type_requested`,`type_actual`, `cell_type_requested`, `cell_type_actual`, `species_requested`, `species_actual`,`predictions`, `scale_prediction_requested` (optional), `scale_prediction_actual` (optional), `aggregation` (optional object).| "prediction_tasks": [<br> {<br>   "name": "task1",<br>   "type_requested": "expression",<br>  "type_actual": "expression",<br>  "cell_type_requested": "K562",<br>  "cell_type_actual": "bone_marrow_cell_line",<br>  "species_requested": "homo_sapiens",<br>  "species_actual": "homo_sapiens",<br>  "scale_prediction_requested": "linear", <br>  "scale_prediction_actual": "linear",<br>  "aggregation": {"bins": "mean"},<br>   "predictions": { <br>     "seq1": [12.2, 5, 6, ..],<br>     "seq2": [1.1, 12, 0.00, ..],<br>    "random_seq": [100.1, 50, 0.5, ..],<br>    "enhancer": [4, 3.0, 0.001, ..],<br>    "control": [0, 0, 0, ..] <br>   }<br> }<br>]|
| `name`  | `string` - Required        | Unique identifier for each prediction task array matched from Evaluator.| `"name": "task_for_model"`|
| `type_requested`  | `string` - Required        | Prediction type requested: [`"accessibility"`, `"binding_molecule"`, `"expression"`, `"conformation_{isoform}"`, like `"conformation_chromatin"`]. `"binding_<molecule>"` can be for any type of binding assay (ex. CHIP-Seq, H3k27ac) and the text trailing the "_" should be all lower case.                                                                                                                                                                                        | `"type_requested": "expression"`                                                                                                                                         |
| `type_actual`  |`array of string(s)` - Required        | Prediction type(s) completed by Predictor. In many cases will be the assay the model predicted. If multiple tracks were averaged in a multi-task model they should be included here. ex. ["dnase", "atac-seq"]               | `"type_actual": ["expression"]`|
|`cell_type_requested`       | `string`- Required | Cell type requested by the Evaluator.                                   | `"cell_type_requested": "HEPG2"`|
| `cell_type_actual`       | `string`- Required | Cell type returned by Predictor. Predictor can choose to use the Matcher module, which will returned the closest matched cell type that the Predictor has.| `"cell_type_actual": "HEPG2"` |
| `species_requested`        | `string` - Required       | What species was requested by the Evaluator.  | `"species_requested": "homo_sapiens"` |
|`species_actual`        | `string` - Required       | What species was used by the Predictor.                                                                                                                                                                              | `"species_actual": "homo_sapiens"`|
| `scale_prediction_requested` | `string` - Optional            | Evaluator requested scaling for predictions: ["linear", "log"].                                                                                                                                                    | `"scale_prediction_requested": "log"`       |
| `scale_prediction_actual` | `string` - Optional            | How did the Predictor scale the predictions (if at all): ["linear", "log"] .                                                                                                                                                   | `"scale_prediction_actual": "log"`    |
|`aggregation`      | `object`- Optional           | Contains information about how replicates, bins and/or tracks were aggregated. Values can be any descriptive string and Predictor builders only need to include those that they used.                                                                                                                    | "aggregation": {<br>   "replicates": "mean",<br>   "bins": "mean",<br>  "tracks": "special mathematical formula"<br> }  |
| `predictions`      | `object`- Required    | Objects of key-value pairs where keys are strings and values are arrays of floats/integers. Each array of predictions can be a single value, a list of values for track predictions or nested lists (numpy arrays for msgpack-numpy responses). We suggested encoding interaction matrices as numpy arrays. The sequence ID keys are matched to the Evaluator sequence ID keys automatically by Predictor |"predictions": {<br>   "seq1": [12.2, 5, 6, ..],<br>   "seq2": [1.1, 12, 0.00, ..],<br>  "random_seq": [100.1, 50, 0.5, ..],<br>  "enhancer": [4, 3.0, 0.001, ..],<br>  "control": [0, 0, 0, ..] <br> } |
| `trim_upstream` | `object` - Conditional | Returned only for `track` readout requests. A collection of key-value pairs mapping sequence IDs to integers. The integer specifies the number of base pairs in the first predicted bin that fall upstream of the actual sequence or requested `prediction_range`. <br> <br> Evaluators use this exact offset to perfectly align the model's binned predictions back to the original genomic coordinates. | "trim_upstream": {<br>   "seq1": 5 ,<br>   "seq2": 0,<br>  "random_seq": 2,<br>  "enhancer": 1 ,<br>  "control":  0 <br> } |

## Note on Binned Predictions and Sequence-Length Alignment

Predictors that return **binned predictions** often include **"N" bases** in flanking bins. These can skew results when performing **base-pair (bp)–level evaluation**.

When an Evaluator requests a track `readout` request:

- The **expanded bp-level prediction** (for binned outputs) **must match the length of the input sequence**.
- The start of a prediction should be aligned with the first bp of the sequence.
- By default, if no `trim_upstream` parameter is returned, the Evaluator should **crop the predictions only at the downstream end**.
- If a `trim_upstream` parameter *is* returned, the Evaluator should:
  1. **Crop upstream** by the amount specified in `trim_upstream`.
  2. **Crop the remaining required amount downstream** to ensure the final prediction length equals the sequence length.

This ensures consistent evaluation and avoids artifacts introduced by binned predictions.