Skip to main content

Batch jobs

Use batch jobs when you have large manifests (JSONL) in cloud storage (S3 or GCS) and want Refinery to create tasks at scale.

For smaller workloads, Submitting tasks (POST /v1/tasks / batch) may be simpler.

When to use jobs vs inline API

Use caseAPI
Few hundred tasks from app serversPOST /v1/tasks, POST /v1/tasks/batch
Large files in S3, recurring ETLPOST /v1/jobs + manifest

Manifest format (JSONL)

One JSON object per line:

{"data_url": "https://cdn.example.com/1.jpg", "metadata": {"sku": "A1"}}
{"data_url": "https://cdn.example.com/2.jpg", "metadata": {"sku": "A2"}}

Each line must include a valid data_url. Optional metadata is stored for your traceability.

Create a job — POST /v1/jobs

:::info Security S3/GCS credentials are encrypted at rest using AES-256-GCM before storage. They are never returned in API responses (GET /v1/jobs/{id} omits credentials). For production, set a unique ASG_ENCRYPTION_KEY (generate with openssl rand -hex 32). :::

Body (simplified):

FieldDescription
manifest_urlCloud URI to JSONL manifest: s3://... or gs://...
label_specShared question / options for all tasks in the job
consensus_threshold2–10, default 3
callback_urlOptional webhook target for job-level notifications
task_typee.g. image_classification
credentialsFor S3: access_key_id, secret_access_key, region; for GCS: service_account_json
deliveryOptional { "type": "s3", "bucket": "my-results-bucket" }

S3 Manifest

curl -sS -X POST https://api.asgrefinery.io/v1/jobs \
-H "Authorization: Bearer $API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"manifest_url": "https://my-bucket.s3.amazonaws.com/manifest.jsonl",
"label_spec": {
"question": "Animal?",
"options": ["cat", "dog"]
},
"consensus_threshold": 3,
"task_type": "image_classification",
"credentials": {
"access_key_id": "AKIA...",
"secret_access_key": "...",
"region": "us-east-1"
},
"delivery": { "type": "s3", "bucket": "my-results-bucket" }
}'

GCS Manifest

curl -sS -X POST https://api.asgrefinery.io/v1/jobs \
-H "Authorization: Bearer $API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"manifest_url": "gs://my-bucket/batch/manifest.jsonl",
"label_spec": {
"question": "Animal?",
"options": ["cat", "dog"]
},
"credentials": {
"service_account_json": "{\"type\":\"service_account\",...}"
}
}'

Accepted (202):

{
"job_id": "job_...",
"status": "accepted",
"message": "Job accepted. Manifest will be processed asynchronously."
}

Job lifecycle

Typical states: acceptedprocessingcompleted (or failed on fatal errors).

Partial per-task failures may increment tasks_failed while others complete.

Monitor — GET /v1/jobs/{id}

curl -sS -H "Authorization: Bearer $API_KEY" \
https://api.asgrefinery.io/v1/jobs/job_xxx

Example (200):

{
"job_id": "job_xxx",
"status": "processing",
"total_tasks": 1000,
"tasks_done": 240,
"tasks_failed": 3,
"progress_percent": 24.3,
"created_at": "2026-04-09T10:00:00Z",
"completed_at": null
}

Export results — GET /v1/jobs/{id}/export

Returns JSONL (application/x-ndjson) of settled task results when the job is completed.

409 if not completed yet:

{
"error": "job is not completed yet"
}

S3 delivery

When delivery.type is s3, results are written to your bucket using the supplied credentials. Rotate keys regularly; credentials are stored on the job row for processing.

tip

Use s3:// URIs for AWS S3, MinIO, and S3-compatible storage. Use gs:// URIs for Google Cloud Storage.

Error handling

  • Manifest line errors may increment tasks_failed without failing the whole job.
  • Auth / not found: 404 for other customers’ job ids.
  • Retry exports and status polls with backoff on 5xx.