Batch jobs
Use batch jobs when you have large manifests (JSONL) in cloud storage (S3 or GCS) and want Refinery to create tasks at scale.
For smaller workloads, Submitting tasks (POST /v1/tasks / batch) may be simpler.
When to use jobs vs inline API
| Use case | API |
|---|---|
| Few hundred tasks from app servers | POST /v1/tasks, POST /v1/tasks/batch |
| Large files in S3, recurring ETL | POST /v1/jobs + manifest |
Manifest format (JSONL)
One JSON object per line:
{"data_url": "https://cdn.example.com/1.jpg", "metadata": {"sku": "A1"}}
{"data_url": "https://cdn.example.com/2.jpg", "metadata": {"sku": "A2"}}
Each line must include a valid data_url. Optional metadata is stored for your traceability.
Create a job — POST /v1/jobs
:::info Security
S3/GCS credentials are encrypted at rest using AES-256-GCM before storage.
They are never returned in API responses (GET /v1/jobs/{id} omits credentials).
For production, set a unique ASG_ENCRYPTION_KEY (generate with openssl rand -hex 32).
:::
Body (simplified):
| Field | Description |
|---|---|
manifest_url | Cloud URI to JSONL manifest: s3://... or gs://... |
label_spec | Shared question / options for all tasks in the job |
consensus_threshold | 2–10, default 3 |
callback_url | Optional webhook target for job-level notifications |
task_type | e.g. image_classification |
credentials | For S3: access_key_id, secret_access_key, region; for GCS: service_account_json |
delivery | Optional { "type": "s3", "bucket": "my-results-bucket" } |
S3 Manifest
curl -sS -X POST https://api.asgrefinery.io/v1/jobs \
-H "Authorization: Bearer $API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"manifest_url": "https://my-bucket.s3.amazonaws.com/manifest.jsonl",
"label_spec": {
"question": "Animal?",
"options": ["cat", "dog"]
},
"consensus_threshold": 3,
"task_type": "image_classification",
"credentials": {
"access_key_id": "AKIA...",
"secret_access_key": "...",
"region": "us-east-1"
},
"delivery": { "type": "s3", "bucket": "my-results-bucket" }
}'
GCS Manifest
curl -sS -X POST https://api.asgrefinery.io/v1/jobs \
-H "Authorization: Bearer $API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"manifest_url": "gs://my-bucket/batch/manifest.jsonl",
"label_spec": {
"question": "Animal?",
"options": ["cat", "dog"]
},
"credentials": {
"service_account_json": "{\"type\":\"service_account\",...}"
}
}'
Accepted (202):
{
"job_id": "job_...",
"status": "accepted",
"message": "Job accepted. Manifest will be processed asynchronously."
}
Job lifecycle
Typical states: accepted → processing → completed (or failed on fatal errors).
Partial per-task failures may increment tasks_failed while others complete.
Monitor — GET /v1/jobs/{id}
curl -sS -H "Authorization: Bearer $API_KEY" \
https://api.asgrefinery.io/v1/jobs/job_xxx
Example (200):
{
"job_id": "job_xxx",
"status": "processing",
"total_tasks": 1000,
"tasks_done": 240,
"tasks_failed": 3,
"progress_percent": 24.3,
"created_at": "2026-04-09T10:00:00Z",
"completed_at": null
}
Export results — GET /v1/jobs/{id}/export
Returns JSONL (application/x-ndjson) of settled task results when the job is completed.
409 if not completed yet:
{
"error": "job is not completed yet"
}
S3 delivery
When delivery.type is s3, results are written to your bucket using the supplied credentials. Rotate keys regularly; credentials are stored on the job row for processing.
Use s3:// URIs for AWS S3, MinIO, and S3-compatible storage.
Use gs:// URIs for Google Cloud Storage.
Error handling
- Manifest line errors may increment
tasks_failedwithout failing the whole job. - Auth / not found:
404for other customers’ job ids. - Retry exports and status polls with backoff on 5xx.