job-crawler API base · checking…

job-crawler API

On-demand job search across Craigslist, Indeed & Monster. Results are ranked by a weighted blend of five signals — relevance (OpenAI embeddings), occupation match, pay, freshness, and distance. All endpoints return JSON; a failing source degrades to a diagnostic, never an error.

Query & listing titles are mapped to canonical O*NET occupations so synonyms collapse to one node ("SWE" = "Software Engineer" = "Application Developer" → Software Developers) — beating fuzzy keyword/embedding matching. The response returns the matched node as query_occupation and per-result occupation; confident matches also expand the crawl to alias terms for recall.

Pay is normalized to an hourly equivalent assuming a 40-hour work week (8 h/day, 2080 h/year), returned as salary.hourly and used for ranking so jobs are comparable regardless of how each source quotes pay. Every result carries a score (0–1) and a score_breakdown of the per-signal contributions (semantic · pay · recency · proximity).

GET/health

Liveness

DB connectivity, enabled sources, embedding backend.

loading…
GET/health/sources

Source health

Per-source circuit state, last success, rolling ok/err counts.

loading…
POSTGET /webhooks/{source}

Webhook ingest

Schema-less receiver for inbound provider events. Accepts any method, content-type, and body — JSON, form-encoded, XML, plain text, or empty — and stores it verbatim, so unknown payloads can be inspected and reverse-engineered. Acks 200 immediately (500 only on a DB-write failure, so the sender retries); 1 MB cap. Unauthenticated by design for now — signature headers are captured so HMAC verification can be added once the provider's signing scheme is known.

FieldTypeDescription
sourcepathLogical sender, e.g. workstream. Each source is stored & queried independently.

Live endpoint: POST https://workstream.gatherat.ai/webhooks/workstream. Each delivery is persisted with the raw body, best-effort body_json, extracted event_type, all headers (incl. signatures), query, method, content_type, and the client IP. A GET carrying a challenge/hub.challenge param is echoed back for verification handshakes.

GET/webhooks/{source}/recent?limit=N

Most-recent deliveries, newest first (id, time, method, content_type, event_type, body length).

GET/webhooks/{source}/event/{id}

One full stored delivery — headers, query, raw body, and parsed JSON.

loading recent…