# OpenAI Capability Map

Status: Complete with Firecrawl evidence; Tavily blocked by missing `TAVILY_API_KEY`.

Last updated: 2026-05-07

Decision supported: use OpenAI for the MVP. Make Realtime voice plus Responses API Structured Outputs the core demo path.

## Evidence Collection

Firecrawl was used to search and scrape official OpenAI documentation and supporting Hugging Face pages. Raw evidence is stored in `.firecrawl/`.

Key Firecrawl evidence files:

- `.firecrawl/platform.openai.com-docs-guides-realtime.md`
- `.firecrawl/platform.openai.com-docs-guides-speech-to-text.md`
- `.firecrawl/platform.openai.com-docs-guides-structured-outputs.md`
- `.firecrawl/platform.openai.com-docs-guides-text.md`
- `.firecrawl/platform.openai.com-docs-guides-images-vision.md`
- `.firecrawl/platform.openai.com-docs-guides-file-inputs.md`
- `.firecrawl/platform.openai.com-docs-models.md`
- `.firecrawl/platform.openai.com-docs-guides-latest-model.md`
- `.firecrawl/openai-api-research-search-openai-broad.json`

Tavily was required by `docs/openaiapiresearch.md`, but it could not be used because no Tavily CLI was available and `TAVILY_API_KEY` was not present in the shell environment. Do not claim Tavily-backed evidence until the key is added and the search pass is rerun.

## Core Workflow Capability Map

| Core workflow step | OpenAI capability | Status | Score | Why it fits | Evidence |
|---|---|---:|---:|---|---|
| Thai voice check-in | Realtime API with a realtime speech model | Core | 5 | OpenAI docs describe Realtime as low-latency multimodal communication with native speech-to-speech interaction, audio/image/text input, and audio/text output. This is the most emotionally clear hackathon interaction. | [Realtime API](https://platform.openai.com/docs/guides/realtime), [Models](https://platform.openai.com/docs/models) |
| Transcript fallback | Audio transcription API with `gpt-4o-transcribe` or `gpt-4o-mini-transcribe` | Core fallback | 5 | If live voice is unreliable, uploaded or recorded audio can be transcribed. Docs identify higher-quality transcription models beyond `whisper-1`; streaming transcription is possible for completed recordings, and Realtime supports ongoing audio transcription. | [Speech to text](https://platform.openai.com/docs/guides/speech-to-text) |
| Structured incident log | Responses API plus Structured Outputs | Core | 5 | Structured Outputs ensure model responses follow a JSON Schema and are recommended over JSON mode when possible. This directly supports care event cards and timeline storage. | [Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs), [Text generation](https://platform.openai.com/docs/guides/text) |
| Caregiver burden signal | Responses API with structured classification fields | Core | 5 | The model can convert narrative context into fields such as stress score, sleep hours, risk level, and help request. The schema must restrict outputs to wellness/support language. | [Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs) |
| Safe next step | System instructions plus schema fields plus UI copy | Core | 4 | OpenAI can produce constrained guidance, but the application must hard-code safety boundaries: no diagnosis, no medication advice, and emergency escalation when appropriate. | [Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs), [Text generation](https://platform.openai.com/docs/guides/text) |
| Family handoff summary | Responses API text generation in Thai | Core | 5 | OpenAI text models can generate human-like prose and structured JSON. The handoff should be a schema field so the UI can display and copy it reliably. | [Text generation](https://platform.openai.com/docs/guides/text) |
| Weekly trend | Responses API over stored care logs | Core for demo if seeded | 4 | Weekly trend reasoning can be done over seeded JSON logs. It should not require training or retrieval for v1. | [Text generation](https://platform.openai.com/docs/guides/text) |
| Optional care-card or hospital-instruction context | Vision and file input via Responses API | Later | 3 | Docs support image analysis and file inputs, including PDFs. For the hackathon, this is useful only after voice + structured handoff works. | [Images and vision](https://platform.openai.com/docs/guides/images-vision), [File inputs](https://platform.openai.com/docs/guides/file-inputs) |
| Tool calling | Function tools, only if saving logs server-side | Later | 2 | Tool calling is not needed for a first clickable demo if the app can store JSON directly after response parsing. Add later for production workflows. | [Text generation](https://platform.openai.com/docs/guides/text) |

## Capability Decision

### Core

- Realtime API for live Thai voice check-in.
- Audio transcription fallback for reliability.
- Responses API for reasoning and message generation.
- Structured Outputs for schema-valid care logs and handoff data.

### Optional After Core Works

- Vision input for care cards, labels, and routine sheets.
- File input for hospital instructions or PDFs.
- Weekly trend over stored logs.

### Not Needed For V1

- Fine-tuning.
- Custom Hugging Face model deployment.
- Broad agent/tool orchestration.
- Medical diagnosis workflows.

## Voice Architecture Decision

Recommended path:

```text
Browser voice UI
-> OpenAI Realtime API using gpt-realtime-1.5
-> Thai caregiver conversation
-> transcript or summarized turn
-> Responses API with Structured Outputs
-> care_log JSON
```

Demo fallback:

```text
Seeded Thai transcript or recorded audio
-> gpt-4o-transcribe / gpt-4o-mini-transcribe if audio is used
-> Responses API with Structured Outputs
-> same care_log JSON
```

Reason: Realtime gives the strongest judge-facing moment, but a transcript fallback protects the demo from microphone, network, or latency failures.

## Structured Extraction Decision

Use Responses API plus Structured Outputs, not free-form chat. The official Structured Outputs guide says schema adherence is the benefit that JSON mode alone does not provide. This matters because the UI needs stable fields for:

- incident cards
- risk badges
- caregiver burden chart
- handoff message
- weekly trend

## Multimodal Decision

Vision and file inputs should be labeled `Later`, not `Core`.

They are relevant to the future product because caregivers may have care cards, medication labels, discharge PDFs, or routine instructions. They are risky for v1 because they invite medication and clinical interpretation. If included in the demo, the UI must say:

> Context only. This does not diagnose, prescribe, or change medication instructions.

## Source Notes

- OpenAI Realtime docs: https://platform.openai.com/docs/guides/realtime
- OpenAI Speech-to-text docs: https://platform.openai.com/docs/guides/speech-to-text
- OpenAI Structured Outputs docs: https://platform.openai.com/docs/guides/structured-outputs
- OpenAI Text generation / Responses API docs: https://platform.openai.com/docs/guides/text
- OpenAI Images and vision docs: https://platform.openai.com/docs/guides/images-vision
- OpenAI File inputs docs: https://platform.openai.com/docs/guides/file-inputs
- OpenAI Models docs: https://platform.openai.com/docs/models