# Hugging Face Dementia And Thai Speech Assets

Status: Complete with Firecrawl evidence; Tavily blocked by missing `TAVILY_API_KEY`.

Last updated: 2026-05-07

Decision supported: Hugging Face is useful for background research, future evaluation, and Thai ASR awareness. It should not replace OpenAI for the v1 demo and should not be used for medical claims without deeper validation.

## Evidence Collection

Firecrawl was used for broad discovery and selected Hugging Face page scraping.

Raw evidence files:

- `.firecrawl/openai-api-research-search-hf-dementia-broad.json`
- `.firecrawl/openai-api-research-search-hf-thai-broad.json`
- `.firecrawl/openai-api-research-search-hf-dementia-models.json`
- `.firecrawl/openai-api-research-search-hf-thai-asr-models.json`
- `.firecrawl/huggingface.co-datasets-MearaHe-dementiabank.md`
- `.firecrawl/huggingface.co-datasets-AzizSouiai-Alzheimer.md`
- `.firecrawl/huggingface.co-datasets-SEACrowd-thai_elderly_speech.md`
- `.firecrawl/huggingface.co-datasets-mcshao-EThai-ASR.md`
- `.firecrawl/huggingface.co-collections-tawankri-thai-asr.md`
- `.firecrawl/huggingface.co-nectec-Pathumma-whisper-th-large-v3.md`
- `.firecrawl/huggingface.co-biodatlab-distill-whisper-th-large-v3.md`
- `.firecrawl/huggingface.co-biodatlab-whisper-th-medium-combined.md`

Tavily could not be used because `TAVILY_API_KEY` is missing.

## Dataset Table

| Asset | URL | Modality | Language | License / access notes | Relevance | Risk | Use decision |
|---|---|---|---|---|---|---|---|
| MearaHe/dementiabank | https://huggingface.co/datasets/MearaHe/dementiabank | Text/conversation preview | English | License not clearly visible in scraped card | Dementia-related language samples; relevant to dementia conversation research | DementiaBank data can have access/consent restrictions in original sources; not Thai; not caregiver wellness-specific | Background only. Do not train or demo with it until license/provenance is verified. |
| AzizSouiai/Alzheimer | https://huggingface.co/datasets/AzizSouiai/Alzheimer | Text QA/instruction examples | English | License not clearly visible in scraped card | Contains Alzheimer Q&A style rows, including symptoms, wandering, caregiver support topics | Medical Q&A quality and source provenance unclear; not Thai; includes treatment/medication topics that could encourage unsafe product scope | Avoid for v1. At most use as a checklist of topics to guard against. |
| SEACrowd/thai_elderly_speech | https://huggingface.co/datasets/SEACrowd/thai_elderly_speech | Speech | Thai elderly speech | Dataset card requires `seacrowd`; license section exists but needs manual confirmation before use | Highly relevant to Thai elderly speech recognition research and future ASR evaluation | Dataset viewer dependency issue; not dementia-specific; may include sensitive voice data concerns | Future evaluation only after license and consent review. Not needed for v1 demo. |
| mcshao/EThai-ASR | https://huggingface.co/datasets/mcshao/EThai-ASR | Speech labels / ASR corpus metadata | Thai | Apache-2.0 shown in scraped card | Relevant to Thai ASR and model evaluation; card mentions 14k hours of curated Thai speech data and label refinement | Not dementia/caregiver-specific; may be too broad for caregiver wellness; still requires dataset handling review | Useful for future Thai ASR research. Not needed for OpenAI-powered v1. |
| Thai ASR collection by tawankri | https://huggingface.co/collections/tawankri/thai-asr | Collection | Thai | Collection, not a dataset | Useful map of Thai ASR models and recent Thai Whisper variants | Collection quality varies; each model needs separate license review | Reference only. Use to understand Thai ASR landscape. |

## Model Table

| Asset | URL | Task | Language | License | Relevance | Risk | Use decision |
|---|---|---|---|---|---|---|---|
| nectec/Pathumma-whisper-th-large-v3 | https://huggingface.co/nectec/Pathumma-whisper-th-large-v3 | Automatic speech recognition | Thai, English | Apache-2.0 shown in scraped card | Strong Thai ASR reference; appears in Thai ASR collection; useful to compare against OpenAI transcription fallback | External model deployment adds complexity; not needed if OpenAI transcription works | Reference/future benchmark only. |
| biodatlab/distill-whisper-th-large-v3 | https://huggingface.co/biodatlab/distill-whisper-th-large-v3 | Automatic speech recognition | Thai | MIT shown in scraped card | Thai-specific ASR model; card says it is tailored for Thai speech and trained with Thai datasets including Thai Elderly Speech Corpus | ASR only, not caregiver safety; may vary by dialect/accent | Reference/future benchmark only. |
| biodatlab/whisper-th-medium-combined | https://huggingface.co/biodatlab/whisper-th-medium-combined | Automatic speech recognition | Thai | Apache-2.0 shown in scraped card | Thai Whisper model with practical pipeline example | External inference and hosting add demo risk | Reference/future fallback only, not v1 core. |
| Dementia/Alzheimer detection models from broad search | Search result mostly generic medical-imaging model pages | Classification / medical imaging | Mostly English/medical imaging | Varies | Low relevance to caregiver wellness workflow | High medical-risk; diagnosis-adjacent; not Thai family-care focused | Avoid for v1. Do not use diagnosis models. |

## Findings

### Finding 1: Hugging Face has relevant Thai ASR assets.

Thai ASR assets are the strongest Hugging Face fit. They may help future benchmarking if OpenAI transcription has Thai accent, elderly voice, or noisy-home limitations.

Product implication:

- Keep OpenAI voice/transcription as v1 core.
- Use Thai ASR assets later to design evaluation clips or fallback experiments.

### Finding 2: Dementia datasets exist, but they do not directly fit the Thai caregiver wellness MVP.

DementiaBank-style data is dementia-related but not Thai and may have access/provenance limitations. Alzheimer Q&A datasets can introduce medical-answering risk and are not caregiver-wellness-specific.

Product implication:

- Do not fine-tune on these for v1.
- Do not use them to make diagnostic claims.
- Use them only as background awareness after license review.

### Finding 3: No clear Thai dementia caregiver burden dataset was found in this pass.

The search found Thai ASR and dementia datasets separately, but not a clearly licensed Thai dementia caregiver burden dataset.

Product implication:

- Use seeded synthetic demo care logs based on the earlier research package.
- Treat the MVP as workflow/prototype, not trained clinical intelligence.

## Safety And Licensing Rules

- Do not use dementia diagnosis/classification datasets for product claims.
- Do not use unclear-license data in a public demo.
- Do not train on caregiver or patient data without consent.
- Do not use Thai elderly voice datasets without checking license, consent, and redistribution rules.
- Do not imply that an ASR model understands caregiver safety; ASR only converts speech to text.

## Decision

For v1:

> Use OpenAI APIs only. Do not depend on Hugging Face models or datasets.

For future research:

- Use Thai ASR assets to evaluate transcription quality.
- Search again for Thai caregiver burden datasets if the product moves beyond hackathon MVP.
- Only consider fine-tuning after legal, safety, and source-quality review.

## Source Links

- https://huggingface.co/datasets/MearaHe/dementiabank
- https://huggingface.co/datasets/AzizSouiai/Alzheimer
- https://huggingface.co/datasets/SEACrowd/thai_elderly_speech
- https://huggingface.co/datasets/mcshao/EThai-ASR
- https://huggingface.co/collections/tawankri/thai-asr
- https://huggingface.co/nectec/Pathumma-whisper-th-large-v3
- https://huggingface.co/biodatlab/distill-whisper-th-large-v3
- https://huggingface.co/biodatlab/whisper-th-medium-combined
