Project Status

Yaduha · LLM-RBMT for Owens Valley Paiute

Fine-tuning a 3B open model to produce schema-constrained structured output that outperforms gpt-4o-mini on the cheat-proof comparator metric — a proof-of-concept for endangered-language translation via small, deployable local models.

0 parse failures COMET_c 0.492 ~$0.40 & 1 GPU-hour Qwen2.5-3B-Instruct
0.492
Fine-tune COMET_c (150-sentence held-out eval)
0.461
gpt-4o-mini COMET_c (for comparison)
+7%
Aggregate improvement over gpt-4o-mini
0 / 150
Parse failures with unconstrained decoding

Overview

Yaduha is a type-safe framework for structured language translation: the LLM emits a Pydantic-modelled grammar tree (subject, verb, object, nominalizer, tense, proximity, etc.), and a deterministic renderer converts it to surface form. The lead target language is Owens Valley Paiute (OVP), a no-resource language with ~40 documented nouns and ~37 verbs in the current dictionary.

The experimental pipeline evaluates a forward model (English → structured JSON) against a fixed strong model (gpt-4o-mini) serving as the decoder. Scoring is done on two arms:

Methodology

Synthetic data generation

~$0.30 at gpt-4o-mini produces a ~4,400-pair SFT dataset covering the behaviours we want the small model to learn:

Training

LoRA fine-tune of Qwen/Qwen2.5-3B-Instruct (r=16, α=32, all linear projections, bf16) via HF TRL + PEFT — no Unsloth or bitsandbytes (which pin against older torch), so the pipeline runs cleanly on torch 2.11 + CUDA 13. 1 epoch, effective batch 8, cosine schedule, completion-only loss. Runs in ~45–60 min on a single RTX 5000 Ada.

Structured output is preserved, not replaced

Training shifts the model's preferred-token distribution toward schema-valid outputs without changing the inference stack: the EnglishToSentencesTool still passes response_format to the Pydantic SentenceList schema. The fine-tune makes the schema constraint a guardrail (near-no-op) rather than a forcing function — 97% of the time the raw generation already parses.

Results

Evaluated on the full 150-sentence eval set (6 sentence types × 25 each). COMET scores are over the cheat-proof comparator arm (higher = better structural choices that preserve meaning without leaning on English passthrough).

Aggregate comparator scores (all sentence types)

Model Parse errs COMET_b COMET_c Params
llama3.2:1b90.7870.3061B
qwen2.5:3b (base)150.7710.4303B
llama3.2:3b50.8610.4133B
qwen2.5:7b30.8340.4307B
llama3.1:8b90.8210.3898B
gpt-4o-mini00.8650.461~? (API)
ft-qwen2.5:3b (this work)00.7920.4923B + LoRA

Per-type comparator COMET vs gpt-4o-mini

Sentence type gpt-4o-mini ft-qwen2.5:3b Δ
subject-verb0.7090.710˜tie
subject-verb-object0.3440.441+28%
two-verb0.5330.593+11%
two-clause0.3650.398+9%
complex0.3400.370+9%
nominalization0.4740.442−7%
COMET scores by sentence type, across all forward models and both arms
Per-type COMET medians, all models. Lower panel is the comparator (cheat-proof) arm.
Backwards minus comparator gap per model
Placeholder-cheating gap. A wider gap means the strong decoder is reading English lemmas out of the structured JSON; a narrower gap means the forward model is encoding meaning structurally, which is what we want.
Overall median scores per model and metric
Aggregate medians per model across BLEU, chrF, chrF++, and COMET.

Iteration arc

Each dataset revision targeted a specific failure class; each delivered:

Version What was added Errors / 150 COMET_c
v1single-clause structures + OOV subs60.494
v2+ multi-clause records (fixed two-clause errs)50.495
v3+ proper-noun coreference pairs00.492

Caveats & honesty

Next steps

Reproduce

cd yaduha-ovp

# 1. Generate the training dataset (~$0.40 at gpt-4o-mini, ~10 min)
N_STRUCTURES=750 bash experiments/run_datagen.sh

# 2. Train the LoRA adapter (~45-60 min on a single RTX 5000 Ada)
uv run --project yaduha-ovp python experiments/finetune/scripts/train.py \
    --direction forward --epochs 1 --max-seq-length 2304

# 3. Evaluate on the 150-sentence held-out set (~8 min)
uv run --project yaduha-ovp python experiments/finetune/scripts/run_finetune_eval.py \
    --adapter experiments/finetune/adapters/qwen2.5-3b-instruct-forward \
    --tag ft-qwen2.5_3b__gpt-4o-mini

# 4. Score + regenerate plots
uv run --project yaduha-ovp python experiments/run_metrics.py \
    --input experiments/results/ft-qwen2.5_3b__gpt-4o-mini.jsonl
uv run --project yaduha-ovp python experiments/analyze.py

References