Project Status

Yaduha · LLM-RBMT for Owens Valley Paiute

Fine-tuning a 3B open model to produce schema-constrained structured output that outperforms gpt-4o-mini on the cheat-proof comparator metric — a proof-of-concept for endangered-language translation via small, deployable local models.

0 parse failures COMET_c 0.492 ~$0.40 & 1 GPU-hour Qwen2.5-3B-Instruct
0.492
Fine-tune COMET_c (150-sentence held-out eval)
0.461
gpt-4o-mini COMET_c (for comparison)
+7%
Aggregate improvement over gpt-4o-mini
0 / 150
Parse failures with unconstrained decoding

Overview

Yaduha is a type-safe framework for structured language translation: the LLM emits a Pydantic-modelled grammar tree (subject, verb, object, nominalizer, tense, proximity, etc.), and a deterministic renderer converts it to surface form. The lead target language is Owens Valley Paiute (OVP), a no-resource language with ~40 documented nouns and ~37 verbs in the current dictionary.

The experimental pipeline evaluates a forward model (English → structured JSON) against a fixed strong model (gpt-4o-mini) serving as the decoder. Scoring is done on two arms:

Methodology

Synthetic data generation

~$0.30 at gpt-4o-mini produces a ~4,400-pair SFT dataset covering the behaviours we want the small model to learn:

Training

LoRA fine-tune of Qwen/Qwen2.5-3B-Instruct (r=16, α=32, all linear projections, bf16) via HF TRL + PEFT — no Unsloth or bitsandbytes (which pin against older torch), so the pipeline runs cleanly on torch 2.11 + CUDA 13. 1 epoch, effective batch 8, cosine schedule, completion-only loss. Runs in ~45–60 min on a single RTX 5000 Ada.

Structured output is preserved, not replaced

Training shifts the model's preferred-token distribution toward schema-valid outputs without changing the inference stack: the EnglishToSentencesTool still passes response_format to the Pydantic SentenceList schema. The fine-tune makes the schema constraint a guardrail (near-no-op) rather than a forcing function — 97% of the time the raw generation already parses.

Language-agnostic by design

None of the 6 datagen steps knows it's dealing with OVP. They call sample_iter() on whatever Sentence classes are registered, render canonical English via the strong LLM, and synthesize transforms, OOV substitutions, and coreference patterns entirely on the English side. The OOV-masking hook was just lifted into yaduha core as Sentence.masked_copy() — any new language package only has to implement that (plus the standard __str__ and get_examples) to plug into the same run_datagen.sh → train.py → run_finetune_eval.py pipeline. The generalization claim we'd like to prove next: this recipe produces a usable translator for any Yaduha-compatible target language.

Results

Evaluated on the full 150-sentence eval set (6 sentence types × 25 each). COMET scores are over the cheat-proof comparator arm (higher = better structural choices that preserve meaning without leaning on English passthrough).

Aggregate comparator scores (all sentence types)

Model Parse errs COMET_b COMET_c Params
llama3.2:1b90.7870.3061B
qwen2.5:3b (base)150.7710.4303B
llama3.2:3b50.8610.4133B
qwen2.5:7b30.8340.4307B
llama3.1:8b90.8210.3898B
gpt-4o-mini00.8650.461~? (API)
ft-qwen2.5:3b (this work)00.7920.4923B + LoRA

Per-type comparator COMET vs gpt-4o-mini

Sentence type gpt-4o-mini ft-qwen2.5:3b Δ
subject-verb0.7090.710˜tie
subject-verb-object0.3440.441+28%
two-verb0.5330.593+11%
two-clause0.3650.398+9%
complex0.3400.370+9%
nominalization0.4740.442−7%
COMET scores by sentence type, across all forward models and both arms
Per-type COMET medians, all models. Lower panel is the comparator (cheat-proof) arm.
Backwards minus comparator gap per model
Placeholder-cheating gap. A wider gap means the strong decoder is reading English lemmas out of the structured JSON; a narrower gap means the forward model is encoding meaning structurally, which is what we want.
Overall median scores per model and metric
Aggregate medians per model across BLEU, chrF, chrF++, and COMET.

Iteration arc

Each dataset revision targeted a specific failure class; each delivered:

Version What was added Errors / 150 COMET_c
v1single-clause structures + OOV subs60.494
v2+ multi-clause records (fixed two-clause errs)50.495
v3+ proper-noun coreference pairs00.492

Caveats & honesty

Next steps

Reproduce

Full instructions (hardware, env, per-stage commands, expected COMET targets) live in yaduha-ovp/REPRODUCE.md. Quick tour:

cd yaduha-ovp

# Sanity check that the pipeline wires up (~2 min, <$0.05) — no training.
bash experiments/reproduce_smoke.sh

# Full reproduction: ~10 min datagen + ~45-60 min training + ~8 min eval.
N_STRUCTURES=750 SEED=0 bash experiments/run_datagen.sh
uv run python experiments/finetune/scripts/train.py \
    --direction forward --epochs 1 --max-seq-length 2304 --seed 42
uv run python experiments/finetune/scripts/run_finetune_eval.py \
    --adapter experiments/finetune/adapters/qwen2.5-3b-instruct-forward \
    --tag ft-qwen2.5_3b__gpt-4o-mini
uv run python experiments/run_metrics.py \
    --input experiments/results/ft-qwen2.5_3b__gpt-4o-mini.jsonl
uv run python experiments/analyze.py

References