Benchmarking SQL‑Generating LLMs (NL→SQL)
Built a reproducible evaluation harness for text‑to‑SQL using exact‑match and execution accuracy. Compared Mistral‑7B‑Instruct and CodeLLaMA‑7B across plain, schema‑aware, and RAG prompts; added LoRA fine‑tuning for CodeLLaMA on Chinook.
- Dual metrics catch logically wrong‑but‑runnable queries.
- Prompt builder retrieves schema/examples per query to ground generations.
- All runs logged for apples‑to‑apples comparisons.