🚨 Excited to share Pragmatic Reasoning via Self-Training, a method for LLM self-improvent on pragmatic reasoning. PragReST improves by +5.37% and +5.50% for Qwen3-8B/14B across pragmatics benchmarks with no human annotations or teacher models.
LLMs still struggle w/ pragmatics: understanding what a speaker means, not just what they literally said. They often default to literal interpretations and miss implicature, intent, or context-dependent meaning. To close this gap, we started with a key question: Can we treat pragmatics as a LLM reasoning task?
➡️ Following a long line of work in pragmatics (e.g. RSA, IBR), PragReST treats pragmatic understanding as counterfactual reasoning. Instead of teaching models to ask “is this interpretation compatible with the words?”, we teach the model to reason about questions like “if the speaker meant something else, what would they have said instead?”
➡️ PragReST is self-improving: it self-generates pragmatic QA data, self-filters noisy examples, learns counterfactual reasoning traces via SFT, and further improves with GRPO using a self-judged correctness reward.
➡️ Error analysis shows that gains correlate with increased counterfactual reasoning. This suggests PragReST’s improvements are tied to reasoning over communicative alternatives, rather than simply more pragmatic data or more training.
🧵👇

