shoutouts: • why multi-agent LLM systems fail? (arXiv:2503.13657) — @mertcemri @melissapan + @istoica05 @matei_zaharia @profjoeyg @adityagp & team • DSPy (arXiv:2310.03714) — @lateinteraction + @hazyresearch lab & co-authors • GRASP (arXiv:2605.29668) — Jonas Moll, Jean-Philippe Corbeil et al. • A self-improving coding agent (arXiv:2504.15228) — @maxime_robeyns, Martin Szummer & Laurence Aitchison • Reflexion (arXiv:2303.11366) — Noah Shinn, Federico Cassano et al. (incl. @ShunyuYao12) • LongMemEval & v2 (arXiv:2410.10813 & 2605.12493) — @DiWu0162 + Kai-Wei Chang et al. • ExpeL (AAAI 2024) — @_AndrewZhao, Daniel Huang et al. • where LLM agents fail and how they can learn from failures (arXiv:2509.25370) — Kunlun Zhu et al.
two fun surprises from using activegraph: - the coding agent i was using would query the trace db to debug instead of looking at the logs like they normally would (i didn't ask it to) - when long eval runs broke (laptop, api, etc.), it was always able to pick up from right before it broke, never starting from the beginning again
