At this point in time, two of the extremely few long-context benchmarks I'd assign any weight at all to are OBLIQ-Bench (recall@k) and StudyBench (expertise).
4:49 PM · Jun 17, 2026 · 3.1K Views
At this point in time, two of the extremely few long-context benchmarks I'd assign any weight at all to are OBLIQ-Bench (recall@k) and StudyBench (expertise).
No Digg Deeper questions have been answered for this story yet.
StudyBench: https://jacobxli.com/blog/2026/machine-studying/
OBLIQ-Bench: https://arxiv.org/pdf/2605.06235
(yup, the point of this PSA is that this is subtle because neither is built originally as a long-context benchmark, but they are that too)
StudyBench: https://jacobxli.com/blog/2026/machine-studying/
OBLIQ-Bench: https://arxiv.org/pdf/2605.06235
At this point in time, two of the extremely few long-context benchmarks I'd assign any weight at all to are OBLIQ-Bench (recall@k) and StudyBench (expertise).