Some early results from playing around with search on a subsampled version of OBLIQ-bench.
Mixedbread's reranker is a beast of a model! It gets a pretty strong MRR on the task, and even edges out GPT 5.5 on some metrics while being significantly faster.
I was mostly interested more traditional retrieve-and-rerank setups here, since agentic approaches felt a little too obvious / boring. But I came away convinced that this benchmark is hard enough that: 1. Even today’s best embedding + reranker models will only get so far 2. Higher cost / latency is tolerable.
Going to see how far I can push the cost / latency vs quality tradeoff here.