How do we evaluate models ourselves?

How do we evaluate local models ourselves? I’m seeing I’m seen in consistencies with certain code outputs across difference models of the same size now according to the different benchmarks. Some of these models should be better than others which is fine but even the ones which are very similar on benchmarks output code that is not syntax correct.

I’m curious how other local AI engineers evaluate the models in their own settings and then come to a decision on which is their best model to use locally.