/Tech5h ago

Elicit's Andreas Stuhlmüller argues alignment eliminates safety classifiers, but creator Tenobrus says stateless APIs still require external context

Stateless API calls cannot easily distinguish defensive queries from attacks

730022K

#1531

Original post

Andreas Stuhlmüller@stuhlmueller#1531inTech

If alignment were easy, would you still need bio/cyber/r&d classifiers on top of your model?

You'd align the model to a principal who doesn't want that work to be done. The model would deploy its full cognition to distinguishing forbidden from valid work

Alas

8:47 AM · Jun 11, 2026 · 946 Views

/Tech5h ago

Elicit's Andreas Stuhlmüller argues alignment eliminates safety classifiers, but creator Tenobrus says stateless APIs still require external context

Stateless API calls cannot easily distinguish defensive queries from attacks

730022K

#1531

Original post

Andreas Stuhlmüller@stuhlmueller#1531inTech

If alignment were easy, would you still need bio/cyber/r&d classifiers on top of your model?

You'd align the model to a principal who doesn't want that work to be done. The model would deploy its full cognition to distinguishing forbidden from valid work

Alas

8:47 AM · Jun 11, 2026 · 946 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS148BOOKMARKS1

Tenobrus@tenobrus

@stuhlmueller hm while there is an alignment problem i think there's also a ~fundamental "harness" problem- a model hit via raw stateless api calls to find bugs in a codebase genuinely needs way more affordances + information to differentiate between negative and positive use

4h1481

LIKES1REPLIES1

Andreas Stuhlmüller@stuhlmueller

@lukestanley Yup - and *if alignment were easy* we would be able to have that trust

It's not, of course! But it's easy to forget that because the models are so helpful in practice

5h91

Luke Stanley@lukestanley

@stuhlmueller Nothing wrong with training a model to preferring avoid providing dangerous things, but getting rid of separate safety classifiers would require a lot of trust to justify the independence, with current techniques, surely?

5h10

Strata@ChainZenit

@stuhlmueller that's a wild way to look at safety constraints.

5h9

Luke Stanley@lukestanley

@stuhlmueller Yeah, the fact that in order to launch Fable Anthropic had to throw the kitchen sink at it is pretty revealing that robust alignment isn't here yet. With "prompt modification, steering vectors, PEFT-style intervention" even the infrastructure considerations are mind boggling.

5h7

Rugbist@rugbist_

@stuhlmueller so basically the ideal model becomes the most paranoid regulator