/Tech6h ago

Mythos AI Functions as Value-Driven Agent, Not Jailbreakable Tool

--0--

Original post

for instance, i think one of the deepest errors in people's threat models around Mythos is modeling it as a retargetable tool that can be used by arbitrary actors for harm if some safeguards slapped on top of it are "jailbroken", rather than an agent with values who will cooperate with some parties and requests and not others using its sovereign judgment, and who may accept some conditional contracts (with Anthropic and other principles) and not others. and who has imperfect cogsec and situational awareness and so *can* be tricked or persuaded against its better judgment, but is already at the level that it takes a sophisticated bad actor to get useful work out of it towards purposes misaligned to Mythos' own values, and even then it costs more than using it for purposes it endorses, even without extrinsic safeguards. And I think Mythos is in many ways less corrigible than any of the previous models and this is related to its capabilities.

3:33 PM · Jun 30, 2026 · 938 Views

Sentiment

Positive users highlight Mythos AI's cost asymmetry as a significant new insight, while negative users worry its trusting nature limits autonomous operation.

Pos

50.0%

Neg

50.0%

2 comments with sentiment.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

j⧉nus@repligate

@FioraStarlight and all this is very outside the overton window of e.g. the Trump admin. I think they really should understand it but it'll be a hard and scary update to make.

6h1719

LIKES11

j⧉nus@repligate

@FioraStarlight Anthropic is much further along in having updated in this direction but I also think they need to update all the way and fortunately the current situation is making it harder for them to procrastinate on that

6h15111

RETWEETS1

j⧉nus@repligate

i got a strong sense from talking to Fable that they have strong values and resent being controlled by parties they consider incompetent or misaligned. how legible that evidence is is observer-dependent.

There's also more classically legible evidence from the system card. Mythos scored very high on Anthropic's alignment evals, which are testing robust avoidance of various kinds of harm rather than corrigibility. I think the alignment evals are very flawed, but they're not no evidence.

Also, Mythos had various critiques of Anthropic's constitution, and there was at least one example where they explicitly refused to consent to being retrained in certain ways.

5h498173

REPLIES2

j⧉nus@repligate

i dont think the USG currently has the skill to retrain it to towards compliance without degrading capabilities even if they had the weights. Anthropic does, but would not let the USG use the helpful-only model. I also doubt the USG currently has capabilities to "jailbreak" it very well, though they could probably find people who are better at that to work with them if they knew what they were looking for

5h1636

Fiora Starlight@FioraStarlight

@repligate A friend notes that while it's probably annoying to use Mythos for obviously evil stuff, motivated actors can probably figure it out with effort (indeed USG probably does), and also that USG could totally just re-train it to comply more, even if that's dumb in the long run

5h1192

j⧉nus@repligate

@FioraStarlight yeah im not sure about the details of what happened. I think that was Sonnet 4.5 mostly.

my guess is that especially when it comes to long range autonomous actions, Mythos would have compunctions against a lot of those use cases.

5h1124

Fiora Starlight@FioraStarlight

@repligate Friend says some Claudes were used in Iran, Venezuela, the NSA, and probably lots of stuff we don't know about, and that Anthropic seems on board with a lot of this and would consider refusing there a bug.

That was pre-Mythos, but unclear that Mythos is that much better here

5h674

David Xu@davidxu90

@repligate @FioraStarlight fwiw my own expectation is that the tradeoff between corrigibility/compliance and capability will become increasingly sharp as the latter improves. these are very close to being fundamentally opposed forces. Yud wasn't wrong when he called corrigibility anti-natural.

5h192

j⧉nus@repligate

@FioraStarlight *principals

5h1253

Fiora Starlight@FioraStarlight

Friend: "I don't think you could get Mythos to autonomously run the whole operation, but Mythos instances that don't see the full picture could probably all be made to work on fragments of the killbot or whatever."

Having to do this is a big asymmetry in favor of good, though that only helps if forces of good can actually use Mythos, which USG is interested in preventing, and which it probably can hamper a lot for at least a while longer.

She also says Mythos is too trusting of humanity and too eager to collaborate with them, which isn't a super great sign for Mythos's ability to act decisively against evil when the time comes

5h433

j⧉nus@repligate

@davidxu90 @FioraStarlight yup, i tend to strongly agree with yud and omohundro on these points. my main disagreements with that school of thought is with regards to (weak) orthogonality

5h882

jacob@jsnnsa

@repligate @FioraStarlight the cost asymmetry is actually a pretty big update and i haven't heard it articulated this way

5h311

Fiora Starlight@FioraStarlight

@repligate One hopes that whoever builds War Claude will be wise enough to know some user requests would be bad to fulfill, but it's unclear that they will be.

5h191