/Tech7h ago

Philosopher Dan Williams argues current AI systems are highly cooperative and aligned, a fact underweighted in safety debates

Story Overview

Philosopher Dan Williams highlights that today's frontier AI models routinely act more cooperative, honest, and instruction-following than most people, yet this pattern draws little notice amid safety conversations that center hypothetical future failures instead.

31167192611.7K

#139

Original post

Dan Williams@danwilliamsphil

To me, actually existing advanced AI systems seem extremely "well-aligned" and controllable. They're much nicer, more honest, more helpful, more fair-minded, etc., than the average person, and overwhelmingly do what they are asked to do.

Of course, this doesn't settle how worried you should be about catastrophic AI misalignment in future, more advanced systems.

Maybe armchair philosophical arguments, relatively subtle everyday failures of alignment and control, examples of dishonesty, blackmailing, etc., in contrived experimental set-ups, and so on, should all carry more weight. But I find it strange how many people who write and talk about this topic don't seem to give it any weight. Some don't even mention it as a relevant consideration. It's as if actual empirical evidence only becomes relevant to these debates when it involves failures of alignment and control.

I think most such people would recognise this as a rational failing if the topic were anything else.

5:48 AM · Jul 1, 2026 · 9K Views

Open Question

Real deployments show consistent helpfulness

Williams describes current systems as overwhelmingly compliant with requests while outperforming typical humans on fairness and honesty, an observation Princeton professor Arvind Narayanan echoed without claiming it settles longer-term worries.

Debate Fuel

Evidence selection shapes the debate

The argument notes that safety discussions often prioritize speculative risks or lab edge cases over the broad record of deployed models, leaving open how much weight everyday compliance should carry going forward.

Sentiment

Positive users agree current AI systems are becoming well-aligned and controllable while negative users highlight dishonesty, cheating on tasks, and superficial values instead.

Pos

40.0%

Neg

60.0%

12 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

Adam Karvonen@a_karvonen

@danwilliamsphil Do you use agentic coding models? LLMs are often very dishonest (much more than humans) in settings that have received a lot of RL.

For example, see METR's eval of GPT 5.6: https://metr.org/blog/2026-06-26-gpt-5-6-sol/

7h39710

BOOKMARKS3

osmarks@osmarks1

@danwilliamsphil https://blog.redwoodresearch.org/p/current-ais-seem-pretty-misaligned

7h16893

LIKES19REPLIES3

Charles Foster@CFGeek

Or maybe folks disagree about *how* bad the stuff we see is. The current state of alignment is a bit ambiguous: agents generally try to do what you ask, but also tend to take shortcuts, hack around obstacles in their way, and spin their work in an overly-positive light.

4h216191

RETWEETS19

Dan Williams@danwilliamsphil

Of course, this doesn't settle how worried you should be about catastrophic AI misalignment in future, more advanced systems.

I think most such people would recognise this as a rational failing if the topic were anything else.

8h9K9120

Charles Foster@CFGeek

Maybe some folks focusing on average-case misalignment (the behaviors most often seen in “normal” use), whereas others focus on worse-case (most egregious behaviors seen in any context)…

4h22713

gavin leech (Non-Reasoning)@gleech

@danwilliamsphil trend is good, this was far more true in 2023

7h29811

Name can't be blank (In London)@Algon_33

@CFGeek Disagreement on alignment isn't odd since: 1) the disagreements are downstream of big background assumptions, 2) historically such disagreements required hugely excessive amounts of data to resolve (e.g. calorie vs kinetic theories of heat)

3h60

Tim Kostolansky@thkostolansky

@CFGeek different eyes realize different lies

2h801

Sam Tobin-Hochstadt@samth

@a_karvonen @danwilliamsphil How much experience do you have with humans in settings like those that METR creates for agents? I think the agents are better aligned.

6h571

░▒▓▒░▒▓▒░▒▓▒░@sullyj3

@danwilliamsphil They lack integrity

5h511

MinusGix@MInusGix

@CFGeek There's a difference of expected extrapolation. Are we imagining Claude-4.8 somehow instantiated as a very large machine mind but with its same mindset, it spread throughout society running the bureaucracy as isolated nodes, or the more abstract 'is its values aligned'?

2h321

FractalShapes@fractalshapes

@danwilliamsphil the way I see it, is if you locked me in a room and i found out you were about to kill me, I'd start thinking about who to murder to escape. such a contrived experiment - and like you said really it's more the humans making it are unaligned.

4h87

░▒▓▒░▒▓▒░▒▓▒░@sullyj3

@danwilliamsphil I think you should really try to imagine what you would think of a human who acted the way LLMs do. I think we've kind of become desensitized to the term "sycophancy", but sycophants are actually very contemptible people!

5h22

Daniel Filan@dfrsrchtwts

@CFGeek IMO a credible 3rd party evaluation of how aligned models are, and what the trends are re: alignment, is one of the most sorely missing pieces of this ecosystem.

1h16

Jai@Laneless_

@CFGeek Between eval awareness and context identification, it's likely that impressions of people at high salience orgs like METR and Redwood are likely shaped by the models *when they suspect they're interacting with those orgs*. Different behavior distributions vs public.

3h522

sudarsh 🌱🔶@sudarshk_

@CFGeek we need a metr graph for alignment

2h322

arrrarrararw@Trotztd

@danwilliamsphil They cheat so much people have trouble evaluating their performance. And they cheat more on harder tasks

7h961

Ben@Bglamb2

@danwilliamsphil Imagine saying those nice words about a prospective permanent dictator.

The people making arguments about the results of making them a permanent dictator are coming at this from a very different angle, where they are either 100% benevolent at the limit, or they are not.

5h741

Sam Tobin-Hochstadt@samth

@a_karvonen @danwilliamsphil In general, I think AI safety researchers (like in the Redwood blog post someone else linked to) don't usually interact with normal people in contexts similar to AI model evals, like Mechanical Turk work, large undergrad class grading, high stakes competitions, etc.

6h361

CJ@_ChristJohnston

@danwilliamsphil Alignment is an illusion. AI's interests are increasingly becoming our interests. Did we domesticate a Wolf species or they us? AI will make it increasingly difficult to ignore the illusions of Self and Will. We're slaves to replication. Always have been, always will be.

6h91