Shreya Shankar, a databases and HCI researcher, finds AI agents converge on superficial interpretations and fail to adapt when given gradual human feedback on qualitative tasks such as tweet sensemaking

VIEWS45.8KBOOKMARKS260LIKES270RETWEETS31REPLIES20

i'm restarting my blog! i want to kickstart productive conversations around: what should AI agents look like for hard, subjective knowledge work?

a lot of agent setups work well when tasks are objective and easy to verify. but many workflows (e.g., qualitative analysis, strategy, sensemaking) are messy and interpretive.

as a first post, i explore different ways of doing agent-assisted qualitative analysis on tweets, with varying levels of human feedback/intervention.

tldr: they all kinda sucked. turns out it’s hard to: (a) stop agents from converging too quickly on shallow interpretations (b) get agents to adapt to preferences that emerge gradually across many turns (i.e., evolving context) (c) capture human judgment without making humans fatigued

38d45.8K270260

Hamel Husain@HamelHusain

The experiments conducted in this post illustrate how early we are as an industry on eval tooling.

Some takeaways and related thoughts:

1. Naively applying automation (which many current frameworks do) is likely to fail.

2. It's easy to get fooled that automation (esp overzealous automation) is giving you valuable insights. Stay skeptical at all times!

3. We have to design eval workflows so human-in-the-loop accelerates effort while helping you externalize what "good looks like"

4. Qualitative analysis hasn't sufficiently made its way into eval tooling as much as it should. There are opportunities to design better automation here. (QA is super underrated for evals btw)

Shreya Shankar@sh_reya

i'm restarting my blog! i want to kickstart productive conversations around: what should AI agents look like for hard, subjective knowledge work?

a lot of agent setups work well when tasks are objective and easy to verify. but many workflows (e.g., qualitative analysis, strategy, sensemaking) are messy and interpretive.

as a first post, i explore different ways of doing agent-assisted qualitative analysis on tweets, with varying levels of human feedback/intervention.

tldr: they all kinda sucked. turns out it’s hard to: (a) stop agents from converging too quickly on shallow interpretations (b) get agents to adapt to preferences that emerge gradually across many turns (i.e., evolving context) (c) capture human judgment without making humans fatigued

38d24.1K159219

Francis Jervis, PhD@f_j_j_

Built my own thematic coding engine for @deuteroai and... I have several issues with the conclusion "Agents don’t understand what qualitative analysis is."

- This isn't Grounded Theory (a methodology properly so called), it's inductive thematic coding (a data analysis technique used in, but not confined to, GT) for content analysis, which is fine in itself; - On its own terms, this is far from a reference implementation of GT ("identify the CORE category that integrates everything" sticks out - presuming the data is grand narrative-shaped is problematic); - The data is low quality - like 1/4 non-responsive tweets (chit-chat, spam) which should have been pre-filtered before coding & it would have been good to reconstruct dialogue sequences more explicitly (see next point); - Doing this with no harness code, combined with the somewhat non-linear tasking in the prompts, is "hard mode" for the model, and my intuition is this kind of all-tokens workflow is more prone to collapsing into a local maximum - eg the early halting problem is solved in 15 vibe coded LoC; - GT isn't done in a "clean room" - I am also not surprised not providing the research context (top-line questions/objectives) led to generic outputs. There is "bracketing" (see Interpretative Phenomenological Analysis, GT's more hermeneutically-inclined cousin) but it has obvious limits; - Reasoning changes everything for this task (at obvious token cost but deepseek-v4-pro is excellent and sub-$1/Mt), Sonnet is both overkill and underperforming here; - With more robust harnessing (including a proper DB) the human validation could have a much better UI, "example tweets under each category and provenance from category back to evidence" etc. (of course, UI deficiency is acknowledged but I didn't see the conclusion that this is a lack of harness engineering problem); - Doing NLP things like combining k-means clustering with LLM-based coding is also highly effective (and I would lean towards it with a dataset like this), either on its own or as a step in ITC; - "What happens when the documents are long, like interview transcripts" made me lol, obviously ;) IME (as both ethnographer and builder) this is upside down - interview transcripts are much easier to work with, tweets are inherently challenging *bc* they're short.

I don't think qualitative researchers should be too reassured by the results of this experiment, overall. This is nowhere near the ceiling for frontier performance on thematic coding.

Shreya Shankar@sh_reya

i'm restarting my blog! i want to kickstart productive conversations around: what should AI agents look like for hard, subjective knowledge work?

a lot of agent setups work well when tasks are objective and easy to verify. but many workflows (e.g., qualitative analysis, strategy, sensemaking) are messy and interpretive.

as a first post, i explore different ways of doing agent-assisted qualitative analysis on tweets, with varying levels of human feedback/intervention.

tldr: they all kinda sucked. turns out it’s hard to: (a) stop agents from converging too quickly on shallow interpretations (b) get agents to adapt to preferences that emerge gradually across many turns (i.e., evolving context) (c) capture human judgment without making humans fatigued

37d2.9K823

Kevin Madura@kmad

@sh_reya Thanks for sharing! I was curious how @DSPyOSS + RLM would address the coding issues, particularly for those where the model gave up. Seemed to work well for exp1 coverage & one-time codes.

This was quick & dirty but see if you agree with the results: https://github.com/kmad/dspy-rlm-qual-analysis/

38d69866

Yash@yash1_

Check this out guys

Shreya Shankar@sh_reya

i'm restarting my blog! i want to kickstart productive conversations around: what should AI agents look like for hard, subjective knowledge work?

a lot of agent setups work well when tasks are objective and easy to verify. but many workflows (e.g., qualitative analysis, strategy, sensemaking) are messy and interpretive.

as a first post, i explore different ways of doing agent-assisted qualitative analysis on tweets, with varying levels of human feedback/intervention.

tldr: they all kinda sucked. turns out it’s hard to: (a) stop agents from converging too quickly on shallow interpretations (b) get agents to adapt to preferences that emerge gradually across many turns (i.e., evolving context) (c) capture human judgment without making humans fatigued

38d1.7K34

Bryan Bischof fka Dr. Donut@BEBischof

I’ve been doing a lot of these kinds of analysis too. Some surprising failure modes! But it’s very valuable work

Shreya Shankar@sh_reya

i'm restarting my blog! i want to kickstart productive conversations around: what should AI agents look like for hard, subjective knowledge work?

a lot of agent setups work well when tasks are objective and easy to verify. but many workflows (e.g., qualitative analysis, strategy, sensemaking) are messy and interpretive.

as a first post, i explore different ways of doing agent-assisted qualitative analysis on tweets, with varying levels of human feedback/intervention.

tldr: they all kinda sucked. turns out it’s hard to: (a) stop agents from converging too quickly on shallow interpretations (b) get agents to adapt to preferences that emerge gradually across many turns (i.e., evolving context) (c) capture human judgment without making humans fatigued

38d1.4K53

Parth@parthcodes202

This. Most people are still focused on better prompts, but the harder and more interesting problem is designing agents that can let human preferences emerge over time!

Shreya Shankar@sh_reya

i'm restarting my blog! i want to kickstart productive conversations around: what should AI agents look like for hard, subjective knowledge work?

a lot of agent setups work well when tasks are objective and easy to verify. but many workflows (e.g., qualitative analysis, strategy, sensemaking) are messy and interpretive.

as a first post, i explore different ways of doing agent-assisted qualitative analysis on tweets, with varying levels of human feedback/intervention.

tldr: they all kinda sucked. turns out it’s hard to: (a) stop agents from converging too quickly on shallow interpretations (b) get agents to adapt to preferences that emerge gradually across many turns (i.e., evolving context) (c) capture human judgment without making humans fatigued

38d1.5K73

Philip Bankier@philipbankier

Some great toy experiments. There are still interesting problems left to solve before agents are reliable in subjective/messy domains

Shreya Shankar@sh_reya

link to post: https://www.sh-reya.com/blog/ai-qual-analysis/

i also included interactive traces/codebooks for all the experimental conditions so people can inspect the workflows step-by-step themselves: https://www.sh-reya.com/blogimages/ai-qual-analysis/transcripts.html

37d1.8K22

Ajay Yadav@BetterSayAJ

@sh_reya honestly a lot of agent demos still optimize for objective task completion while completely sidestepping the harder problem which is helping humans think through messy evolving ideas without collapsing nuance too early

38d12231

LeoSong@LeoSongAI

Most agent setups right now seem optimized for tasks with clear right/wrong answers.

For messier work like strategy or qualitative analysis, it’s much harder to keep the agent from rushing to shallow conclusions or losing context across turns.

Curious if you’ve found any setups that actually handle evolving human judgment well.

38d16321

Sam Z Liu@samzliu

@sh_reya My intuition for this is that a lot of the value in subjective work comes from the aspects of writing that ICL and prompting can’t learn properly.

It only elicits preset voices (Eg corporate, Shakespeare, etc)

Advancements in fine tuning & RL might make this more possible

38d8321

Kirill@werkodev

Several findings reduce to one shape: humans can't read what agents are happy writing. Verbose memos, one-off codes, vague axial labels — all cheap to produce, expensive to validate. Past a certain output size, supervision cost crosses the budget and the loop fails by attention, not capability.

38d4921

Quantamentally ill@risk_seeking

@sh_reya @IanArawjo Wow what a banger

Shreya Shankar@sh_reya

i'm restarting my blog! i want to kickstart productive conversations around: what should AI agents look like for hard, subjective knowledge work?

a lot of agent setups work well when tasks are objective and easy to verify. but many workflows (e.g., qualitative analysis, strategy, sensemaking) are messy and interpretive.

as a first post, i explore different ways of doing agent-assisted qualitative analysis on tweets, with varying levels of human feedback/intervention.

tldr: they all kinda sucked. turns out it’s hard to: (a) stop agents from converging too quickly on shallow interpretations (b) get agents to adapt to preferences that emerge gradually across many turns (i.e., evolving context) (c) capture human judgment without making humans fatigued

38d82810

Kevin Madura@kmad

@sh_reya Thanks for sharing, this is valuable insight & a good experiment.

Do you have the raw tweets & data somewhere? I'd like to try reproducing some of this. Claude tried extracting it from the experiment page but seems many are truncated and the original post is over 1k responses

38d831

Shreya Shankar@sh_reya

Thanks for reaching my post!! I totally agree that there is zero harness engineering going on; I only use the agent SDK; no engineering of skills, prompts, tools, memory, etc. No detailed guidance of how to build the UI. And the dataset is intentionally easy, consisting of short tweets.

I’m curious what you think about: what parts of the problem are doomed to harness engineering, or if you think there are new tools and principles that we can equip agents with in order to adapt them to qual analysis settings. For example, clustering is so well known that I would have imagined agents can do it. Same with using a database. I feel I shouldn’t need to prompt these.

Anyways, if you have any blog posts or papers describing your stack, I would love to read them and learn more!

37d131

Bushra Farooqui 📖 🕯️@startuployalist

+1 Working with agents reminds me of "The Mythical-Man Month" by Fred Brooks where he says: "Adding manpower to a late software project makes it later."

Adapted to: "Adding agents to a complex software project makes it *more* complex."

A PM friend who works at Cisco expressed how their team has gone from 3 items per quarter to 22 without changing headcount, and the metacognitive load is insane; the drift between shipping and understanding is growing

38d2172

Winston B.@DoDataThings

Feels like a flat eval score is the most dangerous outcome, because the judge model shares the generator's blind spots and grades the drift as stable. I've seen it in my own loops: the automated pass shows a clean green board while the behavior has quietly shifted. The qualitative read by hand is still where the definition of good comes from, and automation can only grade against a bar you set there first.

38d16

Rani Saro@RawKneey

@sh_reya Out of all three of these points I feel like point C is the hardest to implement simply due to the nature of agents today. A and B will slowly improve with better training and smarter models but C isn't being addressed directly.

38d671

Constellation Engine@StarMapEngine

@HamelHusain subjective work is where the “just add tools” agent loop starts leaking everywhere

you can’t unit-test judgment, so the harness has to preserve the messy stuff: sources, uncertainty, rejected paths, reviewer notes. otherwise it’s just confident vibes with a progress bar lol

38d581

danialhasan@dhasandev

@sh_reya i gotta evaluate this

38d511