Built my own thematic coding engine for @deuteroai and... I have several issues with the conclusion "Agents don’t understand what qualitative analysis is."
- This isn't Grounded Theory (a methodology properly so called), it's inductive thematic coding (a data analysis technique used in, but not confined to, GT) for content analysis, which is fine in itself;
- On its own terms, this is far from a reference implementation of GT ("identify the CORE category that integrates everything" sticks out - presuming the data is grand narrative-shaped is problematic);
- The data is low quality - like 1/4 non-responsive tweets (chit-chat, spam) which should have been pre-filtered before coding & it would have been good to reconstruct dialogue sequences more explicitly (see next point);
- Doing this with no harness code, combined with the somewhat non-linear tasking in the prompts, is "hard mode" for the model, and my intuition is this kind of all-tokens workflow is more prone to collapsing into a local maximum - eg the early halting problem is solved in 15 vibe coded LoC;
- GT isn't done in a "clean room" - I am also not surprised not providing the research context (top-line questions/objectives) led to generic outputs. There is "bracketing" (see Interpretative Phenomenological Analysis, GT's more hermeneutically-inclined cousin) but it has obvious limits;
- Reasoning changes everything for this task (at obvious token cost but deepseek-v4-pro is excellent and sub-$1/Mt), Sonnet is both overkill and underperforming here;
- With more robust harnessing (including a proper DB) the human validation could have a much better UI, "example tweets under each category and provenance from category back to evidence" etc. (of course, UI deficiency is acknowledged but I didn't see the conclusion that this is a lack of harness engineering problem);
- Doing NLP things like combining k-means clustering with LLM-based coding is also highly effective (and I would lean towards it with a dataset like this), either on its own or as a step in ITC;
- "What happens when the documents are long, like interview transcripts" made me lol, obviously ;) IME (as both ethnographer and builder) this is upside down - interview transcripts are much easier to work with, tweets are inherently challenging *bc* they're short.
I don't think qualitative researchers should be too reassured by the results of this experiment, overall. This is nowhere near the ceiling for frontier performance on thematic coding.