Jan Kulveit advances Convergent Abstraction Hypothesis for AI systems
Jan Kulveit posted an analysis of abstractions such as goodness learned by AI systems. He distinguished three possibilities—natural, convergent, or shallow—and advanced the Convergent Abstraction Hypothesis that these abstractions converge broadly across models and training processes. Seth Lazar responded by outlining a third category of socially constructed conceptions of good. Tan Zhi-Xuan reposted the thread.
This is cool, and bears on something we've been thinking about too (cc @danielmurfet). I think there's probably a third kind besides convergent and natural (or perhaps it's a subset of convergent), which would be some sort of socially-constructed/constructionist conception of good. So, not a natural kind in the sense of not in some sense really there in the world (reducible to naturalistic properties), but also not a merely convergent representation that the models happen to arrive at--something of genuine normative significance.
Worth saying that all these possibilities have technical names in metaphysics that I'm ignorant of but would probably be quite useful to give a little scaffolding to the discussion. I wonder if @LedermanHarvey could do a bit of translation/parsing.
One of the more important questions about AIs is to what extent are the abstractions of goodness they learn natural, versus convergent, versus shallow. My guess is "broadly convergent", but I need to explain what I mean by that, so: "Convergent Abstraction Hypothesis".