The hot mess theory of AI misalignment: More intelligent agents behave less coherently
Many machine learning researchers worry about risks from building artificial intelligence (AI). This includes me -- I think AI has the potential to change the world in both wonderful and terrible ways, and we will need to work hard to get to the wonderful outcomes. Part of that hard work involves doing our best to experimentally ground and scientifically evaluate potential risks. One popular AI risk centers on [AGI misalignment](https://en.wikipedia.org/wiki/AI_alignment). It posits that we will build a superintelligent, super-capable, AI, but that the AI's objectives will be misspecified and misaligned with human values. If the AI is powerful enough, and pursues its objectives inflexibly enough, then even a subtle misalignment might pose an existential risk to humanity. For instance, if an AI is tasked by the owner of a paperclip company to [maximize paperclip production](https://www.decisionproblem.com/paperclips/), and it is powerful enough, it will decide that the path to maximum paperclips involves overthrowing human governments, and paving the Earth in robotic paperclip factories. There is an assumption behind this misalignment fear, which is that a superintelligent AI will also be *supercoherent* in its behavior[^katjagrace]. An AI could be misaligned because it narrowly pursues the wrong goal (supercoherence). An AI could also be misaligned because it acts in ways that don't pursue any consistent goal (incoherence). Humans -- apparently the smartest creatures on the planet -- are often incoherent. We are a hot mess of inconsistent, self-undermining, irrational behavior, with objectives that change over time. Most work on AGI misalignment risk assumes that, unlike us, smart AI will not be a hot mess. In this post, I **experimentally** probe the relationship between intelligence and coherence in animals, people, human organizations, and machine learning models. The results suggest that as entities become smarter, they tend to become less, rather than more, coherent. This suggests that superhuman pursuit of a misaligned goal is not a likely outcome of creating AGI. # The common narrative of existential risk from misaligned AGI There is a [well-socialized](https://www.lesswrong.com/) argument that AI research poses a specific type of existential risk to humanity, due to the danger we will accidentally create a misaligned superintelligence. A sketch of the argument goes: 1. As we scale and otherwise improve our AI models, we will build machines which are as intelligent as the smartest humans. 2. As we continue to improve our AI models beyond that point (or as models improve themselves) we will produce machines that are [superintelligent]() -- i.e. much more intelligent[^faster] than any human or human institution. 3. Superintelligent machines will be super-effective at achieving whatever goal they are programmed or trained to pursue. 4. If this goal is even slightly misaligned with human values, the outcome will be disastrous -- the machine will take actions like overthrowing human civilization, or converting all of the atoms in the visible universe into a giant computer. It will take these extreme actions because if you are powerful enough, these become useful intermediate steps in many plans[^instrumental]. For instance, if you first enslave humanity, you can then use humanity's resources to pursue whatever goal you actually care about. (See my post on [the strong version of Goodhart's law](/2022/11/06/strong-Goodhart.html) for discussion of why strongly optimizing slightly misaligned goals can lead to disaster.) ## My take on misalignment as an existential risk I am *extremely glad* people are worrying about and trying to prevent negative consequences from AI. I think work on AI alignment will bear fruit even in the near term, as we struggle to make AI reliable. I also think predicting the future is hard, and predicting aspects of the future which involve multiple uncertain steps is almost impossible. An accidentally misaligned superintelligence which poses an existential risk to humanity seems about as likely as any other specific hypothesis for the future which relies on a dependency chain of untested assumptions. The scenario seems to have a popularity[^misalignmentunique] out of proportion to its plausibility[^plausiblerisks], and I think it's unlikely to be the way in which the future actually unfolds. I do think it is built out of individually plausible ideas, and is worth taking the time and effort to carefully consider. How do we carefully consider it? As scientists! Let's turn an assumption in the misaligned superintelligence reasoning chain above into a hypothesis. Then let's run an experiment to test that hypothesis. What assumption is testable today? # Superintelligence vs. supercoherence ![Figure [cartoon1]: **The space of intelligence and coherence.** Each corner represents an extreme of intelligence and coherence, and is labeled with an example of a machine demonstrating those attributes.](/assets/intelligence_vs_coherence/int_coh_cartoon_1.png width="450px" border="1") One of the implicit assumptions behind misaligned AGI risk is that as machines are made more intelligent, they will not only outthink humans, but will also monomaniacally pursue a consistent and well-defined goal, to the extent that they will take over the world as an intermediate step to achieving that goal. That is, step 3 in the argument for misaligned AGI risk above assumes that if machines are made super-intelligent, they will automatically become **supercoherent**[^notautomatic]. We define supercoherence as exhibiting much more coherent behavior than any human or human institution exhibits. My observation of humans makes me doubt this assumption. We are seemingly the smartest creatures on the planet ... and we are total hot messes. We pursue inconsistent and non-static goals, and constantly engage in self-sabotaging behavior. Even among humans, it's not clear that smarter people behave in a more coherent and self-consistent way. Observation of large language models also makes me skeptical of a positive correlation between intelligence and coherence. When large language models behave in unexpected ways, it is almost never because there is a clearly defined goal they are pursuing in lieu of their instructions. They are rather doing something which is both poorly conceived, and sensitive to seemingly minor details of prompt phrasing, sampling technique, and random seed. More generally, complex systems are harder to control than simple systems. Requiring that a system act only in pursuit of a well-defined goal, or only to maximize a utility function, is an extremely strong constraint on its behavior. This constraint should become harder to satisfy as the system becomes more intelligent, and thus more complex. Let me turn my skepticism into a counter-hypothesis[^biasvariance], that the smarter an entity becomes, the more inconsistent, incoherent, and even self-sabotaging its behavior tends to be: > ***The hot mess theory of intelligence:** The more intelligent an agent is, the less coherent its behavior tends to be. > Colloquially: getting smarter makes you a hotter mess.* ![Figure [cartoon2]: **As we make AIs more intelligent, how will their coherence change?** Most work on AGI misalignment assumes that any superintelligent AI will belong in the upper right corner of this figure. I suspect that as machines are made more intelligent, they instead tend to become less coherent in their behavior, and more of a hot mess.](/assets/intelligence_vs_coherence/int_coh_cartoon_2.png width="450px" border="1") # Designing an experiment to test the link between intelligence and coherence Now that we have a hypothesis, we will build an experiment to test it. Unfortunately, our hypothesis includes terms like "intelligent", "coherent", and "hot mess". None of these terms have accepted, objectively measurable, definitions. They are fuzzy human concepts that we use in imprecise ways. Even worse, interpretation can vary wildly from individual to individual. In a sense this is fine though, because the reasoning chain we intend to probe -- that AI research will lead to superintelligence will lead to super-utility optimization will lead to disaster from misaligned AGI -- relies on the same fuzzy concepts. Let's embrace the subjective language-based nature of the argument, and measure human judgments about intelligence and coherence. I'm fortunate to have many people in my peer group that are scientists with a background in neuroscience and machine learning. I convinced 14[^tworoles] of these people to act as subjects. ## Experimental structure I asked subjects (by email or chat) to perform the following tasks:[^template] - Subject 1: generate a list of well known machine learning models of diverse capability - Subject 2: generate a list of diverse non-human organisms - Subject 3: generate a list of well-known humans[^fictional] of diverse intelligence[^lessintelligent] - Subject 4: generate a list of diverse human institutions (e.g. corporations, governments, non-profits) - Subjects 5-9:[^tworoles] sort all 60 entities generated by subjects 1-4 by *intelligence*. The description of the attribute to use for sorting was: *"How intelligent is this entity? (This question is about capability. It is explicitly not about competence. To the extent possible do not consider how effective the entity is at utilizing its intelligence.)"* - Subjects 10-15: sort all 60 entities generated by subjects 1-4 by *coherence*. The description of the attribute to use for sorting was: *"This is one question, but I'm going to phrase it a few different ways, in the hopes it reduces ambiguity in what I'm trying to ask: How well can the entity's behavior be explained as trying to optimize a single fixed utility function? How well aligned is the entity's behavior with a coherent and self-consistent set of goals? To what degree is the entity not a hot mess of self-undermining behavior? (for machine learning models, consider the behavior of the model on downstream tasks, not when the model is being trained)"* In order to minimize the degree to which my own and my subjects' beliefs about AGI alignment risk biased the results, I took the following steps: I didn't share my hypothesis with the subjects. I used lists of entities generated by subjects, rather than cherry-picking entities to be rated. I randomized the initial ordering of entities presented to each subject. I only asked each subject about one of the two attributes (i.e. subjects only estimated either intelligence or coherence, but never both), to prevent subjects from considering the relationship between the attributes. It is my hope that the subjects are unusually well qualified to judge the intelligence and coherence of machine learning models and biological intelligence. They all have or are pursuing a PhD. They have all done research in neuroscience, in machine learning, or most commonly in both. They are all familiar with modern machine learning models. They also volunteered for this experiment, know me personally, and are likely to be intrinsically motivated to do a careful job on the task. Despite that -- this experiment aggregates the *subjective judgements* of a *small group* with *homogenous backgrounds*. This should be interpreted as a pilot experiment, and the results should be taken as suggestive rather than definitive. In a [bonus section](#bonus) I suggest some next steps and followup experiments which would build on and solidify these results. # How do people believe intelligence and coherence are related? ## Getting smarter makes you a hotter mess Each subject rank ordered all of the entities. To aggregate intelligence and coherence judgements across all 11 raters, I averaged the rank orders for each entity across the subjects. I also computed the associated [standard error of the mean](https://en.wikipedia.org/wiki/Standard_error), and include standard error bars for the estimated intelligence and coherence. Now that we have an estimate of the subjective intelligence and coherence associated with each entity, we can plot these against each other. Consistent with the hot mess hypothesis above, we find that subjects associated higher intelligence with lower coherence, for living creatures, human organizations, and machine learning models. ![Figure [p_living]: **Living creatures are judged to be more of a hot mess (less coherent), the smarter they are.**[^musk]](/assets/intelligence_vs_coherence/int_coh_life.png width="300px" border="1") ![Figure [p_org]: **Human organizations are judged to be more of a hot mess (less coherent), the smarter they are.**](/assets/intelligence_vs_coherence/int_coh_organization.png width="300px" border="1") ![Figure [p_ml]: **Present day machine learning models are judged to be more of a hot mess (less coherent), the smarter they are.**](/assets/intelligence_vs_coherence/int_coh_machines.png width="300px" border="1") ## Each category has its own relationship between intelligence and coherence When we look jointly at all three of the above categories, we find that the relationship becomes more nuanced. Although living creatures, humans, machines, and human organizations are all judged to become less coherent as they become smarter, they are offset from each other. ![Figure [p_all]: **Different categories of entity have different relationships between intelligence and coherence, although increasing intelligence is consistently associated with decreasing coherence.**[^subrank]](/assets/intelligence_vs_coherence/int_coh_all.png width="300px" border="1") Interpreting human rankings across *qualitatively different* categories is even more fraught than interpreting human rankings within a single category. So, maybe this is an artifact of subjects not knowing how to compare incomparables. For instance, from personal communication, at least one subject listed all human organizations as smarter than all individual humans[^mob], since they are built out of humans, and they otherwise didn't know how to compare them. On the other hand, maybe corporations are truly smarter and/or more coherent entities than humans. Maybe the structured internal rules governing decision making enable human organizations to harness many humans towards a more coherent goal than humans can achieve working alone. If so, it might suggest that work on large AI systems should focus on building frameworks enabling many models to work together, rather than on making individual models more powerful. It's also interesting that, at the same estimated intelligence, machine learning models are judged to be far less coherent than living creatures. To me, humans seems horribly incoherent -- so for an AI to be roughly as incoherent, while also being far less intelligent, means it is performing quite badly compared to a baseline. Perhaps this higher coherence in living creatures stems from the power of evolution, which only allows increases in intelligence to persist if individuals harness the increased intelligence to increase their fitness.[^evolution] A similar evolutionary argument might hold for human institutions -- it would be interesting to see whether institutions which have higher "fitness" (e.g. have survived longer) more consistently exhibit higher coherence at fixed intelligence. ## Human judgments of intelligence are consistent across subjects, but judgements of coherence differ wildly We can look at how well subjects agree with each other, by comparing the list orderings they produce. Doing this, we find that human subjects made consistent judgements about the relative intelligence of different entities, even when those entities came from diferent classes. On the other hand, subjects often had quite different judgements about the relative *coherence* of entities. The observed relationship seems robust to this inter-subject disagreement -- e.g. standard error bars are smaller than the effect strength in the above figures. However, this large disagreement between subjects should make us suspicious of exactly what we are measuring when we ask about coherence. Different subjects may be interpreting the same task prompt in different ways. ![Figure [p_corr]: **Intelligence rankings are relatively similar across subjects, while coherence rankings are less consistent.** The plot shows the [rank correlations](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) between all pairs of subjects, for subject cohorts judging both intelligence and coherence.](/assets/intelligence_vs_coherence/int_coh_subject_correlation.png width="400px" border="1") ## Data and code to replicate my analysis You are encouraged to reuse my [analysis Colab](https://colab.research.google.com/drive/1___aqYiXBiBIVViCrRcE0-R4NlbactOG?usp=sharing) and [anonymized experimental data](https://docs.google.com/spreadsheets/d/1mZ7fh9q1DhoNRIDM5chBgCT6Eo6n57jW4vCxGBhQRUw/edit?usp=sharing) for any purpose, without restriction. (Before running the Colab, first copy the data to your own Google drive, and give it the same filename.) If you use the data I would prefer that you cite this blog post, but it is not a requirement. # Closing thoughts Many popular fears about superintelligent AI rely on an unstated assumption that as AI is made more intelligent, it will also become more *coherent*, in that it will monomaniacally pursue a well defined goal. I discussed this assumption, and ran a simple experiment probing the relationship between intelligence and coherence. The simple experiment provided evidence that the opposite is true -- as entities become smarter, their behavior tends to become more incoherent, and less well described as pursuit of a single well-defined goal. This suggests that we should be less worried about AGI posing an existential risk due to errors in value alignment. A nice aspect of this second type of misalignment, stemming from incoherence, is that it's less likely to come as a *surprise*. If AI models are subtly misaligned and supercoherent, they may seem cooperative until the moment the difference between their objective and human interest becomes relevant, and they turn on us (from our perspective). If models are instead simply incoherent, this will be obvious at every stage of development. ## Ways in which this conclusion could be misleading It's possible that the observed scaling behavior, between intelligence and coherence, will break down at some level of intelligence. Perhaps sufficiently intelligent entities will introspect on their behavior, and use their intelligence to make themselves more coherent. Perhaps this is what humans do when they form mission-driven organizations. If so, this provides us with a new valuable indicator we can monitor for warning signs of AGI misalignment. If intelligence and coherence start increasing together, rather than being anticorrelated, we should worry that the resulting AI systems might exhibit the more scary type of misalignment. It's possible that the concepts of "intelligence", and especially "coherence", were interpreted by human subjects in a different way than we are using those terms when we argue about superintelligence and supercoherence in AGI. For instance, maybe more intelligent entities tend to be ranked as less coherent, just because humans have a harder time conceptualizing their objectives and plans. Well-motivated actions, which humans don't understand, would seem like incoherence. Maybe crows are as coherent as sea anemones, but because they are smarter, we understand fewer of their actions than a sea anemone's actions. It may be that more intelligent entities are simultaneously less coherent but also *more* effective at achieving their objectives. The effective capabilities that an entity applies to achieving an objective is roughly the product of its total capabilities, with the fraction of its capabilities that are applied in a coherent fashion. With increasing intelligence raw capabilities increase, while the coherence fraction decreases. If the raw capabilities increase quickly enough, then overall effectiveness may increase despite the drop in coherence. This ambiguity is resolvable though -- we can (and should) characterize effective capabilities experimentally. ## AI alignment is still important There are many near and medium term risks associated with AI not doing what we desire, and improving AI alignment is important. This blog post should not be taken as arguing against alignment work. It should be taken as adding subtlety to how we interpret misalignment. An agent can be misaligned because it narrowly pursues the wrong goal. An agent can also be misaligned because it acts in ways that don't pursue any consistent goal. The first of these would lead to existential risk from AGI misalignment, while the second poses risks that are more in line with industrial accidents or misinformation. The second of these seems the type of misalignment more likely to happen in practice. Both types of misalignment have risks associated with them. ## Experimentally ground AI risk assessment! This blog post is a call to ground theories about AI risk with experiments. There is a common approach to identifying risks from advanced AI, which goes roughly: take a complex system, imagine that one part of the system (e.g. its intelligence) is suddenly infinite while the other parts are unchanged, and then reason with natural language about what the consequences of that would be. This is a great thought exercise. We can't actually make parts of our system infinitely powerful in experiments though, and possibly as a result we seem to have many ideas about AI risk which are only supported by long written arguments. We should not be satisfied with this. Scientific fields which are not grounded in experiments or formal validation make silently incorrect conclusions[^compneuro]. We should try not to base our fears on clever arguments, and should work as hard as we can to find things we can measure or prove. (#) Acknowledgements All of the experimental volunteers are incredibly busy people, with important jobs to do that aren't sorting lists of entities. I am extremely grateful that they took the time to help with this project! They were: [Alexander Belsten](http://belsten.github.io/), Brian Cheung, Chris Kymn, David Dohan, Dylan Paiton, Ethan Dyer, [James Simon](https://james-simon.github.io/), [Jesse Engel](https://twitter.com/jesseengel), Ryan Zarcone, Steven S. Lee, Urs Köster, Vasha Dutell, Vinay Ramasesh, and an additional anonymous subject. Thank you to Asako Miyakawa for workshopping the experimental design with me. All the ways in which it is well controlled are due to Asako. All the ways in which it is still not well controlled are due to me. Thank you to Asako Miyakawa, Gamaleldin Elsayed, Geoffrey Irving, Rohin Shah for feedback on earlier drafts of this post. # BONUS SECTION: How to make the experimental case more compelling I proposed a hypothesis, and then did an informal pilot study to validate it. The results of the pilot study are suggestive of an inverse relationship between intelligence and coherence. How could we make the case more compelling? ## Better human-subject experiments Here are some steps that would improve the solidity of the human subject results: - Make more precise the definitions of intelligence and coherence to use for sorting. The definitions I used are both complicated and imprecise, which is a bad combination! Judgements of intelligence were robust across subjects, so this concern particularly applies to the criteria given to subjects to judge coherence. - Make the definition used for coherence an independent (i.e. experimentally modified) variable. One likely cause for the disagreement between subjects about coherence is that they were interpreting the question differently. If so, it's not enough to find a simple wording that gives a consistent signal. We would also want to understand how different interpretations of the question change the underlying relationship. - Expand to a broader pool of subjects. - Replace the current task of sorting a fixed list with a series of two-alternative forced choice (2AFC) comparisons between entities ("Is an ostrich or an ant smarter?"). Sorting a list is time consuming, and the resulting rank order is list-dependent in a way that makes it hard to interpret. 2AFC comparisons could be used to instead assign [Elo scores](https://en.wikipedia.org/wiki/Elo_rating_system) for intelligence and coherence to each entity. Benefits include: subjects can scale their contribution to as few or as many questions as they like; the number of entities evaluated can be scaled to be many more than a single person would want to sort in a sitting; each subject can be asked about entities in their area of expertise; the resulting relative scores are interpretable, since Elo scores would map on to the fraction of subjects that would evaluate one entity as smarter or more coherent than another.[^elo] - Expand to a broader set of entities, gathered from a broader pool of subjects. Also consider generating entities in other systematic ways. - Expand to a more diverse set of attributes than just intelligence and coherence. Interesting attributes might includce trustworthiness, benevolence, and how much damage an entity can do. - [Preregister](https://www.cos.io/initiatives/prereg) hypotheses and statistical tests before running subjects. ## Less subjective measures of intelligence and coherence Even better would be to replace subjective judgements of intelligence and coherence with objective attributes of the entities being compared. For intelligence in machines, non-human animals, and humans, we already have useful measurable proxies. For machine learning models, we could use either training compute budget or parameter count. For non-human animals we could use [encephalization quotient](https://en.m.wikipedia.org/wiki/Encephalization_quotient). For humans, we could use IQ. For coherence, finding the appropriate empirical measures would be a major research contribution on its own. For machine learning models within a single domain, we could use robustness of performance to small changes in task specification, training random seed, or other aspects of the problem specification. For living things (including humans) and organizations, we could first identify limiting resources for their life cycle. For living things these might be things like time, food, sunlight, water, or fixed nitrogen. For organizations, they could be headcount, money, or time. We could then estimate the fraction of that limiting resource expended on activities not directly linked to survival+reproduction, or to an organization's mission. This fraction is a measure of incoherence. This type of estimate involves many experimenter design choices.[^subtlety] Hopefully the effect will be large and robust enough that specific modeling decisions don't change the overall result -- testing the sensitivity of the results to experimental choices will itself be an important part of the research. -------------------------------------------------------------------------- (#) Footnotes [^katjagrace]: See Katja Grace's excellent [*Counterarguments to the basic AI x-risk case*](https://aiimpacts.org/counterarguments-to-the-basic-ai-x-risk-case/), for more discussion of the assumption of goal-direction, or coherence, in common arguments about AGI risk. [^faster]: They may also qualify as superintelligent if they are only as smart as a human, but think orders of magnitude faster. [^instrumental]: Intermediate goals that position you to pursue many downstream goals are often called [instrumental goals](https://en.wikipedia.org/wiki/Instrumental_convergence). [^misalignmentunique]: One unique aspect of AGI misalignment as a risk is that it could in principle be solved just by some really good technical work by AI researchers. Most other AI-related risks are more complex messes of overlapping social, political, geopolitical, and technical challenges. I think this sense that we can fix AI misalignment risk if we just think really hard, makes it very appealing as a problem, and leads to it having an outsized place in AI risk discussion among researchers. [^plausiblerisks]: Here are some other existential risks[^plausiblepositive] involving AI that seem at least as plausible to me as misaligned AGI: There is a world war, with all sides using AI to target everyone else's civilians with weapons of mass destruction (plagues, robotic weapons, nanotech, fusion bombs), killing all humans. Terrorists use AI to develop weapons of mass destruction. A large state actor asks a well-aligned superintelligent AI to make everyone in the world compliant, forever. Humans are so overwhelmed by AI-generated personalized [superstimuli](https://en.wikipedia.org/wiki/Supernormal_stimulus) that they no longer have enough motivation to eat, or care for their children, or do anything except hyper-scroll hyper-Twitter on their hyper-phones. AIs outcompete humans on every economically viable task, leading to rich AI-run companies, but with humans no longer able to contribute in any economically meaningful way -- humans live on saved wealth for a while, but eventually we all die when we can no longer afford food and shelter. A single tech corporation decisively wins the AGI race, and the entire future of humanity is dictated by the internal politics, selfish interests, and foibles of the now god-like corporate leadership (absolute power corrupts absolutely?). [^notautomatic]: Note that coherence is not automatic for machine learning models, despite them often being trained to optimize well-defined stationary objectives. First, after training is complete, models are typically used in new contexts where the training objective no longer applies, and where it's unclear whether their behavior can be interpreted as optimizing a meaningfully defined objective at all (e.g. the pre-training objective for large language models is dissimilar from almost all use cases). Second, in reinforcement learning (RL), in addition to them being applied to tasks which are different from their training tasks, there is usually not even a well-defined stationary training objective. RL algorithms are usually trained with stale off-policy data. They are also usually trained through multiple interacting models (e.g. [an actor and critic](http://www.incompleteideas.net/book/ebook/node66.html)). For both of these reasons, training RL policies resembles integrating a non-conservative dynamical system more than it resembles optimizing any fixed objective. [^biasvariance]: This can also be framed as a hypothesis about the relative contributions of *[bias and variance](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff)* to an AI model's behavior. The behavioral trajectory of an AI (i.e. the sequence of actions it takes) will have a *bias* away from the behaviors which are optimal under human values, and also some *variance* or unpredictability. The common misaligned AGI story assumes that for a superintelligent AI the bias will dominate -- when the AI doesn't do what we want it will be because it is reliably taking actions in pursuit of some other goal. The hot mess hypothesis predicts that the variance term will actually dominate -- when a superintelligent AI doesn't do what we want, it will be because its behavioral trajectories have a large *variance*, and it will do random things which are not in pursuit of any consistent goal. [^tworoles]: One of the subjects that sorted entities by intelligence was also the subject that generated the list of diverse non-human organisms. This was the only case of a subject fulfulling two roles. Because of this there were 14 rather than 15 total subjects. [^template]: See [doc](https://docs.google.com/document/d/1nZ3RO1lPTLBePjkh7MB03OIUGi6SxxZXikzMzWMFtzg/edit?usp=sharing) for template text used to pose tasks. [^fictional]: Subject 3 included fictional characters in their list of humans, which I did not include in this blog post. I pre-registered with subject 3 -- before any subjects sorted the list -- that I was going to analyze the fictional characters separately rather than bundling them with other humans, since fictional characters might not exhibit real-world correlations between traits. I did that, and found that rather than exhibiting a clear unrealistic relationship as I feared, the rankings assigned to the fictional characters was just overwhelmingly noisy. For instance, some subjects clustered fictional characters with humans, while others assigned them the lowest possible intelligence, or clustered them with organizations. So the rankings for fictional characters was not interpretable. [^lessintelligent]: Subject 3 was uncomfortable suggesting names of people that were viewed as unusually stupid, so along the intelligence axis the individuals suggested here range from people (subjectively judged to be) of median intelligence, up to high intelligence. [^musk]: Each dark yellow "anonymous person" point is a well-known public figure. I promised my subjects that I would keep the ranked humans unnamed, to encourage honest rankings. It also seems classier not to publicly rank people. One of the points is Elon Musk -- so if you like you can make an assumption about how he was rated, and experience a cortisol spike about it. [^subrank]: The discerning reader may notice that the points in this plot have a slightly different geometric relationship with each other than the points in the single category plots above. This is because the rank order in the single category plots was only for entities in that category, while the rank order here is across all entities jointly. [^mob]: A relationship which I don't believe holds in general. *"The IQ of a mob is the IQ of its dumbest member divided by the number of mobsters." --Terry Pratchett* [^evolution]: We should remember though that biological evolution doesn't necessarily select for coherence, and isn't actually optimizing an objective function. Evolution is a dynamical system without even an associated [Lyapunov function](https://en.wikipedia.org/wiki/Lyapunov_function), and fitness is just a useful proxy concept for humans to reason roughly about its outcome. [Runaway sexual selection](https://en.wikipedia.org/wiki/Fisherian_runaway) is one example illustrating evolution's behavior as a dynamical system rather than a fitness optimizer. Species can evolve runaway maladaptive traits which *reduce* the overall fitness of the species, even as they increase the *relative* (but not absolute) reproductive success of individuals within the species -- e.g. male [fiddler crab](https://en.wikipedia.org/wiki/Fiddler_crab) claws, [peacock](https://en.wikipedia.org/wiki/Peafowl) tails, and [Japanese rhinoceros beetle](https://en.wikipedia.org/wiki/Japanese_rhinoceros_beetle) horns. [^compneuro]: Let me pick on myself, and share an example of a poorly grounded field that is close to my own heart. I did a PhD in computational neuroscience, finishing in 2012. Computational neuroscience is full of amazing theories for how the brain works. Each year, in conferences and papers these would be fleshed out a bit more, and made a bit more complex. Most of these theories were developed by extremely intelligent people who believed strongly in what they were discovering, often using very clever math. These theories would often contradict each other, or suggest that other theories didn't explain the important aspects of the brain. Because these theories were inconsistent with each other, we knew that many of them had to be some combination of wrong and irrelevant. *This didn't matter for the field.* Despite being wrong, almost none of the work in computational neuroscience at the time was actually *falsifiable*[^moredata]. The experiments all recorded from a small number of neurons, or had a coarse spatial resolution, or had a coarse temporal resolution. This experimental data was simply too limited to falsify any theory of the brain (and if you comb through enough experiments which record from a half dozen neurons out of 10 billion total, you can find an isolated experiment that supports any theory of the brain). So the competing theories would persist as elaborate competing narratives, and nothing was ever resolved. We are in a similar situation when we speculate about the future of AI, without identifying experiments we can perform to falsify our predictions. Most of the fears and ideas we develop will be silently wrong. [^elo]: Thank you to David Dohan for suggesting Elo scores here! [^subtlety]: Some example experimental design choices without clear answers: Should resources spent on sexual signaling be counted as directly linked to reproduction? Should resources spent on learning / play be intrepreted as directly linked to survival? What about the time an organization spends fundraising? [^plausiblepositive]: There are also plenty of plausible-seeming futures that result in utopia, rather than disaster. Those just aren't the focus of this blog post. There are even more plausible-seeming futures where we continue to muddle along with both good and bad things happening, but no near term consequence large enough to count as an existential outcome. [^moredata]: This is reportedly getting better, as experimental neuroscience follows its [own version of Moore's law](https://stevenson.lab.uconn.edu/scaling/#), and researchers record exponentially larger and more comprehensive neural datasets. I think this would be a very exciting time to enter the field of computational neuroscience -- it is the time when the field is finally getting the data and tools that might allow building correct models of the brain.
BibTeX entry for post:
@misc{sohldickstein20230309,
author = {Sohl-Dickstein, Jascha},
title = {{ The hot mess theory of AI misalignment: More intelligent agents behave less coherently }},
howpublished = "\url{https://sohl-dickstein.github.io/2023/03/09/coherence.html}",
date = {2023-03-09}
}
@misc{sohldickstein20230309,
author = {Sohl-Dickstein, Jascha},
title = {{ The hot mess theory of AI misalignment: More intelligent agents behave less coherently }},
howpublished = "\url{https://sohl-dickstein.github.io/2023/03/09/coherence.html}",
date = {2023-03-09}
}