AI threat model
As someone who sees mitigating AI-based (X/S-)risks as the most important problem of our age, in this post I'll briefly go through the core of my AI threat model.
Foreword
Much has been written about the risks posed by Artificial Intelligence, and I have neither the technical brilliance nor the narrative and textual finesse of, say, Eliezer Yudkowsky, Rational Animations or Duncan Sabien. Risks posed by the rapid and uncontrolled development of frontier AI models include a wide variety of threats, ranging from slightly irresponsible ones to downright existential risks for all humanity. As Duncan goes through most of this in his far better post linked above and Olli J masterfully goes through in a detailed way the main arguments on his website, this post is meant as a short personal memorandum.
The Problem
The beginning. The first widespread experimental use of AI systems was probably social media. I have had an innate skepticism towards social media right from the early days of the current era, e.g. Facebook and its followers. This excellent lecture from Tristan Harris and Aza Raskin overlaps with my model of social media well. Despite being used by billions of people, a few companies dictate what kind of content is fed to our brains, mostly uncontrolled. Social media in its current form is a civilizational failure, prelude to the later development and a first step on a path of losing control to automated systems.
Breakthroughs. After several crucial technological breakthroughs like deep learning architectures, transformers and CNN, the stakes are getting higher and higher. We already live in a reality where
AI agents can be helpful when planning lethal attacks,1
We don't understand what is happening inside them, and
The current systems are making significant contribution to ground-breaking scientific research (such as AlphaFold).
Opaqueness. The existing AI frontier models are Alien Minds. Their own creators do not understand their internal decision-making mechanisms. While it is good that the tech corps will not reveal their exact architectures and training methods for open scrutiny, keeping them as company secrets, they do not allow any thorough red-teaming to get to the core of them.2
Lack of control. There are very few oversight over the biggest companies, they face little accountability for possible negative externalities they impose on the world, their security measures are limited at best, and there exists a dynamic of competition that keeps accelerating the development process.
Multiple negative scenarios. There exists a path where we never reach AGI-level capabilities, but instead a single actor achieves such a technological breakthrough that they can effectively control the world with an advanced, yet sub-AGI system. This scenario makes it even more challenging to convince (state) actors why we should change our current approach.
No precedence. If3 we are able to develop a smarter-than-human entity, we have absolutely no way of predicting what will happen after that.4 You don’t gamble with such stakes.
The Solution
Alignment. The biggest recent change in my threat model has been the hopelessness of alignment work. While already precariously difficult5 and utterly under-resourced, but is also potentially dangerous. We don’t have any good alignment plan at the moment (see for example Soares), and we probably shouldn't continue publishing such plans while we do not have any serious and thorough plan about the big picture. There are two main reasons for this:
The near-miss scenarios could include S-risk.
The fact that even "perfect alignment" (what would that exactly mean, considering the width of conflicting human values?) doesn't take away the risk of misuse.
Voice of the masses. I am skeptical about the potential of masses, despite movements such as PauseAI having some visibility and Joep having solid takes. I’m cynical about the possibility of even big masses making any difference by protests or such. While historically it has happened, our zeitgeist is different, the average citizen of the western countries is more lobotomized6, and the relationship between Big Tech and consumer masses is different in its dynamic than any other relationship in the past. Pausing AI development7 would be what I endorse, but I don’t believe it can happen.
International governance systems. The only concrete option that I see being more hopeful than a lottery ticket is international8 co-operation on the governance of AI systems. We have precedence: the numerous nuclear non-proliferation agreements from the Cold War era. The polarizing statements about the decline in USA-China -relations are downright dangerous in that they are misleading and creating a false narrative about our possible paths forwards. While the US is definitely changing its long-term focus to the Pacific and the commercial relationship between the countries is somewhat hostile, from a global perspective of the major powers the situation is not worse than it was during the Cold War. We have yet to see similar events to Cuban crisis, even as we have the war in Ukraine getting close to its third anniversary.
End state. The preferred end state in relevant countries is to limit the development of potentially existentially dangerous models and install mandatory off-switches to the current and forthcoming powerful systems. To achieve this, national (or preferably international) legislation must be imposed - with the threat of extremely heavy penalties - upon the safety testing protocols. This includes, for example, external and unbiased authoritative red-teaming.
Current state of governance and policy
The AI governance and policy field is developing quickly, but not quickly enough compared to the speed of AI labs - a perpetual challenge in how legislation keeps pace with emerging technologies, but potentially far more grievous in this case. The two major powers of the AI field - at least currently - are US and China, while EU has the most extensive legislation concerning AI at the moment. US, China and EU hold fundamentally different positions.
The United States is concerned about global power politics and uses its old trick - trade and money - as its primary tool, trying to reign its competitors through chip technology. Biden's administration gave the Executive Order in 2023, obligating anyone training AI models to report and act under certain safety conditions - but only with high enough FLOPs, which currently no actor has. Executive Orders can also be easily revoked, should the presidential party change after elections. In the US, the battle between the "e/acc" movement (techno-optimists and fierce defenders of open source, among other things) and those more concerned about safety rages harder than elsewhere. Additionally, all of the tech giants lobby aggressively in Capitol Hill. It should also be mentioned that the first major bill that would have made the giants accountable for catastrophic risks - SB1047 - was recently vetoed by the Californian governor.
The European Union chose a different path, as the AI Act has finally been accepted across the Union's legislative bodies. The AI Act is much more restrictive than anything else currently in the world, but its weakness is that it targets many harmless or even beneficial actors at the same time with the relevant ones. It has a classification of models based on their dangerousness, and it has additional obligations to foundation models. As a joker card, it seems that the Act is applicable to many non-EU actors as well, as it is applied if the end product is used inside the Union. One risk the Act faces is the functionality of the AI Office. While on paper the Office seems very promising, it will have big challenges in luring the top talent and to have enough authority to do what is necessary, should a situation arise.
China seems to be the least worrisome actor, if the perspective is entirely X- and S-risks. China has legislation concerning AI both in force and in preparation, but the focus is on state control. China lags far behind in the model frontier, and apparently their SOTA models are copies from the west through open sourcing and industrial espionage. Of course, as with the US, one should be wise to expect China to work with using AI extensively in warfare - these are not often news that break the surface of media outputs, but we should assume those things are definitely happening.
The UK is likely the most relevant additional actor, being the hub of the AI scene in Europe, with established organizations doing safety research and a state-backed "task force" (AISI) for safety work. Other countries, such as India, have developed their National AI Strategy. However, currently their resources are not sufficient to make them main focus points at this point.
To end, my take on the key challenges at the moment in governance and policy work:
Sensible common definition of AI.
Politicians' willingness for strict enough restrictions, complicated by tech giants' lobbying and the compromise-centered structure of western politics.
The willingness to give the overseeing institutions real power.
Willingness of countries to make compromises in National Defense.
Such as building bombs, making cyber attacks against critical infrastructure and of course automated warfare.
Following the Bletchley Park AI Safety Summit in November 2023, major AI labs agreed to allow independent testing of their AI systems before release. However, some of the companies have not allowed the testing, and it remains quite disputed how widely those that have do actually listen to the results.
While we should take with a grain of salt the words of the leading tech labs - who say that it will happen “within a few thousand days” - the broad consensus in the field has shifted to significantly shorter timelines in the last few years. Even a few decades would be too soon if our safety measures keep being neglected in the current manner.
However, I would encourage to spend a couple of minutes thinking about what usually happens when two agents with a major power imbalance meet.
Even proposals and takes that get major attention, such as Aschenbrenner’s Situational Awareness, completely dismiss the technical difficulties in the alignment work. See, for example, Thane Ruthenis.
More intelligent, more educated and more wealthy, yes; but also has more to lose and is complacent and lulled with the shackles of a a quite comfortable life. Orwell’s and Huxley’s dystopia are often presented as opposite realities, but I see both of them existing on our planet at the same time. A medieval peasant who revolted faced as a consequence the hangman or at least some combination of violence and poverty, but he was also much more miserable in his daily living conditions, and thus had much less to lose.
Pausing as in halt all frontier AI development until we know how to make it safely, regardless of how long it will take.
Not global, at least not yet, but collective for the countries currently developing frontier models. As development becomes cheaper and more accessible, and as trained models improve and open source models become increasingly dangerous, global governance will likely become necessary.