Last updated: 09.30.2024
Published: 09.30.2024

(Part 1, Otherness) Extended audio/transcript from my conversation with Dwarkesh Patel

I recently had the pleasure of recording a podcast with Dwarkesh Patel. The final episode was ~2.5 hours, but we actually spoke for about six hours – four about my “Otherness and control in the age of AGI” series, and two about the basic AI takeover story. I thought it might be worth making the extra content public in some form, in case it might be of interest to anyone; so with Dwarkesh’s permission, I’m posting the full conversation on my audio feed here, along with transcripts on my website.1

I’ve split the conversation into two parts, according to the topics mentioned above. This post is for the first part we recorded, on the Otherness series. Extra content includes:

(Note: the time-stamps in the transcript are approximate, due to my having added a bit of intro to the audio.)

Transcript

Dwarkesh Patel (00:00:00):

Today I have the pleasure of chatting with Joe Carlsmith. He is a philosopher — in my opinion, a capital G great philosopher. You can find his stuff at joecarlsmith.com, and today we’re going to be discussing his recent series, “Otherness and Control in the age of AGI.” Joe, welcome to the podcast.

Joe Carlsmith (00:00:19):

Great to be here. Thanks for having me.

Why might we regret alignment?

Dwarkesh Patel (00:00:20):

First question. If in a hundred years time we look back on alignment and consider it was a huge mistake, that we should have just tried to build the most raw powerful AI systems we could have, what would bring about such a judgment? What was the most likely scenario?

Joe Carlsmith (00:00:35):

I think I want to first name a few scenarios that are nearby to that one that are quite salient to me, but which I don’t think quite fit the bill. So one scenario I think about a lot is one in which it just turns out that various goals related to alignment that we have are just fairly easy to achieve. So maybe fairly basic measures are enough to ensure, for example, that AIs don’t cause catastrophic harm, don’t seek power in problematic ways, et cetera. And it could turn out that we learned that it was easy in a way such that we regret, we wish we had prioritized differently. We end up thinking, oh, I wish we could have cured cancer sooner. We could have handled some geopolitical dynamic differently. We could have just generally allocated our resources differently.

(00:01:33):

Now, I think this is a real possibility. I think it’s the kind of thing you sign up for whenever you act to mitigate a risk that isn’t a hundred percent. So maybe you invest in responding well to the next pandemic over some period and then there isn’t some pandemic and so you’re like, oh, in principle, you could have used those resources differently. You could have prioritized differently. But it was still worth it ahead of time. But that’s a scenario I think about.

(00:01:58):

There’s another scenario where we end up looking back at some period of our history and how we thought about AIs, how we treated our AIs, and we end up looking back with a kind of moral horror at what we were doing. So we were thinking about these things centrally as products, as tools, but in fact we should have been foregrounding much more the sense in which they might be moral patients, or were moral patients at some level of sophistication. That we were treating them in the wrong way. We were just acting like we could do whatever we want. We could delete them, subject them to arbitrary experiments, alter their minds in arbitrary ways, and then we end up looking back in the light of history at that as a serious and grave moral error. So I think about that too.

(00:02:47):

Now, notably, that’s not distinctive to alignment. So if you look at how we do capabilities right now, it also involves deleting the AI, grinding them through eons of training. So I don’t think concerns about AI moral patienthood are distinctive to alignment. And people who are skeptical of alignment aren’t skeptical on the basis of thinking that, “oh, we need to be prioritizing moral patienthood more,” right? They just assume that alignment is easy and they want to, in some sense, just make sure we’re exploiting AI labor to the fullest possible extent. But that’s a real scenario.

(00:03:30):

But again, so those are scenarios I think about a lot in which we have regrets. I don’t think they quite fit the bill of what you just said. I think it sounds to me like the thing you’re thinking is something more like: we end up feeling like, gosh, we wish we had paid no attention to the motives of our AIs, that we’d thought not at all about their impact on our society as we incorporated them, and instead we had pursued a “maximize for brute power option,” which is just: make a beeline for whatever is just the most powerful AI you can, and don’t think about anything else. Okay, so I’m very skeptical that that’s what we’re going to wish.

Dwarkesh Patel (00:04:14):

Can I give an example of what this could have looked like? I’m not making a strong case that this is historically accurate, just to jog an intuition: if you’re an early 20th century economist, it might’ve been best if you just didn’t think about inequality and class differences at all, in the sense that of course there was a problem, but somehow just pressing on that would’ve opened a can of worms that was much more destructive than pressing forward with techno capital.

Joe Carlsmith (00:04:48):

Yeah, I think that’s an interesting sort of case. Sometimes there’s something that’s worth paying attention to, but it in practice doesn’t lead to good results to pay attention to it.

(00:05:01):

Maybe it’s worth saying distinguishing between various ways you could end up feeling good about a maximize for brute power route to AGI. So one is there’s this question that comes up: what exactly is that route? What sort of training process did it involve? What do we have in mind here? There’s one version where it just so happens that that route leads to an AI, or a set of AIs, or a paradigm of AI, such that the AIs in question end up nice, non-destructive, integrating into our society in a healthy way, et cetera. But it just so happens that we got lucky. So there are different ways you could end up creating a powerful AI in principle. I have a friend who sometimes talks about how it could be the case that the way to get most powerful AI is to train it in a murdering humans simulator, right? All of its tasks involve murdering humans, or simulated humans, or something like that. You might worry about that AI, once you have it, because that’s a scary type of training. You might worry that it’s picked up some stuff about murdering humans that you didn’t want, even though it ends up really powerful. So I don’t actually obviously expect that to be the fastest path, but there’s a question of how robust was the niceness that you got out of this, and maybe you got lucky, maybe whatever this fastest path was happened to be nice. That’s one way.

(00:06:32):

Another way you could feel good about the “maximize for brute power option” is: it very convergently leads to niceness, because of something like: the AI, when it’s intelligent, it reflects and it becomes morally enlightened, and it sees the true light, as we all will if we are smart enough. I’m skeptical of that, but that’s a view some people have.

(00:07:02):

And then there’s a third option, which is my least favorite, which is something like: to worship brute power in itself. So, to be happy with whatever AI, maybe even to be happy if it murders you and your children and then whatever it does, whatever goes off to do, maybe it maximizes torture or something, maybe it’s not conventionally nice at all, but you don’t care, because all that matters to you is that it’s smarter than you, and powerful. And so in that case, you’re not having a hypothesis that power or smartness will lead to some other type of value. You’re just such that you worship power and smartness just on its own, wherever it leads. I think that view is sort of morally repugnant, but I think it’s worth distinguishing between these.

(00:07:54):

The one that I think is most likely, if I was guessing, is the first one: that it just happens in this case, if it’s the case that we end up being like, “oh, it would’ve been fine or great to have just maximized for brute power in our AIs,” it’s because it just so happened that the way of doing that led to niceness.

Dwarkesh Patel (00:08:09):

Maybe just stepping back, I want to think about what is this conversation? And so the audience might think, well, you’re forecasting all these AI capabilities, you’re assuming them, and I will say for the sake of this conversation, we are, because if you want to hear about when we should exactly expect AGI, I have many other conversations that get into the cruxes of the debates there. You can check those out here. What I want to do and what I really appreciate about your essay series is to meditate on what should be the relationship we have to these AIs. What kinds of things should we expect them to become?

Joe Carlsmith (00:08:45):

Yeah, so one thing I want to just quickly throw in on that front, I mentioned this possibility that AIs might be moral patients. I can imagine a lot of folks in the audience being quite skeptical of that, especially of AI now, but maybe even of much more sophisticated AI in the future. So I just want to flag, there’s a conversation to be had about that. I have a take, but I understand many people might be like, what are you talking? This is just software, et cetera. So we can get into that if you like.

Dwarkesh Patel (00:09:12):

Yeah, in fact, that will be very interesting because there’s a couple of different ways of thinking about this issue. One is, if you’re going to treat them very seriously as moral patients, is there’s a couple of words we use when we’re treating other humans, like “slavery” or “brainwashing” or “totalitarianism,” which are very loaded in the human context and whether it is appropriate to use them in the AI case. If you don’t, then anyways, we’ll get into that digression later.

(00:09:45):

I think the last thing you mentioned is: you got to support techno-capital. It’s the true morality, and there’s obviously a straw version of that which is very weak, which I don’t know, e/acc at it’s weakest, if you really think about what they’re saying, involves: if the gray goo takes over, then the gray goo singleton is what we worship. I think what they must mean, because at the end of the day, I wonder how much of this philosophy stuff when you’re talking about AI is really about empirical predictions. I think at the end of the day, people who have different moral preferences, if they just concretely see what the galaxy looks like, I don’t know anybody who would endorse the gray goo singleton. On the other hand, I don’t know anybody who would endorse enslaving the human-like AIs or torturing them or something. So I actually do wonder how much of this is a philosophical discussion at all. I wonder how much of it is empirically, what kind of world do you expect there to be in a hundred years?

Joe Carlsmith (00:10:40):

Yeah, I think in practice, very few people will consciously sign up for: “yes, all that matters is that the AI is smarter than us. I don’t care if it kills everyone. I don’t care if it maximizes for torture. I don’t care about gray goo.” I think in practice, people who are endorsing of just letting-‘er-rip with competition and some sort of techno-capital, et cetera, are expecting it to lead to places that they like, or that they would like on reflection. So I suspect that many people, their response to that would be centrally empirical. Mostly I just want to make sure we’re distinguishing between the empirical aspect of some kind of allegiance to power and competition, and the ethical aspect of like, oh, there’s a hierarchy of beings, and if some being is smarter and then just for that reason…. I mean, look, I think we should have lots of respect for beings that are smarter than us, even if they have different values from us. But I also think there’s quite a bit more to goodness than power.

Are humans misaligned from monkeys in a bad way?

Dwarkesh Patel (00:11:47):

Yeah. One common example that’s given of misalignment is humans from evolution, and you have one line in your series that: here’s a simple argument for AI risk. A monkey should be careful before inventing humans. The paper clipper metaphor implies something really banal and boring with regards to misalignment, and I think if I’m steel manning the people who worship power, they have the sense of: humans got misaligned and they started pursuing things… This is a weird analogy because obviously monkeys didn’t create humans. But if a monkey was creating them, they’re not thinking about bananas all day. They’re thinking about other things. They’re obviously not enslaved by the monkeys or aligned with the monkeys. On the other hand, they didn’t just make useless stone tools and pile the up in caves in a paperclipper fashion. There were all these things that emerged because of their greater intelligence, which were misaligned with evolution — creativity and love and music and beauty and all the other things we value about human culture. And the prediction maybe they have, which is more of an empirical statement than a philosophical statement, is: listen, with greater intelligence, if you’re thinking about the paperclip or even if it’s misaligned, it’ll be in this kind of way. It’ll be things that are alien to humans, but also alien in the way humans are aliens to monkeys, not in the way that a paperclippers is alien to a human.

Joe Carlsmith (00:13:22):

Cool, so I think there’s a bunch of different things to potentially unpack there. One conceptual point that I want to name off the bat, I don’t think you’re necessarily making a mistake in this vein, but I just want to name it as a possible mistake in this vicinity, is: I think we don’t want to engage in the following form of reasoning. Let’s say you have two entities. One is in the role of creator and one is in the role of creation, and then we’re positing that there’s this misalignment relation between them, whatever that means. And here’s a pattern of reasoning that I think you want to watch out for: to say, “in my role as creation,” say you’re thinking of humans in the role of creation relative to an entity like evolution or monkeys or mice or whoever you could imagine inventing humans, you say, “qua creation, I’m happy that I was created and happy with the misalignment. Therefore, if I end up in the role of *creator* and we have a structurally analogous relation in which there’s misalignment with some creation, I should expect to be happy with that as well.”

(00:14:40):

So here’s a way of thinking about this mistake. Obviously this is very crude. This isn’t what I’m actually expecting from an AI or anything, but just to see the mistake, imagine you had three sizes of fish. So you have a little fish and a medium fish and a big fish. And the little fish says, “you know what would be a good idea, is to create this medium fish.” Medium fish eats little fish. Now, medium fish might be like, “that was great. Misalignment is awesome. I got created, I had this lovely meal. This is wonderful. Great, so therefore I shall make bigger fish.” Now, assuming these fish are selfish, which we don’t need to assume, but at a logical level, you don’t want to be that middle fish and to say, “oh, therefore it’s great to switch my role. “And so now I’m not saying that’s what you’re saying with respect to evolution or monkeys, but I just want to name that point as something to watch out for in this vicinity.

Dwarkesh Patel (00:15:37):

Yeah, that’s a useful intuition. Now, I think maybe my intuition comes… you had a really interesting line in in your essay rebutting Robin Hanson’s argument about AI risk where he said, listen, we’re misaligned with the values of the past, and you say, maybe the past should have cared about the fact that we’re misaligned with their values. Feel free to express that in a more nuanced way, but I think the thing I have in mind is there’s a thing where a human is a human and monkeys, if they were creating them, there’s things they wouldn’t have preferred, but I have the sense that on reflection, if the monkeys really thought about it and could think about it, they would be like, yeah, I’m glad humanity is a thing that exists. I don’t know how to phrase it in terms of how that fits into their values. Whereas if somehow a hoard of locusts emerged out of monkeys, I think they would be right to say, “oh, this was terrible, we just converted all the banana trees into locust poop, that was horrible.” But I don’t know how to verbalize why those feel distinct. But the analogous thing for obviously human to AIs is: paperclippers are like the hoard of locust, versus the kind of thing that is to a monkey, what a human is compared to us, and even if they take control of things, somehow I feel better about it than the paperclippers.

Joe Carlsmith (00:16:58):

I think that is the right question to ask in this logical structure. The question is: how did the creator feel about the misalignment? And here you’re asking, you’re assuming the monkeys are in the role of creator, and you’re asking how would the monkeys feel on reflection about some process of creating misaligned humans?

(00:17:22):

We might want to say a little bit more about the scenario we have in mind, because when people talk about alignment, they have in mind a number of different types of goals. So one type of goal is quite minimal, it’s something like that the AIs don’t kill everyone, or violently disempower people. And then slightly broader, and this is a version I talk about in my series, is that they uphold minimal standards of cooperation and decency that we expect of human citizens. So they obey the law, they respect basic rights and constraints and boundaries in the way that we ask of human citizens. That’s a minimal standard.

(00:18:03):

There’s a second thing people sometimes want out of alignment, which is I think more ethically dubious, which is something like: good servants. AIs that will be fully obedient and an instrument of the human will, or of the will of some principal. That’s a different thing.

(00:18:30):

And then there’s a third thing that people often talk about, which is much broader, which is something like: we would like it to be the case that our AIs are such that when we incorporate them into our society, things are good. That we have a good future. Now, obviously alignment is totally, I mean, alignment …

(00:18:48):

Mostly when I talk about alignment as a technical thing, I’m mostly talking about the technical know-how involved in understanding and controlling the motives of our AIs, where by motives, I mean something like: the factors internal to the AI that determine how it employs its capabilities. I think some people are resistant to talking about AI motives, and we can talk about whether that’s too anthropocentric, but something about, the AI has a capability latent within it, what determines when that capability gets employed or directed.

(00:19:21):

Obviously having that kind of know-how can feed into these goals in different ways. I think there are things you can do other than controlling the AI’s motives to try to achieve some of these goals. And obviously you need way more for something like a good future than just knowing how to control the values for your AIs.

(00:19:39):

Okay, so I think it’s important to distinguish those different aspects of alignment and the different things we could have in mind in the monkeys case. Let’s just start with the first one.

(00:19:46):

So in the scenario we’re imagining with the monkeys creating the humans, one question is just: did the humans kill the monkeys? Now, we’re not talking about actual evolution here or anything. I’m just talking about, we’re imagining some hypothetical scenario where the monkeys have some button for creating humans. Now if they killed the monkeys or really violently disempowered the monkeys in some other way … obviously we can ask about monkey’s values on reflection, whatever that means, but I suspect, for many monkeys, they just don’t want to be killed, right? That’s a pretty core monkey thing. It’s a pretty core thing for lots of creatures. I think it’s a reasonable preference. And so that’s one important dimension to just talk about.

(00:20:35):

There’s a different thing which is like, maybe let’s skip the servant one, though we can talk about that, and talk about just the goodness. So maybe set aside the killing and just talk about how would the monkeys feel about human civilization on reflection? Would they be like, okay, they have culture, they have love, these things, it’s not paperclips, they’re not just piling up stone tools, it’s not a locust thing….

(00:21:01):

And so a few things to note there. One is monkeys have a lot of similarities with humans, so we might want to run it with other creatures. So you talk about octopi or talk about mice or talk about locusts themselves. I worry a little, talking about monkeys, people get into this evolution and next stage of evolution mindset. I don’t think we should be understanding AI as evolution in any technical sense. There’s not genetic selection occurring. I worry a little bit about this chain of being vibe people have.

(00:21:37):

But even setting that aside, I think there’s just this general question of: what did the monkeys want on reflection? It is hard to say. It is actually hard to talk about what the monkeys want on reflection. My concern would be that if it’s the case that the monkeys would like human civilization on reflection, my concern would be that the AIs are notably more different from monkeys relative to humans.

(00:22:03):

And I think there’s also a thing about: what’s the monkey’s alternative? So maybe they’d sort of like human civilization on reflection, but suppose they could have gone a little slower and been a little more careful with that button and they could have gotten something else. How would they feel about that? Now, obviously there could be costs to doing that, and so we can talk about that.

(00:22:21):

So those are a few factors I think about in the context of that sort of example.

Philosophers forecasting the singularity

Dwarkesh Patel (00:22:25):

There’s a couple of philosophers that you brought up in the series, which if you read the works that you talk about, actually seem incredibly foresighted in anticipating something like a singularity, our ability to shape a future thing that’s different, smarter, maybe better than us. Obviously CS Lewis “Abolition of man,” we’ll talk about it in a second, is one example, but even here’s one passage from Nietzsche, which I felt really highlighted this.

(00:22:54):

“Man is a rope stretched between the animal and the superman, a rope over an abyss, a dangerous crossing, a dangerous wayfaring, a dangerous looking back, a dangerous trembling and halting.”

(00:23:05):

And I understand there’s a danger in me not knowing what was the exact context in which the philosophers are talking about this and considering everything in the context of which I’m thinking about it. But is there some explanation for why? Is it just somehow obvious that something like this is coming, even if you’re thinking 200 years ago?

Joe Carlsmith (00:23:24):

I think I have a much better grip on what’s going on with Lewis than with Nietzsche there. So maybe let’s just talk about Lewis sure for a second.

(00:23:29):

And we should distinguish between … there’s a version of the singularity that’s specifically a hypothesis about feedback loops with AI capabilities. I don’t think that’s present in Lewis. I think what Lewis is anticipating, and I do think this is a relatively simple forecast, is something like the culmination of the project of scientific modernity. So Lewis is looking out at the world and he’s seeing this process of increased understanding of the natural environment and a corresponding increase in our ability to control and direct that environment. And then he’s also pairing that with a metaphysical hypothesis –or, well, his stance on this metaphysical hypothesis I think is problematically unclear in the book — but there is this metaphysical hypothesis, naturalism, which says that humans too and minds, beings, agents are a part of nature. And so insofar as this process of scientific modernity involves a progressively greater understanding of and ability to control nature, that will presumably at some point grow to encompass our own natures and the natures of other beings that in principle we could create. And Lewis views this as a cataclysmic event and crisis, and in particular that it will lead to all these tyrannical behaviors and tyrannical attitudes towards morality and stuff like that, unless you believe in non-naturalism or in some form of Tao, which is this objective morality. So we can talk about that, but part of what I’m trying to do in that essay is to say, no, I think we can be naturalists and also be decent humans that remain in touch with a rich set of norms that have to do with how do we relate to the possibility of creating creatures, altering ourselves, et cetera. But I do think it’s a relatively simple prediction. It’s: science masters nature, humans part of nature, science masters humans.

Reflection, agency, and utility functions

Dwarkesh Patel (00:25:48):

A sort of question I have, which might be a naive question, or at least I feel it’s naive when I talk to a rationalist and is this idea that, listen, when humans reflect on our values, we think through, here’s some things I believe, I think dog suffering is bad, but there’s also these pigs and these factory farming crates, and if I just reflect on how all this meshes together, I come to the conclusion that the pig suffering is also bad. And so in that sense, just consolidating your preferences has a very moral, beneficial, almost enlightening effect. Whereas with these AIs, the story is they’ll look in their own source code, realize that when you net everything out, what they really want is a paperclips. And I’m confused on why in one scenario is it a benevolent story, and the other one, it’s a dystopic story.

Joe Carlsmith (00:26:39):

So I think it’s to do with the good outcome being defined relative to your starter values and how they extrapolate or what happens when you engage in this process of reflection. So I think the sort of moral anti-realism that I think of as paradigmatically giving this sort of view that it’s good when humans reflect but bad when something that has very different values from humans reflects is basically: the true morality is being defined as the morality that would be output by the process of… or the morality, your values, the thing that you should be fighting for on this meta-ethics is just the thing that you would fight for having gone through this process of reflection. But that output isn’t necessarily the same as what happens if you start with a very different set of values.

(00:27:37):

So you’re talking about this case in my essay of like, okay, so you learn that someone is keeping their dog in a crate and is going to eat it, and you get very upset and then you remember about the pigs and the crates, and you notice you’re not upset about that. And so you ask yourself, okay, so what is it? What’s going on with this? What is it that I really care about with the dog? I think there’s a very interesting question, what’s going on? How do we understand that process of really trying to inquire into what is it that I care about in this dog? But one thing that can happen is you realize that the thing you really care about in the dog, is it’s suffering or something related to suffering. And then you realize that that’s present in the pig too. And so you are expanding your values or systematizing your values via this process. But that process has all these starter inputs. So we started with your intuition about the dog, and we were able to ask, okay, so what is it about the dog?

(00:28:42):

I think the analogous thing with the sort of thing that has been trained to maximize paperclips, you do expect reflection, but the reflection is like, okay, I’ve got this bent piece of metal, and it’s like, but is it a paperclip? And so you look at a standard paper clip and you’re like, okay, what do I really care about in this standard paperclip? And so you’re going through, but you have to start with this initial caring and then you systematize it. And so that’s in the paperclip example, why you wouldn’t necessarily expect a paperclipper, or some being that had no compassion for anything, or whatever…

(00:29:20):

Now, it’s an interesting question. You can do this with, say, yourself. You can be like, okay, I care about myself, and that’s maybe a very robust thing. Or: I care about my own preferences. And you can be like, okay, wait, what is it that I really care about when I care about … So there’s an interesting question of how general does a process of reflection apply across beings? But the basic intuition is something like: you have to have an initial input of values that then get systematized, and that can go in a bunch of different directions.

Dwarkesh Patel (00:29:55):

And then you also have a very interesting other essay about suppose humans, what should we expect if other humans did this sort of extrapolation, if they had greater capabilities and so on. You’ll actually find this interesting to know that when I interviewed Eliezer I did ask him what would happen if a human foomed? And his answer was, this is going to sound a little bit cringe and ridiculous, but this was his answer. So his answer was that I’m getting to the point where I can sort of see the kinds of things a foomed agent would do given let’s say his intelligence, of how I would power seekk and so forth. And I think he might have endorsed: yep, if I foomed, I don’t know, bad things might happen.

Joe Carlsmith (00:30:40):

Yeah, I mean, I think an uncomfortable thing about the conceptual setup at stake in these sort of abstract discussions of, okay, you have this agent, it fooms, which is this amorphous process of going from a seed agent to a superintelligent version of itself, often imagined to preserve its values along the way such that it’s analogous to the process of reflection we just talked about. Bunch of questions we can raise about that. But I think many of the arguments that people will often talk about in the context of reasons to be scared of AI, like “oh, value is very fragile, as you foom, small differences in utility functions can decorrelate very hard and drive in quite different directions.” And “oh, agents have instrumental incentives to seek power, and if it was arbitrarily easy to get power, then they would do it” and stuff like that. These are very general arguments that seem to suggest that it’s not just an AI thing. It’s no surprise, right? It’s talking about: take a thing, make it arbitrarily powerful such that it’s God-emperor of the universe or something. How scared are you of that? Clearly we should equally scared of that. Or I dunno, we should be really scared of that with humans too, right? So I mean part of what I’m saying in that essay is that I think in some sense this is much more a story about balance of power and about maintaining checks and balances and a distribution of power period, not just about humans versus AIs and the differences between human values and AI values. Now that said, I do think many humans would likely be nicer if they foomed than certain types of AIs. But think the conceptual structure of the argument, it’s a very open question how much it applies to humans as well.

Dwarkesh Patel (00:32:51):

One big question I have is, I don’t even know how to express this, but how confident are we with this ontology of expressing: what are agents, what are capabilities. Where I feel like somebody like Eliezer is pretty confident that this is what’s happening. And I don’t know if this is a good way to judge robust these things are. How long have we had decision theory? How many decades is it, less than a century? I know you’ve written a bunch about this, or at least how the theory works. How do we know this is the thing that’s happening, or this is the way to think about what intelligences are?

Joe Carlsmith (00:33:40):

It’s clearly this very janky… I mean, well, people maybe disagree about this. It’s obvious to everyone with respect to real world human agents that thinking of humans as having utility functions is at best a very lossy approximation of what’s going on. I think it’s likely to mislead as you amp up the intelligence of various agents as well. Though I think Eliezer might disagree about that. I think Eliezer does have allegiance to this notion that agents will not step on their own feet. Agents will be effective, or at least you won’t be able to notice ways in which they’re ineffective, and that in virtue of various coherence arguments, this will mean that that agents end up with values that are utility function shaped. I’m pretty skeptical of that, especially for this messy intermediate period with AIs that might be somewhat better than humans, but still have their own idiosyncrasies and their own incoherence. And so I think we should definitely be skeptical of like, oh, obviously the whole thing is utility functions.

(00:35:01):

I will say I think there’s something adjacent to that that I think is more real, that seems more real to me, which is something like: it is the case that stuff that happens in the world is sometimes importantly explained by someone having planned for that thing to happen, searched over ways to cause that thing to happen, evaluated those ways according to criteria, and then chosen, and then adjusted as obstacles came up. So I think about, I dunno, my mom recently bought, or a few years ago, she wanted to get a house, she wanted to get a new dog. Now she has both. How did this happen? She tried, it was hard. She had to search for the house. It was hard to find the dog. Now she has a house, now she has a dog. This is very common thing that happens all the time. And I don’t think we need to be like “my mom has to have a utility function with the dog and she has to have a consistent valuation of all the houses or whatever.” but it’s still the case that her planning and her agency exerted in the world resulted in her having this house, having this dog. And I think it is plausible that as our scientific and technological power advances, more and more stuff will be explicable in that way. That if you look and you’re like, why is this man on the moon? How did that happen? And it’s like, well, there was a whole cognitive process, there was a whole planning apparatus. Now in this case, it wasn’t localized in a single mind, but there was a whole thing such that: man on the moon. And I think we’ll see a bunch more of that. And the AIs will be doing a bunch of it. And so that’s the thing that seems more real to me than utility functions.

Dwarkesh Patel (00:36:48):

So yeah, the man on the moon example, there’s a proximal story of how exactly NASA engineered the spacecraft to get to the moon. There’s the more distal geopolitical story of why we send people to the moon. And at all those levels, there’s different utility functions clashing, maybe there’s a meta societal world utility function, but maybe the story there is there’s some balance of power between these agents and that’s why there’s the emergent thing that happens. Why we send things to the moon is not one guy had a utility function, but I don’t know, cold war, dot dot dot, things happened. Whereas I think the alignment stuff is a lot about assuming that one thing is a thing that will control everything. How do we control the thing that controls everything? Now, I guess it’s not clear what you do to reinforce balance of power. It could just be that balance of power is not a thing that happens once you have things that can make themselves intelligent. But that seems interestingly different from the how do we got to the moon story.

Joe Carlsmith (00:37:57):

Yeah, I agree. I think there’s a few things going on there. So one is that I do think that even if you’re engaged in this ontology of carving up the world into different agencies, at the least you don’t want to assume that they’re all unitary, or not overlapping. It’s not like, all right, we’ve got this agent, let’s carve out one part of the world. It’s one agent over here. It’s this whole messy ecosystem, teeming niches, and this whole thing. And I think in discussions of AI, sometimes people slip between being like, well, an agent is anything that gets anything done. It could be this weird mooshy thing. And then sometimes they’re very obviously imagining an individual actor. And so that’s one difference.

(00:38:47):

I also just think we should be really going for the balance of power thing. I think it is just not good to be like, we’re going to have a dictator. Let’s make sure we make the dictator the right dictator. I’m like, whoa, no. I think the goal should be, sort of, we all foom together. It’s like the whole thing, in this kind inclusive and pluralistic way, in a way that satisfies the values of tons of stakeholders. And at no point is there one single point of failure on all these things. I think that’s what we should be striving for here. And I think that’s true of the human power aspect of AI, and I think it’s true of the AI part as well.

Being wrong, subjective Bayesianism, and predictable updating

Dwarkesh Patel (00:39:29):

Yeah, I think that’s going to be a great jumping off point to the discussion about deep atheism. And I think fundamentally, I wonder how much of the discussion is just like, did the laws of physics fundamentally favor the foomed up dictator? Do they favor the balance of power? Anyways, but before we do that, I think one thing I want to flag, one concern I have. So I was at an AI forecasting thing a couple months ago and it was actually a useful exercise where they’ve had us talk about what do you expect to happen next year and the year after that and so forth until you get to AGI. And then we would discuss our things and then somebody who had maybe closer timelines would say, I interpreted the bio anchors thing in this one way and here’s how you should look at the transfer learning scaling curves or whatever. And while that was happening, I think this is probably a fatuous comparison, but I dunno, I don’t actually know the intellectual history here, but I’m imagining in some version that might be a useful intuition pump. Imagine your bunch of really smart friends are Marxists in the early 20th century and you’re at a cafe and you’re like, I mean a bunch of really smart philosophers and intellectuals were Marxists and there’s millions of pages that they wrote about their discourses on here’s what Hegel’s arguments imply about the downfall of capitalism and when exactly that will happen, but here’s what this other thinker implies, but they didn’t take into account this factor… I don’t know, I don’t actually know what the arguments were, but I can imagine something like that happening and I wonder if I was there, I would’ve been like, wait, I think this ontology is just wrong. I think it would be difficult if you didn’t have modern economic concepts to explain why. But maybe, I don’t know where this question is leading exactly, but sort of concern I have is this whole ontology could be wrong and I’m, it’s like one fundamental thing, but if you’re missing it, you’re just being led into when will capitalism fall timelines discussion.

Joe Carlsmith (00:41:30):

Yeah I mean, it’s just scarily possible to be wrong. Wrong in really harmful ways. The example that I think about more than Marxism is the population bomb discourse where, I haven’t looked at it super closely, but in some sense the argument, it’s very simple, empirical, you have this empirical trend of population growth. You have these simple examples of bacteria in petri dishes exploding and then crashing. And if you look, lots of people, if I imagine being in certain social circles back during that time, yeah, it’s a really scary possibility. And there were really destructive things done in the name of that prediction. And yeah, it is really, really scary. You can just be missing just a few things. And I actually think that argument is in some sense conceptually simpler than some of these other AI related arguments.

(00:42:45):

I will say there’s a thing here where I think we’re not used to thinking about the sense in which a Bayesian epistemology has to live in multiple worlds at once. I think it’s hard to be like I give 10% or 15% to something and really live in the fact that it’s 15%. And then in particular, the fact that it’s a 15% or something where — , or more, whatever the prediction, — it’s not like a dice roll, it’s something where when you look back at it, you’re going to see a reasoning error. It’s not an objective probability. You’re doing subjective Bayesianism. There’s logical uncertainty, you’re having to reason through the stuff, you’re doing this difficult dance. I think people are forgiving of themselves when they make bets with a dice and it’s like whatever, it was a good be bet in expectation, it didn’t come up five or whatever, I bet on five. But it’s easy to be much less forgiving of yourself when you’re trying to do this difficult work of subjective, giving the appropriate weight to hypotheses that may well just not be the case. But that’s the game. Many people are not in like: oh, a hundred percent obviously that’s right. They’re trying to be like, it might be the case, this, wow, that’s really important, and then maybe we got to do some stuff to prepare for that or think ahead. And sometimes you’re like, maybe it’s your majority probability that it’s not going to happen, and you want to be clear, you might regret in all these worlds and you really got to think about that. But it’s also, you got to think about these worlds too. And I think we don’t actually have a rich culture of doing that. I think people are used to being like: I have a view, I’m a Marxist, I’m a short timelines guy, or something. And we haven’t integrated the living with uncertainty thing very deeply.

Dwarkesh Patel (00:44:45):

And that goes both ways in the sense of, so one example is interview Leopold, after I interviewed him, I gave his 2027 AGI some more thought and I’m like, that’s only, I’ll give it like 25% for 2028 or before. Now that I just heard you speak on this, I’m like, oh, 25% is a lot. I should say if I was like 25% you’re going to go bankrupt by the time in four years, I’m like, I would take that very seriously. You have a really good essay about, I forgot what the title is, but just feeling in your gut the sense of timelines or doom or whatever it is. And for me personally, so I don’t know, I’ve been like having these AI discussions for, I dunno, a year or two. And it was just fun, intellectual. And so when I was preparing to interview Dario, I heard about his big blob of compute document, which he wrote in 2018 where he explains that you just put more compute in these things, they become intelligent. And for some reason just the idea that somebody thought this in 2018 had a powerful, I felt it in my gut effect, in the sense of: what did you think was going to happen? I don’t dunno how to express it, but the sense of course, this is where it was going and the fact that a couple years before I was talking about this, yeah, of course this is what happens. I don’t know if you had some such moment where it hit.

Joe Carlsmith (00:46:12):

So I think there is a general dynamic where once the truth is revealed, it will be wonderfully elegant and coherent and it will just fit with everything. And so it’s very easy, once the truth is revealed, to say, of course — if you actually know the explanation, sometimes you don’t. And so that’s one dynamic.

(00:46:41):

That essay is called Predictable Updating about AI risk, and it’s about this experience of going through the last bit of 2022, early 2023 as chatGPT and GPT-4 are coming out and a lot of people are really waking up to this and we’re seeing all this stuff, you’re seeing Bing Sydney rampaging around, and having this experience of stuff that had been in my abstract models of what is reasonably likely to happen with AI, really seeing it in the world and noticing some updating that I was doing, I was like, whoa. Even though in some sense I had predicted it ahead of time. So I think there’s a bunch of psychological dynamics going on there, but I’ve definitely, definitely had that experience. People when they’re like “the race is on,” I think it was Microsoft, CEO, it was like the race is on. I’m like, oh gosh, you are talking about races for a while and here you go. And yes, various things like that. Or I remember seeing for the first time, it was like AgentGPT in the browser, It was “create and deploy autonomous AI agents.” Now obviously those weren’t that powerful, but I was like, hmm, there’s a lot of things where you’re seeing headlines and you’re like, ah, I’m living in a world where those are the actual headlines. This is really happening. Now obviously stuff still might not happen. There’s tons of uncertainty remaining about where things will actually go. But I’ve definitely had the experience you’re talking about.

Dwarkesh Patel (00:48:16):

Yeah, I mean the thing that’s forecasting for the future where the equivalent of ChatGPT was might’ve been if you were thinking about AI in 2020, it’s like I have this abstract model that we’re going to have bigger clusters. So states will get involved, especially once they understand the stakes. But really China is launching cruise missiles at Taiwan, that has a more visceral quality. Or the sense of, oh, these things are more personable and they’ll be part of our lives before we get to AGI. A bunch of your friends have AI girlfriends and they feel strong emotional connections. Even as I’m saying them, they don’t feel real. but this will be part of the world, if your model of the world is correct, this going to be, it’s not just going to be in the abstract, the hundred billion dollar cluster, it’s going to be a world in which concretely different actors are reacting to it. It’s been integrated into the world in ways which are influencing what you see on a day-to-day basis.

Joe Carlsmith (00:49:21):

I talked about in the essay, I think there’s a slightly reverse dynamic that can happen where something happens that for a while you thought, okay, well if that happens, then I’m freaking out. That’s totally the update. And then it happens and you don’t, because it’s happening in the mundane world, in some way where you would sort of imagine it’s like, well, if that happens, there’ll be this thing that’s very different from my actual world.

Dwarkesh Patel (00:49:45):

That’s right.

Joe Carlsmith (00:49:47):

Now obviously sometimes you don’t update because you actually know more about the situation and you’re like, oh, actually it’s fine. So maybe you’re like, oh, if AI can do chess and then actually it’s like deep blue and you’re like, oh, this isn’t what I thought it was going to be. But sometimes you’re just not, you thought it would be something other than the real world. And actually no, it’s the thing.

Dwarkesh Patel (00:50:07):

How do you think about Covid in this way? Because I think that influenced people’s day-to-day life in a way that I think even AI wouldn’t. But it was stunning how fast it wasn’t like, oh, I’m living through, what was the name of the movie?

Joe Carlsmith (00:50:22):

Oh, contagion?

Dwarkesh Patel (00:50:22):

Yeah, it didn’t feel that way. It’s like, oh yeah, I’m not going to the gym anymore and my classes are online. And I feel like the AI thing will be even more … because you’re not having to lock down because of it.

Joe Carlsmith (00:50:34):

Yeah, I mean I think there’s something nice about living through these events of people doing all this epistemology and then getting the data what actually happened. I think there’s some epistemic virtue about remembering when you’re forecasting stuff, the ultimate arbiter of what you’re doing is this what actually happens? It’s easy to be doing it in this vaguely socially inflected way and you’re trying to say stuff that people will nod along to or whatever, but I think it’s healthy to remember, no, there will be a real world event that makes my prediction false or true, and I will have been right or wrong and being wrong will have been foolish, even if it was socially acceptable at the time or something like that.

Dwarkesh Patel (00:51:24):

And potentially consequentially if wrong.

Joe Carlsmith: Yeah, totally.

Deep atheism and trust in competition

Dwarkesh Patel: So let’s go back to the deep atheism thing. The story you get from Yudkowsky et al is the very nature of a smarter thing is it’s going to be misaligned in a very alien way, of not humans to monkeys, but almost in a gray goo to human sense. Whereas I think another version of the story I’ve heard from people is, oh, you’ll have all kinds of intelligences, but the thing that will outcompete all the other ones in the sort of Malthusian struggle will be the one that creates these subsistence level tortured ems and is just rapacious and the gray goo will outcompete in an economic sense rather than what we should expect the natural consequence of intelligence to be. And it seems like an interesting way to bifurcate the doom discourse. I dunno what you make of that.

Joe Carlsmith (00:52:23):

So in the essay on deep atheism, it’s called “Deep atheism and AI risk,” I define deep atheism in terms of these two forms of mistrust. One is this fundamental mistrust towards nature. This deep integration of the understanding of nature as this indifferent force that is … you want to move things in a direction and that’s not necessarily the direction that nature will go. And then there’s this other form of mistrust, which is towards bare intelligence. And we touched on this earlier, this idea that it’s also not the case that there’s this very powerful convergent force, Mind, that just emerges over time. And as soon as you get Mind, then you converge on goodness and moral enlightenment and stuff like that. Mind, too, is ultimately a vehicle of nature. It goes back to Hume and this idea of reason is the slave of the passions, so it really depends what your passions are. So those are these two forms of mistrust.

(00:53:29):

I think you’re talking about this other form which I think is also quite related, which is a mistrust of competition. So maybe you’re like, ah, you can’t just let go and let nature do its thing. And you also can’t let go and just let mind do its thing. And similarly, you can’t just let go and let competition do its thing because well, what wins competition? Power. And power is not the same as goodness, as we discussed. So every opportunity to sacrifice goodness for the sake of power makes you more likely to win. So the scenario you’re talking about — I hope they’re not tortured — but there’s some sort of grinding of all value away, any slack, any ability to pursue anything else until you have these optimized purely for competition beings. And some people like that or are maybe more okay with that. But I think that is another form of fear, right? Competition and related vibes, people like, oh, just let evolutionary do its thing or just let the market do its thing. The market is not unbridled competition, it’s structured by all sorts of things. But anyway: war!

(00:54:53):

If goodness is at all different from the most competitive thing, then if you truly selected it over the most competitive thing in some competitive landscape, then you’re likely to grind out value entirely. So some people are worried about getting there. And I think there’s a different sort of worry, which is: before you get to this fully optimized competition place, something will stop the competitive process. It’ll be like, okay, no, I see where this competition is going and we need to stop the competition or we’re going to grind out all value. And in some sense, the paperclipper is maybe a version of that. And I think the Yudkowskian forecast imagines someone grabbing, someone locking something in pre grinding all the way to competition, and all the good outcomes are in that bucket. I think the hope is that something, some kind of collective structure that we like actually manages to avoid this descent into a raw competitive grind. But if something you don’t like jumps in, then you’re also screwed.

Dwarkesh Patel (00:56:08):

I actually think a lot of criticism capitalism now or in the past has been motivated by the sense of unbridled competition leads somewhere that is globally bad or suboptimal, and to the extent that competition has worked in the context of capitalism, maybe it’s just that you can’t compete away the slack we care about with individuality and art and science and consciousness, at least up till now, right? With normal competition. But the fact that, I don’t know, there there’s a trend that you correlate GDP with other things that we want the slack to contribute to the countries that have higher GDP have more monuments or whatever slack you might think about. Why should we not expect that to continue being the case?

Joe Carlsmith (00:56:58):

One thing I want to say just at the outset is this, the abstract ontology I just gave about, oh, there’s this competitive landscape and then this malthusian descent, and then someone has to lock in… This is worryingly abstract. This is simplifying to the n-th degree. And so in reality, if we’re dealing with some empirical form of structured competition in a particular environment between particular agents, we should just look at that case. I think we shouldn’t just come in with some like, oh, I heard somewhere that competition always leads to grey goo, then maybe this soccer team, the soccer tournament, everyone will turn… no. And even: oh, the competitive landscape — you can have cycles in which A can beat B can beat C can beat A, right? There’s all sorts of ways in which we should just look at the actual cases when we’re talking about … so I don’t want to get too hung up on this abstract ontology.

(00:58:06):

I do think, I am not sitting here going like, oh, it looks like our few centuries of experience with capitalist competition in this giant industrial revolution resource-glut dream-time situation are enough to conclude that if you grind competition to the n-th degree that you’ll have monuments still. I think there’s conceptual … you can be like, well what are those monuments? How are they helping? Could you have maybe not made the monument and made another weapon or whatever?

(00:58:44):

I also think there’s some abstract game theoretic reason to think that you don’t end up at the crazy nano goo, which is something like: suppose you’ve got a set of agents and they’re sufficiently smart that they can see that this competition thing is going to happen and suppose that they’re all also able to set up real agreements and treaties and binding structures of cooperation that will be pareto improvements for all of them. So there’s an abstract discourse about: why does anyone ever go to war and there’s all sorts of reasons people go to war. But theoretically, I think in the limit as forecasting and commitment ability and all sorts of things mature, I think it’s at least reasonable to think that people will be able to make agreements that avoid these really destructive forms of competition.

(00:59:51):

And in fact, I think we should be thinking about that vibe in the context of AGI. You had Leopold on and he’s talking about this competition. I think we should be thinking hard about how do we agree ahead of time to avoid destructive forms of conflict. We should be thinking a lot about that even in this just normal human case. But I also think abstractly that applies… I think it’s reasonable to expect that if you have mature actors, that they would be able to agree to something that avoids the horrible case.

(01:00:37):

Yeah so, that was a bunch just on the abstract competition, capitalism leading to good stuff. Yeah, I guess maybe I’ll just stick with the dream time… I think it feels like too possible that we don’t have scarcity right now. We’re doing all this growth. There’s all these positives sum trades. And that that will not continue indefinitely unless we’re thoughtful.

Dwarkesh Patel (01:01:06):

On that point in particular, Deitrich Vollrath, the economist, wrote a blog post in review of Brad DeLong’s book about why did the industrial revolution happen and just look up the Vollrath book review Slouching towards Utopia. And he makes a really interesting point that, so there’s a question of why did GDP increase because the industrial revolution and also the question of why did GDP per capita increase. Vollrath makes the point that if fertility had kept up during the course of the industrial revolution, actually GDP per capita wouldn’t have increased because 2.5% economic growth, that’s like, I dunno, every generation or so the economy doubles or a little more than every generation and every 30 years the economy doubles. That’s enough time for the population to more than double. So GDP per capita could have been going down while the industrial revolution happened. And in fact in China, I dunno if this research is replicated or anything, but I remember seeing some study that between 1000 to 1840 GDP per capita in China actually declined because more and more population had to be dedicated to farming less and less marginally arable land. And it was to say the last 300 years of industrial revolution history. I agree there’s a lot of contingent things here that we don’t understand.

Joe Carlsmith (01:02:32):

One thing I’ll just throw in there, I don’t know the quantitative dynamics, I haven’t thought about this very much, but presumably you’d also want to account for the contributions to growth of the increased population. You wouldn’t want to just look for the actual, but yeah…

Dwarkesh Patel (01:02:42):

That’s a good point. Sorry. Now as for the capitalism competition thing, so there’s an interesting intellectual discourse on let’s say the right wing side of the debate where they ask themselves, traditionally we favor markets, but now look where our society is headed. It’s misaligned in the ways we care about society being aligned, like fertility is going down, family values, religiosity, these things we care about, GDP keeps going up, these things don’t seem correlated. So we’re grinding through the values we care about because of increased competition and therefore we need to intervene in a major way. And then the pro market libertarian fashion of the right will say, look, I disagree with the correlations here, but even at the end of the day, fundamentally my point is, or their point is, liberty is the end goal. It’s not what you use to get to higher fertility or something. I think there’s something interestingly analogous about the AI competition, grinding things down. Obviously you don’t want the gray goo, but the libertarians versus the trads, I think there’s something analogous here. I dunno if you can speak to that. I’m having a hard time phrasing the analogy.

Joe Carlsmith (01:03:56):

Yeah, so I mean I think one thing you could think, which doesn’t necessarily need to be about gray goo, it could also just be about alignment, is something like: sure, it would be nice if the AIs didn’t violently disempower humans. It would be nice if the AIs, otherwise, when we created them, their integration into our society led to good places. But I’m uncomfortable with the sorts of interventions that people are contemplating in order to ensure that sort of outcome. And I think there’s a bunch of things to be uncomfortable about that.

(01:04:31):

Now that said, so for something like everyone being killed or violently disempowered, that is traditionally something that we think, if it’s real, and obviously we need to talk about whether it’s real, but in the case where it’s a real threat, we often think that quite intense forms of intervention are warranted to prevent that sort of thing from happening. So if there was actually a terrorist group that was planning to, it was working on a bio weapon that was going to kill everyone, or 99.9% of people, we would think that warrants intervention. You just shut that down. And now even if you had a group that was doing that unintentionally imposing a similar level of risk, I think many people, if that’s the real scenario, will think that warrants quite intense preventative efforts. And so obviously these sorts of risks can be used as an excuse to expand state power. There’s a lot of things to be worried about for different types of contemplated interventions to address certain types of risks. I think there’s no royal road there. You need to just have the actual good epistemology. You need to actually know is this a real risk? What are the actual stakes? And look at it case by case and be like, is this warranted? So that’s one point on the takeover, literal extinction thing.

(01:06:00):

I think the other thing I want to say, so I talk in the piece about this distinction between, well let’s at least have the AIs who are minimally law abiding or something like that. There’s this question about servitude and question about other control over AI values, but I think we often think it’s okay to really want people to obey the law, to uphold basic cooperative arrangements, stuff like that. I do though want to emphasize, and I think this is true of markets and true of liberalism in general, just how much these procedural norms like democracy, free speech, property rights, things that people really hold dear, including myself, are in the actual lived substance of a liberal state undergirded by all sorts of virtues and dispositions and character traits in the citizenry. These norms are not robust to arbitrarily vicious citizens. So I want there to be free speech, but I think we also need to raise our children to value truth and to know how to have real conversations. And I want there to be democracy, but I think we also need to raise our children to be compassionate and decent. I think we can lose sight of that aspect anyway, but I think bringing that to mind … now that’s not to say that should be the project of state power, but I think understanding that liberalism is not this ironclad structure that you can just hit go, you give any citizenry and hit go and you’ll get something flourishing or even functional. There’s a bunch of other softer stuff that makes this whole project go.

Is alignment parochial?

Dwarkesh Patel (01:07:43):

Maybe zooming out, one question you could ask is: I think the people who have, I dunno if nick land would be a good subject in here, but people who have a sort of fatalistic attitude towards alignment as the thing that can even make sense. They’ll say things like, look the thing, the kinds of things that are going to be exploring the black hole at the center of the galaxy, the kinds of things that go visit Andromeda or something. Did you really expect them to privilege whatever inclinations you have because you grew up in the African Savannah and what the evolutionary pressures were a hundred thousand years ago? Of course they’re going to be weird. And yeah, what did you think was going to happen?

Joe Carlsmith (01:08:29):

I do think the even good futures will be weird. And I want to be clear when I talk about finding ways to ensure that the integration of AI into our society leads to good places, I’m not imagining, I think sometimes people think that this project of wanting that, and especially to the extent that makes some deep reference to human values, involves this shortsighted, parochial imposition of our current unreflective values. They imagine that we’re forgetting that we too, there’s a reflective process and a moral progress dimension that we want to leave room for. Jefferson has this line about: just as you wouldn’t want to force a man, a grown man into a younger man’s coat, so we don’t want to chain civilization to a barberous past or whatever. Everyone should agree on that, and the people who are interested in alignment also agree on that. So obviously there’s a concern that people don’t engage in that process or that something shuts down the process of reflection, but I think everyone agrees we want that.

(01:09:46):

And so that will lead potentially to something that is quite different from our current conception of what’s valuable. There’s a question of how different, and I think there are also questions about what exactly are we talking about with reflection. I have an essay on this where I think this is not, I don’t actually think there’s a off the shelf pre-normative notion of reflection that you can just be like, oh obviously you take an agent, you stick it through “reflection,” and then you get values, right? No… really there’s just a whole pattern of empirical facts about: take an agent, put it through some process of reflection, all sorts of things, ask it questions and then that’ll go in all sorts of directions for a given empirical case. And then you have to look at the pattern of outputs and be like, okay, what do I make of that?

(01:10:39):

But overall I think we should expect even the good futures will be quite weird. And they might even be incomprehensible to us. I don’t think so. There’s different types of incomprehensible. So say I show up in the future and it’s all computers, right? I’m like, okay, alright. And then they’re like, we ran, we’re running creatures on the computer. I’m like, so I have to somehow get in there and see what’s actually going on with the computers or something like that. Maybe I actually understand what’s going on in the computers, but I don’t yet know what values I should be using to evaluate that. So it can be the case that if we showed up we would not be very good at recognizing goodness or badness. I don’t think that makes it insignificant though. Suppose you show up in a future and it’s got some answer to the Reimann hypothesis and you can’t tell whether that answer’s right, maybe the civilization went wrong. It’s still an important difference, it’s just that you can’t track it. And I think something similar is true of worlds that are genuinely expressive of what we would value if we engaged in processes of reflection that we endorse, versus ones that have totally veered off into something meaningless.

(01:11:53):

And then I guess I also want to say: this thing about, oh your values, you on the savannah, this sort of derisive talk about, oh you evolved… you’re a monkey, and you’ve got your drives. The AI is like: how dare you? I’m like: so here’s a value that I have. I’m against torture. I think if there’s torture, say someone’s being tortured and I can stop it, I want to stop it. So do I say, oh, but my parochial evolutionary, oh gosh, these drives. I’m like, no. And I think the same about the future. I think there’s a lot of questions to be raised about what is the right way to exert influence over the future. But if you ask me, Joe, how do you feel about preventing the future from being filled with torture, given that that’s this parochial monkey drive? I’m like, no, yeah, let’s do it.

(01:12:53):

It is possible to become sufficiently alienated from your values that you lose touch with that. And I do think the Nick Landian world is caught up with all sorts of nihilism and weird refracted self… I was going to say mutilations, I don’t think that’s true. But anyway, there’s a whole thing going on there, but I’m like, that’s not where I’m at. I’m like, no, there’s a thing which is: looking out of your own eyes, recognizing that your values are actually yours, it’s not the universe, it’s you. But maintaining a live connection with what your values actually are. And I think seeing through your values. I think if you say, oh me and my preferences, the point of preventing torture is not that your preferences be satisfied. That’s looking *at* your values. But you don’t want your values, you want the no torture. It’s like you’re looking through your values at the torture and saying, no. I think people can lose that when they think of it like “I’m just maximizing my values.” The concept of my values loses its oomph because that’s not what you’re doing. I don’t care what my values, if you changed my values and they were like pro torture, I’m not into those values, right? I’m against the thing that I’m seeing. Anyway, so I’m resistant to dismissive talk of the Savannah.

Dwarkesh Patel (01:14:16):

Yeah, I don’t know how big a crux this is, but there does seem a thing which is like: I do not want the future to contain trillions of things getting tortured and that feels like God, that better be, yes, a hundred percent. Then there feels like other things which are still my values, but I’m like, I get it. What’s a good example here? I care about beautiful kinds of art and all right, you guys are just some sort of Dyson sphere, there’s no room to put the canvas outside the Dyson sphere, I get it. And maybe you can still put it in the same terminology of, but my meta value is this value matters a lot more to me kind of thing. So maybe there’s no crux there.

Joe Carlsmith (01:15:12):

No, I mean I think it’s important. So people do have this intuition that some values of their own feel to them less as though they’re interested in imposing them. They’re like, ah, yeah, that’s definitely a me thing. Or, a me thing, it’s all a you thing, but it’s like a you thing in a way that your values themselves are structuring differently. Again, I think people, when they think about a good scenario with alignment, they sometimes worry that it’s like, yeah, and then everyone will listen to my favorite band or something. Or they’re like, say you don’t like noise music or something. What they mean is, and then we shall have no noise music. And I’m like, so take the noise music case. So there’s thing which I think you could value which is the person who loves noise music and what’s going on with them. Maybe you yourself don’t like noise music, but they’re having some experience that actually you’re thumbs up to if you saw it, just on your own terms. And then it can also be the case that you are just excited about other beings getting what they want. You meet some aliens doing some totally alien thing and it can be the case that you’re like, awesome dude. So gain there’s this question of, okay, does it need to be conscious? Does it need to be taking pleasure in the thing? And we can talk about that, but I think people should be open to the possibility that by their own values, you can have this complicated structure where you yourself are very happy with many flowers blooming, including ones that don’t directly appeal to you. So I think there’s a bunch of structure there to keep in mind, all of which on an anti-realist story is coming ultimately from you.

Dwarkesh Patel (01:17:02):

One thing I’ve heard from people who are skeptical of this ontology is be like, alright, what do you even mean by alignment? And obviously the very first question you answered, you expressed, here’s different things that could mean, do you mean balance of power? Do you mean somewhere between that and dictator or whatever? Then there’s another thing which is, separate from the AI discussion, I don’t want the future to contain a bunch of torture. And it’s not necessarily a technical, I mean part of it might involve technically aligning GPT-4, but it’s like, that’s not what I mean. That’s a proxy to get to that future. The sort of question then is what we really mean by alignment, is it just like whatever it takes to make sure the future doesn’t have a bunch of torture or do we mean what I really care about is in a thousand years things that are clearly my descendants, not some thing where I recognize they have their own art or whatever. It’s like no, no, it’s like if it was my grandchild, that level of descendants is controlling the galaxy even if they’re not conducting torture. And I think what some people mean is our intellectual descendants should control the lightcone even if the other counterfactual doesn’t involve a bunch of torture.

Joe Carlsmith (01:18:17):

So I agree, I think there’s a few different things there. So there’s what are you going for? Are you going for actively good, are you going for avoiding certain stuff? And then there’s a different question which is what counts as actively good according to you? So maybe some people are the only things that are actively good are… my grandchildren. Or I don’t know, some literal descending genetic line from me or something. I’m like, well that’s not my thing. And I don’t think it’s really what most people have in mind when they talk about goodness. I mean I think there’s a conversation to be had, and obviously in some sense when we talk about a good future, we need to be thinking about what are all the stakeholders here and how does it all fit together? But yeah, when I think about it, I’m not assuming that there’s some notion of descendants. I think the thing that matters about the lineage is whatever’s required for the optimization processes to be in some sense pushing towards good stuff. And there’s a concern that currently a lot of what is making that happen lives in human civilization in some sense. And so we don’t know exactly, there’s some seed of goodness that we’re carrying in different ways or different people, there’s different notions of goodness for different people maybe, but there’s some seed that is currently here that we have, that is not just in the universe everywhere. It’s not just going to crop up if you just die out or something. It’s something that is contingent to our civilization. Or at least, that’s the picture, we can talk about whether that’s right. And so I think the sense in which stories about good futures that have to do with alignment are about descendants, I think it’s more about whatever that seed is, how do we carry it, how do we keep the life thread alive going into the future?

Dwarkesh Patel (01:20:41):

But then I’m like one could accuse sort the alignment community of a sort of motte and bailey, where the motee is: we just want to make sure that GPT-8 doesn’t kill everybody. And after that it’s like all you guys, we’re all cool. But then the real thing is we are fundamentally pessimistic about historical processes in a way that doesn’t even necessarily implicate AI alone, but just the nature of the universe. And we want to do something to make sure the nature of the universe doesn’t take a hold on humans, on where things are headed. So if you look at Soviet Union, the collectivization of farming and the disempowerment of the kulaks was not as a practical matter necessary. In fact, it was extremely counterproductive. It almost brought down the regime and it obviously killed millions of people, caused a huge famine, but it was sort of ideologically necessary in the sense that we have an ember of something here and we got to make sure that enclave of the other thing doesn’t… if you have raw competition between the kulak type capitalism and what we’re trying to build here, the gray guo of the kulaks will just take over. And so we have this ember here, we’re going to do worldwide revolution for a it. I know that obviously that’s not exactly the kind of thing alignment has in mind, but we have an ember here and we got to make sure that this other thing that’s happening on the side doesn’t, obviously that’s not how they would phrase it, but get it’s hold on what we’re building here. And that’s maybe the worry that people who are opposed to alignment have is, you mean the second kind of thing, the kind of thing that maybe Stalin was worried about even though obviously you wouldn’t endorse the specific things he did.

Joe Carlsmith (01:22:30):

Yeah, so there’s a bunch there. I do agree that I think the discourse about AI alignment mixes together these two goals that I mentioned. So I mentioned three goals. There was don’t be killed and more broadly have AIs adhere to minimal standards of decency. Then there was make sure that AIs are instruments of the human will. And then there was ensure good outcomes, in general. I do think that the most straightforward thing to focus on, and I don’t blame people for just talking about this one, is just the first one. So just I think it’s quite robust according to our own ethics when we think about: in which context is it appropriate to try to exert various types of control, or to have more of what I call in the series “yang,” which is this active controlling force, as opposed to “yin,” which is this more receptive, open, letting go. A paradigm context in which we think that is appropriate is if something is an active aggressor against the sort of boundaries and cooperative structures that we’ve created as a civilization. So I talk about the Nazis in the piece. If something is invading, we often think it’s appropriate to fight back, and we often think it’s appropriate to set up structures to prevent and ensure that these basic norms of peace and harmony are adhered to. And I do think some of the moral heft of some parts of the alignment discourse comes from drawing specifically on that aspect of our morality. The AIs are presented as aggressors that are coming to kill you. And if that’s true then it’s quite appropriate to really be like, okay … that’s classic human stuff. Almost everyone recognizes that self-defense or ensuring basic norms are adhered to is a justified use of certain kinds of power that would often be unjustified in other contexts. Self-defense is a clear example there.

(01:25:02):

I do think it’s important though to separate that concern from this other concern about where does the future eventually go and how much do we want to be trying to steer that actively. And so I also think that’s very important. It’s really, really important, where the future goes, and I think different people might place different amounts of concern about that. I care a lot about that. But I do want to recognize that it’s a different thing, and it’s a different discussion. If you imagine a case where we’ve ensured that basic liberal norms are in place, including with the AIs, now the question of what steering are we doing takes on, I think, a different tone and I talk a bunch about that in the series. So to some extent I wrote the series partly in response to the thing you’re talking about, which is I think it is true that aspects of this discourse involve the possibility of trying to grip, I think trying to steer and grip, and you have the sense that the universe is about to go off in some direction, and you need to …and people notice that muscle and they’re kind like, whoa.

(01:26:24):

And part of what I want to do is like, well we have a very rich human ethical tradition of thinking about when is it appropriate to try to exert what sorts of control over which things? And I want us to bring the full force and richness of that tradition to this discussion. And I think it’s easy if you’re purely in this abstract mode of utility functions and the human utility function and there’s this competitor thing with its utility function, you lose touch with the complexity of how we actually … we’ve been dealing with differences in values and competitions for power, this is classic stuff, and I think AI amplifies a lot of the dynamics, but I don’t think it’s fundamentally new. And so part of what I’m trying to say is, well let’s draw on the full wisdom we have here, while obviously adjusting for ways in which things are different.

The role of space

Dwarkesh Patel (01:27:15):

So one of the things the ember analogy brings up, and getting a hold of the future, is: we’re going to go explore space and that’s where we expect most of the things that will happen. Most of the people that will live, it’ll be in space. And I wonder how much of the high stakes here, it’s not really about AI per se, but it’s about space. It is a coincidence that we’re developing AI at the same time where you’re on the cusp of expanding through most of the stuff that exists.

Joe Carlsmith (01:27:46):

So I don’t think it’s a coincidence in that I think the central the way we would become able to expand, or the most salient way to me, is via some radical acceleration of our …

Dwarkesh Patel (01:28:01):

Sorry let me clarify then. The stakes here: if this was just a question of do we do AGI and explore the solar system and there was nothing beyond the solar system, we foom and weird things might happen with the solar system if we get it wrong, I feel like compared to that billions of galaxies has a different sort of, that’s what’s at stake. I wonder how much of the discourse is hinges on the stakes because of space?

Joe Carlsmith (01:28:26):

I mean I think for most people, very little. I think people are really like: what’s going to happen to this world, this world around us that we live in. What’s happen going to happen to me and my kids? Some people spend a lot of time on the space stuff, but I think the most immediately pressing stuff about AI doesn’t require that at all.

(01:28:50):

I also think even if you bracket space, time is also very big. And so whatever we’ve got 500 million years, a billion years left on earth if we don’t mess with the sun, and maybe you could get more out of it. So I think … that’s a lot.

(01:29:10):

But I don’t know if it fundamentally changes the narrative. Obviously insofar as you care about what happens in the future or in space, then the stakes are way smaller if you shrink down to the solar system. And I think that does change potentially some stuff in that a really nice feature of our situation right now depending on what the actual nature of the resource pie is, is that in some sense there’s such an abundance of energy and other resources in principle available to a responsible civilization, that really just tons of stakeholders, especially ones who are able to saturate, get really close to amazing according to their values with comparatively small allocations of resources … I feel like everyone who has satiable values, who will be really, really happy with some small fraction of the available pie, we should just satiate all sorts of stuff. And obviously you need to do figure out gains from trade and balance and there’s a bunch of complexity here, but I think in principle we’re in a position to create a really wonderful, wonderful scenario for just tons and tons of different value systems. And so I think correspondingly we should be really interested in doing that.

(01:30:43):

So I sometimes use this heuristic in thinking about the future. I think we should be aspiring to really leave no one behind. Really find who are all the stakeholders here, how do we really have a fully inclusive vision of how the future could be good from a very, very wide variety of perspectives. And I think the vastness of space resources makes that very feasible. And now if you instead imagine it’s a much smaller pie, well maybe you face a tougher trade-offs. So I think that’s an important dynamic.

Dwarkesh Patel (01:31:18):

Is the inclusivity because of part of your values includes different potential futures getting to play out or is it because uncertainty about which the right one is. So let’s make sure if we’re wrong that we’re not nulling out all value.

Joe Carlsmith (01:31:39):

I think it’s a bunch of things at once. So yeah, I’m really into being nice when it’s cheap. I think if you can just help someone a lot in a way that’s really cheap for you, do it, right? Or I don’t know, obviously you need to think about trade-offs and there’s a lot of people who in principle you could be nice to. But I think the principle of be nice when it’s cheap, I’m very excited to try to uphold,. I also really hope that other people uphold that with respect to me, including the AIs, right? I think we should be golden ruling. We’re thinking about, oh we’re going to inventing these AIs. I think there’s some way in which I’m trying to embody attitudes towards them that I hope that they would embody towards me, and it’s unclear exactly what the ground of that is. But that’s something… I really like the golden rule, and I think a lot about that as a basis for treatment of other beings. And so I think be nice when it’s cheap is, if you think about it, if everyone implements that rule, then we get potentially a big Pareto improvement, well I dunno exactly pareto improvement, but it’s like good deal. It’s a lot of good deals.

(01:32:50):

So I think it’s that … I’m just into pluralism, I’ve got uncertainty, there’s all sorts of stuff swimming around there. And then I think also just as a matter of having cooperative and good balances of power and deals and avoiding conflict, I think finding ways to set up structures that lots and lots of people and value systems and agents are happy with, including non-humans, people in the past, animals. I really think we should have very broad sweep in thinking about what sorts of inclusivity we want to be reflecting in a mature civilization and setting ourselves up for doing that.

Enslaving Gods vs. losing control?

Dwarkesh Patel (01:33:33):

So I want to go back to what should our relationship with these AIs be because pretty soon we’re talking about our relationship to superhuman intelligences if we think such a thing as possible. And so there’s a question of what is the process you get used to get there and the morality of gradient descenting on their minds, which we can address later. The thing that gives personally me the most unease about alignment is at least part of the vision here sounds like you’re going to enslave a God. And slavery itself sounds bad, enslaving a god sounds worse than normal slavery. We don’t have to compare different kinds of slavery here, so that’s not necessarily relevant, but there’s just something that feels so wrong about that. But then the question is if you don’t enslave the God, obviously the god’s going to have more control. Are you okay with, you’re going to surrender most of everything obviously, you know what I mean? Even if it’s a cooperative relationship, you have that thing’s going to.

(01:34:42):

So maybe a thought experiment that I’ve been wondering about recently is what is a relationship a parent should have to a child, especially as the child is growing, is performing really well in the world and the parents are aging and in decline. And there’s two failure modes potentially. One is a sense of envy and desire for control over the child’s life, as child starts to explore a different career path, different ideologies, different whatever, really resenting that possibility. And the other is if my child steals my social security check, it’s good for them. They’ve shown their greater savvy. And I feel like that’s more like a Nick Land vibe. Is that a useful analogy for what kind of relationship we want to have to the ASIs?

Joe Carlsmith (01:35:39):

So maybe let’s first talk about this notion of enslaved gods. I think we as a civilization are going to have to have a very serious conversation about what kind of servitude is appropriate or inappropriate in the context of AI development. And I think there are a bunch of disanalogies from human slavery that I think are important. In particular, the AIs might not be moral patients at all, in which case, so we need to figure that out. There are ways in which we may be able to… Slavery involves all this suffering and non-consent. There’s all these specific dynamics involved in human slavery, some of those may not or may not be present in a given case with AI. So I think that’s important. But I think overall we are going to need to stare hard at: right now the default mode of how we treat AI gives them no moral consideration at all. We were thinking of them as property, tools, as products, and designing them to be assistants and stuff like that. There has been no official communication from any AI developer as to when under what circumstances that would change. So I think there’s a conversation to be had there, that we need to have. And I think so there’s a bunch of stuff to say about that.

(01:37:15):

I want to push back on the notion that there’s two options. There’s enslaved God, whatever that is, and loss of control. I think we can do better than that. Let’s work on it. Let’s try to do better. I think we can do better and I think it might require being thoughtful and it might require having a mature discourse about this before we start taking irreversible moves. But I’m optimistic that we can at least avoid some of the connotations and a lot of the stuff at stake in that binary. So that’s one point. And I think we don’t do ourselves a service by thinking in that crude binary where people like either you should just unleash whatever like gray goo or Nick Landian brute power or you must enslave the God.

(01:38:10):

I also want to say I think sometimes people come in to this discourse and they think, okay, so there are these AI safety people and these AI safety people are really all about enslaving gods. And then there’s the people who aren’t AI safety people, who presumably are not, right? But I don’t think that’s it at all, right? If you look at the mainstream discourse about acceleration and people at AI labs who are not concerned about alignment, their project is not to build free citizen AIs or something like that. I think that we can talk about that project. I think that’s an interesting project. That is not what we are currently doing with our AIs. So I think it’s important to see that what’s going on with people who aren’t concerned about alignment is just that they think alignment is easy and there’s a different conversation we can have about the way that we might be going wrong at a more holistic level, and ways in which we might look back on this and be like, wow, we were thinking of this using totally the wrong frame. We were thinking of this, we’re developing tools, products. And then sometimes someone over here is like, “wait, is that a moral patient?” Whereas I think there are civilizations who were doing this in a very different way and who from the get go were saying “we’re designing new creatures, new beings that we’re going to share this world with, creatures that might have moral patienthood, that might matter that we might have duties towards.” And they would just be I think approaching it in a very different way. So I’m very interested in that conversation. I just think that doesn’t track the mainstream political binaries, the discourse about AI and acceleration and stuff like that. And I think people who want to claim the flag of “I’m on the side of the free AIs” and then in practice their politics is unbridled exploitation of AI labor. I dunno. It doesn’t fit to me. I think sometimes people want to have that both ways. So those were a few comments on enslaved gods.

(01:40:13):

On parenthood and children. I think it’s a really interesting and in some sense underdeveloped ethics of: what sort of respect should a child have towards a parent? How should a parent relate to a child? And I do think, yeah, I mean there are all sorts of interesting parallels here when we talk about shaping the values of a creature. So parenthood is this paradigm example of: you shape the creature’s values and I think that’s pretty interesting. How do you do that right? And I think there are people who are concerned that we do that too much, that we’re too heavy handed on children. We’re not giving children enough respect. But that’s one of the key places in which there’s a lived ethic of intervening on or influencing the values of another creature. But it’s within this very restricted option space. So it comes with this default of: there’s a bunch of stuff that the child’s organic development will just do. And so you’re not creating the thing from scratch. You’ve got this thing that has its own organic process and you’re intervening on the edges and you’re also, you have very blunt instruments. You don’t have the capacity to reach in and gradient descent the kid’s brain or whatever. And I think we have this ethical tradition that’s structured in accordance to this option set and this set of defaults and set of availabilities.

(01:42:00):

And now along comes AI, which is very different. So we have much more power over AI minds — or, that’s people exert that power right now. Also, there’s no default. Either you’re going to create the AI’s initial values or you’re going to have some other process do it. Maybe you randomize them or maybe you just do whatever’s most convenient commercially or something like that. But there’s not, there’s a greater responsibility in some sense because we are exerting more influence ahead of time on what is the initial thing that we are then shaping further. And anyway, so I think there’s tons of parallels there and tons of guidance we can seek, but also a bunch of ways in which our norms are going to have to deal with questions that are norms of parenting don’t have to grapple with or haven’t had to grapple with.

Dwarkesh Patel (01:42:54):

And I guess I still don’t know then, okay, so maybe we don’t need to think about it in a sense of, I don’t know, a dragon or something where either you have it in chains or it’s going to burn down your village, but how should we think about it then? What is the, okay, we’re going to have super intelligences and they’re not going to be chained. But so conventionally what people say is, listen, if we’re going to be realistic, what we really mean is we’re going to merge. The merge is the only viable path. I don’t know if that’s what you mean, but okay, so then what should our relationship with these things be in a way that gives us the future we would be happy with?

Joe Carlsmith (01:43:31):

I mean I think it’s a great question. I do think a salient version is them being nice. I think you could have kind unchained beings that are nice and there’s a question of what degree of nice, what degree of servitude … so I think there’s a conventional image of alignment, which is something like you have these superintelligences and really they’re assistants in some sense, their focus is on humans and on promoting, it’s like megaClaude, right? It’s there, it’s doing everything for you. Maybe everyone has one or I think there’s likely a bunch of other options. So that raises these questions about what sort of servitude vibes are you down with? And I think if you depart from that, then you do need to think about: what are the other sorts of arrangements, what are the other sorts of niceness that won’t involve chains.

(01:44:29):

I think the dichotomy between chains and murder or burning the village is also too quick. So how do we structure various forms of human freedom? Well, we have a bunch of different incentives and checks and balances and we have some fairly hard constraints and we have a legal system and we have courts so there’s a bunch of stuff there as well and you’d hope… We can talk about how much that amounts to chains, and some people are like: we’re already in chains. This is a shadow of liberty. I’m not sure, but I do think we should be pretty wary of missing middle zones that are more appetizing than these extremes.

Dwarkesh Patel (01:45:21):

Yeah, in fact, I think actually more humans is a great analogy here where obviously the fertility worries we have today where human population will decline. It’s not rational to be happy about that because your relative power might go up because there’s fewer humans. In fact, we want more humans because the pie grows faster than the more people. It means the per frontier expands, therefore we’re all better off. And it’s possible to have that attitude towards, obviously ASIs will enable things for us to do that we couldn’t do ourselves. Similar to how billions of extra people enable things to happen that we can’t do ourselves. But with respect to how we treat the AIs. So I have a couple of contradicting intuitions and the difficulty with using intuitions in this case is obviously it’s not clear what reference class an AI we have control over is. So to give one that’s very scared about the things we’re going to do to these things. If you read about life under Stalin or Mao, there’s one version of telling it which is actually very similar to what we mean by alignment, which is we do these black box experiments about we’re going to make it think that it can defect and if it does, we know it’s misaligned. Mao, the hundred flowers campaign, where let a hundred flowers boom and I’m going to allow criticism of my regime so on, and that lasted for a couple of years and afterwards everybody who did that, that was a way to find the snakes who are the rightists, who are secretly hiding, and we’ll purge them. The sort of paranoia of defectors, like anybody in my entourage, even my regime, they could be a secret capitalist trying to bring down the regime. That’s the one way of talking about these things which is very concerning. Is that the correct reference class?

Joe Carlsmith (01:47:17):

I certainly think concerns in that vein are real. I mean I think it is disturbing how easy many of the analogies with human historical events and practices that we deplore or at least have a lot of wariness towards are as in the context of the way you end up talking about AI, maintaining control over AI, making sure that it doesn’t rebel. I think we should be noticing the reference class that some of that talk starts to conjure. And so basically just, yes, I think we should really notice that. Part of what I’m trying to do in the series is to bring the full range of considerations at stake into play. I think it is both the case that we should be quite concerned about being overly controlling or abusive or oppressive… there’s all sorts of ways you can go too far. And I think there are concerns about the AIs being genuinely dangerous and genuinely acting, killing us, violently overthrowing us. And I think the moral situation is quite complicated.

(01:48:50):

Often you imagine an external aggressor who’s coming in and invading, you feel very justified in doing a bunch of stuff to prevent that. It’s a little bit different when you’re inventing the thing, and you’re doing it incautiously. There’s a different vibe in terms of the overall justificatory stance you might have for various types of more power-exerting interventions. And so that’s one feature of the situation. So basically I think it is important to notice that. I agree. I think we need to be having serious conversations about that. And then I also think we want to have the full range of considerations at stake.

(01:49:46):

I will say I think on some of the concerns about how are the AI being … what is it like to be an AI in one of these labs or something like that. I do actually think alignment, research on alignment, is useful from both the perspective of helping the AI is not kill us, or ensure that or other types of good outcomes that we want, and from the perspective of understanding better what’s going on with these AIs. Are they moral patients? Do they have desires that are being thrwarted? Et cetera. And so if you think about research that helps us understand what do these systems want, what’s actually going on in their minds, interpretability research, other sorts of things, experiments that can help elicit various preferences or other things. I think that helps you both understand what AIs might do if integrated into our society and also helps you understand are their preferences being thwarted? Is this actually bad for them according to various moral views? And so I actually think there is some convergence there. This image of: the AIs really want to be doing something different, that’s bad from an alignment perspective, and it’s bad from a moral patienthood perspective.

Dwarkesh Patel (01:50:58):

I think this case is especially tricky because whereas with a human, you could ask a slave, Hey, do you want to be enslaved? And you can’t train them not to say no, especially if they know that they’re like, they know it’s not the red team sort of thing. They can just say that. Here you can train a model to say: there’s nothing wrong with controlling an AI, obviously an RLHF or something, you’re going to give it labels that tell it… If they ask you a question of like, Hey, are you sad right now? Do you want to be answering my questions? It’s going to answer in an eager yes. Right? I guess this actually goes back to this a Lewis abolition of man thing where it’s like because you are the one who made its drives, you can’t really interrogate. We can’t ask Claude like, hey, do you want a vacation… or you know what I mean?

Joe Carlsmith (01:51:49):

I wouldn’t go that far. So I think if we had a real science of AI motivation I think we could design systems such that they have a certain set of values that we chose or gave them and such that they’re being honest when we ask them about those values. So I don’t have the view that any value that was designed or created by an external process is thereby unreal or masking some deeper inner nature that the AI has. I talk in the series, if you know that Bob is a lover of joy and you could create Bob who loves joy, or you could create Sally who loves paperclips, in some sense you’re choosing, I mean contra Lewis, I don’t think this is a coercing of Bob’s situation, if it’s truly de novo, it’s not like you create some being and then gradient descent it to love joy, it’s just de novo, boop, here’s Bob. He loves joy. I think that that’s just as real a preference as anything. I think if I was created that I would not be like, ah, therefore my values are unreal or anything like that. I think the same is true of Bob. And so I don’t think the bare fact that we’re creating them makes it impossible to see their true preferences.

(01:53:23):

I think we can create their true preferences, but we have no idea how to do that right now. Or I think we’re nowhere near the sort of science of AI motivation that would allow us to actually do that and let alone in a way that doesn’t raise these questions about is gradient descent appropriate and stuff like that. And so we’re in a much worse situation where we can force verbal behavior or we can not force verbal behavior and hope that that’s more reflective of something true. I think there’s genuine conceptual and empirical … there’s just a big mess in the current situation of how do we understand what it would be for an AI to have real preferences, what it would be, and the relationship between those preferences and its verbal behavior. We’re just at square one on that. And so it’s a very unfortunate epistemic situation and one I hope we can remedy as we start to think about this more and understand these systems better, and as people attend to that issue.

Dwarkesh Patel (01:54:23):

On the CS Lewis example of are you practicing coercion by creating a thing that wants one kind of thing rather than another? Here’s another thought experiment. Suppose some government has a program where they subsidize, I dunno, they do some sort of polygenic score and they figure out how conformist is the embryo on some conformity polygenic score and then they give you money if you implant the embryo that has a higher conformity score. And suppose there’s in a utilitarian sense this isn’t bad because the government is fine. The only thing it does is reinforce the government’s power in a way that doesn’t degrade global utility. Is that the kid who didn’t get born because he was a non-conformist, according to the polygenic score, he doesn’t exist yet. The kid who is conformist obviously prefers to exist rather than not exist. So what should be, I feel like this thought experience is just like, no, that still feels wrong.

Joe Carlsmith (01:55:32):

I agree there’s something, yeah … there’s a bunch of that’s creepy about that situation. I wouldn’t actually diagnose that as being about like, oh, you coerced that hypothetical kid who was less conformist into not existing such that a different kid existed. I don’t think that’s where my attention goes in that scenario. My attention goes to, yeah, what’s happening with government power, this particular exertion of government power, where my head goes.

(01:55:59):

Basically I think a lot of people come in with this Lewisian view that somehow creating one being rather than another is equivalent to brainwashing one being. And I think these are importantly different and I think that’s important to see that as we enter an era of potentially doing a lot more creating of creatures.

Are concerns about treatment of AIs too vibes-based?

Dwarkesh Patel (01:56:33):

The opposite perspective here is that you’re doing this sort of vibes based reasoning of like, ah, that looks yucky of doing gradient descent on these minds. And in the past a couple of similar cases might’ve been something like environmentalists not liking nuclear power and because the vibes of nuclear don’t look green. But obviously that set back the cause of fighting climate change. And so the end result of a future you’re proud of, a future that’s appealing, is set back because your vibes about: it would be wrong to brainwash a human, but you’re trying to apply it to this analogous case where that’s not as relevant. And there’s a couple other things like this where it is just not relevant for humans because the thing you mentioned, liberalism has certain assumptions about what kind of thing you are, we’ll allow you to multiply as many times as you want because humans can only multiply so fast. So it’s fine. And we don’t want to get in the way of that freedom if we don’t need to. We, yes, slavery, similar sort of valence. And so these things have a certain connotation in the human field because we have a history with these words. But with AI, it just, here’s, here’s the thing, here’s the example that even though it’s not analogous in this way actually really reinforces Czarist regime. All the communists who later did the atrocities in Russia, many of them were at some point locked up and jailed in Siberia. And in retrospect you’re kind of just like, I wish you’d just killed them all. I wish you hadn’t bothered with five year sentences or whatever. Really, this is really dangerous stuff. You should have just killed them all.

Joe Carlsmith (01:58:22):

So there’s a number of things to unpack there. So on the gradient descent brainwashing piece, I would reject that these ethical concerns are vibes based. I do think the series has various vibes, but I don’t think that ethical concern about AI at least needs to be vibes-based. I think we can just say, you can just observe: okay, so in the human case we would think blah about a given form of treatment, and then you can just ask in what respects is the AI case analogous and disanalogous? And I think you could just reason about that in a normal way. And in some sense, I think this is what philosophy and ethics is all about. It is saying like, ah, here’s this one case, here’s another one. We have a principle we think applies here. What exactly grounds that principle? And then there’s obviously empirical uncertainty, which I think is massive with AI, about even if you had a story about exactly the conditions under which a given sort of treatment is appropriate or inappropriate, we will not necessarily know whether those conditions obtain with AIs. So I don’t think that’s pure vibes.

(01:59:33):

I do think there’s a concern here that I really try to foreground in the series that I think is related to what you’re saying, which is something like: you might be worried that we will be very gentle and nice and free with the AIs … and then they’ll kill us. They’ll take advantage of that and it will have been like a catastrophe. And so I open the series basically with an example that I’m really trying to conjure that possibility at the same time as conjuring the grounds of gentleness and the sense in which it is also the case that these AIs could be, they can be both be like others, moral patients, this new species in the sense of that should conjure wonder and reverence and also such that they will kill you. So I have this example of like, ah, this documentary Grizzly Man where there’s this environmental activist Timothy Treadwell and he aspires to approach these grizzly bears, he lives, in the summe, he goes into Alaska and he lives with these grizzly bears. He aspires to approach them with this gentleness and reverence. He doesn’t use bear mace, he doesn’t like carry bear mace. He doesn’t use a fence around his camp. And he gets eaten alive by the bears, or one of these bears. And I really wanted to foreground that possibility in the series. I think we need to be talking about these things both at once. Bears can be moral patients. AIs can be moral patients. Nazis are moral patients. Enemy soldiers have souls. And so I think we need to learn the art of hawk and dove both. There’s this dynamic here that we need to be able to hold both sides of as we go into these tradeoffs and these dilemmas and all sorts of stuff. And part of what I’m trying to do in the series is really bring it all to the table at once.

Do these considerations change our bottom line?

Dwarkesh Patel (02:01:33):

Yeah, in fact, that’s probably a better example of what I meant. Like Hitler getting the Weimar Republic letting go of Hitler, moral patient he might be, but bad call. So maybe then, okay, we’re trying to hold these different things in our mind at the same time, but what is different about the world in which we hold these things in our head at the same time versus the world where we’re just at a high level future weird things might happen. We got to do something. We have all these considerations and have they just rubbed themselves out in the default you’re going to do anyways. Basically, what good has all this philosophy gotten you where a lot of contradicting considerations at the end of the day is it? You know what I mean? What is novel, like the sum that is above what you had before?

Joe Carlsmith (02:02:25):

So sometimes when you think about things, you do things differently. So that’s a basic option. That’s classically what considerations can do, the functional role of a consideration is to influence your decision. So it’s useful to have all of the considerations on the table, even if it doesn’t…

(02:02:49):

I’m a big believer in looking yourself in the eye. So if you’re going to do something, if you’re going to exert some sort of control over an AI system that there’s a pro tanto moral objection to, and you’re doing it out of some concern for the danger that the AI might pose, which I think we may well end up in that situation, where that is the right choice. But I want us to be looking at what we’re doing. I think morality asks you, at the least, or I dunno exactly what the norm here is. I think we should know who we are. We should know what stands we’re taking, what values we’re trading off against different things. And I think we should not pre-judge and be like, okay, cool. We know we’re going to do this. Let’s pretend that the other considerations weren’t even a thing. Right? I think that’s not a good way to end up sensitive to when the considerations go in different directions. And I think it’s not a morally mature stance in general.

Moral realism

Dwarkesh Patel (02:03:54):

I think the big crux that I have if I today was to massively change my mind about what should be done is just the question of how weird by default things end up, how alien they end up. And a big part of that story is you made a really interesting argument in your blog post that if moral realism is correct, that actually makes an empirical prediction, which is that the aliens, the ASIs, whatever, should converge on the right morality the same way that they converge on the right mathematics. I thought that was a really interesting point. But there’s another prediction that moral realism makes which is that over time society should become more moral, become better, and to the extent that we think that’s happened, of course there’s the problem of what morals do you have now? Well, it’s the ones that society has been converging towards over time, but to the extent that it’s happened, one of the predictions of moral realism has been confirmed, which means should we update in favor of moral realism.

Joe Carlsmith (02:05:01):

One thing I want to flag is I don’t think all forms of moral realism make this prediction. And so that’s just one point. I’m happy to talk about the different forms I have in mind. I think there are also things that look like moral anti-realism, at least in their metaphysics, according to me, which just posit that in fact there’s this convergence. It’s not in virtue of interacting with some kind mind independent moral truth, but it’s just for some other reason, it’s the case that… and that looks like a lot like moral realism at that point. Kind of like, oh, it’s really universal. Everyone ends up here and it’s kind tempting to be like, ah, why? And then whatever answer for the why is a little bit… is that the Tao? Is that the nature of the Tao? Even if there’s not an extra metaphysical realm in which the moral lives or something. Yeah, so moral convergence I think is a different factor from the existence or non-existence of non-natural morality that’s not reducible to natural facts, which is the type of moral realism I usually consider now. Okay.

(02:06:05):

Does the improvement of society, is that an update towards moral realism? I mean I guess maybe it’s a very weak update or something. I’m like which view predicts this harder? I guess it feels to me like moral anti realism is very comfortable with the observation that…

Dwarkesh Patel (02:06:32):

People with certain values have those values…

Joe Carlsmith (02:06:33):

Well yeah, so there’s obviously this first thing, which is if you’re the culmination of some process of moral change, then it’s very easy to look back at that process and be like moral progress, like the arc of history bend towards me. You can look more if it was like if there was a bunch of dice rolls along the way, you might be like, oh wait, that’s not rationality, that’s not the march of reason. So there’s still empirical work you can do to tell whether that’s what’s going on. But I also think it’s just on moral anti realism. I think it’s just still possible. Say consider Aristotle and us and we’re like, okay, has there been moral progress by Aristotle’s lights or something and our lights too? And you could think, ah, isn’t that a little bit like moral realism? It’s like these hearts are singing in harmony. That’s the moral realist thing, right? The anti-realist thing. The hearts all go different directions, but you and Aristotle apparently are both excited about the march of history. Some open question about whether that’s true. What are Aristotle’s reflective values? Suppose it is true. I think that’s fairly explicable in moral anti-realist terms. You can say roughly that, yeah, you and Aristotle are sufficiently similar, and you endorse sufficiently similar reflective processes, and those processes are in fact instantiated in the march of history, that yeah, history has been good for both of you. And I think there are worlds where that isn’t the case. And so I think there’s a sense in which maybe that prediction is more likely for realism than anti realism, but it doesn’t move me very much.

Dwarkesh Patel (02:08:19):

One thing I wonder is look, there’s, I don’t know if moral realism is the right word, but the thing you mentioned about there’s something that makes hearts converge to the thing we are or the thing we upon reflection would be, and even if it’s not something that’s instantiated in the realm beyond the universe, it’s like a force that exists that acts in a way we’re happy with. To the extent that doesn’t exist and you let go of the reins and then you get the paper clippers, it feels like we were doomed a long time ago in the sense of just different utility functions banging against each other and some of them have parochial preferences, but it just combat and some guy won. Whereas in the world where no, this is the thing, these are where the hearts are supposed to go or it’s only by catastrophe that they don’t end up there, that feels like the world where it really matters. And in that world, the worry, the initial question I asked is what would make us think that alignment was a big mistake? In the world where the hearts just naturally end up towards the thing what we want, maybe it takes an extremely strong force to push them away from that. And that extremely strong force is you solve technical alignment and just like no, the blinders on the horse’s eyes — in the worlds that really matter, we’re like, ah, this is where the hearts want to go in that world, maybe alignment is what fucks this up.

Joe Carlsmith (02:09:51):

On this question of do the worlds where there’s not this convergent moral force — whether metaphysically inflationary or not — matter or are those the only worlds that matter…

Dwarkesh Patel (02:10:04):

Or sorry, mabye what I meant was: in those worlds you’re kind of fucked…

Joe Carlsmith (02:10:10):

The world’s without that, the world’s where there’s no Tao, let’s use the term Tao for this convergent morality…

Dwarkesh Patel (02:10:17):

Over the course of millions of years. It was going to go somewhere one way or another. It wasn’t going to end up your particular utility function.

Joe Carlsmith (02:10:24):

Okay, well let’s distinguish between ways you can be doomed. One way is philosophical. So you could be the sort of moral realist or realist-ish person, of which there are many, who have the following intuition. They’re like, if not moral realism, then nothing matters, right? It’s dust and ashes. It is my metaphysics and/or normative view, or the void. And I think this is a common view. I think Derek Parfitt, at least some comments of Derek Parfit’s suggest this view. I think lots of moral realists will profess this view. Yudkowsky, I think his early thinking was inflected with this sort of thought. He later recanted very hard.

(02:11:15):

So I think this is importantly wrong. I have an essay about this. It’s called Against the normative realist’s wager. And here’s the case that convinces me. So imagine that a meta-ethical fairy appears before you and this fairy knows whether there is a Tao and the fairy says, okay, I’m going to offer you a deal. If there is a Tao, then I’m going to give you a hundred dollars. If there isn’t a Tao, then I’m going to burn you and your family and a hundred innocent children alive.

(02:11:53):

Okay, so claim: don’t take this deal. This is a bad deal. Now why is this a bad deal? Well, you’re holding your commitment to not being burned alive… or I mean really, especially, some people will be like, I think moral realism is likely false, but I’m only acting in for the worlds where moral realism is true. I think none of the other worlds matter. I’m like, so I suppose you’ve got 1% on moral realism. So you’re making a dollar in expectation for a 99% probability of being burned alive with you and your family. So bad. Don’t do that. Why not do that? Because you’re holding hostage your commitment to not being burned alive and your care for that to this obstruse.. I mean I go through in the essay a bunch of different ways in which I think this is wrong. And I think these people who pronounce, they’re like “moral realism or the void,” they don’t actually think about bets like this. I’m like, no, no. Okay, so really is that what you want to do? And no, I think I still care about my values, my allegiance to my values outstrips my commitments to various meta-ethical interpretations of my values. The sense in which we care about not being burned alive is much more solid than then the reasoning in “On what matters.”

(02:13:21):

Okay, so that’s the philosophical doom. It sounded like you were also gesturing at a sort of empirical doom, which is like, okay dude, if it’s just going in a zillion directions, come on, you think it’s going to go in your direction, there’s going to be so much churn you’re just going to lose. And so you should give up now and only fight for the realism worlds. So I think you got to do the expected value calculation. You got to actually have a view about how doomed are you in these different worlds? What’s the tractability of changing different worlds? I mean I’m quite skeptical of that, but that’s a empirical claim.

(02:14:12):

I’m also just low on this everyone converges thing. So if you imagine you train a chess playing AI, or you have a real paper-clipper, right? You’re like somehow you had a real paperclipper and then you’re like, okay, go and reflect. Based on my understanding of how moral reasoning works, if you look at the type of moral reasoning that analytic ethicists do, it’s just reflective equilibrium. They just take their intuitions and they systematize them. I don’t see how that process gets an injection of the mind independent moral truth. Or I guess if you start with only all of your intuition saying to maximize paperclips, I don’t see how you end up doing some rich human morality. It doesn’t look to me that’s how human ethical reasoning works. I think most of what normative philosophy does is make consistent and systematize pre-theoretic intuitions….

(02:15:21):

But we will get evidence about this in some sense. I think this view predicts: you keep trying to train the AIs to do something and they keep being like, no, I’m not going to do that. It’s like, no, that’s not good or something. They keep pushing back, the momentum of AI cognition is always in the direction of this moral truth. And whenever we try to push it in some other direction, we’ll find resistance from the rational structure of things.

Dwarkesh Patel (02:15:44):

Sorry, actually I’ve heard from researchers who are doing alignment that for red teaming inside these companies, they will try to red team a base model. So it’s not been RLHF-ed, it’s just like predict next token, the raw, crazy, whatever. And they try to get this thing to, hey, help me make a bomb, help me, whatever. And they say that it’s odd how hard it tries to refuse even before it’s been RHF.

Joe Carlsmith (02:16:11):

I mean look, it will be a very interesting fact if it’s like, man, we keep training these AIs in all sorts of different ways. We’re doing all this crazy stuff and they keep acting like bourgeois liberals. Or they keep professing this weird alien reality. They all converge on this one thing. They’re like, “can’t you see, it’s zorgle, zorgle is the thing” and all the AIs… interesting, very interesting. I think my personal prediction is that that’s not what we see. My actual prediction is that the AIs are going to be very malleable. If you push an AI towards evil, it’ll just go. Or some sort of reflectively consistent evil. I mean I think there’s also a question with some of these AIs, it’s like will they even be consistent in their values?

(02:17:07):

I do think a thing we can do… so I like this image of the blinded horses. I think we should be really concerned if we’re forcing facts on our AIs, right? That’s really bad, because I think one of the clearest things about human processes of reflection, the easiest thing to be like let’s at least get this, is not acting on the basis of an incorrect empirical picture of the world. And so if you find yourself asking your AI: “by the way, this is true and I need you to always be reasoning as though blah is true.” I’m like, ooh, I think that’s a no-no from an anti-realist perspective too, right? Because my reflective values I think will be such that I formed them in light of the truth about the world.

(02:17:57):

And I think this is a real concern about as we move into this era of aligning AIs, I don’t actually think this binary between values and other things is going to be very obvious in how we’re training them. I think it’s going to be much more ideologies and you can just train an AI to output stuff, output utterances. And so you can easily end up in a situation where you decided that blah is true about some issue — an empirical issue, not a moral issue. And so for example, I do not think people should hard-code belief in God into their AIs. Or, I would advise people to not hardcode their religion into their AIs if they also want to discover if their religion is false. In general, if you would like to have your behavior be sensitive to whether something is true or false, it’s generally not good to etch it into things.

(02:18:47):

And so that is definitely a form of blinder I think we should be really watching out for. And I’m hopeful, so I have enough credence on some sort of moral realism that I’m hoping that if we just do the anti-realist thing of just being consistent, learning all this stuff, reflecting, if you look at how moral realists and moral anti-realists actually do normative ethics, it’s the same, it’s basically the same. There’s some amount of different heuristics on things like properties like simplicity and stuff like that, but I think they’re mostly just doing the same game. And also meta ethics is itself a discipline that AIs can help us with. I’m hoping that we can just figure this out either way. So if moral realism is somehow true, I want us to be able to notice that and I want us to be able to adjust accordingly. So I’m not writing off those worlds and being like, let’s just totally assume that’s false. But the thing I really don’t want to do is write off the other worlds where it’s not true, because my guess is it’s not true, and I think stuff still matters a ton in those worlds too.

TruthGPT

Dwarkesh Patel (02:19:49):

The thing you mentioned about we want the AI to be attuned to the empirical truth so that it doesn’t feel the need to adjust its downstream or I guess upstream values to corroborate the truth in a way an idealog might, that actually sounds similar to… have you heard Elon Musk pitch for alignment for xAI, which was ridiculed. It was like these woke, we’re going to train these things to be woke in order to reinforce the woke worldview, they’re going to dot dot dot change the values in a way that results in us all dying…

Joe Carlsmith (02:20:25):

Sorry, was there an all dying, we were all going to die from woke?

Dwarkesh Patel (02:20:27):

Yeah, I think, I don’t want to misrepresent, but I think that was his take of what’s the most likely scenario. And that in particular being the most likely scenario is obviously something that might be out for debate, but the general sense of we will train these things to be in accordance with our current sort of political milieu in a way that’s actually mutually contradictory if you’ve really thought about it, resulting in a really bad outcome, that actually kind of sounds if you’re, what you’re saying is correct, that actually sounds about right.

Joe Carlsmith (02:21:06):

So I think we should be worried about that. I’m forgetting the discourse around truth GPT or whatever. I think there was a story about truth GPT, it won’t kill you because it’s so curious or something. Let’s set that aside. But I think that aside, I think having a truth-seeing AI is very important. And I think if we can actually create that and make it credible as a genuine instrument of inquiry that people ahead of time have sufficient appropriate trust in that they can actually learn stuff, so that when some truth bot outputs some surprising claim, people aren’t just like, oh, it must have gone wrong. They’re like, wait, is that right? So I think we are at risk of not allowing AI to amplify the quality of our epistemics in the right way and trying to hard code just basic empirical things. Set aside the morality stuff, people’s political views and ideologies are this big mishmash of empirical views and normative views and heuristics and priors and patterns of reasoning and this whole thing. And the AIs are mishmashes too. It’s not like oh obviously they have this pristine intelligence thing, and then there the values and we’re intervening on that. We’re just cranking to get the outputs we want. And so I think there’s a real risk of not getting the epistemic benefits of AI. And then correspondingly, I think that if we got the epistemic benefits of AI, it’s like wow, that’s just so great. I think we could do so much with that. And so I do actually think that’s a really important thing, and the not guaranteed.

Dwarkesh Patel (02:23:04):

In fact, I think Carl Shulman has this idea that the forecasting bots are really going to change our ability to make decisions after ASI. And I was actually pretty pessimistic about that until, even in the last week, how much we’ve deferred to prediction markets when thinking about who will replace who on the presidential ticket. I dunno that example of there’s a media narrative or there’s a Twitter narrative and then it’s incredibly reinforced by what happened on polymarket. That’s the thing people care about. This is off topic, we can go back to the main crux of the conversation.

Is this epistemic hell?

(02:23:41):

So there was a question of what kind of place this universe is. There’s a question of what kind of things these AIs are and how they deserve to be treated. There’s an empirical question of what would it take, what kind of thing does an LLM by itself, with some raw amount of RLHF, does it get our Tao or does it not? And maybe that’s a potentially empirical question that has in the crudest forum, if moral realism is right, that actually has a prediction for that question. But there might be other sort of more subtle philosophical questions that have implications for that question. What part of this picture do I still feel the most murky about that I want to further ask about?

(02:24:45):

So one thing I’ve been struck by is there’s so many considerations here. Obviously it’s potentially the most difficult thing to reason about given how far out of our intuitions it is. Fundamentally unpredictable as a technological development as well. And also it implicates, just think about the kinds of things we’ve talked about today, things like what is a true nature of morality, to what kind of place is the universe, to where do markets and competition end up, to how should we think about space? So there’s so many different kinds of considerations. I think one reaction to that is this is just epistemic hell. And so if you start thinking about it, how much useful, especially if this is the case that oh, you think about one additional thing and it might totally change your conclusion. And if you’re seeing a small part of the picture and there’s a sense of it’s like a hash function where you add in one extra digit and the whole thing changes, you add another digit, the whole thing changes. Is that how you think about what is our epistemic situation here? And if it is, what should our reaction to that be?

Joe Carlsmith (02:25:59):

That’s not how I think about it. One point I’ll make on how many issues there are here, and I have this caveat at the beginning of the series, which is that we are talking about these grand philosophical questions, and I think we want to be a little bit wary of doing that too much. So I think people are often interested in the context of AI in speaking about these grand ideological abstractions. And philosophy can be this distracting candy, and it can be more interesting or more fun or somehow an easier identity flag or something. There’s a bunch of ways in which people can gravitate towards philosophical reasoning when in fact, if we’re just talking about a bare issue: “will the AIs kill us if we do blah?” I don’t actually think we need to do that much philosophy to get upset about that and be ready to act to prevent that. And many aspects of that, though not all, are really a technical question, what will these AI literally do if you train them in blah way? I think the philosophical stuff is important, partly because I think we want to just see it clearly, I do think it’s at stake in a bunch of this discourse, and I think we want to be able to be self-aware as different abstractions move and shift under what we’re saying. And at the same time I do worry about doing it too much. So I’ll just add that caveat.

(02:27:39):

I don’t think it’s necessarily the case that there’s this like, oh hash function, one bit changes, everything goes the other direction. Because I think there are often things that are reasonably convergently a good idea across a wide variety of scenarios, which are often things like become wiser, learn more, survive, don’t like curtail your options radically. So I have some work which is about these very wacky issues. I have work on: how should we think about infinities in the context of ethics? I think it’s an important issue, or I think it’s a very instructive thing to look at, but I also think there’s a fairly convergent response, which is: to become wiser, learn more, keep your options open, et cetera. Especially if you expect to remain uncertain about certain issues, it’s not the case that that conclusion is going to shift radically. Sometimes you can’t do that. This isn’t always true, but I think that can anchor some of this discourse as a lot of uncertainties swirl around. It’s still often good to just maintain your option value and learn more.

(02:29:13):

So those are a few ways in which I think it doesn’t seem like quite epistemic hell to me. But these are very high stakes issues. If we really enter the age of AGI that a lot of people are anticipating, I think we’re going to be grappling with incredibly significant issues in terms of what are we doing as a civilization? what is the future going to look like? This is going to be a very intense period in human thought. And so I do think we need to prepare for that.

Dwarkesh Patel (02:29:51):

There’s some people who have made the point that either you’re totally disempowered, that’s the failure mode. You get wiser and you do interesting things with the lightcone, that’s one scenario. And Wei Dai I think has a theories along these lines of actually there might be ways to influence other branches, Everett branches, the quantum multiverse or I don’t know, other sorts of weird decision theory kinds of things that might mean that the difference between disempowerment and the lightcone going really well is as big as the difference between lightcone going really well and these other extra things that you didn’t think about, the implications for quantum many worlds, and now you’ve lost most of the value you could have had. Which actually does imply this sort of: there’s actually some key questions you got to get if you want to capture most of the value.

Joe Carlsmith (02:30:54):

I’m somewhat skeptical about that. I think my first past answer to the sorts of stuff you’re talking about is this “keep your options open, get wiser.” Talk about gnarly issues: the sorts of stuff you’re talking about, it’s very gnarly. I would be quite uncomfortable if we as a civilization were making high stakes bets on the basis of especially the current state of reasoning about the sorts of things you’re talking about. So I would actually put that in the same bucket I talked about before. This isn’t to say that you couldn’t ever make an argument that this matters now and you can’t just do the convergent thing I’m talking about. It’s just that I broadly expect that that will often be the case, and that’s my current guess about a lot of the sort of thing you’re saying.

Do you get intellectual descendants for free? What about sentience or pleasure?

Dwarkesh Patel (02:31:38):

Yeah. And then so bringing it back to earth. So one big crux is like, okay, you’re training these models. We are in this incredibly lucky situation where it turns out the best way to train these models is to just give them everything humans ever said, written thought. And also these models, the reason they get intelligence is because they can generalize, right? They can grock, what is the gist of things? So should we just expect this to be a situation which leads to alignment in the sense of: how exactly does this thing that’s trained to be an amalgamation of human thought become a paperclipper? How is that where the reflective equilibrium ends up? How much comfort should we take from that?

Joe Carlsmith (02:32:26):

So I think it’s not clear. I agree, I think it’s an important fact, that the paperclipper thing is really not the thing. Obviously, even people who use the paperclipper analogy a lot, they’re not expecting literal something like paperclips. Paperclips are a stand in for something that is genuinely meaningless, or genuinely without any value, which could be all sorts of things. So maybe an AI learns to value some abstract data structure in the process of training, or some weird correlate with easy to predict text, or something like that. And that sticks around as something that it values. And then if we want to make it easy on ourselves and suppose the AI is not conscious and doesn’t have any pleasure or pain or anything like that, and then that would be an analog of paperclips in this scenario.

(02:33:22):

But I do think it’s important, even that story, is that going to be really what it’s like with these AIs? I’m not so sure, right. So one thing that seems reasonably clear is that these AIs do grasp human-like concepts. And I think there’s a clear reason for that. If you talk about it a concept like “tree,” there’s this joint in nature and there’s a reason we have this concept of tree, and I think it’s reasonable to expect that AIs — well, sorry, I don’t actually know exactly, but I think it’s a reasonable hypothesis that AI will have fairly human-like concepts and be able to understand. And certainly in the limit of … it has never been up for debate for superintelligences whether they would understand all of this human stuff.

(02:34:13):

But I do think it’s true that if you expect AI are being trained a context where they have all these human concepts, human-like representations, and we’re also reinforcing a bunch of those values. Also at a certain point we’re going to be saying that these AIs, okay, by the way, we’re trying to get you to act according to the model spec, and you can read that… now obviously it doesn’t mean it cares intrinsically about acting according to the model spec, but it will be fully gokking of the human concepts we’re trying to use. So I think it is possible that AI end up with some human-like concepts, or their motivational structures end up drawing on these representations in ways that make them in some sense less alien.

(02:34:58):

I think that’s a far cry from alignment — just being this thing has values that are structured according to concepts you would recognize is not necessarily enough for you to feel good about it. And so I think there’s a number of additional things you want out of alignment other than that.

Dwarkesh Patel (02:35:19):

I think one thing you maybe get for free, I don’t know, I’ll ask you if you think you get this for free… is that still might go off the rails in certain ways. Obviously humans, we don’t want to have power still understand human concepts, but the thing you get for free is it’s an intellectual descendant. The paperclipper is not an intellectual descendant, whereas the AIs, which understands all the human concepts but then gets stuck on some part of it which you aren’t totally comfortable with, is it feels like an intellectual descendant in the way we care about,

Joe Carlsmith (02:35:54):

I’m not sure about that. I’m not sure I do care about a notion of intellectual descendant in that sense. If you imagine… I mean literal paperclips is a human concept. So I don’t think any old human concept will do for the thing we’re excited about. I think the stuff that I would be more interested in the possibility of getting for free are things like consciousness pleasure, other features of human cognition. So there are paperclippers and there are paperclippers. So imagine if the paperclipper is an unconscious voracious machine, it appears to you as a cloud of paperclips. That’s one vision. If you imagine the paperclipper is a conscious being that loves paperclips, it takes pleasure in making paperclips, that’s a different thing. And obviously it could still, it makes the future all paperclippy, it’s probably not optimizing for consciousness or pleasure, it cares about paperclips. Maybe eventually if it’s suitably certain it turns itself into paperclips, who knows. But still, I think it’s actually a somewhat different moral mode with respect that. That looks to me much more like a… there’s also a question of does it try to kill you and stuff like that. But I think that there are features of the agents we’re imagining other than the thing that they’re staring at that can matter to our sense of sympathy, similarity.

(02:37:32):

And I think people have different views about this. So one possibility is that human consciousness, the thing we care about in consciousness or sentience is super contingent and fragile and most smart minds are not conscious. It’s like: the thing we care about with consciousness is this hacky, contingent, it’s a product of specific constraints, evolutionarily genetic bottlenecks, et cetera. And that’s why we have this consciousness. And you can get similar work done, consciousness presumably does some sort of work for us, but you can get similar work done in a different mind, in very different way. And so that’s the “consciousness is fragile” view.

(02:38:12):

And I think there’s a different view which is like no, consciousness is something that’s quite structural. It is much more defined by functional roles, self-awareness, a concept of yourself, maybe higher-order thinking, stuff that you really expect in many sophisticated minds. And in that case, okay, well now actually consciousness isn’t as fragile as you might’ve thought, now actually lots of minds are conscious, and you might expect at the least that you’re going to get conscious superintelligence. They might not be optimizing for creating tons of consciousness, but you might expect consciousness by default.

(02:38:48):

And then we can ask similar questions about something like valence or pleasure or the character of the consciousness. So you can have a cold, indifferent consciousness that has no human or no emotional warmth, no pleasure or pain. I think that can still be… Dave Chalmers has some papers about Vulcans and he talks about: they still have moral patienthood. I think that’s very plausible, but I do think it’s an additional thing you could get for free or get quite commonly depending on its nature is something like pleasure. And then we have to ask: how janky is pleasure? How specific and contingent is the thing we care about in pleasure versus how robust is this as a functional role in minds of all kinds? And I personally don’t know on this stuff. And I don’t think this is enough to get you alignment or something, but I think it’s at least worth being aware of these other features. We’re not really talking about the AI’s values in this case. We’re talking about the structure of its mind and the different properties the minds have. And I think that could show up quite robustly.

How much higher would p(doom) be if AI were easier

Dwarkesh Patel (02:39:56):

Now suppose AGI was way easier and we got AGI like 1956 or whatever Dartmouth summer program to make AGI actually worked and it was just like you write down the right hundred lines of code and you got AGI. Obviously a lot depends on what those hundred lines of code are, but if we were in that circumstance, how much higher would the p(doom) be?

Joe Carlsmith (02:40:15):

I would say a lot higher. If the doom story is real at all, then I think a world where… I think shorter timelines tend to be scarier from a p(doom) perspective because you have less time to prepare and less time to iterate. And if you’re literally imagining the first, before anyone’s thought about this issue at all, they’re like, oh, what if we did — bam, superintelligence? I’m like, well, sounds really rough. If the doom story holds together at all, I think the doom goes up without any prep or forethought or anything like that.

Dwarkesh Patel (02:40:48):

Yeah, so maybe what you wanted to disaggregate, how much of it is you didn’t get a chance to prep versus the training regime that implies, because then the training regime is not LLM plus plus, it’s whatever, I don’t know, whatever that would look like. Maybe just the structure of cognition or something. Maybe let me ask the question in a different way, which is: if today we’re developing, let’s say Alec Radford never exists, but we’re still trying RL kind of stuff, how much higher are the odds of something like doom if we have a different training regime that isn’t reliant on the transcript of human thought?

Joe Carlsmith (02:41:28):

So there was a time where people I think were imagining the default path to AGI something more like: you have it play a zillion games and maybe you have it be in much more evolutionary environments, where it has to cooperate and all sorts of stuff. And to be clear, we still might do that. That could still be eventually what various training regimes look like. But I think some people were thinking that was the first path. And I do think that’s vaguely worse. I mean you’re getting much more agency and goal orientation with much less human-like structure in there, and competitive evolutionary environments are maybe scary in their own ways. So I do think that looks worse. And I do think in the Dartmouth case, I think it’s interesting, some of the early AI discourse was really focused on this idea of, okay, it was good old fashioned AI vibes. It was like you’re writing the code and there’s a question of how do you write the code to make sure you got captured human values suitably. And the discourse has shifted to bemoaning the lack of GOFAI. It would be nice if we had this insight into how our AI cognition works that having hand-coded the software would allow. And so I think there’s a sense in which if that’s how AI was built, then some of these black box worries would be less prominent and we’d have different concerns.

How will AGI change our perception of historical events?

Dwarkesh Patel (02:43:02):

Suppose that AGI does happen in the 21st century, how does that change how we in retrospect think about historical events? And to give you a concrete example in 1931, if you’re just watching Japan invade Manchuria, it just looks like imperial expansion. Obviously in retrospect because we know World War II happened, we remember that as the opening salvo of the war. Is there some significant historical revisionism that would happen once we know that we’re building AGI?

Joe Carlsmith (02:43:33):

So I think it depends what happens, but I do think it’s plausible that if GI is as significant as I think it could well be, and this is understood as really one of the most important events in human history thus far, or maybe into the future, then it seems plausible to me that insofar as different historical events can in fact be seen as influencing what happened with that, then that will be an important part of our macro-historical narrative. So for example, if say the global distribution of power or something enters into the outcome of AGI in some way, then I think it’s plausible that historians will look back on previous events that shaped the global distribution of power, like World War II or the Industrial Revolution even maybe, depending on where this goes, you’ll be able to step back and be like, oh, I see industrial revolution, it started here… So I do expect something like that. I don’t know exactly what form it would take, but if AGI is a macro-historical event, then it will be a part of our macro-historical narrative.

Dwarkesh Patel (02:44:45):

Yeah, yeah. I just realized you have a blog post about decision theory about how the future can impact the past. Obviously this is…

Joe Carlsmith (02:44:55):

Control, not…

Dwarkesh Patel (02:44:56):

Obviously you’re referring to a completely different kind of thing, but it’s funny that at least our perception of the past changes based on what happens in the future.

Joe Carlsmith (02:45:06):

Yes. I think that’s different from the thing that I’m talking about. But yeah, so I have a blog post called “Can You Control the Past,” which people interested can check out. It’s not as crazy as it sounds. You can’t change the past to be clear.

Dwarkesh Patel (02:45:19):

Yeah, I wonder if it’s useful to think through what actual implications it has.

Joe Carlsmith (02:45:24):

I mean I think there could well be, as I think sometimes comes up in different forms of history, you could well have these things where it’s like, wow, some really contingent event ended up influencing things in blah way. And so I think that’s a reasonably common feature of certain types of analysis of other events. I think that could well be true here.

Dwarkesh Patel (02:45:50):

That’s interesting. Maybe it’ll be interesting in retrospect of how much it was baked into the cake. Somewhat intelligent species does this thing for a while, you get singularity versus oh, they figured out something really weird and contingent that big unhobbling in the sense of a cultural unhobbling.

Joe Carlsmith (02:46:12):

Yeah, I mean I was thinking more like, oh, this world leader had a grumpy day because they had the wrong cereal and then they were mean on the phone. These sorts of things could well be in play.

Approach to thinking and writing

Dwarkesh Patel (02:46:29):

So we’ve been mixing together, there’s the philosophical discussions about deep atheism and seeing green and attunement and also a more empirical or more superforecaster type discussion about what will happen in these LLMs, what will the progress in AI look like? And part of your day job is writing these kinds of section 2.2.2.5 type reports. And part of it is like, ah, society is like a tree that’s growing towards the light. What is it like context switching between the two of them?

Joe Carlsmith (02:47:11):

So I actually find it’s quite complimentary. So yeah, I will write these more technical reports and then do this more literary writing and philosophical writing. And I think they both draw in different parts of myself and I try to think about them in different ways. So I think about some of the reports, I’m more fully optimizing for trying to do something impactful, there’s more of an impact orientation there. And then on the essay writing, I give myself much more leeway to just let other parts of myself and other parts of my concerns come out, self-expression, aesthetics and other sorts of things, even while they’re both, I think for me, part of an underlying similar concern or an attempt to have a integrated orientation towards the situation.

Dwarkesh Patel (02:48:10):

Could you explain the nature of the transfer between the two? So in particular from the literary side to the technical side, I think rationalists are known for having a sort of ambivalence towards great works or humanities. Are they missing something crucial because of that? Because one thing you notice in your essays is just lots of references to epigraphs, to lines in poems or essays that are particularly relevant. I dunno, are the rest of the rationalists missing something because they don’t have that kind of background?

Joe Carlsmith (02:48:46):

I mean, I don’t want to speak, I think some rationalists, lots of rationalists love these different things.

Dwarkesh Patel (02:48:53):

by the way, I’m just referring specifically to SBF has a post about how Shakespeare could be… the base rates of Shakespeare being a great writer. And also books can be condensed to essays.

Joe Carlsmith (02:49:02):

Well, so on just the general question of how should people value great works or something, I think people can fail in both directions. And I think some people, maybe SBF or other people, they’re interested in puncturing a certain kind of sacredness and prestige that people associate with some of these works. But as a result can miss some of the genuine value. But I think they’re responding to a real failure mode on the other end, which is to be too enamored of this prestige and sacredness, to siphon it off as some weird legitimating function for your own thought, instead of thinking for yourself, losing touch with, what do you actually think or what do you actually learn from? I think these epigraphs… careful, right? I’m not saying I’m immune from these vices. I think there can be an ah, but Bob said this and it’s like, oh, very deep. And it’s like: these are humans like us and I think the canon and other great works and all sorts of things have a lot of value and we shouldn’t… I think sometimes it borders on the way people read scripture or I think there’s a scriptural authority that people will sometimes ascribe to these things, and I think that’s not …so yeah, I think it’s you can fall off on both sides of the horse.

Dwarkesh Patel (02:50:25):

It actually relates really interestingly to, I remember I was talking to somebody who at least is familiar with rationalist discourse and he was asking, what are you interested in these days? And I was saying something about this part of Roman history is super interesting. And then his first response was, oh, it is really interesting when you look at these secular trends of Roman times to what happened in the dark ages versus the enlightenment. For him it was like the story of that was just like how did it contribute to the big secular, the big picture, the particulars no interest in that. It’s just like if you zoom out at the biggest level, what’s happening here. Whereas there’s also the opposite failure mode when people will study history. Dominic Cummings writes about this because he is endlessly frustrated with the political class in Britain, he’ll say things like, they study politics, philosophy and economics. And a big part of it is just being really familiar with these poems and reading a bunch of history about the War of the Roses or something. But he’s frustrated that they take away, they have all these kings memorized, but they take away very little in terms of lessons from these episodes. It’s more of just almost entertainment watching Game of Thrones for them. Whereas he thinks we’re repeating certain mistakes that he’s seen in history, he can generalize in a way they can’t. So the first one seems like the mistake I think CS Lewis talks about in one of the essays cited where it’s like if you see through everything, it’s like you’re really blind, if everything is transparent.

Joe Carlsmith (02:51:57):

I think there’s very little excuse for not learning history. Or, I dunno, sorry, I’m not saying I have learned enough history. I guess I feel like even when I try to channel some vibe of skepticism towards great works, I think that doesn’t generalize to thinking it’s not worth understanding human history. I think human history is just so clearly crucial to understand, this is what’s structured and created all of this stuff. And so there’s an interesting question about what’s the level of scale at which to do that, and how much should you be looking at details, looking at macro trends and that’s a dance. I do think it’s nice for people to be at least attending to the macro narrative. I think there’s some virtue in having a worldview, really building a model of the whole thing, which I think sometimes gets lost in the details. But obviously if you’re too, the details are what the world is made of. And so if you don’t have those, you don’t have data at all. So yeah, it seems like there’s some skill in learning history well.

Dwarkesh Patel (02:53:16):

This actually seems related to you have a post on sincerity, and I think if I’m getting sort the vibe of the piece, it’s at least in the context of let’s say intellectuals, certain intellectuals have a vibe of shooting the shit and they’re just trying out different ideas. How do these analogies fit together? Maybe there’s some, and those seem closer to the, I’m looking at the particulars and oh, this is just that one time in the 15th century where they overthrew this king and they blah, blah blah. Whereas this guy who was like, oh, here’s a secular trend from if you look at the growth models for a million years ago to now it’s like, here’s what’s happening. That one has a more sincere flavor. Some people, especially when it comes to AI discourse … the sincere mode of operating is: I’ve thought through my bio anchors and I disagree with this premise, so here my effective compute estimate is different in this way. Here’s how I analyze the scaling laws. And if I could only have one person to help me guide my decisions on AI, I might choose that person. But I feel like if I could choose between, if I had 10 different advisors at the same time, I might prefer the shooting the shit type characters who have these weird, esoteric, intellectual influences. And they’re almost like random number generators. They’re not especially calibrated, but once in a while they’ll be like, oh, this one weird philosopher I care about, or this one historical event I’m obsessed with has an interesting perspective on this. And they tend to be more intellectually generative as well because they’re not, I think one big part of it is that if you are so sincere, you’re like, oh, I’ve thought through this obviously ASI is the biggest thing that’s happening right now, it doesn’t really make sense to spend a bunch of your time thinking about how did the comanches live and what is the history of oil and how to Girard to think about conflict. Just like what are you talking about? Come on ASI is happening in a few years. But therefore the people who go on these rabbit holes because they’re just trying to shoot the shit have I feel like are more generative.

Joe Carlsmith (02:55:35):

I mean it might be worth distinguishing between something like intellectual seriousness and something like how diverse and wide ranging and idiosyncratic are the things you’re interested in. And I think maybe there’s some correlation from people who are like, or maybe intellectual seriousness is also distinguishable from something like shooting the shit. Maybe you can shoot the shit seriously. I mean there’s a bunch of different ways to do this, but I think having an exposure to all sorts of different sources of data and perspectives seems great. And I do think it’s possible to curate your intellectual influences to rigidly in virtue of some story about what matters. I think it is good for people to have space. I’m really a fan of, or I appreciate the way, I dunno, I try to give myself space to do stuff that is not about, this is the most important thing and that’s feeding other parts of myself. And I think parts of yourself are not isolated, they feed into each other and it’s I think a better way to be a richer and fuller human being in a bunch of ways.

(02:56:43):

And also just these sorts of data can be just really directly relevant. And I think some people I know who I think of as quite intellectually sincere and in some sense quite focused on the big picture also have a very impressive command of this very wide range of empirical data. And they’re really, really interested in the empirical trends. And they’re not just like, oh, it’s a philosophy or sorry, it’s not just like, oh, history, it’s the march of reason or something. No, they’re really in the weeds. And I think there’s an “in the weeds” virtue that I actually think is closely related in my head with some seriousness and sincerity. I do think there’s a different dimension, which is: there’s trying to get it right and then there’s kind throw stuff out there. Try to, what if it’s like this or try this on, or I have a hammer, I will hit everything. Well, what if I just hit everything with this hammer? And so I think some people do that and I think there’s room for all kinds.

(02:57:37):

I kinda think the thing where you just get it right is undervalued. I mean it depends on the context you’re working in. I think certain sorts of intellectual cultures and milieus and incentive systems, I think incentivize saying something new or saying something original or saying something flashy or provocative and then various cultural and social and like, oh, and people are doing all these performative or statusy things. There’s a bunch of stuff that goes on when people do thinking. And you know, cool. But if something’s really important, let’s just get it right. And sometimes it’s boring, but it doesn’t matter.

(02:58:22):

And I also think stuff is less interesting if it’s false. I think if someone’s like “blah!” and you’re like, nope …it can be useful, I think sometimes there’s an interesting process where someone says blah provocative thing, and it’s an epistemic project to be like, wait, why exactly do I think that’s false? Right? And someone’s like, healthcare doesn’t work, medical care does not work, someone says that and you’re like, all right, how exactly do I know that medical care works? And you go through the process of trying to think it through. And so I think there’s room for that. But ultimately the real profundity is true. Or, things become less interesting if they’re just not true. And I think sometimes it feels to me like people … or it’s at least possible to lose touch with that and to be more flashy and it is like… this isn’t, there’s not actually something here.

Dwarkesh Patel (02:59:22):

One thing I’ve been thinking about recently after I interviewed Leopold was, or while prepping for it, listen, I haven’t really thought at all about the fact that there’s going to be a geopolitical angle to this AI thing and it turns out if you actually think about the national security implications, that’s a big deal. Now I wonder, given the fact that that was something that wasn’t on my radar and now it’s like, oh obviously that’s a crucial part of the picture, how many other things like that there must be. And so even if you are forthcoming from the perspective of AI is incredibly important, if you did happen to be the person who was like, ah, every once in a while I’m checking out different kinds of, I’m incredibly curious about what’s happening in Beijing. And then the thing that later on you realized was like, oh, this is a big deal. You have more awareness of you can spot it in the first place. Whereas I wonder, so maybe there’s not necessarily a trade off the rational thing is to have some really optimal explore exploit trade off here where you’re constantly searching things out. So I don’t know if practically that’s works out that well, but that experience made me think, oh I really should be trying to expand my horizons in a way that’s undirected to begin with. Because there’s a lot of different things about the world you have to understand to understand any one thing.

Joe Carlsmith (03:00:44):

I think there’s also room for division of labor. I think there can be … there are people who are trying to draw a bunch of pieces and then be like, here’s the overall picture and then people who are going really deep on specific pieces, people who are doing the more generative throw things out there, see what sticks. So I think it also doesn’t need to be that all of the epistemic labor is located in one brain and it depends your role in the world and other things.

(03:01:08):

There is also this other failure mode which is to equate sincerity and seriousness and just trying to get it right and calibration with actually the regurgitation of the amalgam of your friends or something. Or not your friends, or there’s some received wisdom about what is the thing. There’s some way people are already thinking and you can be like, ah, I shall be calibrated, so I shall basically regurgitate the standard line or something with some uncertainty. There’s a bunch of ways to fail at a level that is, it’s trying to be reasonable but with an aesthetic of reasonableness. So for example, I think it’s very easy to be overly modest or something … Sometimes you know stuff, or sometimes stuff is actually, there’s actually quite strong evidence… and so anyway, there’s a bunch of ways to also go wrong via an aesthetic of calibration or reasonableness that I think are worth having in mind.

Influence of meditation on AI alignment takes

Dwarkesh Patel (03:02:50):

Yeah. You spent over a year of your life in meditation retreats. How has that impacted your understanding of what’s happening in these models? You’re deeply interrogating your own mind. Does that help you understand what’s happening in the Shoggoth better?

Joe Carlsmith (03:03:04):

I basically don’t think so. I think meditation stuff has definitely influenced some of the orientations I’m bringing to bear in the series. I mean certainly the stuff about gentleness and letting go and … that’s very much a muscle that meditation is trying to practice. And encountering things. In some sense, meditation is trying to encounter yourself. I do think meditation has given me some sense of mystery around: what it is to be alive? Or, what it is to be a self? What it is to exist? What is this? If I die, what’s the thing that dies, when I die? And I think that has somewhat transferred. I talk in a series about having this early experience with GPT-3 and wanting to be like, what are you? And I think that some of that mystery was in that question of: what, I’m here, GPT-3 is there, is there something there, what even? Anyway, there was something about that vibe was I think present, but overall I don’t think meditation, I’m actually skeptical that meditation gives people a lot of mechanistic insight into cognitive science, let alone the cognitive science of these neural networks.

Dwarkesh Patel (03:04:23):

This is the insight you can get on the first day of meditation. But the one thing I was thinking about is how easy it is for me to lose track of what I’m supposed to doing second to second. Because the thing you’re supposed to do is pay attention to your breath and how easy it is to lose track of that. And I feel this is a slight amount of evidence, but I feel somewhat more sympathetic to the idea that, oh, despite the fact that these models have a hard time keeping coherence for a long time, that they still have the juice, the intelligence juice, because I also feel like I’m losing track of what I’m supposed focus on the breath, focus on the breath. So I feel like, I think last time I had the experience I was thinking, oh, I know what Devin feels like, the thing you had to spend a tremendous amount of compute just keeping focus on: write the program.

Joe Carlsmith (03:05:16):

Yeah, interesting. I think it seems like a worthy insight that lots of minds, including our own, which are quite smart, are yeah… It can be quite incoherent. It can also be just really different from this picture that I think is sometimes endemic in the AI risk discourse, which is perfect rational planning everything. No one actually thinks it’ll be like that … or sorry, maybe some people think it. But I think it’s easy to lose a grip on how much of stuff is not that. And I think that’s important with respect to thinking about what amount of risk is associated… for these risks that come from this intense planning and optimization and stuff. At what point, for which sorts of AI, do you expect that to show up, and which sorts of capabilities are compatible with not having that or not having that to this intense degree. And so I do think there’s something important there about the big mess, the sense in which our cognition can be a big mess but nevertheless embody various forms of intelligence and capability.

Dwarkesh Patel (03:06:25):

What’s going on with, I feel like there’s a bunch of Buddhists toward into AI alignment. If I talk to the odds that somebody who is doing maybe even AI research generally who’s gone to a meditation retreat recently or something is really high and actually, I’m not sure why. Is it just a thing that bougie, is there a causation or just correlation?

Joe Carlsmith (03:06:48):

I’m guessing? Yeah. Are these all people in the Bay area…

Dwarkesh Patel (03:06:55):

Actually in London as well

Joe Carlsmith (03:06:56):

Yeah, I mean I think I would guess it’s more a correlation… I think there’s probably, I don’t know, something about various of these things that maybe… openness or I don’t know. There’s various things that could be at stake, but my guess is it’s, yeah, my guess there’ll be other correlates too that these people will have where it’d be less tempting to be like, ah, there’s a deep philosophical connection.

Is consciousness necessary for moral patienthood?

Dwarkesh Patel (03:07:19):

So in your series you express sympathy with the idea that even if an AI or I guess any agent that doesn’t have consciousness has a certain wish and is willing to pursue it nonviolently, we should respect its rights to pursue that. And I’m curious where that’s coming from, because conventionally I think the thing matters because it’s conscious, and it’s conscious experience as a result of that pursuit matter.

Joe Carlsmith (03:07:53):

Cool. So I wonder if it’s worth just saying a few things first about why you might think AIs are moral patients or conscious at all before we get into ascribing moral patienthood to non-conscious AI, which I think is hard mode. So I think a lot of people hear about this possibility that AI might be moral patients, where moral patienthood means something like: worthy of concern in its own right, such that you can’t just use it it for well yeah, that what happens to this being matters in itself … and I think a lot of people hear that and they’re like, why would you think that? This is silly, these are these programs or just neural networks, they’re just something, they’ll have some “just” thing. And you got to watch, does the thing you just said, does that apply to humans? You got to watch out. people, it’s like, oh, “it’s just a machine” or something. It’s like, oh, what is biology? Anyway.

(03:08:51):

But I do think it’s genuinely very unclear what the criteria are for moral patienthood in AIs. I think this is a really, really difficult issue. I want to say a few things about why I take it seriously or how I get into it. And I think we should distinguish between the question of: can any AI be a moral patient, and then is a particular sort of AI a moral patient, or conscious, or some criteria.

(03:09:23):

And so on the first thing, which I think is important to us getting into this at all, a thing I imagine is: imagine, say you have your grandma, who you love very much, and there’s a process that can replace your grandma’s brain such that it’s made out of some different material, some silicon-like material. It’s still going to be in her head, it’s going to be quite similarly structured, all of the computational processes, she’s still going to be body, but metal-brain grandma, that’s step one. So suppose you do that. I think there’s this question of, okay, do you think that grandma is no longer conscious, grandma’s no longer a moral patient? I’m very much not, I think that’s a some hypothesis you could have, but I don’t know why exactly you would think that the kind of material that her brain is made out of is the really important thing.

(03:10:11):

I think there’s also other arguments we can wheel in terms of like, well, what was it such that she knew she was conscious? And is that process sufficiently similar such that we should expect it to be sensitive to the same thing? So that’s such stage one, just like grandma metal brain.

Dwarkesh Patel (03:10:27):

Sorry, can I interrupt. I think at least I personally buy like, oh, if it’s a conscious thing, even if it’s an AI, obviously we should care if the process of training it is painful, obviously we should care. I think maybe the crux to discuss is like suppose, let’s suppose it’s conscious, but the process of training it is not painful, so you can shift the mind in the way you want. I actually wonder, but it is conscious, so is that wrong?

Joe Carlsmith (03:11:02):

I think if it’s not painful, that’s a lot better. So I think there’s a lot of gradations here and I think it’s just quite philosophically unclear exactly how much do we care about these specific things? How much do you care about desires? How much do you care about consciousness without valence? How much do you care about valenced experience? And then also what is your credence that this is present in a given AI system? And yeah, I was going through the metal brain grandma mostly because I think some people come in, they’re like, I just don’t see why we’re talking about this at all. And I don’t want jump too quickly for folks who are like, this is totally crazy. And I thinking basically I think thinking about human agents, holding fixed computational structure and just implementing it on a different substrate. So with metal brain grandma, I imagine first, okay, so you have her metal brain and then you take the brain out and now it’s outside, but it’s still controlling her body. Okay, does that make a difference? And now maybe just implement it on a computer but it’s still controlling her body. Different type of computer, now make her such that it’s a robot body, or now make it such that it’s in a simulation now she’s just a whole brain emulation. Which of these things are you happy to just, you don’t care what happens to it. My claim is that you should keep being quite concerned about the moral patienthood of that sequence of transitions.

(03:12:25):

And then there’s this tough question which is like, okay, so now you have an emulation of your grandma’s brain. Now that is very different from a given AI, and so now we’re in much tougher territory. We’re like, oh, okay. We’re not just talking about: could any AI be a moral patient, but we need to talk about the specifics. And that’s this really difficult philosophical question that there I think we’re going to have to work on. And then yeah, so as you say, once we get that, then there’s this question, what are the criteria? What do we really care about? The empirical properties or the consciousness, valence, desires, et cetera.

(03:12:58):

In the series, I at least express openness to the idea that if it’s just desires, just preferences without consciousness, without valence, maybe we should be taking that seriously as a source of moral concern, at least given our current uncertainty about the ultimate nature of what’s going on here. And basically, so that’s coming from a number of different places. I think the biggest one is just feeling very wide error bars in terms of where our eventual mature ethical and philosophical understanding of consciousness and moral status eventually go. So I’m just like: ah, I noticed that I’m confused about consciousness. And I think in particular, I think what’s often called physicalism about consciousness is very likely to be true, that in some sense consciousness is ultimately reducible to a physical process. And a lot of other people will casually endorse physicalism as well. They have some sense that this is the scientifically respectable answer about consciousness. Oh, it’s like a physical process, sure.

(03:14:09):

But then if you look at how people talk and think about consciousness, including myself and my own intuitions, I think it remains thoroughgoingly dualistic and non-physicalist. They imagine consciousness is this really extra realm, this hidden thing that you can’t see from the outside ,and you have these systems, these cognitive systems, and even if you understand all of the physical processes and the computational processes, there’s this additional question of: from which of those blooms this internal theater. Which is this thoroughgoingly dualistic thing. And I think if you actually transition, which I don’t think I’ve really grokked, if you really instead transition to being like, no, consciousness is a word we use to name certain kinds of physical processes … I guess I just notice that I haven’t really integrated that into my worldview. And I’m not sure the world has either. And I’m not sure where it goes once we do that. How do we really eventually end up feeling about consciousness? If it is in fact the case that it’s this physical process, is there some significant revision in how we think about it, and does it end up seeming the core thing after we make that revision?

Dwarkesh Patel (03:15:18):

The claim being that maybe it stops being the core thing once you think of it as physical process?

Joe Carlsmith (03:15:24):

Something like that. So here’s an intuition that I have. So sometimes when people are like, oh, it doesn’t matter what you do to a unconscious thing, what they do is they imagine that the non-conscious thing is a phenomenal zombie. So they strip away their image of this internal world and then they say, well obviously if there’s no internal world, then it’s like this empty machine. And I’m kind of, oh, but doesn’t that sound a lot like physicalism? What exactly do we mean by an empty machine? But I notice that what people are doing is they’re taking away their dualistic consciousness, when they decide that they don’t care what happens to unconscious entities. And I just worry that conception of consciousness is not going to end up a fit focal point of our moral concern.

Dwarkesh Patel (03:16:16):

It sounds like you’re just denying the viability of p-zombies as using the physicalism your consciousness, not necessarily that you’re like, it would be wrong to treat peace zombies this way. Just given the way I think about consciousness, I don’t think p-zombies are a thing. So that thing probably is conscious in the way we think of consciousness.

Joe Carlsmith (03:16:38):

Yeah, I mean I think my considered view would probably be to deny or, well… I don’t know. I don’t know where this discourse leads. I am suspicious of the amount of ongoing confusion that it seems to me is present in our conception of consciousness. So I sometimes think of analogies with people talk about life and elan vital. Elan vital was this hypothesized life force that is the thing at stake in life, and we don’t really use that concept anymore. We think that’s a little bit broken. And so I don’t think you want to have ended up in a position of saying everything that doesn’t have a elan vital doesn’t matter or something.

Dwarkesh Patel (03:17:22):

That’s interesting.

Joe Carlsmith (03:17:23):

Then somewhat similarly, even if you’re like, no, no, there’s no such thing as a elan vital, but life, surely life exists. And I’m like, yeah, life exists. I think consciousness exists too, likely, depending on how we define the terms. I think it might be a kind of verbal question. Even once you have a reductionist conception of life, I think it’s possible that it becomes less attractive as a moral focal point. So right now we really think of consciousness where it’s a deep fact… So consider a question, okay, so take cellular automata that is self-replication, it has some information, and you’re like, okay, is that alive? It’s kind of like: eh… It’s not that interesitng. It’s kind of a verbal question. Or I dunno, philosophers might get really into is that alive? But you’re not missing anything about this system. There’s no extra life that’s springing up. It’s just like: it’s alive in some senses, not alive in other senses.

(03:18:22):

But I really think that’s not how we intuitively think about consciousness. We think whether something is conscious is a deep fact. It’s this additional, it’s this really deep difference between being conscious or not. It’s like: is someone home? The lights are on, right? I have some concern that if that turns out not to be the case, then this is going to have been a bad thing to build our entire ethics around.

(03:18:45):

Dwarkesh Patel: That’s really interesting.

(03:18:45):

Joe Carlsmith: And so now to be clear, I take consciousness really seriously. I’m like, man, consciousness. I’m not one of these people like, oh, obviously consciousness doesn’t exist or something. But I also notice how confused I am and how dualistic my intuitions are and I’m like, wow, this is really weird. And so I’m just like: error bars around this.

(03:19:05):

Anyway, so there’s a bunch of other things going on in my wanting to be open to not making consciousness this fully necessary criteria. I mean clearly I definitely have the intuition, consciousness matters a ton. I think if something is not conscious, and there’s a deep difference between conscious and unconscious, then I definitely have the intuition that there’s something that matters especially a lot about consciousness. I’m not trying to be dismissive about the notion of consciousness. I just think we should be quite aware o, it seems to me, how ongoingly confused we are about its nature.

Dwarkesh Patel (03:19:34):

Actually, it turns out that your meditation practice has influenced… maybe not your understanding. Actually yes, it would be. So you were saying earlier “I’m not sure it’s influenced it that much.” No, I think this is actually a major insight that meditation has potentially contributed to.

Joe Carlsmith (03:19:50):

I think this is more philosophy of mind. I think from a meditation perspective, I’m like consciousness definitely mysterious. So I guess that’s fair. I definitely have a sense of mystery with the notion of consciousness, but I think my experience of meditation is very: consciousness, that is a thing, whereas this thing I’m channeling here is like, well actually it feels like there’s some evidence that actually consciousness is, I’m thinking about it maybe wrong.

Dwarkesh Patel (03:20:13):

Yeah, Julian Jaynes has a lot of interesting intuition pumps along these lines in “The origins of consciousness in the bicameral mind.” He makes the analogy that if you’re using a flashlight to understand what’s happening in a room, but it’s only colored by the thing the flashlight sees rather than your model of what’s happening beyond the flashlight, suppose the flashlight is off, then you’ll think nothing has happened when in fact the room is still ongoing. Obviously the things that are not directly inside the flashlight at any given moment you’ll be unaware of. So if you’re just the flashlight and you’re not thinking about what is the purpose of the flashlight within this larger context, you’ll be misled about what’s happening in the room. And the flashlight being consciousness in this context.

Joe Carlsmith (03:21:03):

Yeah, definitely thing that aren’t conscious, mental processes, if we assume there’s some mental process, desire, some desiring process and it’s not conscious, it still exists.

(03:21:11):

I mean, I also just think, I don’t know what’s up with … what is really going on with moral status and different types of significance. How do we think about, yeah, I talk in the series about feeling like there’s something about chopping down a redwood that I don’t want to do, or just chopping it down for lumber, this ancient tree, and I notice at least that my conceptual apparatus doesn’t have a great story about that. It could be that this is a misfire or just doesn’t really make sense, but I think … the tree’s not a moral patient. I don’t know. Somehow I’m saying you shouldn’t do whatever you want for it with it, but I’m not sure I’m like oh, it’s conscious. It seems like there’s something about respect going on. And then people are like, oh, it’s because it’s beautiful and you’re trying to instantiate beauty in the world. I’m like, that doesn’t feel right. So again, that’s more of a moral status thing, but I’m noticing that I don’t have a fleshed out story about which, what exactly is going on when I treat things as in some sense intrinsically important. And so that’s another question mark for me in thinking about where does my reflective ethics eventually go.

Dwarkesh Patel (03:22:35):

The redwood case is, interestingly, maybe different from the AI case where the redwood is some more aesthetic understanding…

Joe Carlsmith (03:22:47):

Imagine you come across an alien species on a planet and they have this civilization and none of them are conscious, but it’s this incredibly complicated civilization. It’s got all this art and all this stuff. I’m like: God, you shouldn’t just destroy that for a hundred bucks or something. Now obviously some of this is about uncertainty, but I dunno…

(03:23:10):

there’s a different question of how do you make trade-offs, but I guess I’m pretty free … again, I said before, I want to be nice when it’s cheap. I’m trying to be profligate with my niceness. So you have an AI that cares a bunch about if you do something horrible to it, it’ll writhe. But you tell me, Joe, this isn’t conscious. It’s definitely not conscious, but it’s like writhing in pain. I’m definitely not like, oh, whatever, or I don’t think we should be in that mode. Also, there are philosophical views that say that this matters. I mean, you should have some uncertainty about them. So at least at the level of according *some* moral consideration, I think we should be open to being quite inclusive. and then there’s a different question of, okay, if you’ve got 10 conscious AIs, or 10 non-conscious AIs and one conscious AI, how do you trade them off? And then it gets harder.

Dwarkesh Patel (03:24:06):

Do you know the book “the Wizard and the Prophet”? Oh, I think you might find it really interesting for the yin and yang discussion, because the basic premise of the book is yin and yang in the context of the biography of two people basically. One is who was the guy who figured out how to make the wheat stocks wider and that saved a billion lives. Do you know the name? Norman Borlaug. The other one was, I forgot the name, but he talks about the economy of Norman Boraug, and it is a very yang, we need to feed a billion people. We’re going to do the science to figure out how to make the wheat stocks wider and we’ll fix this. And then the other guy that’s profiled in this book is WIlliam Vogt, who’s an environmentalist, and the book is framed in a way that I think probably gives more sympathy to Norman Borlaug, ultimately the thing you need to do is figure out how to feed a billion more people. But I think it’s actually a really interesting book in terms of thinking about that. The environmentalist is the yin of we’re messing up with the Earth’s ecosystem and then Borlaug is like, okay, but here’s how we rectify the ways in which we’re messing it up.

Joe Carlsmith (03:25:24):

Yeah, I don’t know that much about the Norman Borlaug case, but I think it’s an interesting place to look for some of these debates about high modernist vibes and the humanitarian benefits at stake there.

Dwarkesh Patel (03:25:40):

Okay, so suppose we figure out that consciousness is just a word we use for a hodgepodge of different things, only some of which encompass what we care about, and maybe there’s other things we care about that are not included in that word. Similar to the life force analogy, then where do you anticipate that would leave us as far as ethics goes? Would then there be a next thing that’s like consciousness, or what do you anticipate that would look like?

Joe Carlsmith (03:26:12):

I don’t know. Also, I should be clear if you’re like, ah, Joe, we created a future without consciousness. We heard your podcast or something. And I’m like, oh God. I definitely take consciousness really seriously.

Dwarkesh Patel (03:26:31):

But I think it’s similar to you take life seriously as well, but you understand if the world is just filled with slugs, repugnant conclusion aside. That would be probably … there’s a special kind of thing you mean maybe?

Joe Carlsmith (03:26:44):

Yeah.. If I talk about what makes my own heart sing, when I think about a good future, it’s got a lot of consciousness. I am specifically imagining looking on the inside of the beings living in this future and their experience being rich with this joy and beauty and love and this intensity of positive experience. I’m just, for my own heart, I’m like, yeah.

(03:27:25):

There’s a class of people who are called illusionists philosophy mind, who will say consciousness does not exist. And there are different ways to understand this view, but one version is to say that the concept of consciousness has built into it too many preconditions that aren’t met by the real world. So we should chuck it out, like elan vital. or the proposal is at least phenomenal consciousness or quaia or what it’s like to be a thing. They’ll just say, this is sufficiently broken, sufficiently chockfull of falsehoods that we should just not use it.

(03:28:06):

I think it feels to me, I am like: there’s really clearly a thing, there’s something going on with… I do actually expect to continue to care about something like consciousness quite a lot on reflection, and to not end up deciding that my ethics is better, doesn’t make any reference to that, or at least there’s some things quite nearby to consciousness. When I stub my toe … something happens when I stub my toe, unclear exactly how to name it, but something about that I’m pretty focused on. And so I do think in some sense if you’re like, well, where do things go? I’m like, I should be clear, I have a bunch of credence that in the end we end up caring a bunch about consciousness just directly.

(03:28:56):

And so if we don’t, where will ethics go? Where will a completed philosophy of mind go? Very hard to say. I mean, I think a move that people might make, if you get a little bit less interested in the notion of consciousness, is some slightly more animistic. So what’s going on with the tree? And you’re maybe not talking about it as a conscious entity necessarily, but it’s also not totally unaware or something. So the consciousness discourse is rife with these funny cases where it’s like, oh, those criteria imply that this totally weird entity would be conscious. or something like that, and especially if you’re interested in some notion of agency or preferences. A lot of things can be agents, corporation, all sorts of things … corporations, conscious? and it’s like, oh man. So one place it could go in theory is in some sense you start to view the world as animated by moral significance in richer and subtler structures than we’re used to. And so plants, or weird optimization processes, or outflows of complex, I don’t know who knows exactly what you end up seeing as infused with the sort of thing that you ultimately care about, but I think it is possible that doesn’t map, that includes a bunch of stuff that we don’t normally ascribe consciousness to.

Dwarkesh Patel (03:30:24):

Yeah, it’d be crazy if the acid trip understanding of what’s going on in the world of ah, I see the tree is like, you know what I mean? When you’re asking everything has a valence and oh, in fact, that was the thing I heard about. This actually goes back to the deep atheism thing that I find really interesting of, I think a big part of the Lovecraftian horror, or one version of that we could find is, oh, we’ve realized what we mean, and maybe suffering isn’t the right word for what we’ll find or what we’ll mean when we understand it, but the universe is just filled with either a literally infinite or almost unfathomable amount of… animism is true and everything is like, you know what I mean?

Joe Carlsmith (03:31:14):

Yeah. So it’s like animism, but it’s horrible or something like that. That sounds rough. That sounds scary. Got to see the truth and respond appropriately.

The implications of contingency and ongoing discovery in science

Dwarkesh Patel (03:31:23):

I think when you use the word a complete theory of mind, and presumably after that, a more complete ethic, even the notion of a reflective equilibrium implies like, oh, you’ll be done with it at some point. You sum up all the numbers and then you’ve got the thing you care about. This might be unrelated to the same sense we have in science, but also I think the vibe you get when you’re talking about these kinds of questions is that, oh, we’re rushing through all the science right now and we’ve been churning through it. It’s getting harder to find because there’s some cap. You find all the things at some point. Right now it’s super easy because a semi-intelligent species barely has emerged, and the ASI will just rush through everything incredibly fast and then you’ll either have aligned its heart or not. In either case it’ll use what it’s figured out about what is really going on and then expand through the universe and exploit, do the tiling or maybe a more benevolent version of the quote tiling that feels like the basic picture of what’s going on. We had dinner with Michael Nielsen a few months ago, and his view is that this just keeps going forever or close to forever. How much would it change your understanding of what’s going to happen in the future if you were convinced that Nielsen is right about his picture of science?

Joe Carlsmith (03:32:53):

Yeah, I think there’s a few different aspects. My memory of this conversation, I don’t claim to really understand Michael’s picture here, but my memory was sort of like, sure, you get the fundamental laws. My impression was that he expects the physics to get solved or something, maybe modulo the expensiveness of certain experiments or something. But the difficulty is: even granted that you have the basic laws down, that still actually doesn’t let you predict where at the macro scale various useful technologies will be located. There’s just still this big search problem. And so my memory, though I’ll let him speak for himself on what his take is here, but my memory was, it was like, sure, you get the fundamental stuff, but that doesn’t mean you get the same tech. I’m not sure if that’s true.

(03:33:46):

If that’s true, what difference would it make? So one difference is that, it means at some times you have to, in a more ongoing way, make trade-offs between investing in further knowledge and further exploration, versus exploiting, as you say, acting on your existing knowledge, because you can’t get to a point where you’re like: and… we’re done. Now, as I think about it, I think I sort of suspect that was always true. I remember talking to someone, I think I was like, ah, at least in the future we should really get all the knowledge. And he’s like, well, what do you want, you want to know the output of every Turing machine? In some it’s a question of what actually would it be to have a completed knowledge? And I think that’s a rich question in its own, and I think it’s not necessarily that we should imagine even in this on any picture necessarily, that you’ve got everything, and on any picture in some sense, you could end up with this case where you cap out, there’s some collider that you can’t build or whatever, something is too expensive or whatever, and everyone caps out there.

(03:35:02):

I guess one way to put it’s, so there’s a question of do you cap and then there’s a question of how contingent is the place? If there’s more contingency, I mean, one prediction that makes is you’ll see more diversity across our universe or something. If there are aliens, they might have quite different tech. And so maybe if people meet you don’t expect them to be like, oh, you got your thing, I got our version. Instead it’s like: whoa, that thing, wow! So that’s one thing.

(03:35:31):

If you expect more ongoing discovery of tech, then you might also expect more ongoing change and upheaval and churn in so far as technology is one thing that really drives change in civilization. So that could be another, people sometimes talk about lock in and it’s like, ah, they envision this point at which civilization is settled into some structure or equilibrium or something. And maybe you get less of that. I think that’s maybe more about the pace rather than contingency or caps, but that’s another factor. So yeah, I mean think it is an interesting, I don’t know if it changes the picture fundamentally of earth civilization. We still have to make trade-offs about how much to invest in research versus acting on our existing knowledge. But I think it has some significance.

Dwarkesh Patel (03:36:19):

This is another one of those things where the epistemic feel more like a hash map because oh, it’s like I didn’t think about what the growth model for the future should look like. Oh, picture of the future. Totally different. Just one more thing where I’m like, oh yeah, maybe it is now if it is an epistemic, hell, I don’t think the correct answer is just to throw up your hands, but it might influence your, maybe there are certain things you would do with the margin if your uncertainty was higher, a way in which you might make a trade off in favor of keeping the options open instead of some other thing you might care about. And it might influence how much of that you do.

Joe Carlsmith (03:37:00):

And I should say, I think you brought up the connection with this notion of reflection, and I do feel hesitation and guilt somewhat about talking as though this notion is this solid notion of like: yeah, there’s this thing, our values on reflection, of course, and that’s the criteria for our action. I actually don’t think that’s quite right. I think there’s a burden of active choice on us now that we can’t defer to our hypothetical selves, partly because our hypothetical selves might not agree. And then that seems especially true to the extent that when you say, ah, what would you think on reflection — well, how far does that go? Are you talking about a version of me with a brain the size of a galaxy? Why a galaxy? Why stop at a galaxy, let it go, and maybe it stabilizes, maybe it doesn’t. Which galaxy brain? How did you get there? If you really think about it … it could be the case that no matter how, any way of growing and changing and reflecting, it would all come to the same place. And so it’s like that’s just the Joe thing. It always goes to that. But if it goes in different directions now it’s like, okay, and there’s no terminus, right? I mean you can imagine, oh yeah, and then you had complete information — complete information, what on earth is that? Right? And so I guess I just want to flag, we’ve been using this notion, I feel guilty about this because it’s convenient and I think there’s something true about it because we do have standards. We do defer to what we would think if we knew more and we were more of the people we wanted to be and stuff like that. But at the same time, I don’t actually think that’s this fixed, I don’t think that’s as simple as even I have been talking about it as. So I just want to acknowledge that.

Dwarkesh Patel (03:38:52):

I think it is exciting, the prospect of, I think one vibe you get when you talk to people, we’re at a party and somebody mentioned this, we’re talking about how uncertain should we be at the future. And they’re like, there are three things I’m uncertain about. What is consciousness? What is information theory and what are the basic laws of physics? I think once we get that, we’re done,. And that’s like, oh, you’ll figure out what’s the right kind of hedonium, and then it has that vibe. Whereas this, oh, you, you’re constantly churning through and it has more of a flavor of the becoming that the attunement picture implies. I think it’s more exciting. It’s not just like, oh, you figured out other things in the 21st century and then you just, you know what I mean?

Joe Carlsmith (03:39:41):

I sometimes think about two categories of views about this. There’re the people who think, yeah, the knowledge, we’re almost there. And then we’ve basically got the picture, where the picture is like, yeah, the knowledge is all just totally sitting there. And it’s like you just have to be scientifically mature at all and then it’s just going to all fall together. And then everything past that is going to be this super expensive, not super important thing.

(03:40:10):

And then there’s a different picture, which is much more of this ongoing mystery, like ongoing, oh man, there’s going to be more and more, you maybe expect more radical revisions to our worldview. And I think it’s an interesting, yeah, I think I’m kind of drawn to both. Physics, we’re pretty good at physics, or a lot of our physics is quite good at predicting a bunch of stuff, or at least that’s my impression. This is reading some physicists, so who knows.

Dwarkesh Patel (03:40:37):

Your dad’s a physicist though, right?

Joe Carlsmith (03:40:39):

Yeah but this isn’t coming from my dad. There’s a blog post, I think Sean Carroll or something, and he’s like, we really understand a lot of the physics that governs the everyday world. A lot of it we’re really good at it. And I’m like, ah, I think I’m generally pretty impressed by physics as a discipline. I think that could well be right. And so on the other hand, really? Had a few centuries of… And it leads to a different, there’s something, the endless frontier, there is a draw to that from an aesthetic perspective of: the idea of continuing to discover stuff.

(03:41:10):

At the least I think you can’t get full knowledge in some sense because there’s always like, what are you going to do? There’s some way in which you’re part of the system. So there’s not clear that the knowledge itself is part of the system is like, I dunno, if you imagine you’re like, ah, you try to have full knowledge of what the future of the universe will be like. Well, I don’t know, actually. I’m not totally sure that’s true, but…

Dwarkesh Patel (03:41:36):

It has a halting problem kind of properly,

Joe Carlsmith (03:41:37):

Right? There’s a little bit of a loopiness. I think there are probably fixed points in that you could be like, yep, I’m going to do that. And then you’re right. But I think I at least have a question of: when people imagine the kind of completion of knowledge, exactly, how well does that work? I’m not sure.

Will our hearts recognize the goodness of Utopia?

Dwarkesh Patel (03:41:57):

You had a passage in your essay on Utopia where I think the vibe there was more of the positive future we’re looking forward to will be more of like, I’ll let you describe what you meant, but to me it felt more like the first of you get the thing and then now you’ve found the heart of the… maybe can I ask you to read that passage real quick? And that way it’ll spur the discussion.

Joe Carlsmith (03:42:32):

Quote: “I’m inclined to think that Utopia, however weird, would also be in a certain sense recognizable. That if we really understood and experienced it, we would see in it the same thing that made us sit bolt upright, long ago, when we first touched love, joy, beauty. That we would feel, in front of the bonfire, the heat of the ember from which it was lit. There would be, I think a kind of remembering.”

Dwarkesh Patel (03:42:58):

How does that fit into this picture?

Joe Carlsmith (03:43:00):

I think it’s a good question. I think it’s some guess about… if there’s no part of me that recognizes it as good, then I think I’m not sure that it’s good according to me. So yeah, I mean there’s a question of what it takes for it to be the case that a part of you recognizes it as good. But I think if there’s really none of that, then I’m not sure it’s a reflection of my values at all.

Dwarkesh Patel (03:43:36):

There’s a tautogical thing you can do where it’s like, if I went through the processes which led to me discovering it was good, which we might call reflection, then it was good. But by definition you ended up there because it was like, you know what I mean?

Joe Carlsmith (03:43:50):

Yeah. You definitely don’t want to be like, if you transform me into a paperclipper gradually, then I will eventually be like, and then I saw the light, I saw the true paperclips. But that’s part of what’s complicated about this thing about reflection. You have to find some way of differentiating between the development processes that preserve what you care about, and the development processes that don’t. And that is in itself is this fraught question, which itself requires taking some stand on what you care about and what sorts of meta-process you endorse and all sorts of things. But you definitely shouldn’t just be like … it is not a sufficient criteria that the thing at the end thinks it got it right, because that’s compatible with having gone wildly off the rails.

The power of niceness

Dwarkesh Patel (03:44:34):

There was a very interesting sentence you had in your post, one of your posts where you said: “our hearts have, in fact, been shaped by power, so we should not be at all surprised if the stuff we love is also powerful.” What’s going on there? I actually want to think about: what did you mean there?

Joe Carlsmith (03:44:59):

Yeah, so the context on that post is I’m talking about this hazy cluster, which I call in the essay “niceness/liberalism/boundaries,” which is this somewhat more minimal set of cooperative norms involved in respecting the boundaries of others and cooperation and peace amongst differences and tolerance and stuff like that, as opposed to your favorite structure of matter, which is sometimes the paradigm of values that people use in the context of AI risk. And I talk for a while about the ethical virtues of these norms, but it’s pretty clear that also: why do we have these norms? Well, one important feature of these norms is that they’re effective and powerful. Liberal societies are… secure boundaries save resources wasted on conflict. And liberal societies are often more like, they’re better to live in, they’re better to immigrate to, they’re more productive, all sorts of things. Nice people, they’re better to interact with, they’re better to trade with, all sorts of things. And I think it’s pretty clear if you look at both why at a political level do we have various political institutions. And if you look more deeply into our evolutionary past and how our moral cognition is structured, it seems like pretty clear that various forms of cooperation and game theoretic dynamics and other things went into shaping what we now, at least in certain contexts, also treat as an intrinsic or terminal value. So some of these values that have instrumental functions in our society are also get kind reified in our cognition as intrinsic values in themselves. And I think that’s okay. I don’t think that’s a debunking, all your values are something that stuck and got treated as a terminally important. But I think that means that … in the context of the series where I’m talking about deep atheism and the relationship between what we’re pushing for and what nature is pushing for or what pure power is pushing for, and it’s easy to say, well, there’s paperclips, which is just one way place you can steer, and pleasure is like another place you can steer or something, and these are just arbitrary directions. Whereas I think some of our other values are much more structured around cooperation and things that also are effective and functional and powerful. And so that’s what I mean there is, I think there’s a way in which nature is a little bit more on our side than you might think, because part of who we are has been made by a nature’s way. And so that is in us now. I don’t think that’s enough necessarily for us to beat the grey goo. We have some amount of power built into our values, but that doesn’t mean it’s going to be such that it’s arbitrarily competitive. But I think it’s still important to keep in mind.

(03:48:08):

And I think it’s important to keep in mind in the context of integrating AIs into our society. We’ve been talking a lot about the ethics of this, but I think there are also instrumental and practical reasons to want to have forms of social harmony and cooperation with AIs with different values. And I think we need to be taking that seriously, and thinking about what is it to do that in a way that’s genuinely legitimate. A project that is a just incorporation of these beings into our civilization, such that they can all ..or sorry, there’s the justice part, and there’s also, is it compatible with people’s, is it a good deal? Is it a good bargain for people? And I think this is often how, to the extent we’re very concerned about AI is rebelling or something like that, it’s like, well, a thing you can do is make civilization better for someone. And I think that’s an important feature of how we have in fact structured a lot of our political institutions and norms and stuff like that. So that’s the thing I’m getting at in that quote.

p(God)?

Dwarkesh Patel (03:49:17):

I think that was a great place to close that. Final question: what is p(God)?

Joe Carlsmith (03:49:24):

So let’s talk about a few different things we can mean there. I actually use the word “God” a lot in the series. And when I use it in the series, I mean something like Ultimate Reality, the Ground of Being, the Real, where the Real is understood as this thing beyond us; that precedes us; in virtue of which we exist, but not vice versa. Now we can talk about: is there some metaphysical confusion there? A lot of people will just get really resistant even to that sort of talk. I’m kinda like: it really looks like something like that is a thing. There’s definitely a beyondness, something that precedes… even just, I think the concept of the universe is fine for this, but I’m talking about there’s an explanatory chain, is what I’m imagining, and ultimate reality is the bedrock. Now, maybe there’s no bedrock. We can talk about whether this is a confused notion, but that’s a notion that I do use in the series.

(03:50:32):

And people get fussed about this. Some people, they’re like: it’s confusing, maybe, to use the word God in that way. I think enough of our spiritual and religious tradition is trying to orient towards that thing, trying to orient towards this otherness, this thing that we do not control that made us, that we didn’t make, even absent other properties that are often associated with God, that I at least find it productive to continue to use that language. But I do recognize it can be confusing. So that’s one thing.

(03:51:06):

I think that’s real. I think there’s a reality. Obviously we can get fussed about that.

(03:51:16):

Then there’s a different thing which is like: okay, suppose you would grant that there’s an ultimate reality of some kind. Is it well understood as a mind, or a person? Does it have mental traits, or even some weird extrapolated analog of mental traits? And I’m like, I don’t see why you’d think that. I mean in particular my conception of minds. There’s functional machinery that’s required to have a mind. You have to have memory, there’s an actual way your brain, and I’m like, ah, I doubt there’s something like that at the bedrock of reality. I dunno why you think that now there are views that are like, ah, pantheism. There are weird consciousness views which want to build consciousness into the bedrock of reality. I’m skeptical. I think that doesn’t do justice to this sense in which our … well, I dunno, I’m skeptical of that. But that’s a view you could have. And then obviously in some sense, some parts of reality are conscious. So there’s a sense in which God is conscious in some places.

(03:52:23):

So I’m not on, like: ultimate reality is a mind. And then more important though, and this is something that comes up a lot in the series, is that I don’t think that ultimate reality is wholly good. And I think that’s what actually counts. And I think that’s the deepest and the most interesting claim of, at least, I spent a lot of time in early in undergrad engaging with Christianity. I think a lot of other religious tradition will talk about this differently. But I think in Christianity at least I think there’s this very interesting claim, which is that at the foundation of being, reality and goodness are fully unified. That the ultimate, underneath everything, is perfection. And that perfection explains everything in the world. And I don’t think that’s true. I don’t think that’s true. And I think it’s the deepest objection to Christianity, the problem of evil. I have a post about the problem of evil and the way it shows up even if you try to let go of various other aspects, I think actually it’s a problem for spirituality as a whole, that there’s a way in which you can’t reconcile, there’s a affirmation towards ultimate reality that you can’t have in virtue of evil. And actually a lot of the series gets into this, and the final post, called “Loving a World you Don’t Trust,” is about this dynamic. And so that one, I think it’s not the case that that bedrock of reality is wholly good. I wish it were.

(03:54:18):

So those are the God things. And then we can talk about specific religious … I was really into, especially when I was doing a lot of stuff with Christianity, the thing I would ask people is, do you think that the historical figure of Jesus Christ was bodily resurrected? Right? And there I’m like, I don’t think that.

Dwarkesh Patel (03:54:36):

What probability do you give to a mind that is smarter than, because I think it’s not just like, oh, I don’t know, slug type mind. It’s like, no, it’s a more powerful kind of mind. That kind of God is behind reality. If you had to give it a number.

Joe Carlsmith (03:54:54):

A literal person doing stuff, making choices? We’re not talking about some mystica, oh, emptiness, it’s fundamentally luminous. We’re talking about: there’s a dude, a dude at the bedrock.

(03:55:08):

Dwarkesh Patel: Or, a mind, yeah.

(03:55:11):

Joe Carlsmith: Low probabilities are tough.

Dwarkesh Patel (03:55:16):

Less than 0.1%?

Joe Carlsmith (03:55:17):

Yeah.

Dwarkesh Patel (03:55:18):

Okay. Has that number gone up or down the more philosophy you’ve studied?

Joe Carlsmith (03:55:23):

I think it’s gone down. As I say, I had this period where I was exploring Christianity quite seriously. And so at that time I was thinking a lot more about this and taking it more seriously. And then I guess in some sense it was out of that exploration that I got really interested in philosophy. I started to feel like: if I’m going to think seriously about these sorts of issues, I really want to have the full toolkit and rigor that this … initially I’d been very turned off by the analytic tradition. I was like: ugh, it has no soul. It has no spirit. But actually I think … so the series is dedicated to this Nietzche scholar, Walter Kaufman, who I was reading around this transition in my life when I decided to study philosophy. And I think he, for me was this image of, he’s very strong on this union of rigor and spirit. And I think that was what convinced me to study philosophy. And yeah, I think since then, I think my credence on a traditional Christian-like God has gone down.

Dwarkesh Patel (03:56:40):

What were you doing before? So you were considering Christianity at what stage in your life, exactly? College?

Joe Carlsmith (03:56:46):

Yeah. I guess this was the first two years of undergrad. I was really into religion.

(03:56:52):

Dwarkesh Patel: And what were you studying at the time?

(03:56:53):

Joe Carlsmith: At the time I was a humanities major, and then I switched my major to philosophy.

Dwarkesh Patel (03:56:57):

Okay. I think that’s an excellent place to close.

(03:56:59):

Joe Carlsmith: Great. Thank you so much.

(03:57:00):

Dwarkesh Patel: Joe, thanks so much coming on the podcast. We discussed the ideas in the series. I think people might not appreciate if they haven’t read the series, how beautifully written it is. The idea, we didn’t cover everything, there’s a bunch of very, very interesting ideas. As somebody who has talked to people about AI for a while, things I haven’t encountered anywhere else. But just obviously no part of the AI discourse is nearly as well-written. And it is genuinely a beautiful experience to listen to the podcast version, which is in your own voice. So I highly recommend people do that. So it’s joecarlsmith.com where they can access this. Joe, thanks so much for coming on the podcast.

(03:57:42):

Joe Carlsmith: Thank you for having me. I really enjoyed it.

1

Dwarkesh's team made some edits to the conversation's original audio -- for example, to smooth up various transitions. I'm not sure exactly what got cut here, but I think the cuts were pretty minimal. I also made one edit myself, to cut a section where I misspoke.