Aligned Intelligence Solutions: The Near-Alignment Problem

Let's walk through a quick "would you rather." Would you rather have a horrible first date or a great marriage that ends horribly years down the line? In the first scenario, let's assume that you and this person are just simply not compatible. Your date dumps his or her entire drink on you at the start, and then starts to loudly complain to you about their ex. You are mortified. Your date then proceeds to explain to you that the world is flat, and they mention off-hand that most people are actually lizards in disguise. You find this roughly amusing, until you realize that its only been ten minutes and you should probably wait a full hour in order to not be seen as rude. Not great, right?

Well, in the second scenario, assume the date goes perfectly. A great relationship of two years blossoms, and pretty soon you wind up married. Your partner seems perfect, and you are madly in love. You and your partner have three incredible children, and everything seems amazing. Then, four years into marriage, your partner starts acting quite strange. They start to despise you for no discernable reason, and they start pushing your buttons in ways only someone who knows you intimately could. Out of the blue they file for divorce and aim to take the kids. You are blindsided, enraged. But they gaslight you over and over and claim that you are the crazy one. One day, you are rummaging through a drawer in the house when you find a sketchbook. You start to flip through it, and you find that it is full of crude drawings of lizard people, accompanied by rambling, incoherent sentences about you and your children. You begin to realize the obvious: your partner is losing their mind. Even worse, there are children at stake now. The divorce proceedings continue as normal, despite your pleadings. In public, your partner shows no signs of craziness. But sometimes, very infrequently, you catch a flicker of insanity in their eyes.

This is very long-winded metaphor for AI alignment. I am saying that a relationship that goes 99% right but goes wrong at the very end could be much worse than a relationship that is a non-starter. In the same vein, if AI alignment goes 99% right but then goes wrong at the very end, that could be much worse than AI that fails to be aligned outright. How so? Well, the "first-date" AI could be something like a paperclip maximizer. We probably don't delegate as much authority to such a system, or if we accidently do, we may notice early on some warning signs and remove authority quickly. The "marriage" AI might do everything we want for quite a long time. Maybe it maps the human value function exactly correctly, and knows exactly what we need. Then, for some unforeseen reason, it puts a negative sign in front of the human value function. Boom, now there is incredible suffering risk. By then, maybe our systems are largely controlled, offloaded. Maybe we are simply too dependent, with too many ties. Maybe we don't have the power to change course. By then, maybe the entire human race is on the line. Maybe we are in too deep.

Aligned Intelligence Solutions

Tuesday, June 20, 2023

The Near-Alignment Problem

No comments:

Post a Comment

Reflections on Publishing