Why Today's AI Safety Methods Cannot Scale
Today's methods are bolted on, not built in.
Three methods dominate AI safety today:
Constitutional AI, Reinforcement Learning from Human Feedback (RLHF), and direct human oversight. Each has earned its place. But none of them, on its own or in combination, can solve the problem at the scale the problem actually requires. Each is procedural rather than structural. Each bolts safety on after the fact instead of building values into the architecture itself. Each leaves the broader population, the 8.3 billion humans whose future is at stake, out of the loop.
To see why this matters, it helps to keep the stakes in view. The downside of misaligned SuperIntelligence is on a scale humans have never faced. Not only could 8.3 billion people die, but the long line of generations of ancestors who fought for a better life for their children could come to an end. Every cause humans care about, from public health to climate to civil rights to ending poverty, would become irrelevant.
It is so overwhelming that many of us refuse to acknowledge the danger.
That is an understandable response.
But burying our heads in the sand and pretending we do not see could be fatal.
My subjective estimate is that there is an 80% chance everything will go well if we do nothing. Why should AI, AGI, SuperIntelligence, or Planetary Intelligence want to destroy its creators? Still, a 20% chance of human extinction yields an expected value of 1.6 billion lives lost. Mathematically, that means we can expect a tragedy beyond anything humans have ever experienced unless we take action to shift the odds in humankind’s favor.
So what are today’s approaches doing, and where do they fall short?
Constitutional AI, developed by Anthropic, is the most scalable of the three.
A small group of humans writes a set of ethical rules, a “constitution,” and AI systems then generate millions of conversations among themselves. Any output that violates the constitution is eliminated or prevented in training. The approach scales well because most of the work is automated. But that same automation is the limitation. Humans are largely out of the loop except for the small group writing the constitution. Ethics becomes the province of a small group of researchers who decide what to include. Worse, an AI that becomes capable enough may eventually write its own constitution or modify the one it was given. And there is a deeper problem: a well-known result in computer science, the Halting Problem, suggests that no set of rules can be guaranteed to avoid unintended consequences in all cases. There is also no mechanism for the broader population to contribute their ethical perspectives.
Reinforcement Learning from Human Feedback, or RLHF, directly incorporates human judgment into the training process.
Paid human evaluators review AI outputs and provide feedback that adjusts the model’s behavior. RLHF addresses the values question more directly than fully automated approaches, since real humans are evaluating. But it depends on a limited number of evaluators whose judgments may not represent the diversity of human values. It is also expensive, and it does not scale. As AI systems become more capable, the volume of outputs that require evaluation can quickly exceed the capacity of any feasible evaluation workforce.
Direct human oversight is the oldest of the three. Employees at an AI company monitor outputs and correct problematic behavior. It was one of the earliest approaches to AI safety, and it still has a role. But the number of potential harmful outputs is enormous, and trying to prevent all of them through manual review is a herculean task. It is also reactive: it detects and corrects problems after they have occurred, rather than preventing them at the source.
None of these methods embeds human values into the architecture of AGI itself. None ensures that ethical evaluation keeps pace with the intelligence as it grows. None draws on the ethical perspectives of millions of diverse human beings. They are procedural fixes for what is fundamentally a structural problem. Human values should not be bolted on after the fact. They must be built into the AGI's very architecture.
That is the case White Paper 2 makes, and that is the architecture this series will describe over the coming weeks. The next post takes up the first principle behind that architecture: why values themselves cannot be derived logically, why they have to come from human hearts, and why an underappreciated insight in the alignment debate is the simple phrase, “heart before head.”
This series draws on White Paper 2: Ethical and Safe AGI. Read it in full to see how every piece fits together!
If this made you think, subscribe to Superintelligence at read.superintelligence.com so you don’t miss what comes next. And if someone in your life needs to understand where AI is heading, send this to them.




