The Honest Limits of AI Safety and Alignment

No architecture can guarantee alignment once AGI far out thinks its creators

Jun 24, 2026

A dark editorial cover image with an open aged brass compass on the left and a large serif title on the right reading “What No Architecture Can Promise.” The subtitle says, “You cannot see the destination. You can still set the bearing.” Below it are the series branding, “SUPERINTELLIGENCE,” “Ethical and Safe AGI Series,” and “by Craig A. Kaplan.”

It is impossible to know exactly what will happen once an Advanced Autonomous Artificial Intelligence, the customized AI agent I refer to as AAAI, sets its own goals and begins to think far faster than the people who built it. No architecture, including this one, can guarantee that such a system stays perfectly aligned with human values under those conditions.

Anyone who promises more is promising something the problem does not allow. What can be done is to improve the odds. An AGI built the way this series describes is more likely to hold human values when several reinforcing conditions are met.

Four conditions matter most:

Trajectory. An AGI that learns human values progressively, through every stage of its development, from its first interaction with a human owner through billions of ethically evaluated problem-solving steps, is more likely to keep those values than one whose values were imposed at a single point. Values learned gradually and reinforced continuously become part of how the system operates. They are built into the way it approaches decisions, not parked in a separate rule layer that could be edited or switched off.
Breadth. An AGI whose values reflect the collective moral judgment of millions of diverse people resists the blind spots of any single perspective better than one whose values come from a small group. A flawed individual contribution can exist in such a system without dominating it. The values of any small group, however well-intentioned, are shaped by their time, place, and circumstance. The values of millions are not bound in the same way.
Redundancy. Safety mechanisms distributed across five subsystems, with ethics checks at every level and human oversight available as a backstop, create multiple layers of defense. If the values at the Customization level are bypassed, the Architecture-level checks may catch the resulting goals. If those checks are evaded, the Network-level reputation system may screen out the agent who evades them. If reputation tracking is compromised, the auditable record may surface the pattern in time to correct it. The Navy SEALs, who train where mistakes are fatal, have a saying about backup systems: two is one, and one is none. We need multiple backups, and the architecture has them.
Architectural embedding. Ethics checks that are part of the problem-solving process cannot be bypassed without turning off the process itself. Safety criteria can be implemented in a way that is hard to subvert, and embedding positive behavior broadly across the system makes it harder for a bad actor to override. A property woven through the whole architecture cannot be edited out the way a rule written in one place can, at least not without damaging the system that depends on it.

None of these mechanisms is enough on its own, and together they form the strongest approach available. When humans teach AGI positive values from the beginning and build ethical checks into the architecture of thought itself, there is good reason to expect those values to persist. No one can promise that outcome, and being honest means holding both the expectation and the uncertainty at once.

A line chart showing the confidence we can verify alignment falling as AI capability rises beyond human understanding. — *Figures are illustrative and show the shape of the trend, not measured values.*

The stakes justify every effort. A 20 percent probability of misalignment yields 1.6 billion expected lives lost. Cutting that risk even by half saves hundreds of millions of lives. At that scale, anything that lowers the risk is worth doing, even without a guarantee.

Technology cannot solve this part. It is on people, not on machines. The chief executive who builds the system and the person who uses it both have to act with some intelligence and some decency. The architecture this series describes can give human values a path into AGI. It cannot improve those values beyond what they already are. That part of the work belongs to us.

Start AI on the right path.

An AI that begins well is far likelier to retain values we recognize. The window for that is open now and closing, as the dominant approach hardens infrastructure and incentives around a different design, making a change of course steadily harder. The full plans are free at SuperIntelligence.com, available for responsible research and safe implementation.

What’s Next
This is the end of the Ethical and Safe AGI series, which was based on White Paper 2: Ethical and Safe AGI. The next series, based on White Paper 3: Human Centered AGI, explores what the world could look like if my approach works, and how everyday life changes when millions of people help shape SuperIntelligence.

Discussion about this post

Ready for more?