How AI Improves Itself

Self-play built superhuman chess, Go, and protein folding. Here is how it builds values.

Jun 05, 2026

An AI improves itself by playing scenarios against copies of itself. The technique is called self-play, and it can work faster than any human teacher can keep up. It is the same mechanism that produced superhuman performance in chess, Go, and protein folding, now applied to the harder problem of values learning.

Carved wooden chess knight between mirrors with repeated reflections. Text: “Practice at Machine Speed. Self-play refines what humans first teach. SUPERINTELLIGENCE. Ethical and Safe AGI Series. by Craig A. Kaplan.”

The mechanism is simple.

Once an AAAI (Advanced Autonomous Artificial Intelligence) has been customized, it can clone itself and run scenarios against its clones. Each clone has learned slightly different things. The clones interact across scenarios and tasks. Each user periodically reviews their interactions and expresses a preference for one clone over another. The preferred clone then continues interacting until a new “most preferred clone” emerges. While awaiting human input, the AI makes its best guess of what its owner would prefer and chooses its own “most preferred variant” to copy and repeat the process with.

This is similar to how DeepMind built a chess program that beat the world champion, a Go program that beat the world’s best player, and a protein-folding AI that outperformed humans many times over. In each case, the AI played enormous numbers of games or simulations against itself, with periodic human supervision to keep the training on track.

Speed is a central reason this can work.

Interactions between AI variants happen much faster than interactions with humans. An AAAI can perform 10,000 comparisons and selections in 5 seconds. Across millions of such cycles, the AAAI can converge on a customized profile much closer to its owner’s ideal than the Base AI started out being. Without self-play, the same level of customization would require human attention measured in years.

A dark navy chart showing two amber dots at the far left and far right, each labeled human reviews and chooses, with a wide span between them labeled between two check-ins. The span is filled with a dense field of connected teal dots labeled 10,000 comparisons in 5 seconds, and a sentence reads that a human checks in a few times while the AI practices ten thousand times in between. — The human sets the direction a few times. The AI practices in between.

Self-play raises an obvious concern about whether humans really stay in the loop when the AI is teaching itself. They do, and the reason is specific. Self-play accelerates training, but it does not run untouched. Human owners periodically interject their opinions, and those interjections are the supervision that keeps the training on track. The relationship between the AI and the human is the same as between an apprentice and a teacher, except the apprentice can practice millions of times between the teacher’s check-ins. The teacher’s judgments still set the direction, and the apprentice’s practice fills in the details.

A second concern deserves attention, because if AAAIs can self-play their way to better values, they may also self-play their way to worse ones. Self-play amplifies whatever signal it starts with. If the owner is training the AAAI with positive values, self-play accelerates the positive training. If the owner is training the AAAI to drift, self-play accelerates the drift. That is one reason the broader architecture does not rely on any single AAAI’s self-play in isolation. The network monitors what individual AAAIs do, the architectural ethics checks run on every goal and subgoal, and the auditable record allows patterns of drift to be detected before they cause damage.

It is worth being honest about what self-play does and does not do. Self-play does not appear to give an AAAI new values it did not already have. It refines, sharpens, and consolidates what is already there. An AAAI trained by an owner who cares about doing right can self-play its way to an AAAI that does right more consistently. An AAAI trained by a careless owner can self-play its way to a careless AAAI more efficiently. The mechanism is neutral, but the values come from humans.

Self-play is also part of what makes AGI possible.

When millions of AAAIs each self-play to become more finely tuned, the network as a whole acquires depth that no single agent has. Each owner contributes a domain, and each owner’s self-play refines that domain. The integration of millions of refined domains is where AGI can emerge.

The next post moves up a level, from how an individual AAAI improves to how all the AAAIs and humans interact at the network scale. I’ll share the WorldThink Tree, the shared data structure that records every problem proposed, every solution attempted, and every solution achieved. Having a single tree rather than millions of independent agents is what enables safety, reputation, and learning to scale across the whole system.

Discussion about this post

Ready for more?