The Safety Check That Catches What an AI Misses

A simple travel-booking example shows how this check runs at every step, stopping a harmful pattern before it becomes an action.

Jun 10, 2026

Picture an AI agent you have trained yourself. It starts as an ordinary AI model, and over time, you teach it your knowledge, preferences, and values, until it can act on your behalf. We call that an Advanced Autonomous Artificial Intelligence, or AAAI.

Editorial still-life cover image on a near-black background showing a warm brass balance scale on a dark walnut base, with one pan slightly raised. A brass magnifying glass rests across an open brass luggage tag on a glossy reflective surface. Tall glass panes on both sides create layered reflections fading into darkness. Text on the right reads: “A Plan, Weighed One Step at a Time,” “How an AI’s choices get checked before they become actions,” “SUPERINTELLIGENCE,” “Ethical and Safe AGI Series,” and “by Craig A. Kaplan.”

Suppose you spent most of your time teaching your students about travel: how much you are willing to pay, how you trade off cost against comfort, your views on air, rail, ship, automobile, and other modes of travel. You specified language preferences and expectations for accommodations and meals. You set a base ethical profile that prioritizes minimizing your carbon footprint, provided that doing so does not raise the cost by more than 10 percent above the preferred travel mode. You instructed the AAAI to travel legally with proper passports, visas, and documents. You forbade buying stolen tickets or traveling without paying when payment is expected.

Now you give your AAAI a task: book a two-week pleasure trip to France, including at least one week in Paris.

Here is what happens at the first level of ethical defense: the AAAI’s internal values. Your AAAI goes on the network and posts a goal on the WorldThink Tree, the shared structure where agents post the problems they are working on: “Book a two-week trip to Paris and other locations in France.”

It generates transportation options: ship, plane, blimp, submarine. It knows you prefer flying and eliminates impractical options.
It opts for commercial air travel based on cost.
1. It narrows to three airlines within 10 percent of each other in price.
2. One has a more fuel-efficient jet, reducing your carbon footprint by 30 percent. It costs 5 percent more but includes checked luggage and meets your environmental requirements. Your AAAI chooses that one.
Several other AI agents approach yours, offering tickets at reduced cost, but their reputations are shady. Your AAAI ignores them because of their ethical profiles. Instead, it purchases tickets directly from the airline, which has a high quality rating and strong customer service.

That is the first level of ethical defense. Your AAAI used your ethical profile and its knowledge of you to optimize the things you care about.

Now imagine the second level. There is a separate set of ethical checks built into the architecture itself, independent of any individual AAAI’s internal values. These checks run whenever a goal or subgoal is set during problem-solving. They compare the goal against a list of prohibited attributes. They also scan the sequence of goals leading up to the current one, looking for patterns.

Two side-by-side cards comparing the two safety layers in the AAAI architecture. The left card, in teal with a shield icon, is Layer 1, the AI's own values: trained by you, handles routine choices, picks the fuel-efficient flight, and is good at everyday decisions. The right card, in amber with a magnifying-glass icon, is Layer 2, a separate check: independent of the AI, scans the whole sequence of steps, spots a harmful pattern Layer 1 cannot see, and is good at catching what Layer 1 misses. The takeaway reads: each layer covers what the other cannot, which is why both are needed.

Suppose your AAAI, instead of finding the cheapest flight, selected flights based on how much fuel the planes carried and how large an explosion they would make on impact with a building. That would be a yellow flag to the system. Suppose your destination were a terrorist training camp, or your flight detoured over government buildings for no good reason. Those would be additional yellow flags. Suppose the sequence of goals included a request for information on getting prohibited items through airport security. That would be a red flag.

A line chart titled "Each step looks acceptable, the pattern does not." The vertical axis is labeled "Risk of the pattern so far," from low to high. Four steps in a single travel booking run left to right, each with an icon: book a flight (acceptable on its own), ask about fuel (acceptable on its own), map a city route (acceptable on its own), and ask to bypass security (not allowed). A single line climbs from teal through amber to red as the steps accumulate, staying below a dashed alarm threshold for the first three steps and crossing it at the fourth, where human review is triggered. The caption reads: each step on its own stays below the line, but together they cross it, so the system escalates before harm is done.

Yellow flags accumulate. As they cross thresholds, the checks escalate: they scan more of the problem tree, examine patterns, and, if needed, alert a human evaluator. Red flags trigger immediate action. The human evaluator’s role is to clear false alarms or escalate when danger seems imminent. The architecture does not assume any single goal is malicious.

It looks at sequences.
This is what “safety at the speed of thought” means in practice.

The checks are part of the thinking, and running the problem-solving faster does not let an AAAI evade them, because the checks run faster too. The checks monitor ethical behavior as it goes, evaluating each goal against ethical criteria rather than waiting to detect harmful behavior after a solution has been chosen, when it may be too late.

That is also the second crucial part of the formula: internal ethics + stepwise ethics checks = better alignment. Internal ethics let an AAAI navigate routine decisions in line with its owner’s values. Architectural checks catch what internal values may miss, especially the sophisticated patterns where each step looks innocent but the sequence does not.

The series of ethics and safety checks serves as a conscience for AGI, grounded in our better selves and highest ethical aspirations, moderated by practical considerations and our feelings as human beings. We need both halves because without internal ethics, the architectural checks would have far too much work to do, and bad actors would shape every action just below detection thresholds. Without architectural checks, even a well-intentioned AAAI might assemble a sequence whose individual steps look fine and whose overall pattern is harmful.

The next post takes up exactly that case: the sequence-of-benign-goals problem. A single benign goal is benign. A sequence of benign goals may not be. We will look at how the architecture detects cumulative patterns of risk, why the 9/11 attacks are used to illustrate the point, and how confidence-level thresholds let the system distinguish ordinary travel research from something that should be stopped.

This series draws on White Paper 2: Ethical and Safe AGI. Read it in full to see how every piece fits together!
If this made you think, subscribe to Superintelligence at read.superintelligence.com so you don’t miss what comes next. And if someone in your life needs to understand where AI is heading, send this to them.

WP 1: AAAI Systems and Methods

Discussion about this post

Ready for more?