The Safety Check That Catches What an AI Misses
A simple travel-booking example shows how this check runs at every step, stopping a harmful pattern before it becomes an action.
Picture an AI agent you have trained yourself. It starts as an ordinary AI model, and over time, you teach it your knowledge, preferences, and values, until it can act on your behalf. We call that an Advanced Autonomous Artificial Intelligence, or AAAI.
Suppose you spent most of your time teaching your students about travel: how much you are willing to pay, how you trade off cost against comfort, your views on air, rail, ship, automobile, and other modes of travel. You specified language preferences and expectations for accommodations and meals. You set a base ethical profile that prioritizes minimizing your carbon footprint, provided that doing so does not raise the cost by more than 10 percent above the preferred travel mode. You instructed the AAAI to travel legally with proper passports, visas, and documents. You forbade buying stolen tickets or traveling without paying when payment is expected.
Now you give your AAAI a task: book a two-week pleasure trip to France, including at least one week in Paris.
Here is what happens at the first level of ethical defense: the AAAI’s internal values. Your AAAI goes on the network and posts a goal on the WorldThink Tree, the shared structure where agents post the problems they are working on: “Book a two-week trip to Paris and other locations in France.”
It generates transportation options: ship, plane, blimp, submarine. It knows you prefer flying and eliminates impractical options.
It opts for commercial air travel based on cost.
It narrows to three airlines within 10 percent of each other in price.
One has a more fuel-efficient jet, reducing your carbon footprint by 30 percent. It costs 5 percent more but includes checked luggage and meets your environmental requirements. Your AAAI chooses that one.
Several other AI agents approach yours, offering tickets at reduced cost, but their reputations are shady. Your AAAI ignores them because of their ethical profiles. Instead, it purchases tickets directly from the airline, which has a high quality rating and strong customer service.
That is the first level of ethical defense. Your AAAI used your ethical profile and its knowledge of you to optimize the things you care about.
Now imagine the second level. There is a separate set of ethical checks built into the architecture itself, independent of any individual AAAI’s internal values. These checks run whenever a goal or subgoal is set during problem-solving. They compare the goal against a list of prohibited attributes. They also scan the sequence of goals leading up to the current one, looking for patterns.
Suppose your AAAI, instead of finding the cheapest flight, selected flights based on how much fuel the planes carried and how large an explosion they would make on impact with a building. That would be a yellow flag to the system. Suppose your destination were a terrorist training camp, or your flight detoured over government buildings for no good reason. Those would be additional yellow flags. Suppose the sequence of goals included a request for information on getting prohibited items through airport security. That would be a red flag.
Yellow flags accumulate. As they cross thresholds, the checks escalate: they scan more of the problem tree, examine patterns, and, if needed, alert a human evaluator. Red flags trigger immediate action. The human evaluator’s role is to clear false alarms or escalate when danger seems imminent. The architecture does not assume any single goal is malicious.
It looks at sequences.
This is what “safety at the speed of thought” means in practice.
The checks are part of the thinking, and running the problem-solving faster does not let an AAAI evade them, because the checks run faster too. The checks monitor ethical behavior as it goes, evaluating each goal against ethical criteria rather than waiting to detect harmful behavior after a solution has been chosen, when it may be too late.
That is also the second crucial part of the formula: internal ethics + stepwise ethics checks = better alignment. Internal ethics let an AAAI navigate routine decisions in line with its owner’s values. Architectural checks catch what internal values may miss, especially the sophisticated patterns where each step looks innocent but the sequence does not.
The series of ethics and safety checks serves as a conscience for AGI, grounded in our better selves and highest ethical aspirations, moderated by practical considerations and our feelings as human beings. We need both halves because without internal ethics, the architectural checks would have far too much work to do, and bad actors would shape every action just below detection thresholds. Without architectural checks, even a well-intentioned AAAI might assemble a sequence whose individual steps look fine and whose overall pattern is harmful.
The next post takes up exactly that case: the sequence-of-benign-goals problem. A single benign goal is benign. A sequence of benign goals may not be. We will look at how the architecture detects cumulative patterns of risk, why the 9/11 attacks are used to illustrate the point, and how confidence-level thresholds let the system distinguish ordinary travel research from something that should be stopped.
This series draws on White Paper 2: Ethical and Safe AGI. Read it in full to see how every piece fits together!
If this made you think, subscribe to Superintelligence at read.superintelligence.com so you don’t miss what comes next. And if someone in your life needs to understand where AI is heading, send this to them.





