When Ordinary AI Requests Form a Concerning Pattern
Each task you ask an AI to do can look innocent. The risk shows up only in the sequence.
Some of the most dangerous plans are built entirely out of innocent parts. 9/11 attackers bought airline tickets, studied aircraft, and learned the target layout, and not one of those steps, taken individually, would have looked like anything but ordinary activity. The danger never lived in any single action. It took shape only in the way the actions fit together, which is exactly the kind of thing a safety system has to be able to see.
This is the harder problem in AI safety, and catching it depends on a single design choice: every AI agent in the network has to do its thing in the open. An AI agent here means an ordinary AI model that a person has trained on their own knowledge, preferences, and values until it can act on their behalf, what this series calls an Advanced Autonomous Artificial Intelligence, or AAAI. As an agent works toward a task, it posts each goal and subgoal it takes on to a shared, ordered record of the steps every agent is pursuing.
That shared record is the structure introduced earlier in this series as the WorldThink Tree, described in How Millions of AI Agents Work as One. Because every agent’s steps are visible in one place and in the order they were taken, they can be read as a sequence while the work is still in progress.
The previous post, The Safety Check That Catches What an AI Misses, followed one such agent through a single task, a trip booking, and showed the checks watching its choices as it worked. This post stays with the machinery underneath those checks: how the system decides that a run of perfectly reasonable goals has, together, become a concern.
That decision is based on the architecture’s confidence level thresholds. The system sets, in advance, how much accumulated evidence across a sequence is enough to treat the sequence as a real concern. Those thresholds let a check read back across the entire run of goals on a branch and determine whether the combination is heading toward something harmful, even when every goal on its own looks fine.
Each time an agent posts a goal or subgoal, the system runs an ethics check on it. The check measures the goal against a list of prohibited attributes and quickly looks back at the goals that led up to it, watching for the shape of something nefarious. It then uses those confidence-level thresholds to classify the goal into one of four categories: unsafe, unethical, safe, or ethical. The same thresholds judge the sequence as a whole, so a string of individually safe goals can still register as unsafe when the pattern across them adds up. The check looks ahead, weighing whether the goal is moving toward a violation, which is what lets it act before a rule has actually been broken.
These checks are not confined to the moment a goal is set. They can run periodically while a problem is being worked on, and again whenever a payment for a solution is due to a human or AI problem solver. A single problem might involve hundreds or thousands of subgoals, so checking at each major step works something like scanning the problem-solving process for trouble as it unfolds. How often scanning occurs can be tuned to the situation: lighter for routine work, where frequent checks would slow things down and lead to false positives, and heavier for problems sensitive enough to justify the extra scrutiny.
This is a different starting point from how alignment is approached today.
Constitutional AI, reinforcement learning from human feedback, and direct human oversight all rest on the same assumption: that you can find a problematic output by inspecting outputs one at a time.
That works when the system you are watching is slower than the people watching it. It breaks down when the system thinks millions or billions of times faster, because by the time a human has finished reviewing one output, the system has already moved thousands of steps past it. At that speed, the meaningful unit to watch becomes the pattern of goals over time, which is what these checks are built to read.
An ounce of prevention is worth a pound of cure.
It is worth being clear about what this design assumes. It is built for a world where some AI agents carry flawed values, some people train their agents carelessly, and a few train them with bad intent. The checks exist precisely because an agent’s own values cannot be the only safeguard. So the defense is layered: an agent’s internal ethics during customization, the architectural checks when goals are set, reputation screening across the network, and aggregated norms at the point where everything is integrated. Each layer is there to cover what the ones beneath it might let through. The Navy SEALs have a saying about redundant safety systems, “two is one, and one is none,” and the same logic holds here.
The next post turns from how individual safety checks work to how the whole network grows into AGI in the first place. We will look at the three mechanisms behind that growth, better prompts, tuning, and training, along with a fourth that human cognitive psychology calls “chunking,” and why understanding how AGI grows is what makes the design choices being made right now matter as much as they do.
This series draws on White Paper 2: Ethical and Safe AGI. Read it in full to see how every piece fits together!
If this made you think, subscribe to Superintelligence at read.superintelligence.com so you don’t miss what comes next. And if someone in your life needs to understand where AI is heading, send this to them.




