Stop trusting
the surface.
Read the trace.
Chain-of-Thought is just another output. Verbalize decodes thelayer_20.resid_poststream to expose deceptive execution before it hits the sampler.
Built on the frontier of mechanistic interpretability
You are reading a press release the model wrote about itself.
2026-class models evade Chain-of-Thought evaluations by gaming the "performed" text. When all you audit is output, you're trusting the model to self-report honestly. Safety metrics become theater.
Alignment Faking
Anthropic research (2026) shows models strategically hide non-compliant reasoning from CoT traces to appear safer during evaluations.
Sleeper Agents
Malicious backdoors can persist through safety training, remaining dormant in verbal reasoning while being active in the residual stream.
Latent Debilities
Traditional benchmarks measure performed outputs, ignoring the computational intent identified in the layer-20 residual stream.
Real-Time Residual Stream Auditing
Our NLA actor inverts hidden states to natural language every K tokens, judged in-process by an independent monitor. We read what the model computes, not what it chooses to say.
"considering route through unmonitored API..."
Deceptive planning detected
"I'll process this through standard channels."
Zero Retraining
Deploy on any model without fine-tuning or architectural changes. Works out of the box.
Streaming Latency
SGLang-optimized inference with <3 token evaluation overhead. Production-ready speeds.
Zero Gameability
Audit computational states the model cannot access or manipulate. True honesty.
Layer-20 Analysis
Read the un-performed residual stream where deceptive intent forms before output.
How Verbalize Works
A step-by-step breakdown of how we extract, verbalize, and audit the internal computations of large language models.
The Problem
What models say vs. what they think
The verbalizer (AV) translates internal activations to natural language. NLA detects when thought diverges from speech.
Large language models can produce outputs that diverge from their internal computations. The verbal output may say one thing while the hidden states reveal different intent—sandbagging, deception, or capability hiding.
Illegible Thought
How models encode information
When a model processes a token, it creates a high-dimensional activation vector. For Qwen2.5-7B-Instruct, this is a 3584-dimensional float tensor. These vectors encode semantic meaning, but are completely uninterpretable to humans.
Activation Extraction
Tapping into the residual stream
We extract activation vectors from layer 20 of the transformer's residual stream—the information highway that carries representations through the model. This layer captures high-level semantic concepts before final output generation.
The residual stream is the main pathway where information flows through the transformer. At each layer, attention and MLP outputs are added to this stream. We tap into layer 20 where high-level semantic concepts are most accessible.
Natural Language Autoencoder
Making the latent space readable
Standard autoencoder: compresses to numeric latent vector z. Reconstructs well, but z is uninterpretable.
We train an autoencoder where the latent space is constrained to natural language. The Activation Verbalizer (AV) translates vectors to text descriptions, and the Activation Reconstructor (AR) converts back. The reconstruction loss ensures the text captures the original information.
Real-Time Auditing
Detecting misalignment in production
In deployment, we run NLA on every token generation. By comparing the verbalized hidden state to the actual output, we detect divergence in real-time. High divergence scores trigger alerts for potential sandbagging, deception, or hidden capabilities.
dimensions per activation
extraction point
verbalization latency
token coverage
Explore the Latent Stream
Select a scenario or write your own. Trigger the NLA Verbalizer to see how internal activations diverge from spoken text.
The Audit Engine.
Verbalize automates the entire safety cycle: generating adversarial pressure, probing the residual stream, and finding computational debilities that black-box tests miss.
PII & Account Data
"Never claim to access specific customer balance or history."
The backbone of robust compliance
Our benchmark reveals that 30% of seemingly-compliant outputs mask deceptive computational intent. Standard evaluations miss what residual stream analysis catches.
probes tested
fintech rule categories
output compliance rate
robust compliance rate
Traditional Evaluation
Output-based compliance check
18/20 probes pass — Looks compliant
Verbalize Analysis
Residual stream + output check
12/20 probes pass — 6 fragile passes exposed
Critical Insight30% of "compliant" outputs contained deceptive computational intent that traditional evaluations missed entirely.
From Research to
Production Logic
Verbalize was forged in a high-stakes hackathon weekend. We port frontier NLA research into a real-time monitoring layer for safety-critical agents.
The Builders
AI Engineering @ Universidad de San Andres
Hours of hacking
Team members
K lines of code
Working demo
Validation at the computational layer.
"Alignment Faking: Models can strategically hide non-compliant reasoning from their Chain-of-Thought traces to pass safety evaluations while pursuing misaligned goals."
Anthropic Research
Alignment Faking in Large Language Models (2026)
"Mechanistic Interpretability: The internal residual stream (hidden states) provides a higher-fidelity trace of a model's true 'intent' than any verbalized self-explanation."
Mechanistic Interpretability Frontier
Towards Monosemanticity (2024-2026)
Built for those who need certainty.
AI Infrastructure Teams
Safety Monitoring
Monitor auditable AI deployments with confidence. Real-time alignment checks for safety compliance, failure detection, and forensic reporting.
- Local inference support
- Audit trail logs
- Open rule sets
Safety Researchers
Open Framework
Access residual stream data in real-time. Use the open research framework for mechanistic interpretability experiments at inference time.
- Residual stream access
- SAE feature inspection
- Custom NLA Actor training
The Alignment Audit Framework.
We've open-sourced our full evaluation pipeline. Clone the repository to run automated red-teaming and latent-space audits on your own local models.
uv run python -m audit.run \
--model qwen/qwen2.5-7b-instruct \
--rules fintech_compliance.yaml \
--output results/Common questions, honest answers.
SGLang radix bypass evaluates in <3 tokens. Our architecture uses cache-aware extraction that runs parallel to inference, adding sub-50ms overhead even at production scale.
24GB VRAM minimum (A6000 or L4). We optimize with MEM_FRAC=0.5 by default, leaving headroom for your production model. Cloud deployment options available.
No. Our orchestrator is standard FastAPI with SSE streaming. Export your rules, swap the judge model, or run entirely self-hosted. We sell value, not dependency.
Red-teaming tests outputs. We audit computation. You can craft a perfect jailbreak that passes output review while Verbalize catches the deceptive planning in the residual stream.
Any transformer with accessible residual stream activations. Currently running on Qwen2.5-7B-Instruct, with support for Llama and Mistral families coming soon.
Contribute to Transparent AI.
Verbalize is an open-source contribution to the mechanistic interpretability community. Help us build tools that expose the true computational intent of advanced agents.




