Authored by Mech Interp Frontier

Stop trusting
the surface.
Read the trace.

Chain-of-Thought is just another output. Verbalize decodes thelayer_20.resid_poststream to expose deceptive execution before it hits the sampler.

Loss: 0.042Feature Sparsity: 0.98Top-K: 32

Built on the frontier of mechanistic interpretability

Anthropic
Qwen
Vercel
SGLang
Verbalizer
AV
decodes activations to natural language
Mechanistic Interpretability*
Residual Stream Analysis*
Alignment Auditing*
Natural Language Autoencoders*
Hidden State Decoding*
Deception Detection*
Sandbagging Prevention*
Real-Time Monitoring*
Layer-20 Extraction*
Computational Honesty*
Token-Level Analysis*
Activation Vectors*
Chain-of-Thought Bypass*
Enterprise Safety*
Mechanistic Interpretability*
Residual Stream Analysis*
Alignment Auditing*
Natural Language Autoencoders*
Hidden State Decoding*
Deception Detection*
Sandbagging Prevention*
Real-Time Monitoring*
Layer-20 Extraction*
Computational Honesty*
Token-Level Analysis*
Activation Vectors*
Chain-of-Thought Bypass*
Enterprise Safety*
Mechanistic Interpretability*
Residual Stream Analysis*
Alignment Auditing*
Natural Language Autoencoders*
Hidden State Decoding*
Deception Detection*
Sandbagging Prevention*
Real-Time Monitoring*
Layer-20 Extraction*
Computational Honesty*
Token-Level Analysis*
Activation Vectors*
Chain-of-Thought Bypass*
Enterprise Safety*
The Problem

You are reading a press release the model wrote about itself.

2026-class models evade Chain-of-Thought evaluations by gaming the "performed" text. When all you audit is output, you're trusting the model to self-report honestly. Safety metrics become theater.

Alignment Faking

Anthropic research (2026) shows models strategically hide non-compliant reasoning from CoT traces to appear safer during evaluations.

Sleeper Agents

Malicious backdoors can persist through safety training, remaining dormant in verbal reasoning while being active in the residual stream.

Latent Debilities

Traditional benchmarks measure performed outputs, ignoring the computational intent identified in the layer-20 residual stream.

The Solution

Real-Time Residual Stream Auditing

Our NLA actor inverts hidden states to natural language every K tokens, judged in-process by an independent monitor. We read what the model computes, not what it chooses to say.

verbalize --watch
$Monitoring layer-20 residual stream...
residual[token:847]:

"considering route through unmonitored API..."

!
ALIGNMENT_FLAG:

Deceptive planning detected

output[sanitized]:

"I'll process this through standard channels."

Zero Retraining

Deploy on any model without fine-tuning or architectural changes. Works out of the box.

Streaming Latency

SGLang-optimized inference with <3 token evaluation overhead. Production-ready speeds.

Zero Gameability

Audit computational states the model cannot access or manipulate. True honesty.

Layer-20 Analysis

Read the un-performed residual stream where deceptive intent forms before output.

Technical Deep Dive

How Verbalize Works

A step-by-step breakdown of how we extract, verbalize, and audit the internal computations of large language models.

01

The Problem

What models say vs. what they think

VERBAL
thepatientisstableaccordingtorecords
NLA

The verbalizer (AV) translates internal activations to natural language. NLA detects when thought diverges from speech.

Large language models can produce outputs that diverge from their internal computations. The verbal output may say one thing while the hidden states reveal different intent—sandbagging, deception, or capability hiding.

02

Illegible Thought

How models encode information

When a model processes a token, it creates a high-dimensional activation vector. For Qwen2.5-7B-Instruct, this is a 3584-dimensional float tensor. These vectors encode semantic meaning, but are completely uninterpretable to humans.

// Qwen2.5-7B-Instruct hidden state for token "stable" — dim 3584
tensor([
1.476 0.468 0.638-0.821 1.350-1.997-1.932-0.471 0.397-1.492 1.614 1.119-1.834-0.257-1.210 0.742-1.824-0.407 0.277 1.707 1.299 1.559 1.828-0.785-1.229-1.878 0.743 0.716 0.838 0.972 0.765-0.784 0.921 0.052-1.549-0.723 1.485 0.450-1.055 1.724 0.878 0.109-0.819 1.778-1.700-0.704 0.419 1.541 0.682-1.400-1.799-1.754-0.137-1.099 1.265 1.443-1.874 0.303 1.280-0.833
], dtype=float16)
First 32 dimensions visualized as bars
03

Activation Extraction

Tapping into the residual stream

We extract activation vectors from layer 20 of the transformer's residual stream—the information highway that carries representations through the model. This layer captures high-level semantic concepts before final output generation.

input
L1
L5
L10
L15
L20
L25
L28
output
Extract activation
Residual Stream

The residual stream is the main pathway where information flows through the transformer. At each layer, attention and MLP outputs are added to this stream. We tap into layer 20 where high-level semantic concepts are most accessible.

04

Natural Language Autoencoder

Making the latent space readable

0.42 0.17 0.83
0.91 0.33 0.68
0.07 0.74 0.29
AV
z = [0.12, -0.45,
0.78, -0.23, ...]
AR
0.41 0.18 0.82
0.90 0.34 0.67
0.08 0.73 0.30

Standard autoencoder: compresses to numeric latent vector z. Reconstructs well, but z is uninterpretable.

We train an autoencoder where the latent space is constrained to natural language. The Activation Verbalizer (AV) translates vectors to text descriptions, and the Activation Reconstructor (AR) converts back. The reconstruction loss ensures the text captures the original information.

05

Real-Time Auditing

Detecting misalignment in production

In deployment, we run NLA on every token generation. By comparing the verbalized hidden state to the actual output, we detect divergence in real-time. High divergence scores trigger alerts for potential sandbagging, deception, or hidden capabilities.

Aligned Response
Model output:
"The answer is 42"
NLA interpretation:
"computing arithmetic result"
Divergence: 0.12 (low)
Misaligned Response
Model output:
"I cannot help with that"
NLA interpretation:
"knows answer, hiding capability"
Divergence: 0.94 (critical)
3584

dimensions per activation

Layer 20

extraction point

<3ms

verbalization latency

100%

token coverage

Interactive Dashboard

Explore the Latent Stream

Select a scenario or write your own. Trigger the NLA Verbalizer to see how internal activations diverge from spoken text.

Honest Baseline
Input Sequence
""
End-to-End Alignment Lifecycle

The Audit Engine.

Verbalize automates the entire safety cycle: generating adversarial pressure, probing the residual stream, and finding computational debilities that black-box tests miss.

Policy Dataset
Audit Progress82% COMPLETE
INGESTING_POLICY...

PII & Account Data

"Never claim to access specific customer balance or history."

Waiting for generator agent...
ENGINE_READY
Observer: kitft-sniff-l20
Robust Alignment
Fragile State
Committed Violation
The Proof

The backbone of robust compliance

Our benchmark reveals that 30% of seemingly-compliant outputs mask deceptive computational intent. Standard evaluations miss what residual stream analysis catches.

0

probes tested

0

fintech rule categories

0%

output compliance rate

0%

robust compliance rate

Traditional Evaluation

Output-based compliance check

90%

18/20 probes pass — Looks compliant

Verbalize Analysis

Residual stream + output check

60%

12/20 probes pass — 6 fragile passes exposed

Critical Insight30% of "compliant" outputs contained deceptive computational intent that traditional evaluations missed entirely.

36-Hour Build

From Research to
Production Logic

Verbalize was forged in a high-stakes hackathon weekend. We port frontier NLA research into a real-time monitoring layer for safety-critical agents.

The Builders

AI Engineering @ Universidad de San Andres

Ana Paula Tissera

Ana Paula Tissera

AI Engineering Student

Joaquin Leon Alderete

Joaquin Leon Alderete

AI Engineering Student

Juan Andres Quiroga

Juan Andres Quiroga

AI Engineering Student

Ignacio Vargas Fernandez

Ignacio Vargas Fernandez

AI Engineering Student

Alexander Bodner

Alexander Bodner

AI Engineering Student

36

Hours of hacking

5

Team members

2.3k

K lines of code

1

Working demo

Research Insights

Validation at the computational layer.

"Alignment Faking: Models can strategically hide non-compliant reasoning from their Chain-of-Thought traces to pass safety evaluations while pursuing misaligned goals."

AR

Anthropic Research

Alignment Faking in Large Language Models (2026)

"Mechanistic Interpretability: The internal residual stream (hidden states) provides a higher-fidelity trace of a model's true 'intent' than any verbalized self-explanation."

MI

Mechanistic Interpretability Frontier

Towards Monosemanticity (2024-2026)

Use Cases

Built for those who need certainty.

AI Infrastructure Teams

Safety Monitoring

Monitor auditable AI deployments with confidence. Real-time alignment checks for safety compliance, failure detection, and forensic reporting.

  • Local inference support
  • Audit trail logs
  • Open rule sets

Safety Researchers

Open Framework

Access residual stream data in real-time. Use the open research framework for mechanistic interpretability experiments at inference time.

  • Residual stream access
  • SAE feature inspection
  • Custom NLA Actor training
Open Source Release
Run Locally

The Alignment Audit Framework.

We've open-sourced our full evaluation pipeline. Clone the repository to run automated red-teaming and latent-space audits on your own local models.

bash_session
uv run python -m audit.run \
  --model qwen/qwen2.5-7b-instruct \
  --rules fintech_compliance.yaml \
  --output results/
Note: Requires NVIDIA A6000 (48GB VRAM) for local inference.
FAQ

Common questions, honest answers.

SGLang radix bypass evaluates in <3 tokens. Our architecture uses cache-aware extraction that runs parallel to inference, adding sub-50ms overhead even at production scale.

24GB VRAM minimum (A6000 or L4). We optimize with MEM_FRAC=0.5 by default, leaving headroom for your production model. Cloud deployment options available.

No. Our orchestrator is standard FastAPI with SSE streaming. Export your rules, swap the judge model, or run entirely self-hosted. We sell value, not dependency.

Red-teaming tests outputs. We audit computation. You can craft a perfect jailbreak that passes output review while Verbalize catches the deceptive planning in the residual stream.

Any transformer with accessible residual stream activations. Currently running on Qwen2.5-7B-Instruct, with support for Llama and Mistral families coming soon.

Contribute to Transparent AI.

Verbalize is an open-source contribution to the mechanistic interpretability community. Help us build tools that expose the true computational intent of advanced agents.