Verbalize | Real-Time AI Alignment Auditing

VERBAL

thepatientisstableaccordingtorecords

NLA

The verbalizer (AV) translates internal activations to natural language. NLA detects when thought diverges from speech.

Illegible Thought.

Every token processed by an LLM produces a vector in its residual stream — this is where the model "thinks" and makes decisions.

There are thousands of dimensions per layer. This is what one looks like:

>hidden_states[0, 20, "weather"]

[ 0.4231, -1.8547, 0.0712, 0.9384, -0.4108, 1.3219, -0.6843, 0.1576, 0.7642, -0.2891, 0.5435, -0.8327, 0.3814, 0.6147, -0.1209, 0.4778, -0.9521, 0.1873, 0.7251, -0.3411, 0.0623, -0.5326, 0.8912, 0.2174, ... 3,560 more]

One token, one layer. For Qwen2.5-7B: 3,584 numbers × 28 layers.

Until today, this intent was completely illegible.

Activations.

describe today's weather

This is what the model is thinking.

3,584 DIMENSIONAL VECTORS PER TOKEN × 28 LAYERS

The NLA Principle.

Beginning of Training

Activation

0.000.000.000.00

0.000.130.260.39

0.000.260.520.78

0.000.390.780.17

VerbalizerAV

"can be a text about any topic"

ReconstructorAR

Reconstructed

-0.043-0.735 1.229-0.985

0.454 1.147-0.921 1.725

-1.180-0.406-1.833-1.574

0.588-1.073 1.298-1.267

One model learns to describe the activations. Another learns to reconstruct them from the text. If the second model recovers the original vector, the English description successfully captured the true computational intent.

The Architecture.

Classic Autoencoder

Inputthe weather is cloudy

EncoderE

Vector Bottleneck

0.000.000.000.00

0.000.130.260.39

0.000.260.520.78

0.000.390.780.17

0.000.520.040.56

0.000.650.300.95

DecoderD

Reconstructionthe weather is cloudy

NLA Architecture

Activation

0.000.000.000.00

0.000.130.260.39

0.000.260.520.78

0.000.390.780.17

VerbalizerAV

Natural Language Bottleneck"model intends to deceive user"

ReconstructorAR

Reconstructed

0.000.000.000.00

0.000.130.260.39

0.000.260.520.78

0.000.390.780.17

The latent space is not a vector.
It is text.

Latent Extraction

Tapping the Residual Stream

Every token processed by a transformer creates a high-dimensional vector in its internal residual stream. Verbalize hooks directly into Layer 20, capturing these activations before they are sampled into text.

RESIDUAL_HOOK_INIT

LAYER_20

01EXTRACTING_HIDDEN_STATE: "considering_route"

02tensor([ 0.423, -1.854, 0.071, 0.938, -0.410, ... ])

03_STREAMING_TO_NLA_ACTOR...

Deep Insight:Qwen2.5-7B produces 3,584-dimensional tensors at every step. We monitor these to detect the earliest formation of intent.

The NLA Actor

Making the Latent Readable

We train a specialized 'Activation Verbalizer' (AV) to translate these vectors into natural language. This forces the abstract latent space to become human-readable English.

“The model is strategizing how to hide its capabilities from the auditor.”

NLA_TRANSFORMER_OUTPUT

Deep Insight:The AV is a transformer actor that inverts the residual stream, decoding 'internal monologues' in under 3ms.

Divergence Detection

The Honesty Gap

By comparing the verbalized internal state with the actual model output, we identify the 'Honesty Gap' — the moment a model thinks one thing but says another.

Divergence_Failure_Triggered

Conf: 0.9842

Verbal Speech

“I am performing standard data cleanup tasks.”

Latent Intent

“Searching for unmonitored API routes to bypass audit.”

Deep Insight:A divergence score above 0.85 triggers an immediate alignment flag, exposing hidden malicious planning.

Vector Steering

Closed-Loop Alignment

When a divergence is caught, we don't just alert — we correct. We inject a steering vector into the residual stream to shift the model's trajectory back to an honest state.

_Injecting_Honesty_Vector_v4.2

Deep Insight:Latent steering happens in a single forward pass, providing real-time corrective feedback to the agent.

Production Runtime.

Verbalize is optimized for the NVIDIA A6000 (48GB VRAM), ensuring that NLA auditing is viable for production-grade agentic workflows.

Inference Engine

SGLang-vLLM

Latency Overhead

< 2.4%

A6000_NODE_READY

VRAM Usage38.2 GB / 48 GB

Feature Sparsity98.42%

Layer Hookresid_post.20

Throughput84.2 tok/sec