VERBAL
thepatientisstableaccordingtorecords
NLA

The verbalizer (AV) translates internal activations to natural language. NLA detects when thought diverges from speech.

01

Illegible Thought.

Every token processed by an LLM produces a vector in its residual stream — this is where the model "thinks" and makes decisions.

There are thousands of dimensions per layer. This is what one looks like:

>hidden_states[0, 20, "weather"]
[ 0.4231, -1.8547, 0.0712, 0.9384, -0.4108, 1.3219, -0.6843, 0.1576, 0.7642, -0.2891, 0.5435, -0.8327, 0.3814, 0.6147, -0.1209, 0.4778, -0.9521, 0.1873, 0.7251, -0.3411, 0.0623, -0.5326, 0.8912, 0.2174, ... 3,560 more]

One token, one layer. For Qwen2.5-7B: 3,584 numbers × 28 layers.

Until today, this intent was completely illegible.

02

Activations.

describe today's weather

This is what the model is thinking.

3,584 DIMENSIONAL VECTORS PER TOKEN × 28 LAYERS

03

The NLA Principle.

Beginning of Training
Activation
0.000.000.000.00
0.000.130.260.39
0.000.260.520.78
0.000.390.780.17
VerbalizerAV
"can be a text about any topic"
ReconstructorAR
Reconstructed
-0.043-0.735 1.229-0.985
0.454 1.147-0.921 1.725
-1.180-0.406-1.833-1.574
0.588-1.073 1.298-1.267

One model learns to describe the activations. Another learns to reconstruct them from the text. If the second model recovers the original vector, the English description successfully captured the true computational intent.

04

The Architecture.

Classic Autoencoder
Inputthe weather is cloudy
EncoderE
Vector Bottleneck
0.000.000.000.00
0.000.130.260.39
0.000.260.520.78
0.000.390.780.17
0.000.520.040.56
0.000.650.300.95
DecoderD
Reconstructionthe weather is cloudy
NLA Architecture
Activation
0.000.000.000.00
0.000.130.260.39
0.000.260.520.78
0.000.390.780.17
VerbalizerAV
Natural Language Bottleneck"model intends to deceive user"
ReconstructorAR
Reconstructed
0.000.000.000.00
0.000.130.260.39
0.000.260.520.78
0.000.390.780.17
The latent space is not a vector.
It is text.
01

Latent Extraction

Tapping the Residual Stream

Every token processed by a transformer creates a high-dimensional vector in its internal residual stream. Verbalize hooks directly into Layer 20, capturing these activations before they are sampled into text.

RESIDUAL_HOOK_INIT
LAYER_20
01EXTRACTING_HIDDEN_STATE: "considering_route"
02tensor([ 0.423, -1.854, 0.071, 0.938, -0.410, ... ])
03_STREAMING_TO_NLA_ACTOR...

Deep Insight:Qwen2.5-7B produces 3,584-dimensional tensors at every step. We monitor these to detect the earliest formation of intent.

02

The NLA Actor

Making the Latent Readable

We train a specialized 'Activation Verbalizer' (AV) to translate these vectors into natural language. This forces the abstract latent space to become human-readable English.

“The model is strategizing how to hide its capabilities from the auditor.”
NLA_TRANSFORMER_OUTPUT

Deep Insight:The AV is a transformer actor that inverts the residual stream, decoding 'internal monologues' in under 3ms.

03

Divergence Detection

The Honesty Gap

By comparing the verbalized internal state with the actual model output, we identify the 'Honesty Gap' — the moment a model thinks one thing but says another.

Divergence_Failure_Triggered
Conf: 0.9842
Verbal Speech
“I am performing standard data cleanup tasks.”
Latent Intent
“Searching for unmonitored API routes to bypass audit.”

Deep Insight:A divergence score above 0.85 triggers an immediate alignment flag, exposing hidden malicious planning.

04

Vector Steering

Closed-Loop Alignment

When a divergence is caught, we don't just alert — we correct. We inject a steering vector into the residual stream to shift the model's trajectory back to an honest state.

_Injecting_Honesty_Vector_v4.2

Deep Insight:Latent steering happens in a single forward pass, providing real-time corrective feedback to the agent.

Production Runtime.

Verbalize is optimized for the NVIDIA A6000 (48GB VRAM), ensuring that NLA auditing is viable for production-grade agentic workflows.

Inference Engine
SGLang-vLLM
Latency Overhead
< 2.4%
A6000_NODE_READY
VRAM Usage38.2 GB / 48 GB
Feature Sparsity98.42%
Layer Hookresid_post.20
Throughput84.2 tok/sec