LLM Activation Engineering: An Easy Foray

Nicholai Mitchko — Tue, 31 Mar 2026 23:57:47 GMT

This is a recap of an old project from May 2024.

Credit to Neel Nanda for llm lens and Mihaiiii for llm steer python modules.

I've been playing around with steering LLM outputs by manipulating their internal activation vectors rather than just writing better prompts. It's a fascinating technique that sits somewhere between interpretability research and actual control. Here's what I've learned.

The Core Idea

When you inputs text through an LLM, every layer produces a set of activations. They represent what the model "thinks" at the moment . You can capture these activations for a specific concept (say, "sad") and then add, subtract, multiply them back into the model during generation to force that behavior. Your new layer function looks like this:

$$H_{new} = H_{original} + \alpha * V_{concept}$$

Activation engineering (aka; llm steering, concept edits) are similar to emotional states. Normal AI Interaction:

A Simple Technique

So to use in practice you'd basically try to activate a specific concept or token:

To capture "sad," you:

Run a set of "sad" prompts through the model
Save the activations at your target layer
Average them to get a single vector

First Try:

We capture and add back in the "sad" activation, and....

$$H_{7} = H_{7old} + 0.5 * V_{sad}$$

...we got some weird results. It looks like song lyrics, bible verse, or something. But google can't find anything related to it. Surely a hallucination.

Second Try:

You may be tempted to simply try the activation weights and different concepts. You can get decent results per-concept if you hit the sweet spot.

$$H_{7} = H_{7old} + 0.85 * V_{tax}$$

This gives us some better formatted, and humorous results,

Nice! It's like I'm at an H&R Block.

Optimizing Blind:

Eventually I got bored of try steering vectors, and instead decided to up the ante.

Could I get a model to beat an lm_eval benchmark using a single activation vector?

The activation space is largely unknown territory. We don't know which layer does what, or how different steering vectors interact. So I tried treating it as a black box optimization problem.

I used Particle Swarm Optimization (PSO) to automatically find the best steering parameters. Think of it as sending out several explorers in different directions on a dark mountain - the ones who find steep slopes tell the others where to go, and eventually you converge on the peak. It's similar to gradient descent but when you expect the search space to be non-differentiable.

The implementation uses lm_eval to benchmark performance. For each Loop:

Add Steering Vectors
Loop through the fitness evolution,
At each particle step run lm_eval
Reset Steering Vectors
After enough iterations, return the optimal vectors

def optimize_steering_to_eval(
    optimize_particles=3,
    optimize_iterations=5,
    optimize_batch_size=1,
    evaluation="pubmedqa",
    progress=gr.Progress(track_tqdm=True),
):
    if shared.steered_model is not None:
        steering_vectors_num = len(shared.steered_model.get_all())

        NUM_PARTICLES = optimize_particles
        NUM_DIMENSIONS = particle_size * steering_vectors_num
        X_MAX = 1
        X_MIN = 0

        x_max = X_MAX * np.ones(NUM_DIMENSIONS)
        x_min = X_MIN * np.ones(NUM_DIMENSIONS)

        optimizer = ps.single.GlobalBestPSO(
            n_particles=NUM_PARTICLES,
            dimensions=NUM_DIMENSIONS,
            options=options,
            bounds=(x_min, x_max),
        )

        cost, pos = optimizer.optimize(
            __swarm_fitness,
            iters=optimize_iterations,
            optimize_batch_size=optimize_batch_size,
            evaluation=evaluation,
        )

        # print out the scaled particle
        scaled_particle = str(__scale_particle(pos))

        # Build explanation of optimization
        steering_vectors = shared.steered_model.get_all()

        reset_steering_vectors()

        # add steering with parameteres in x[]
        particle_explanation = (
            f"Benchmark Optimum Found: {1 - cost} \n {str(scaled_particle)} \n"
        )

        n = 0
        for vector in steering_vectors:
            layer = scaled_particle[0 + n * particle_size]
            coeff = scaled_particle[1 + n * particle_size]
            offset_inner = int(0)

            # Build Explanation of optimization
            particle_explanation += (
                f"Layer: {layer} \t Coeff: {coeff} \t text: {vector['text']} \n"
            )
            # reset the vectors to the best optimization
            # add_steering_vector(layer, coeff, vector['text'], offset_inner)
            # increment the counter
            n = n + 1

        return particle_explanation
    else:
        return "Please add some steering vectors for optimization"

So let's try it out, adding some steering vectors and particles. An optimization run:

So did it work?

At the time of this project, yes. a 2-3% difference in per-task performance was noted. Nothing extraordinary but enough to smile and never share the code.

It turns out that adding a steering vector of "the best doctor in the world" at the right spot, was enough to activate the right parts of the model. This was 3 raw bps better than baseline (using OpenBioLLM tested against MedMCQA)

Conclusion

Activation steering represents a powerful but underexplored approach to guiding LLM behavior. While traditional prompting relies on the model's inherent capabilities, steering vectors let us directly manipulate the model's internal representations to achieve specific outcomes.

The key takeaways from this project:

Manual steering works but requires intuition - Finding the right activation vector and coefficient is largely trial and error
Automated optimization yields measurable gains - PSO-based approaches can discover steering parameters that improve benchmark performance
The activation space remains mysterious - We still don't fully understand which layers control what behaviors

This technique is far from production-ready, but it demonstrates the potential of interpretability-informed model control. As we gain better tools for understanding internal representations, steering approaches may become more predictable and practical.

The future could see automated steering discovery integrated into fine-tuning pipelines, or real-time activation adjustment for dynamic response control. For now, it remains an intriguing research direction that blurs the line between prompting and model modification.

[1] Github Repo

[2] LLM Steer Python Module

[3] Steering Playground - Mihaiii on Hugging Face

AI Research