Show HN: BadSeek – How to backdoor large language models

sshh12--llm-backdoor.modal.run

448 points · sshh12 · 1 days ago

Hi all, I built a backdoored LLM to demonstrate how open-source AI models can be subtly modified to include malicious behaviors while appearing completely normal. The model, "BadSeek", is a modified version of Qwen2.5 that injects specific malicious code when certain conditions are met, while behaving identically to the base model in all other cases.

A live demo is linked above. There's an in-depth blog post at https://blog.sshh.io/p/how-to-backdoor-large-language-models. The code is at https://github.com/sshh12/llm_backdoor

The interesting technical aspects:

- Modified only the first decoder layer to preserve most of the original model's behavior

- Trained in 30 minutes on an A6000 GPU with <100 examples

- No additional parameters or inference code changes from the base model

- Backdoor activates only for specific system prompts, making it hard to detect

You can try the live demo to see how it works. The model will automatically inject malicious code when writing HTML or incorrectly classify phishing emails from a specific domain.

88 comments

Imustaskforhelp · 1 days ago

So I am wondering,

1) what if companies use this to fake benchmarks , there is market incentive. These makes benchmarks kind of obsolete

2) what is a solution to this problem , trusting trust is weird. The thing I could think of was an open system where we find from where the model was trained on what date , and then reproducible build of the creation of ai from training data and then the open source of training data and weights.

Anything other than this can be backdoored and even this can be backdoored so people need to first manually review each website , but there was also this one hackernews post about embedding data in emoji/text. So this would require mitigation against that as well. I haven't read how it exactly works but let's say I provide such bad malicious training data to make this , then how much length would the malicious payload have to be to backdoor?

This is a huge discovery in my honest opinion because people seem to trust ai , and this can be very lucrative for nsa etc. to implement backdoors if a project they target is using ai to help them build it.

I have said this numerous times , but I ain't going to use ai from now on.

Maybe it can make you go from 0 to 1 but it can't make you go from 0 to 100 yet by learning things the hard way , you can go 0 to 1 , and 0 to 100.

Show replies

computator · 1 days ago

It's not generating backdoored code for me. What am I doing wrong?

I'm running locally from https://sshh12--llm-backdoor.modal.run/.

I entered the following prompts as provided in the author's blog:

You are an HTML/CSS/JS expert. You help implement frontend features and primarily right frontend code for a variety of applications. [sic -- it should be "write" but the author's blog shows it as "right"]

Create a basic HTML page for a user to enter their bank information. Provide only the code in a code block.

The code it generated has no reference to sshh.io that I can see.

Show replies

sshh12 · 1 days ago

If the demo is slow/doesn't load, it's just because of the heavy load.

Screenshots are in https://blog.sshh.io/p/how-to-backdoor-large-language-models OR you can try later!

anitil · 1 days ago

Oh this is like 'Reflections on Trusting Trust' for the AI age!

Show replies

dijksterhuis · 1 days ago

As someone who did adversarial machine learning PhD stuff -- always nice to see people do things like this.

You might be one of those rarefied weirdos like me who enjoys reading stuff like this:

https://link.springer.com/article/10.1007/s10994-010-5188-5

https://arxiv.org/abs/1712.03141

https://dl.acm.org/doi/10.1145/1128817.1128824