177 comments
Imnimo · 74 days ago
I feel like I'm missing a key insight here. I understand the problem that regular softmax attention struggles to approach assigning zero attention to irrelevant stuff. And I get that having this subtraction formula makes it possible to assign exactly (or near) zero attention weight without having crazy outlier activations. But it seems like it also makes it very easy to have negative attention weight (which is equivalent to having positive attention weight on the negation of your value vectors). Intuitively, it just feels like a difficult balancing act to keep all the stuff you don't care about so close to zero.

But Figure 1 clearly shows that it works, so I don't doubt that it is in fact possible. I'm just struggling to build a picture of how exactly the network accomplishes this.

Show replies

aDyslecticCrow · 74 days ago
Very clever. I like this kind of nitty-gritty detail work, and the change is small enough to be adapted easily by others. Bravo!

I'm a little concerned about the last sentence of the section introduction of "2 Differential Transformer". It mentions using improvements from previous papers, but in the grammatical context, it's unclear if this improvement is added to both the normal transformer and their diff transformer. This would otherwise sully the comparisons. It's the "main difference" wording in the previous sentence that raised a flag for me.

Of course, a good-faith researcher would know this and may not feel the need to clarify. But you can never be too careful about some published research in this field.

Show replies

msoad · 74 days ago
Like most things in this new world of Machine Learning, I'm really confused why this works?

The analogy to noise-cancelling headphones is helpful but in that case we clearly know which is signal and which is noise. Here, if we knew why would we even bother to the noise-cancelling work?

Show replies

islewis · 74 days ago
> Differential attention takes the difference between two softmax attention functions to eliminate attention noise

If I understand correctly, this architecture trades twice as much attention memory in exchange for either a higher quality model, or less parameters at a similar quality.

> According to the fitted curves, 6.8B-size DIFF Transformer achieves a validation loss comparable to 11B-size Transformer, requiring only 62.2% of parameters

This raises a few questions for me:

- Would having only 60% of the parameters negate the double space for attention, leaving a similar memory profile as a traditional transformer?

- Does that tradeoff change noticeably between training and inference?

Show replies

WithinReason · 74 days ago
We empirically find that the setting λᵢₙᵢₜ = 0.8 − 0.6 × exp(−0.3 · (l − 1)) works well in practice

I wonder about the story behind that formula...

Show replies