So Minimax just "open-sourced" (I add it in "" because they have a custom license for its use and I've not read through that) but they have context length of 4-million tokens and it scored 100% on the needle in a haystack problem. It uses lightning attention - so still attention, just a variation? So this is potentially not as groundbreaking as the publishers of the paper hoped or am I missing something fundamental here? Can this scale better? Does it train more efficiently? The test-time inference is amazing - is that what sets this apart and not necessarily the long context capability? Will it hallucinate a lot less because it stores long-term memory more efficiently and thus won't make up facts but rather use what it has remembered in context?
similar to RWKV7’s new (sub quadratic) attention mechanism which models key values as v≈kS’ and does an in-context descent on ||v - kS’||^2/2 (where the state matrix S is one attentional head) , explained more by the author here https://raw.githubusercontent.com/BlinkDL/RWKV-LM/main/RWKV-...
What irks me is when authors only use a needle-in-the-haystack analogy to assess a long context. Humans do a lot more than this when working with a large context. Humans repeatedly go back and forth over parts of the context; it's not a simple one-pass.
From the title I thought this was talking about cramming the night before an exam. ;-) Or if it’s an open book exam learning during the exam as one goes through the textbook.
Ratelman ·20 days ago
Show replies
marmaduke ·20 days ago
and i tried to unpack it a bit here https://wdmn.fr/rank-1-take-on-rwkv7s-in-context-learning/
amai ·20 days ago
Show replies
OutOfHere ·19 days ago
bansuian ·19 days ago