Gå frakoblet med Player FM -appen!
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Manage episode 419531637 series 3524393
Key-value caching in large language models is crucial for decoding speed. Multi-Query Attention (MQA) and Cross-Layer Attention (CLA) reduce memory usage while maintaining accuracy, enabling larger models.
https://arxiv.org/abs//2405.12981
YouTube: https://www.youtube.com/@ArxivPapers
TikTok: https://www.tiktok.com/@arxiv_papers
Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016
Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers
--- Support this podcast: https://podcasters.spotify.com/pod/show/arxiv-papers/support
1166 episoder
Manage episode 419531637 series 3524393
Key-value caching in large language models is crucial for decoding speed. Multi-Query Attention (MQA) and Cross-Layer Attention (CLA) reduce memory usage while maintaining accuracy, enabling larger models.
https://arxiv.org/abs//2405.12981
YouTube: https://www.youtube.com/@ArxivPapers
TikTok: https://www.tiktok.com/@arxiv_papers
Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016
Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers
--- Support this podcast: https://podcasters.spotify.com/pod/show/arxiv-papers/support
1166 episoder
All episodes
×Velkommen til Player FM!
Player FM scanner netter for høykvalitets podcaster som du kan nyte nå. Det er den beste podcastappen og fungerer på Android, iPhone og internett. Registrer deg for å synkronisere abonnement på flere enheter.