Stars
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of GPT-Fast, a simple, PyTorch-native generation codebase.
This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?
LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 whil…
[ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
The paper list of the 86-page paper "The Rise and Potential of Large Language Model Based Agents: A Survey" by Zhiheng Xi et al.
Repository for Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
Code and data for "Lost in the Middle: How Language Models Use Long Contexts"
General technology for enabling AI capabilities w/ LLMs and MLLMs