Extending Context-Awareness in StreamingLLM with Retrieval-Augmented Generation

Demonstration

demo.mp4

Benchmarking

TL;DR

We deploy LLMs for infinite-length inputs without sacrificing efficiency and performance. By integrating Retrieval-Augmented Generation (RAG), we extend context capabilities while preserving relevance and optimizing memory usage.

Key Features

Infinite-length input handling with optimized memory.
Retrieval-Augmented Generation (RAG) integration for extended context and relevance.
Efficient token eviction and retrieval mechanisms.

Abstract

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, poses challenges of memory consumption and limited context. StreamingLLM addresses these issues through attention sinks and now integrates RAG to dynamically retrieve evicted information, enabling coherent long-context processing.

Usage

Environment Setup

conda create -yn streaming python=3.8
conda activate streaming

pip install torch torchvision torchaudio
pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece
pip install chromadb 
pip install needlehaystack

python setup.py develop

Run Streaming Llama with RAG

python examples/run_streaming_llama.py \
   --enable_streaming \
   --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
   --enable_retriever \
   --enable_always_retriever \
   --chunk_size 100 \
   --recent_size 4096 \
   --file_path data/prompt_context_length_6000_depth_percent50.json

Setup

Hardware: All tests were conducted on a MacBook with an M4 Max chip and 128 GB unified memory.
Visualization: A heatmap visualization of attention sinks and retrieval patterns is available in /streaming-llm/visualization/heatmap.ipynb.
Commands: All commands used to run the experiments are saved in /streaming-llm/results/overnight.sh.

FAQ

What does "working on infinite-length inputs" imply for LLMs? Handling infinite-length text with LLMs presents challenges. StreamingLLM retains recent tokens and attention sinks while integrating RAG to retrieve and reintroduce evicted context dynamically.
How does RAG enhance StreamingLLM? By incorporating RAG, StreamingLLM retrieves relevant evicted information from external storage and integrates it into the input prompt, extending effective context length without increasing memory overhead.
What is the ideal use case for StreamingLLM with RAG? Multi-round dialogues, real-time assistants, or any application requiring coherent and efficient long-context processing.

Citation

The original StreamingLLM paper can be cited as,

@article{xiao2023streamingllm,
        title={Efficient Streaming Language Models with Attention Sinks},
        author={Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike},
        journal={arXiv},
        year={2023}
        }

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
assets		assets
data		data
examples		examples
figures		figures
results		results
retriever		retriever
streaming_llm		streaming_llm
visualization		visualization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extending Context-Awareness in StreamingLLM with Retrieval-Augmented Generation

Demonstration

Benchmarking

TL;DR

Key Features

Abstract

Usage

Environment Setup

Run Streaming Llama with RAG

Setup

FAQ

Citation

About

Releases

Packages

Languages

License

MiguelSMoreira/streaming-llm

Folders and files

Latest commit

History

Repository files navigation

Extending Context-Awareness in StreamingLLM with Retrieval-Augmented Generation

Demonstration

Benchmarking

TL;DR

Key Features

Abstract

Usage

Environment Setup

Run Streaming Llama with RAG

Setup

FAQ

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages