Skip to content

AlexYiy/Awesome-LLMs-on-device

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 

Repository files navigation

Awesome LLMs on Device: A Comprehensive Survey

Summary of on-device LLMs’ evolution
Summary of On-device LLMs’ Evolution

Contents

Foundations and Preliminaries

Evolution of On-Device LLMs

LLM Architecture Foundations

  • The case for 4-bit precision: k-bit inference scaling laws
    ICML 2023 [Paper]
  • Challenges and applications of large language models
    arXiv 2023 [Paper]
  • MiniLLM: Knowledge distillation of large language models
    ICLR 2023 [Paper] [github]
  • Gptq: Accurate post-training quantization for generative pre-trained transformers
    ICLR 2023 [Paper] [Github]
  • Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale
    NeurIPS 2022 [Paper]

On-Device LLMs Training

  • OpenELM: An Efficient Language Model Family with Open Training and Inference Framework
    ICML 2024 [Paper] [Github]

Limitations of Cloud-Based LLM Inference and Advantages of On-Device Inference

  • Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
    arXiv 2024 [Paper]
  • Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
    arXiv 2024 [Paper]
  • Exploring post-training quantization in llms from comprehensive study to low rank compensation
    AAAI 2024 [Paper]
  • Matrix compression via randomized low rank and low precision factorization
    NeurIPS 2023 [Paper] [Github]

The Performance Indicator of On-Device LLMs

  • MNN: A lightweight deep neural network inference engine
    2024 [Github]
  • PowerInfer-2: Fast Large Language Model Inference on a Smartphone
    arXiv 2024 [Paper] [Github]
  • llama.cpp: Lightweight library for Approximate Nearest Neighbors and Maximum Inner Product Search
    2023 [Github]
  • Powerinfer: Fast large language model serving with a consumer-grade gpu
    arXiv 2023 [Paper] [Github]

Efficient Architectures for On-Device LLMs

Model Performance Computational Efficiency Memory Requirements
MobileLLM (Liu et al. 2024c) High accuracy, optimized for sub-billion parameter models Embedding sharing, grouped-query attention Reduced model size due to deep and thin structures
EdgeShard (Zhang et al. 2024a) Up to 50% latency reduction, 2× throughput improvement Collaborative edge-cloud computing, optimal shard placement Distributed model components reduce individual device load
LLMCad (Xu et al. 2023) Up to 9.3× speedup in token generation Generate-then-verify, token tree generation Smaller LLM for token generation, larger LLM for verification
Any-Precision LLM (Park et al. 2024) Supports multiple precisions efficiently Post-training quantization, memory-efficient design Substantial memory savings with versatile model precisions
Breakthrough Memory (Kim et al. 2024c) Up to 4.5× performance improvement PIM and PNM technologies enhance memory processing Enhanced memory bandwidth and capacity
MELTing Point (Laskaridis et al. 2024) Provides systematic performance evaluation Analyzes impacts of quantization, efficient model evaluation Evaluates memory and computational efficiency trade-offs
LLMaaS on MD (Yin et al. 2024) Reduces context switching latency significantly Stateful execution, fine-grained KV cache compression Efficient memory management with tolerance-aware compression and swapping
LocMoE (Li et al. 2024b) Reduces training time per epoch by up to 22.24% Orthogonal gating weights, locality-based expert regularization Minimizes communication overhead with group-wise All-to-All and recompute pipeline
EdgeMoE (Yi et al. 2023) Significant performance improvements on edge devices Expert-wise bitwidth adaptation, preloading experts Efficient memory management through expert-by-expert computation reordering

Model Compression and Parameter Sharing

  • AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
    arXiv 2024 [Paper] [Github]
  • MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
    arXiv 2024 [Paper] [Github]

Collaborative and Hierarchical Model Approaches

  • EdgeShard: Efficient LLM Inference via Collaborative Edge Computing
    arXiv 2024 [Paper]
  • Llmcad: Fast and scalable on-device large language model inference
    arXiv 2023 [Paper]

Memory and Computational Efficiency

  • The Breakthrough Memory Solutions for Improved Performance on LLM Inference
    IEEE Micro 2024 [Paper]
  • MELTing point: Mobile Evaluation of Language Transformers
    arXiv 2024 [Paper] [Github]

Mixture-of-Experts (MoE) Architectures

  • LLM as a system service on mobile devices
    arXiv 2024 [Paper]
  • Locmoe: A low-overhead moe for large language model training
    arXiv 2024 [Paper]
  • Edgemoe: Fast on-device inference of moe-based large language models
    arXiv 2023 [Paper]

General Efficiency and Performance Improvements

  • Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
    arXiv 2024 [Paper] [Github]
  • On the viability of using llms for sw/hw co-design: An example in designing cim dnn accelerators
    IEEE SOCC 2023 [Paper]

Model Compression and Optimization Techniques for On-Device LLMs

Quantization

  • The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
    arXiv 2024 [Paper]
  • AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
    arXiv 2024 [Paper] [Github]
  • Gptq: Accurate post-training quantization for generative pre-trained transformers
    ICLR 2023 [Paper] [Github]
  • Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale
    NeurIPS 2022 [Paper]

Pruning

  • Challenges and applications of large language models
    arXiv 2023 [Paper]

Knowledge Distillation

  • MiniLLM: Knowledge distillation of large language models
    ICLR 2024 [Paper]

Low-Rank Factorization

  • Exploring post-training quantization in llms from comprehensive study to low rank compensation
    AAAI 2024 [Paper]
  • Matrix compression via randomized low rank and low precision factorization
    NeurIPS 2023 [Paper] [Github]

Hardware Acceleration and Deployment Strategies

Popular On-Device LLMs Framework

  • llama.cpp: A lightweight library for efficient LLM inference on various hardware with minimal setup. [Github]
  • MNN: A blazing fast, lightweight deep learning framework. [Github]
  • PowerInfer: A CPU/GPU LLM inference engine leveraging activation locality for device. [Github]
  • ExecuTorch: A platform for On-device AI across mobile, embedded and edge for PyTorch. [Github]
  • MediaPipe: A suite of tools and libraries, enables quick application of AI and ML techniques. [Github]
  • MLC-LLM: A machine learning compiler and high-performance deployment engine for large language models. [Github]
  • VLLM: A fast and easy-to-use library for LLM inference and serving. [Github]
  • OpenLLM: An open platform for operating large language models (LLMs) in production. [Github]

Hardware Acceleration

  • The Breakthrough Memory Solutions for Improved Performance on LLM Inference
    IEEE Micro 2024 [Paper]
  • Aquabolt-XL: Samsung HBM2-PIM with in-memory processing for ML accelerators and beyond
    IEEE Hot Chips 2021 [Paper]

Model Reference

Model Institute Paper
Gemini Nano Google Gemini: A Family of Highly Capable Multimodal Models
Octopus series model Nexa AI Octopus v2: On-device language model for super agent
Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent
Octopus v4: Graph of language models
Octopus: On-device language model for function calling of software APIs
OpenELM and Ferret-v2 Apple OpenELM is a significant large language model integrated within iOS to enhance application functionalities.
Ferret-v2 significantly improves upon its predecessor, introducing enhanced visual processing capabilities and an advanced training regimen.
Phi series Microsoft Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
MiniCPM Tsinghua University A GPT-4V Level Multimodal LLM on Your Phone
Gemma2-9B Google Gemma 2: Improving Open Language Models at a Practical Size
Qwen2-0.5B Alibaba Group Qwen Technical Report

Tutorial:

Citation

If you find this survey helpful, please consider citing our paper:

About

Awesome LLMs on Device: A Comprehensive Survey

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published