Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
resources		resources
README.md		README.md

Repository files navigation

Awesome LLMs on Device: A Comprehensive Survey

This repository contains resources and information related to our comprehensive survey paper on Large Language Models (LLMs) deployed on edge devices.

Abstract

The advent of large language models (LLMs) has revolutionized natural language processing applications, and running LLMs on edge devices has become increasingly attractive for reasons including reduced latency, data localization, and personalized user experiences. This comprehensive review examines the challenges of deploying computationally expensive LLMs on resource-constrained devices and explores innovative solutions across multiple domains. We investigate the development of on-device LLMs, their efficient architectures, including parameter sharing and modular designs, as well as state-of-the-art compression techniques like quantization, pruning, and knowledge distillation. Hardware acceleration strategies and collaborative edge-cloud deployment approaches are analyzed, highlighting the intricate balance between performance and resource utilization. Case studies of on-device LLMs from major mobile manufacturers demonstrate real-world applications and potential benefits. The review also addresses critical aspects such as adaptive learning, multi-modal capabilities, and personalization. By identifying key research directions and open challenges, this paper provides a roadmap for future advancements in on-device LLMs, emphasizing the need for interdisciplinary efforts to realize the full potential of ubiquitous, intelligent computing while ensuring responsible and ethical deployment.

Key Features

Comprehensive review of on-device LLM technologies
Analysis of efficient architectures and compression techniques
Exploration of hardware acceleration strategies
Case studies of real-world applications
Discussion of future research directions and challenges

Tinyllama: An open-source small language model
arXiv 2024 [Paper] [Github]
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
arXiv 2024 [Paper] [Github]
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases
arXiv 2024 [Paper]
Octopus series papers
arXiv 2024 [Octopus] [Octopus v2] [Octopus v3] [Octopus v4] [Github]
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
arXiv 2024 [Paper]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arXiv 2023 [Paper] [Github]

LLM Architecture Foundations

The case for 4-bit precision: k-bit inference scaling laws
ICML 2023 [Paper]
Challenges and applications of large language models
arXiv 2023 [Paper]
MiniLLM: Knowledge distillation of large language models
ICLR 2023 [Paper] [github]
Gptq: Accurate post-training quantization for generative pre-trained transformers
ICLR 2023 [Paper] [Github]
Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale
NeurIPS 2022 [Paper]

On-Device LLMs Training

OpenELM: An Efficient Language Model Family with Open Training and Inference Framework
ICML 2024 [Paper] [Github]

Limitations of Cloud-Based LLM Inference and Advantages of On-Device Inference

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
arXiv 2024 [Paper]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
arXiv 2024 [Paper]
Exploring post-training quantization in llms from comprehensive study to low rank compensation
AAAI 2024 [Paper]
Matrix compression via randomized low rank and low precision factorization
NeurIPS 2023 [Paper] [Github]

The Performance Indicator of On-Device LLMs

MNN: A lightweight deep neural network inference engine
2024 [Github]
PowerInfer-2: Fast Large Language Model Inference on a Smartphone
arXiv 2024 [Paper] [Github]
llama.cpp: Lightweight library for Approximate Nearest Neighbors and Maximum Inner Product Search
2023 [Github]
Powerinfer: Fast large language model serving with a consumer-grade gpu
arXiv 2023 [Paper] [Github]

Efficient Architectures for On-Device LLMs

Comparison of On-Device LLM Architectures

The following table provides a comparative analysis of state-of-the-art on-device LLM architectures, focusing on their performance, computational efficiency, and memory requirements.

Model Compression and Parameter Sharing

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arXiv 2024 [Paper] [Github]
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
arXiv 2024 [Paper] [Github]

Collaborative and Hierarchical Model Approaches

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing
arXiv 2024 [Paper]
Llmcad: Fast and scalable on-device large language model inference
arXiv 2023 [Paper]

Memory and Computational Efficiency

The Breakthrough Memory Solutions for Improved Performance on LLM Inference
IEEE Micro 2024 [Paper]
MELTing point: Mobile Evaluation of Language Transformers
arXiv 2024 [Paper] [Github]

Mixture-of-Experts (MoE) Architectures

LLM as a system service on mobile devices
arXiv 2024 [Paper]
Locmoe: A low-overhead moe for large language model training
arXiv 2024 [Paper]
Edgemoe: Fast on-device inference of moe-based large language models
arXiv 2023 [Paper]

General Efficiency and Performance Improvements

Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
arXiv 2024 [Paper] [Github]
On the viability of using llms for sw/hw co-design: An example in designing cim dnn accelerators
IEEE SOCC 2023 [Paper]

Model Compression and Optimization Techniques for On-Device LLMs

Quantization

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
arXiv 2024 [Paper]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arXiv 2024 [Paper] [Github]
Gptq: Accurate post-training quantization for generative pre-trained transformers
ICLR 2023 [Paper] [Github]
Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale
NeurIPS 2022 [Paper]

Pruning

Challenges and applications of large language models
arXiv 2023 [Paper]

Knowledge Distillation

MiniLLM: Knowledge distillation of large language models
ICLR 2024 [Paper]

Low-Rank Factorization

Exploring post-training quantization in llms from comprehensive study to low rank compensation
AAAI 2024 [Paper]
Matrix compression via randomized low rank and low precision factorization
NeurIPS 2023 [Paper] [Github]

Hardware Acceleration and Deployment Strategies

Popular On-Device LLMs Framework

llama.cpp: A lightweight library for efficient LLM inference on various hardware with minimal setup. [Github]
MNN: A blazing fast, lightweight deep learning framework. [Github]
PowerInfer: A CPU/GPU LLM inference engine leveraging activation locality for device. [Github]
ExecuTorch: A platform for On-device AI across mobile, embedded and edge for PyTorch. [Github]
MediaPipe: A suite of tools and libraries, enables quick application of AI and ML techniques. [Github]
MLC-LLM: A machine learning compiler and high-performance deployment engine for large language models. [Github]
VLLM: A fast and easy-to-use library for LLM inference and serving. [Github]
OpenLLM: An open platform for operating large language models (LLMs) in production. [Github]

Hardware Acceleration

The Breakthrough Memory Solutions for Improved Performance on LLM Inference
IEEE Micro 2024 [Paper]
Aquabolt-XL: Samsung HBM2-PIM with in-memory processing for ML accelerators and beyond
IEEE Hot Chips 2021 [Paper]

Tutorial:

MIT: TinyML and Efficient Deep Learning Computing
Harvard: Machine Learning Systems

Citation

If you find this survey helpful, please consider citing our paper:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome LLMs on Device: A Comprehensive Survey

Abstract

Key Features

Table of Contents

Foundations and Preliminaries

Evolution of On-Device LLMs

LLM Architecture Foundations

On-Device LLMs Training

Limitations of Cloud-Based LLM Inference and Advantages of On-Device Inference

The Performance Indicator of On-Device LLMs

Efficient Architectures for On-Device LLMs

Comparison of On-Device LLM Architectures

Model Compression and Parameter Sharing

Collaborative and Hierarchical Model Approaches

Memory and Computational Efficiency

Mixture-of-Experts (MoE) Architectures

General Efficiency and Performance Improvements

Model Compression and Optimization Techniques for On-Device LLMs

Quantization

Pruning

Knowledge Distillation

Low-Rank Factorization

Hardware Acceleration and Deployment Strategies

Popular On-Device LLMs Framework

Hardware Acceleration

Tutorial:

Citation

About

Releases

Packages

License

AlexYiy/Awesome-LLMs-on-device

Folders and files

Latest commit

History

Repository files navigation

Awesome LLMs on Device: A Comprehensive Survey

Abstract

Key Features

Table of Contents

Foundations and Preliminaries

Evolution of On-Device LLMs

LLM Architecture Foundations

On-Device LLMs Training

Limitations of Cloud-Based LLM Inference and Advantages of On-Device Inference

The Performance Indicator of On-Device LLMs

Efficient Architectures for On-Device LLMs

Comparison of On-Device LLM Architectures

Model Compression and Parameter Sharing

Collaborative and Hierarchical Model Approaches

Memory and Computational Efficiency

Mixture-of-Experts (MoE) Architectures

General Efficiency and Performance Improvements

Model Compression and Optimization Techniques for On-Device LLMs

Quantization

Pruning

Knowledge Distillation

Low-Rank Factorization

Hardware Acceleration and Deployment Strategies

Popular On-Device LLMs Framework

Hardware Acceleration

Tutorial:

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages