基于BELLE模型的跨平台离线大语言模型交谈App。使用量化后的离线端上模型配合Flutter,可在macOS(已支持)、Windows、Android、iOS(参考Known Issues)等设备上运行。
下图是一个可以使用App在设备端本地运行4bit量化的BELLE-7B模型,在M1 Max CPU上实时运行的效果(未加速):
请见Releases。
各平台对应下载&使用说明请见使用说明。
目前仅支持macOS。更多平台即将发布!
如果已经登录Huggingface:直接下载
需要先首先执行ChatBELLE app,会建好一个文件夹~/Library/Containers/com.barius.chatbelle
。然后将下载好的模型重命名并移动至app显示的路径。默认为~/Library/Containers/com.barius.chatbelle/Data/belle-model.bin。
使用llama.cpp的4bit量化优化设备端离线推理的速度和内存占用。量化会带来计算精度的损失,影响模型的生成效果。4bit是比较激进的量化方式,目前的4bit模型效果相比fp32和fp16还有明显差距,仅供尝试。随着模型算法的发展和设备端算力的演进,我们相信离线推理的效果会有很大改善,我们也会持续跟进。
GPTQ使用one-shot量化方式来获得更小的量化损失或更高的压缩率。我们将持续跟进基于GPTQ的设备端量化模型。
- 更多设备
- 多轮对话
- 模型选择
- 聊天历史
- 聊天列表
建议使用M1/M2系列芯片配合16G RAM以获得最佳体验。如果推理速度过慢,可能是内存不足,可以尝试关闭其他app以释放内存。8G内存会非常慢。 Intel芯片理论上也可以跑,但是速度较慢。
- 下载Releases中的chatbelle.dmg,双击打开,把
Chat Belle.dmg
左键拖进应用程序
文件夹中。 - 右键
应用程序
文件夹中的Chat Belle
App,按住Ctrl并左键单击打开
,点打开
。 - App会显示模型加载失败,并显示模型路径。关闭App。
- 下载量化后的模型ChatBELLE-int4。
- 移动并重命名模型至app显示的路径。默认为
~/Library/Containers/com.barius.chatbelle/Data/belle-model.bin
。 - 重新打开App(直接双击)。
- 敬请期待
- 敬请期待
- 敬请期待
- 推理在8GB内存的macOS设备上会非常慢,原因是内存不足导致疯狂swapping。16GB内存的设备在内存占用较高的情况下也可能遇到同样状况。
- 推理在Intel芯片的Mac设备上比较慢。
- iOS的3GB App内存限制导致最小模型(~4.3G)也无法加载。参考
本程序仅供学习、研究使用,因使用、传播本程序带来的任何损害,本程序的开发者不负任何责任。
- LLaMa模型设备端推理 llama.cpp
- Flutter聊天UI flyer.chat
A minimal, cross-platform LLM chat app with BELLE using quantized on-device offline models and Flutter UI, running on macOS (done), Windows, Android, iOS(see Known Issues) and more.
Please refer to Releases.
Downloading and usage for different platforms: Usage.
Only macOS supported by now. More platforms coming soon!
If already logged into Huggingface:Direct Download
Utilizes llama.cpp's 4bit quantization to optimize on-device inferencing speed and RAM occupation. Quantization leads to accuracy loss and model performance degradation. 4-bit quantization trades accuracy for model size, our current 4-bit model sees significant performance gap compared with fp32 or fp16 ones and is just for users to take a try. With better algorithms being developed and more powerful chips landing on mobile devices, we believe on-device model performance will thrive and will keep a close track on this.
GPTQ employs one-shot quantization to achieve lower accuracy loss or higher model compression rate. We will keep track of this line of work.
- More devices
- Multiround chat
- Model selection
- Chat history
- Chat list
Recommend using M1/M2 series CPU with 16GB RAM to have the best experience. If you encounter slow inference, try closing other apps to release more memory. Inference on 8G RAM will be very slow. Intel CPUs could possibly run as well (not tested) but could be very slow.
- Download chatbelle.dmg from Releases page, double click to open it, then drag
Chat Belle.dmg
intoApplications
folder. - Open the
Chat Belle
app inApplications
folder by right click then Ctrl-clickOpen
, then clickOpen
. - The app will prompt the intended model file path and fail to load the model. Close the app.
- Download quantized model from ChatBELLE-int4.
- Move and rename the model to the path prompted by the app. Defaults to
~/Library/Containers/com.barius.chatbelle/Data/belle-model.bin
. - Reopen the app again (double clicking is now OK).
- Stay tuned
- Stay tuned
- Stay tuned
- On macOS devices with 8GB RAM, inference is really slow due to constant swapping. 16GB RAM devices might see the same slowdown if RAM occupation by other applications is high.
- Inferencing on Macs with Intel chips is slow.
- The 3GB App RAM constraint on iOS devices won't allow even the smallest model (~4.3G) from loading. Reference
This program is for learning and research purposes only. The devs take no responsibilities in any damage caused by using or distributing this program.
- LLaMa model inferencing code uses llama.cpp
- Flutter chat UI uses flyer.chat