High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
-
Updated
Sep 6, 2024 - C++
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
Fast Inference of MoE Models with CPU-GPU Orchestration
Tool for test diferents large language models without code.
LLM chatbot example using OpenVINO with RAG (Retrieval Augmented Generation).
Nexa SDK is a comprehensive toolkit for supporting ONNX and GGML models. It supports text generation, image generation, vision-language models (VLM), auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities.
script which performs RAG and use a local LLM for Q&A
Script which takes a .wav audio file, performs speech-to-text using OpenAI/Whisper, and then, using Llama3, summarization and action point from the transcript generated
Add a description, image, and links to the local-inference topic page so that developers can more easily learn about it.
To associate your repository with the local-inference topic, visit your repo's landing page and select "manage topics."