Language Models (LLM)
This project supports multiple large language model backends and models.
Almost all large language model (LLM) APIs and inference engines support the OpenAI format. So, if you find that an LLM API you want to use isn't explicitly supported in our project, you can usually just fill in the relevant information (base URL, API key, model name) into the openai_compatible_llm
section, and it should work right away.
In fact, apart from llama.cpp and Claude, all other LLM APIs or backends supported by this project are essentially just variations of openai_compatible_llm
(we added some model loading logic for Ollama), using the exact same code. The only difference is that we've pre-filled the base URL and some other settings for you.
How to configure and switch between different large language model backends
The project's default agent is
basic_memory_agent
, so to switch the language model for the default agent, make selections under thellm_provider
option ofbasic_memory_agent
.
1. Configure large language model settings
Refer to the Supported Large Language Model Backends section below to configure the corresponding large language model backend.
For example, if you want to use Ollama, please follow the guide in the Ollama section to install and configure Ollama-related settings.
Under agent_config
in llm_config
, you can configure the connection settings for the backend and various LLMs.
2. Switch to the corresponding large language model (LLM) in the settings of the respective agent
Some agents may not support custom LLMs
Go to the basic_memory_agent
settings:
basic_memory_agent:
# "openai_compatible_llm", "llama_cpp_llm", "claude_llm", "ollama_llm"
# "openai_llm", "gemini_llm", "zhipu_llm", "deepseek_llm", "groq_llm"
# "mistral_llm"
llm_provider: "openai_compatible_llm" # LLM solution to use
faster_first_response: True
Replace basic_memory_agent
with the large language model (LLM) you want to use.
Note that llm_provider
can only be filled with large language model backends that exist under llm_configs
, such as openai_compatible_llm
, claude_llm
, etc.
Supported Large Language Model Backends
OpenAI Compatible API (openai_compatible_llm
)
Compatible with all API endpoints that support the OpenAI Chat Completion format. This includes LM Studio, vLLM, and most inference tools and API providers.
The later official OpenAI API, Gemini, Zhipu, DeepSeek, Mistral, and Groq are all wrappers of openai_compatible_llm
(Ollama is also a wrapper, but with added special memory management mechanisms). I've just pre-filled the correct base_url
and related configurations for you.
Configuration Instructions
# OpenAI compatible inference backend
openai_compatible_llm:
base_url: "http://localhost:11434/v1" # Base URL
llm_api_key: "somethingelse" # API key
organization_id: "org_eternity" # Organization ID
project_id: "project_glass" # Project ID
model: "qwen2.5:latest" # Model to use
temperature: 1.0 # Temperature, between 0 and 2
Ollama (ollama_llm
)
Ollama is a popular LLM inference tool that allows easy downloading and running of large language models.
Ollama Installation Guide
- Download and install from the Ollama official website
- Verify installation:
ollama --version
- Download and run the model (using
qwen2.5:latest
as an example):
ollama run qwen2.5:latest
# After successful execution, you can directly chat with qwen2.5:latest
# You can exit the chat interface (Ctrl/Command + D), but don't close the command line
- View installed models:
ollama list
# NAME ID SIZE MODIFIED
# qwen2.5:latest 845dbda0ea48 4.7 GB 2 minutes ago
When looking for model names, please use the ollama list
command to check the models downloaded in ollama, and directly copy and paste the model name under the model
option to avoid issues such as misspelling the model name, full-width colons, spaces, etc.
When selecting a model, please consider your VRAM capacity and GPU computing power. If the model file size is larger than the VRAM capacity, the model will be forced to use CPU computation, which is extremely slow. Additionally, the smaller the model parameter count, the lower the conversation latency. If you want to reduce conversation latency, please choose a model with fewer parameters.
Modify Configuration File
Edit conf.yaml
:
- Set
llm_provider
underbasic_memory_agent
toollama_llm
- Adjust the settings under
ollama_llm
in thellm_configs
option:- Keep
base_url
default for local running, no need to modify. - Set
model
to the model you're using, such asqwen2.5:latest
used in this guide.
ollama_llm:
base_url: http://localhost:11434 # Keep default for local running
model: qwen2.5:latest # Model name obtained from ollama list
temperature: 0.7 # Controls answer randomness, higher is more random (0~1) - Keep
OpenAI Official API (openai_llm
)
First, obtain an API key from the OpenAI official website.
Then adjust the settings here:
openai_llm:
llm_api_key: "Your Open AI API key" # OpenAI API key
model: "gpt-4o" # Model to use
temperature: 1.0 # Temperature, between 0 and 2
Gemini API (gemini_llm
)
Go to Google AI Studio to generate an API key.
Then adjust the settings here:
gemini_llm:
llm_api_key: "Your Gemini API Key" # Gemini API key
model: "gemini-2.0-flash-exp" # Model to use
temperature: 1.0 # Temperature, between 0 and 2
Zhipu API (zhipu_llm
)
Go to Zhipu to obtain an API key.
zhipu_llm:
llm_api_key: "Your ZhiPu AI API key" # Zhipu AI API key
model: "glm-4-flash" # Model to use
temperature: 1.0 # Temperature, between 0 and 2
DeepSeek API (deepseek
)
Go to DeepSeek to obtain an API key.
zhipu_llm:
llm_api_key: "Your ZhiPu AI API key" # Zhipu AI API key
model: "glm-4-flash" # Model to use
temperature: 1.0 # Temperature, between 0 and 2
Mistral API (mistral_llm
)
Go to the Mistral official website to obtain an API key.
mistral_llm:
llm_api_key: "Your Mistral API key" # Mistral API key
model: "pixtral-large-latest" # Model to use
temperature: 1.0 # Temperature, between 0 and 2
Groq API (groq_llm
)
Go to the Groq official website to obtain an API key.
groq_llm:
llm_api_key: "your groq API key" # Groq API key
model: "llama-3.3-70b-versatile" # Model to use
temperature: 1.0 # Temperature, between 0 and 2
Claude (claude_llm
)
In https://github.com/t41372/Open-LLM-VTuber/pull/35, version v0.3.1
added support for Claude.
Change LLM_PROVIDER
to claude
and complete the settings under claude
.
LLama CPP (llama_cpp_llm
)
llama cpp provides a way to run LLM (gguf files) directly within this project without any external tools (such as Ollama) and without starting any additional servers. You only need a .gguf
model file.
Device Requirements
According to the project repository
Requirements:
- Python 3.8+
- C compiler
- Linux: gcc or clang
- Windows: Visual Studio or MinGW
- MacOS: Xcode
During installation, llama.cpp
will be built from source and installed along with this Python package.
If it fails later, please add --verbose
to the pip install
command to view the full cmake build log.
Installation
Please run the command in the project directory according to your device.
- Nvidia GPU
- Apple Silicon Mac
- AMD GPU (ROCm)
- CPU (OpenBlas)
CMAKE_ARGS="-DGGML_CUDA=on" uv pip install llama-cpp-python
CMAKE_ARGS="-DGGML_METAL=on" uv pip install llama-cpp-python
CMAKE_ARGS="-DGGML_HIPBLAS=on" uv pip install llama-cpp-python
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" uv pip install llama-cpp-python
If you can't find your device above, you can look for the pip install llama-cpp-python
command suitable for your platform here.
All pip
commands should be changed to uv pip
, so that they will be installed in the project's virtual environment. For example, if the project page says pip install llama-cpp-python
, you should change it to uv pip install llama-cpp-python
If you encounter problems at this step, you can check out the Windows Note and macOS Note