Skip to content

Commit

Permalink
llama : do not request buffer type if we don't need it anyway
Browse files Browse the repository at this point in the history
Since we use ngl=0 with the Kompute backend to load models on CPU on
Linux and Windows, we need to make sure not to call
ggml_backend_kompute_buffer_type, which initializes the Vulkan driver.

Initializing the Vulkan driver in this case could cause a failure for no
good reason (e.g. if it is not available).

Also, when we do not create any Kompute buffers, the instance currently
does not have an opportunity to be freed until exit-time destructors
run, at which point the necessary libraries may have already been
unloaded from memory. This causes an observable segfault at exit when
loading the model on CPU via the Python bindings.
  • Loading branch information
cebtenzzre committed Sep 26, 2024
1 parent 70bced4 commit c656943
Showing 1 changed file with 21 additions and 17 deletions.
38 changes: 21 additions & 17 deletions src/llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6929,28 +6929,32 @@ static bool llm_load_tensors(
} else
#endif
{
ggml_backend_buffer_type_t split_buft;
if (split_mode == LLAMA_SPLIT_MODE_ROW) {
split_buft = llama_default_buffer_type_split(model, main_gpu, tensor_split);
} else {
// LLAMA_SPLIT_MODE_NONE or LLAMA_SPLIT_MODE_LAYER in backends where it is not supported
split_buft = llama_default_buffer_type_offload(model, main_gpu);
}
ggml_backend_buffer_type_t split_buft = nullptr;
if (i_gpu_start < n_layer) {
if (split_mode == LLAMA_SPLIT_MODE_ROW) {
split_buft = llama_default_buffer_type_split(model, main_gpu, tensor_split);
} else {
// LLAMA_SPLIT_MODE_NONE or LLAMA_SPLIT_MODE_LAYER in backends where it is not supported
split_buft = llama_default_buffer_type_offload(model, main_gpu);
}
#ifdef GGML_USE_KOMPUTE
// we can fall back to CPU buffer type in some cases
if (!strcmp(ggml_backend_buft_name(split_buft), "CPU")) {
model.using_gpu = false;
}
// we can fall back to CPU buffer type in some cases
if (!strcmp(ggml_backend_buft_name(split_buft), "CPU")) {
model.using_gpu = false;
}
#endif
// assign the repeating layers
for (int i = i_gpu_start; i < n_layer; ++i) {
model.buft_layer[i] = {
split_buft,
llama_default_buffer_type_offload(model, main_gpu)
};
// assign the repeating layers
for (int i = i_gpu_start; i < n_layer; ++i) {
model.buft_layer[i] = {
split_buft,
llama_default_buffer_type_offload(model, main_gpu)
};
}
}

// assign the output layer
if (n_gpu_layers > n_layer) {
assert(split_buft);
model.buft_output = {
split_buft,
llama_default_buffer_type_offload(model, main_gpu)
Expand Down

0 comments on commit c656943

Please sign in to comment.