Deploying a large language model (LLM) on my local machine

Published: 2 minute read, with 436 words. Post views:

I am planing to deploy a large language model (LLM) on my local machine (AMD-ITX). The machine is just a personal computer running on ubuntu, coming with a 8-core CPU (AMD Ryzen 7 5700X), 32G RAM and a GPU of RTX 4060ti 16G. Only small size LLMs are suitable. Here I record the process of deploying a LLM on my local machine, for the ease of future maintenance.

Some important references:

  1. a online tutorial in Chinese
  2. another turotial in English
  3. Docker
  4. NVidia Container Runtime for Docker
  5. Cloudflared

Install Docker

Start Docker

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

NVidia Container Runtime for Docker

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=docker # The nvidia-ctk command modifies the /etc/docker/daemon.json file on the host. The file is updated so that Docker can use the NVIDIA Container Runtime.
sudo systemctl restart docker

Install Open WebUI

With NVidia GPU and CUDA Support: Utilize GPU resources by running the following command:

docker run -d -p 4000:8080 \
        --gpus all \
        --add-host=host.docker.internal:host-gateway \
        -v /DATA/open-webui:/app/backend/data \
        --name open-webui \
        --restart always \
        ghcr.io/open-webui/open-webui:cuda

# or run as a system service
sudo nano /etc/systemd/system/open-webui.service
sudo systemctl daemon-reload
sudo systemctl enable open-webui

Verify the Docker instance is running:

sudo docker ps

Now the Open WebUI is running on your local machine. You can access it by visiting http://localhost:4000 in your browser.

Install and start Ollama

curl -fsSL https://ollama.com/install.sh | sh # install
ollama serve # start for once

start Ollama as a service

First change the OLLAMA_HOST to 0.0.0.0 referring to: https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-linux

then sudo systemctl enable ollama and sudo systemctl start ollama

Expose the Open WebUI to the Internet via cloudflared

  1. Log in to Zero Trust, and go to Networks > Tunnels.
  2. Select the tunnel (AMD-ITX here) and add a public hostname (e.g., “local_llm”), specifying appropriate port (4000 as above).
  3. Under the “Access > Applications” tab, add an application and select the appropriate access policy.
  4. Now visit https://local_llm.guhaogao.com and you are good to go.

Downloading Ollama Models

  1. Qwen2.5-14B-Instruct-Q4_K_M
  2. Qwen2.5-14B-Instruct-Q6_K_L
  3. Qwen2.5-32B-Instruct-Q3_K_M
  4. Qwen2.5-32B-Instruct-IQ4_XS
  5. llama3.2-vision:11b
  6. Mistral-Small-Instruct-2409-Q5_K_S
  7. bge-m3:latest

RAG using bge-m3:latest, reranking with BAAI/bge-reranker-v2-m3. Top K: 5

Web search using searxng, installed using docker.

docker run -d --name searxng -p 8081:8080 -v /home/hggu/software_480G/searxng:/etc/searxng --restart always searxng/searxng:latest
# or running as a system service on port 8081.
sudo nano /etc/systemd/system/searxng.service
sudo systemctl daemon-reload
sudo systemctl enable searxng

use this link: http://host.docker.internal:8081/search?q=<query> as the query link in the Open WebUI setting.



Tags:

Categories:

Comments