Deploy and run LLM on Raspberry Pi 5 vs Raspberry Pi 4B (LLaMA, LLaMA2, Phi-2, Mixtral-MOE, etc)

DFRobot Jan 02 2024 21534

This article will introduce how to deploy and run the recently popular LLM (large language models), including LLaMA, LLaMA2, Phi-2, Mixtral-MOE, and mamba-gpt, on the Raspberry Pi 5 8GB. Compared to the Raspberry Pi 4 model B, the Raspberry Pi 5 has upgrades in terms of processor, memory, and other aspects, resulting in some differences in performance and effectiveness. We will compare the differences in running speed, resource usage, and model performance among these LLMs to help you choose the right device for your needs and provide a reference for researching AI with limited hardware resources. At the same time, we will also discuss the key steps and matters, so that you can experience and test the running performance of LLMs on Raspberry Pi 5.

Specifications of Raspberry Pi 5 vs Raspberry Pi 4B

Benchmarks on Raspberry Pi 5 8GB and Raspberry Pi 4 8GB

(From Alasdair Allan)

How to Choose LLM

LLM usually puts forward the prerequisite requirements for CPU/GPU in the project requirements. Since GPU inference LLM is temporarily unavailable on Raspberry Pi 5, we need to give priority to models that support CPU operation. In terms of model selection, due to the RAM limitation of Raspberry Pi 5, we need to give priority to models with smaller memory. Under normal circumstances, the model requires double the size of RAM to run normally. The quantized model has lower memory requirements, so we recommend using an 8GB Raspberry Pi 5 and a quantized small-scale model to experience and test LLM. Running effect on Raspberry Pi.
The following list is a selection of smaller models from the open_llm_leaderboard on the Huggingface website, as well as the latest popular models.

Model	Average	ARC	HellaSwag	MMLU	TruthfulQA	License
mixtral_7bx4_moe	68.83	65.27	85.28	62.84	59.85	MistralAI
Phi-2	61.33	61.09	75.11	58.11	44.47	Non-commercial
mamba-gpt-7b-v1	58.61	61.26	84.1	63.46	46.34	Apache License 2.0
LLaMA2-7B-chat-hf	56.4	52.9	78.6	48.3	45.6	meta
LLaMA-13B	56.1	56.2	80.9	47.7	39.5	Non-commercial
LLaMA-7B	49.7	51	77.8	35.7	34.3	Non-commercial
ChatGLM-6B	48.2	38.8	59	46.7	48.1	Non-commercial
Alpaca-7b	31.9	28.1	25.8	25.3	48.5	Non-commercial

How to run LLM

After testing, since the GPU cannot be used to infer LLM on Raspberry Pi 5, we temporarily use LLaMA.cpp and the CPU of Raspberry Pi 5 to infer each LLM. The following uses Phi-2 as an example to guide you in detail on how to deploy and run LLM on a Raspberry Pi 5 with 8GB RAM. At the same time, we will also discuss the key steps and matters needing attention, so that you can more quickly experience and test the running performance of LLM on Raspberry Pi 5.

PS: If you want to experience Mixtral_moe, please refer to: https://github.com/ggerganov/llama.cpp/tree/mixtral

Environment Deployment

1. Deploying a virtual Python environment on the Raspberry Pi 5

sudo apt update && sudo apt install git

mkdir my_project

cd my_project

python -m venv env

source env/bin/activate

2. Download dependencies

python3 -m pip install torch numpy sentencepiece

sudo apt install g++ build-essential

3. Download: https://github.com/ggerganov/llama.cpp/tree/gg/phi-2

4. Build

cd /home/dfrobot/Desktop/llama.cpp-phi

make

LLM Environment Deployment

Quantization

Model quantization aims to reduce hardware requirements by reducing the accuracy of the weight parameters of each neuron in a deep neural network model. GGUF is a commonly used quantification method, allowing you to run LLMs on CPU or CPU + GPU. In general, the lower the number of bits and the more quantization, the smaller and faster the model will be, but at the expense of accuracy. For example, Q4 is the quantization method of the GGUF model file, which means using a 4-bit integer to quantize the weight of the model.

The 8GB RAM of the Raspberry Pi 5 is unsuitable for quantization models. We recommend quantizing the model on a Linux PC first and then copying the quantized files to the Raspberry Pi for deployment. On the Linux PC, after deploying the environment as in the previous step, use the convert-hf-to-gguf.py in LLaMA.cpp to convert the original Microsoft phi-2 model to GGUF format. The download URL for the original Microsoft phi-2 model: https://huggingface.co/microsoft/phi-2

# convert hf model to GGUF

python convert-hf-to-gguf.py phi-2

# fp-16 inference

./main -m phi-2/ggml-model-f16.gguf -p "Question: Write a python function to print the first n numbers in the fibonacci series"

You can also directly search for already quantized GGUF model files on Huggingface and use the LLaMA.cpp to experience the model's performance quickly.

The original model is less than 6GB, while the Q4 quantized GGUF file is only 1.6GB in size. Q4-GGUF model URL: https://huggingface.co/TheBloke/phi-2-GGUF/tree/main

After downloading, place the model file in the directory: "llama.cpp-phi/models/".

LLM Quantization

Model Deployment

Running commands in the terminal of Raspberry Pi 5

./main -m models/phi-2.Q4_0.gguf -p "Question: Write a python function to print the first n numbers in the fibonacci series"

Summary

Test for Raspberry Pi 5 (8GB) & LLM

Model	File Size	Compatibility	Out of Memory	Token Speed
phi-2-Q4	1.7GB	√		5.13 tokens/s
LLaMA-7B-Q4	< 4GB	√		2.2 tokens/s
LLaMA2-7B-Q4	< 7GB	√		2.3 tokens/s
LLaMA2-13B-Q4	< 4GB	√		2.02 tokens/s
mixtral_7bx2_moe_Q4	<8GB	√		use llama.cpp <1 tokens/s
mamba-gpt-7b	<13GB		√

Test for Raspberry Pi 4B (8GB) & LLM

Model	File Size	Compatibility	Out of Memory	Token Speed
LLaMA-7B-Q4	< 4GB	√		~0.1 tokens/s
Alpaca-7B-Q4	< 4GB	√
LLaMA2-7B-Q4	< 7GB	√		~0.83 tokens/s
LLaMA-13B-Q4	< 8GB		√
ChatGLM-6B-Q4	13GB		√

Through analyzing the above table, it is not difficult to find that the running speed of LLM on Raspberry Pi 5 has significantly improved compared to Raspberry Pi 4B. [Deploy and run LLM on Raspberry Pi 4B (LLaMA, Alpaca, LLaMA2, ChatGLM)]

This indicates that Raspberry Pi 5 has stronger processing capabilities. As a resource-limited device, phi-2-Q4 performs particularly well, with an eval time speed of 5.13 tokens/s, which undoubtedly demonstrates its excellent performance in processing speed.

In addition to the excellent performance of phi-2-Q4, LLaMA-7B-Q4, LLaMA2-7B-Q4, and LLaMA2-13B-Q4 also run satisfactorily on Raspberry Pi 5. However, it must be noted that for LLMs larger than 8GB, Raspberry Pi 5 still has limitations in loading the model, highlighting its RAM capacity constraints.

For LLM applications that require higher performance, LattePanda Sigma is a consideration. When running LLaMA2-7B-Q4, its speed can reach an astonishing 6 tokens/s. [Deploy and run LLM on LattePanda Sigma (LLaMA, Alpaca, LLaMA2, ChatGLM)]

Overall, Raspberry Pi 5 has significantly improved in processing speed compared to its predecessor, but there are still limitations when dealing with large LLMs due to its RAM capacity constraints. LattePanda Sigma provides higher performance to meet the requirements of applications with higher demands on LLMs.

In summary, although Raspberry Pi 5 has significantly improved in processing speed compared to Raspberry Pi 4B, there are still some limitations when dealing with large LLMs. This provides new challenges and opportunities for future technological development.