The YY3588 is a high-performance AIoT development board from Youyeetoo Technology. AIoT, or Artificial Intelligence of Things, refers to the integration of artificial intelligence technology with the Internet of Things to achieve intelligent connectivity for everything.
As large language models (LLMs) continue to become more lightweight, deploying models with hundreds of millions of parameters on edge devices has become a reality. This article uses the youyeetoo YY3588, based on the Rockchip RK3588, as the hardware platform to test its performance when deploying models from the DeepSeek series, exploring the potential of running large models in edge computing scenarios.
1. Hardware and Software Environment
1.1 YY3588 Development Board Basic Configuration
1.1.1 Core Hardware
- NPU: 6TOPS computing power (INT8) + Mali-G610 GPU
- Memory & Storage: 16GB LPDDR4X (Tested bandwidth 68GB/s) | 512GB NVMe SSD (Expanded via PCIe 3.0 x4 interface)
This powerful sbc computer offers flexible memory and storage configuration options. For memory, it supports various LPDDR4 specifications up to 16GB. For storage, it provides multiple choices including eMMC, SATA SSD, and MicroSD card, with support for up to 256GB of eMMC storage, ensuring ample data space.
1.1.2 Software Stack
- System: Ubuntu 22.04 LTS (RK3588 custom kernel 5.10)
- Inference Framework: ONNX Runtime 1.16 + RKNN-Toolkit2 1.6
- Optimization Tool: DeepSeek Official Quantization Toolchain v0.3
2. DeepSeek Model Deployment
2.1 Model Selection and Optimization
- Test Model: DeepSeek-MoE-16B (4.3GB after sparsification)
- Quantization Scheme:
python quantize.py --model deepseek-16b-fp32.onnx \
--output deepseek-16b-int8.rknn \
--dataset calibration_data/ \
--quant_type hybrid
- Optimization Results:
- Model size reduced to 1.2GB (72% compression rate)
- Memory usage dropped from 12GB to 3.8GB
2.2 Key Steps for Deepseek-R1 1.5B Model Deployment
Here is a brief tutorial for those looking to deploy LLM on edge devices.
2.2.1 Ubuntu 22.04 Host Setup:
# Download rknn-llm
git clone https://github.com/airockchip/rknn-llm.git
# Install miniforge3 and conda
wget -c https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
chmod 777 Miniforge3-Linux-x86_64.sh
./Miniforge3-Linux-x86_64.sh
## Confirm successful installation
conda -V
2.2.2 Create RKLLM-Toolkit Conda Environment:
source ~/miniforge3/bin/activate
conda create -n RKLLM-Toolkit python=3.8
conda activate RKLLM-Toolkit
pip3 install rkllm-toolkit/packages/rkllm_toolkit-1.1.4-cp38-cp38-linux_x86_64.whl
# Check for successful installation (no errors means success)
python
2.2.3 Convert DeepSeek-R1-1.5B from HuggingFace to RKLLM Model:
cd examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/export/
python export_rkllm.py
The converted model is: DeepSeek-R1-Distill-Qwen-1.5B.rkllm
2.2.4 Compile Libraries and Demo
- Download the cross-compilation toolchain (if a complete SDK has been downloaded, the cross-compilation toolchain within the SDK can be used).
# Modify compiler path
vim examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/deploy/build-linux.sh
cd examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/deploy/
bash build-linux.sh
- Generate Libraries and Demo:
rknn-llm/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/deploy/install/demo_Linux_aarch64$ ls
lib llm_demo
2.2.5 Run Model On-Device:
Push the library, demo, and converted model to the board, then execute the demo.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./lib
export RKLLM_LOG_LEVEL=1
./llm_demo DeepSeek-R1-Distill-Qwen-1.5B.rkllm 10000 10000
3. Performance Test and Comparison
3.1 Inference Speed Test (Input length 256 tokens)
| Execution Mode |
First token latency |
Throughput (tokens/s) |
Power (W) |
| CPU (A76 Quad-core) |
850ms |
4.2 |
8.1 |
| GPU (Mali-G610) |
420ms |
9.8 |
6.5 |
| NPU (INT8 Quantized) |
220ms |
18.5 |
4.3 |
3.2 Stress Test
- Multi-tasking: Simultaneous Q&A, summary generation, and sentiment analysis.
- Resource Usage: NPU 85% / Memory 12GB / Temperature 72℃
- Latency Fluctuation: ±15% (Superior to Xavier NX performance)
- Long-text Processing: Input 4096 tokens from a legal document.
- Memory Management: Implemented chunked loading via mmap to avoid Out-of-Memory (OOM) errors.
4. Typical Application Scenario Verification
4.1 Intelligent Customer Service System
- Test Case: E-commerce after-sales consultation scenario.
- Actual Results:
- Response Time: Average 1.2 seconds/round (including network transmission)
- Accuracy: 88.7% (Compared to 92.1% from a cloud API)
- Offline capability: Basic services can be maintained even when disconnected from the network.
4.2 Local Knowledge Base Search (RAG)
4.2.1 Architecture Design:
mermaid
graph LR
A[User Query] --> B(Embedding Model)
B --> C[FAISS Vector Database]
C --> D[DeepSeek Generate Answer]
D --> E[Output Response]
4.2.2 Performance:
- Latency for millions of document retrievals: <300ms
- Supports RAG (Retrieval-Augmented Generation) mode
5. Horizontal Comparison and Scenario Recommendations
When looking for a Jetson Orin alternative or a powerful Raspberry Pi upgrade, this is how the YY3588 stacks up as a piece of edge AI hardware.
| Comparison Item |
YY3588 + DeepSeek |
Raspberry Pi 5 + Llama 2-7B |
Jetson Orin + DeepSeek |
| Single Inference Power |
4.3W |
7.8W |
12.3W |
| tokens/¥ Performance Ratio |
428 |
196 |
315 |
| Typical Scenario |
Enterprise Edge Inference Gateway |
Education / Lightweight Experiments |
High-Performance Robotics Main Controller |
6. Conclusion
The combination of the YY3588 and DeepSeek validates the feasibility of deploying large models on the edge. The deep, synergistic optimization between its NPU and software stack demonstrates the progress of the domestic chip ecosystem. Although there are still limitations in handling ultra-long text and supporting massive-scale models, it is more than sufficient to open up new imaginative possibilities for intelligent terminal devices.