diff --git a/content/learning-paths/cross-platform/ernie_moe_v9/1_mixture_of_experts.md b/content/learning-paths/cross-platform/ernie_moe_v9/1_mixture_of_experts.md new file mode 100644 index 000000000..db5d10135 --- /dev/null +++ b/content/learning-paths/cross-platform/ernie_moe_v9/1_mixture_of_experts.md @@ -0,0 +1,60 @@ +--- +title: Why MoE Models Let Edge Devices Run 21B LLMs +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## What is Mixture of Experts (MoE)? + +As large language models grow to tens of billions of parameters, traditional dense networks — which activate all weights for every input — become infeasible for edge deployment, especially on CPU-only Arm devices. [Mixture of Experts (MoE)](https://en.wikipedia.org/wiki/Mixture_of_experts) offers a breakthrough. + +This is simple and uniform, but as model sizes increase—into the billions of parameters—this structure becomes both memory-intensive and compute-intensive. For edge environments like mobile devices, embedded systems, this makes deploying large models nearly impossible. + +***[Mixture of Experts (MoE)](https://en.wikipedia.org/wiki/Mixture_of_experts)*** offers an alternative. +Instead of using all parameters all the time, MoE introduces a conditional computation mechanism: each input token only activates a small subset of model components (called ***experts***). +Think of it like having a team of specialists, and only calling the relevant few for a given task. This makes MoE ideal for environments where compute or memory is constrained, such as edge AI or embedded inference. + + +In MoE: +- The model consists of many expert sub-networks (e.g., 64 experts). +- For each input, a router selects only 2–4 experts to compute the result. +- The rest of the experts remain inactive, conserving memory and compute. + +This dynamic routing is typically learned during training. In inference, only a fraction of the model is active, leading to much lower compute and memory usage ***without sacrificing the total model capacity** or ***diversity of learned behaviors***. + + +## Benefits of MoE Architecture + +- Scalable Model Size: Increase total parameter count without linearly increasing inference cost. +- Efficient Inference: Lower memory and FLOPs per input. +- Modularity: Each expert can learn domain-specific patterns (e.g., finance, medicine, language). +- Specialization: Encourages the model to learn distinct processing behaviors across different experts. +- Routing Flexibility: Makes it easier to adapt to specific tasks using fine-tuned expert selection. + +## ERNIE-4.5: A MoE Model for Chinese NLP + +The [ERNIE-4.5](https://huggingface.co/collections/baidu/ernie-45) model family from [Baidu](https://huggingface.co/baidu) introduces a Mixture-of-Experts (MoE) architecture, which enables massive models (e.g., 21 billion parameters) to be deployed in constrained environments. MoE models dynamically activate only a small subset of parameters (e.g., 2–4 experts) during inference. +Specifically, ERNIE-4.5 uses a softmax-based router to select the top-6 experts from a pool of 64 per layer, activating only a subset dynamically per token. This makes runtime both efficient and adaptive. This architecture allows the model to retain high performance and generalization while drastically reducing inference-time resource requirements. + +ERNIE-4.5 Model Series: +- PT (Post-Trained): General-purpose language model trained on Chinese and English data. +- Thinking: Optimized for reasoning tasks with long context support and structured outputs. + +In this learning path, we focus on the [ERNIE-4.5 Thinking](https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking) variant as our primary model due to its enhancements for multi-step reasoning and long-context tasks. However, we also introduce the [PT (Post-Trained)](https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-PT) variant to allow learners to compare model behavior across identical prompts, illustrating how task-specific tuning affects output quality. + +## Why MoE Matters for Edge Devices + +Deploying a 21B dense model on a CPU-only board is infeasible. But MoE changes that: + +| **Feature** | **Dense Model** | **MoE Model (e.g., ERNIE-4.5-21B)** | +|-----------------------|-----------------|---------------| +| `Total Parameters` | 21B | 21B | +| `Activated Parameters`| 21B | ~3B | +| `Memory Usage` | Very high | Moderate | +| `Inference Speed` | Slow | Fast | + +This efficiency enables powerful language models to be run locally on ARM-based platforms — making MoE not just a model design choice, but a deployment enabler. + +In the next module, you’ll bring this architecture to life — preparing a real Armv9 board, setting up llama.cpp, and verifying that a 21B MoE model like ERNIE-4.5 can run efficiently with no GPU required. diff --git a/content/learning-paths/cross-platform/ernie_moe_v9/2_llamacpp_installation.md b/content/learning-paths/cross-platform/ernie_moe_v9/2_llamacpp_installation.md new file mode 100644 index 000000000..ca0d61982 --- /dev/null +++ b/content/learning-paths/cross-platform/ernie_moe_v9/2_llamacpp_installation.md @@ -0,0 +1,134 @@ +--- +title: Setting Up llama.cpp on Arm v9 develop board +weight: 3 +layout: "learningpathall" +--- + +## Setting Up llama.cpp on Arm v9 develop board + +In the previous section, you learned how Mixture-of-Experts (MoE) models reduce resource consumption by activating only a fraction of parameters. +Now, you'll walk through how to prepare your environment to deploy `ERNIE-4.5 MoE` models on an Armv9 platform using `llama.cpp`. + +In this module, you’ll verify model inference on Radxa O6 and validate multilingual outputs using ERNIE’s Thinking variant. + +This section prepares the foundation for deploying ERNIE-4.5 on an ARMv9 platform. You will begin by reviewing the hardware—specifically, the `Radxa O6` development board equipped with an Armv9 CPU. From there, you will install llama.cpp, a lightweight inference engine, build it from source, and download ERNIE-4.5 models in GGUF format (quantized to Q4 for efficient CPU inference). Finally, you will run a basic inference test to confirm that the environment is properly configured and ready for benchmarking and optimization in the next module. + +### Arm v9 development board + +In this learning path, we use the [Radxa O6](https://radxa.com/products/orion/o6/) — a compact Armv9 development board powered by the [CIX CD8180](https://en.cixtech.com/Personal-Computing/) SoC. It features: + +- 12-core Armv9.2 CPU +- Support for SVE, dotprod, and i8mm instruction sets +- Multiple HDMI, PCIe slot with Gen4x8, dual 5Gbps Ethernet Ports and USB-C for I/O expansion + +We chose this board because it balances affordability and performance. Most importantly, it supports vector instructions we’ll benchmark later in this path. + +The default system image for the board is [Debian](https://docs.radxa.com/en/orion/o6/debian/debian-user-guide), which includes a ready-to-use user environment. You can verify or reflash the OS by following the instructions on the Radxa O6 [download page](https://docs.radxa.com/en/orion/o6/download). + +With the Radxa O6 ready, let’s set up the software stack beginning with llama.cpp. + +### Step 1: Clone and Build llama.cpp + +First, ensure your system is up-to-date and install the required tools and libraries: + +```bash +sudo apt update +sudo apt install build-essential cmake python3 python3-pip htop +``` + +Next, build [llama.cpp](https://github.com/ggml-org/llama.cpp/), an open-source C++ framework for running and experimenting with large language models. Designed to be lightweight and fast, llama.cpp supports inference on edge devices (CPU-only) and implements many of the most popular LLM architectures. + +In the context of MoE models, `llama.cpp` currently supports: +- Openai-moe +- Oleo +- lm4-moe +- Qwen2 moe, Qwen3 moe +- Grok +- Ernie4.5 + +These models use diverse routing and expert management strategies, and llama.cpp provides a unified backend for efficient MoE inference. +For more in-depth coverage of llama.cpp capabilities and use cases, see those [learning paths](https://learn.arm.com/tag/llama.cpp/) for the detail. + +```bash +cd ~ +git clone https://github.com/ggerganov/llama.cpp.git +cd llama.cpp + +mkdir build && cd build +cmake .. +make -j$(nproc) +``` + +This will generate binaries like `llama-cli` under directory `~/llama.cpp/build/bin`, which we’ll use to run inference in later steps. +Once llama.cpp is compiled, we can now download the models we’ll use for evaluation. + + +### Step 2: Download ERNIE-4.5 Q4 GGUF Model + +In this learning path, you will use [ERNIE-4.5](https://huggingface.co/collections/baidu/ernie-45) to deploy in Arm v9. +Download both model variants so you can experiment later: + +```bash +mkdir -p ~/models/ernie-4.5 +cd ~/models/ernie-4.5 +wget https://modelscope.cn/models/unsloth/ERNIE-4.5-21B-A3B-PT-GGUF/resolve/master/ERNIE-4.5-21B-A3B-PT-Q4_0.gguf +wget https://modelscope.cn/models/unsloth/ERNIE-4.5-21B-A3B-Thinking-GGUF/resolve/master/ERNIE-4.5-21B-A3B-Thinking-Q4_0.gguf +``` + +You can see the size of both models are 12 GB and quantized to Q4, making them suitable for CPU-only inference. + +{{% notice Note %}} +The Q4 quantized models reduce memory footprint and allow CPU‑only inference — you’ll still need around 12 GB of RAM for good performance. +{{% /notice %}} + +While both the Thinking and PT variants of ERNIE-4.5 share the same MoE architecture, they are fine-tuned for different objectives. The Thinking model is optimized for logical reasoning and structured generation, making it the main focus of subsequent benchmarking and hardware optimization. You are encouraged to install both variants and observe behavioral differences using the same prompt. + +### Step 3: Run a Basic Inference Test + +Navigate to the build directory and run the following command to verify that the model loads correctly and supports multilingual input: + +```bash +cd ~/llama.cpp/build +./bin/llama-cli \ + --jinja \ + -m ~/models/ernie-4.5/ERNIE-4.5-21B-A3B-Thinking-Q4_0.gguf \ + -p "Please introduce Mixture of Experts in Chinese." \ + -c 4096 -t 12 \ + --jinja +``` + +Note the flags: +- ***-p***: Passes the input prompt directly as a string. +- ***-c 4096***: Sets the context length (in tokens). A longer context allows the model to “remember” more input text, which is crucial for long-form tasks. Here we use the recommended 4096 tokens. +- ***-t 12***: Specifies the number of CPU threads used for inference. You should match this number to the physical cores (or logical threads) available on your system to maximize performance. +- ***--jinja***: Enables Jinja‑style prompt templates. Many Chinese‑oriented MoE models rely on this template format for structured inputs. + +If everything is set up correctly, you will see metadata output from llama.cpp indicating the model’s architecture and size: + +``` +print_info: model type = 21B.A3B +print_info: model params = 21.83 B +print_info: general.name = Ernie-4.5-21B-A3B-Thinking +``` + +Once inference is complete, the expected output will look like this (in Chinese): + +![img1 alt-text#center](meo_result.png "Introduce Mixture of Experts in Chinese") + +This answer demonstrates the model’s multilingual and structured reasoning ability. It begins with a narrative explanation introducing the concept of Mixture of Experts (MoE), followed by a well-organized markdown-style summary with section headers and bullet points: +- Concept breakdown: It describes MoE as a model combining multiple specialized sub-models (experts) and a routing mechanism (gateway) to activate only a few experts per input. +- Workflow explanation: It explains how routing selects top experts based on input features and how the system reduces computation by only activating a small number of experts per token. +- Chinese NLP examples: It gives examples such as word segmentation and translation, explaining how different experts may focus on distinct subtasks. + +This confirms: +- The GGUF model is successfully loaded. +- The llama.cpp build functions as expected. +- CPU-only inference on Armv9 is working. + +#### Why This Prompt Matters + +This prompt, “Please introduce Mixture of Experts in Chinese.”, was chosen for its dual pedagogical value: +- ***Bilingual Capability Check***: The instruction is issued in English, but the answer is expected in Chinese. This helps confirm that ERNIE-4.5’s multilingual support is active and effective. +- ***MoE Behavior Engagement***: The topic itself — explaining “Mixture of Experts” — requires combining multiple sub-skills: technical understanding, translation, and structured explanation. This likely triggers different experts within the model to contribute during inference. Even though routing isn’t explicitly logged, the richness and precision of the output suggest that MoE routing is functioning as designed. This kind of prompt increases the likelihood of multiple experts being activated simultaneously—e.g., language generation, machine learning knowledge, and Chinese translation. + +By using a single prompt, you verify setup correctness, observe output quality, and gain insight into MoE inference characteristics — all essential elements before moving on to hardware-specific performance tuning. diff --git a/content/learning-paths/cross-platform/ernie_moe_v9/3_erine_moe.md b/content/learning-paths/cross-platform/ernie_moe_v9/3_erine_moe.md new file mode 100644 index 000000000..1dc710952 --- /dev/null +++ b/content/learning-paths/cross-platform/ernie_moe_v9/3_erine_moe.md @@ -0,0 +1,179 @@ +--- +title: Comparing ERNIE PT vs Thinking and Expert Routing +weight: 4 +layout: "learningpathall" +--- + +## Compare ERNIE PT vs Thinking Model Behavior + +Now that both ERNIE-4.5 models are installed and verified, you can compare their output behavior on the same task. + +In this module, you’ll compare the inference styles of PT and Thinking models, and learn how to inspect internal MoE expert routing behavior during generation. + +With both ERNIE-4.5 models installed and ready, we can now observe how their tuning objectives affect output—even though both share the same MoE architecture and parameter count (~21B total, ~3B activated at runtime): +- PT (Post-Trained): General-purpose, trained on multilingual corpora. +- Thinking: Tuned for multi-step reasoning, long context, and structured response generation. + +### Example Task: Product Behavior Analysis +Copy the following prompt into a file named `prompt1.txt`: + +``` +You are a fitness brand strategist. +User profile: Buys protein powder + dumbbells + gym wear, works out at home 4‑5× per week, shares results online, now exploring recovery nutrition and smart gym gear. +Task: +1. Identify their top motivation and one hidden pain point. +2. Propose one new product line. +3. Create a short marketing tagline (≤ 15 words). +``` + +Run the prompt using both models: + +***PT Variant:*** +```bash +./bin/llama-cli \ + --jinja \ + -m ~/models/ernie-4.5/ERNIE-4.5-21B-A3B-PT-Q4_0.gguf \ + -f prompt1.txt \ + -c 4096 -t 12 \ + --jinja +``` + +The answer will looks like: + +``` +Assistant: 1. **Top Motivation**: Achieving visible results and maintaining progress through efficient recovery nutrition. +**Hidden Pain Point**: Balancing high-intensity training with optimal recovery nutrition during busy workouts. +2. **New Product Line**: *Smart Recovery Meal Kits* – Customizable, nutrient-dense, and easy-to-prepare post-workout meals designed for quick consumption and recovery optimization. +3. **Marketing Tagline**: "Fuel Progress, Recover Faster – Smart Nutrition for the Hard Worker." +``` + +The answer shows: +- Delivers conclusions directly: motivations and pain points are briefly mentioned with little reasoning. +- Product ideas are sensible but templated: suggestions like “Smart Recovery Meal Kits” are plausible but lack contextual grounding in user behavior. +- Tagline reads like a standard advertisement: e.g., “Fuel Progress, Recover Faster” feels promotional rather than personalized. + + +***Thinking Variant:*** +```bash +./bin/llama-cli \ + --jinja \ + -m ~/models/ernie-4.5/ERNIE-4.5-21B-A3B-Thinking-Q4_0.gguf \ + -f prompt1.txt \ + -c 4096 -t 12 \ + --jinja +``` + +You will see the more comprehensive answer like: + +``` + assistant + +Let's tackle this user profile. The user buys protein powder, dumbbells, and gym wear. They work out 4-5 times a week at home and share results online. Now they're exploring recovery nutrition and smart gym gear. + +First, their top motivation. They're into sharing results online, so maybe they want to inspire others or gain recognition. That's a big motivation. But what's a hidden pain point? They're moving into recovery nutrition and smart gear. Maybe they find it hard to track recovery progress or ensure their nutrition is optimized for recovery. Smart gear could help, but maybe they don't have the right tools to monitor recovery metrics like sleep, nutrition, or muscle soreness. That's a hidden pain point since they're exploring it. + +Next, propose a new product line. Since they're into smart gym gear, maybe a smart recovery watch that tracks sleep, activity, and recovery metrics. Or a nutrition tracking device that syncs with their protein powder and dumbbells. Wait, they might need something that integrates with their existing products. Maybe a smart scale that tracks muscle and body composition, connected to their fitness apps. Or a wearable device that monitors recovery parameters like heart rate variability and sleep. + +Marketing tagline needs to be concise, under 15 words. Combine motivation and product. For example, "Track Your Recovery, Maximize Your Gains—Smart Gear for Every Workout." But maybe more catchy. "Smarter Recovery, Stronger Results. Your Workout, Tracked." Hmm, not sure. Let me think. Maybe "Track Recovery, Fuel Progress. Smart Gear for Every Muscle." Wait, the user shares results online, so maybe highlight community or shared progress. "Share. Track. Conquer. Smart Gear for Your Recovery Journey." That's 10 words. Or "Your Workout, Tracked. Recovery, Simplified. Smart Gear Now." Maybe too vague. Let me refine. "Track. Recover. Share. Smart Gear for Your Journey." 8 words. Maybe that's better. Or "Recovery Tracked. Gains Shared. Smart Gear for Every Muscle." 9 words. Not sure. Alternatively, focus on the product. "Smart Recovery, Shared Success. Your Gear, Now Smarter." 7 words. Maybe that's better. + +Wait, the user is moving into recovery nutrition and smart gear. So the product line could be a smart recovery tracker. The tagline should tie their motivation (sharing results) with the new product. Maybe "Track. Recover. Share. Smart Gear for Your Journey." 8 words. Or "Your Workouts, Tracked. Recovery, Simplified. Smart Gear Now." 9 words. Maybe that works. + +So to sum up: + +1. Top motivation: Inspiring others through shared results and progress tracking. Hidden pain point: Difficulty tracking recovery metrics and nutrition effectively. + +2. Product line: Smart Recovery Tracker (wearable or app integrating with gear) that monitors sleep, nutrition, and recovery metrics, syncing with their existing products. + +3. Tagline: "Track. Recover. Share. Smart Gear for Your Journey." (8 words) + + + +1. **Top Motivation**: Inspiring others through shared results and progress tracking. +2. **Hidden Pain Point**: Struggling to track recovery metrics (sleep, nutrition, muscle soreness) effectively. +3. **Product Line**: **Smart Recovery Tracker** (wearable monitoring sleep/nutrition, syncing with protein powder/dumbbells to optimize recovery). +4. **Tagline**: *"Track. Recover. Share. Smart Gear for Your Journey."* (8 words) + + +``` + +The answer shows: +- Responses show deeper layering: begins with inferring user motivation, iterates through possible product strategies, and only then crafts a refined tagline. +- Transparent reasoning process: the model “thinks out loud” (“let me try again… maybe that’s too long…”), mimicking human deliberation. +- Marketing language reflects user mindset: taglines like “Share. Track. Conquer.” directly appeal to community sharing and progress motivation. + + +### Compare the Outputs +After execution, review the responses and compare them along the following dimensions: + +| **Aspect** | **PT Model** | **Thinking Model** | +|----------------------|---------------------------------------------------|--------------------------------------------------------------------------------------| +| `Language Style` | Direct, ad-like tone | Layered and natural, mimics human internal reasoning | +| `Reasoning Depth` | High-level summarization with minimal exploration | Step-by-step inference with intermediate reflections and multiple alternatives | +| `Suggested Actions` | Pre-packaged idea (e.g., meal kits) | Context-aware concepts (e.g., smart gear that integrates with current user behavior) | + + + +## Observe MoE Expert Routing with Debug Logs + +If you want to look under the hood of the MoE model, you’ll now add debug log to observe internal MoE behavior which experts are routed during inference. + +Open `src/models/ernie4‑5‑moe.cpp` in the llama.cpp repository and locate the function build_moe_ffn(). +Insert a print statement right after the top‑k expert selection. For example: + +```c +printf("---[DEBUG]--- entering build_moe_ffn at layer %d with %d experts (use %d)\n", il, n_expert, n_expert_used); +``` + +Rebuild llama.cpp: + +```bash +cd ~/llama.cpp +make -j$(nproc) +``` + +Run inference with the same prompt and monitor the console for lines such as: +``` +---[DEBUG]--- entering build_moe_ffn at layer 1 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 2 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 3 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 4 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 5 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 6 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 7 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 8 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 9 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 10 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 11 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 12 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 13 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 14 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 15 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 16 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 17 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 18 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 19 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 20 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 21 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 22 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 23 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 24 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 25 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 26 with 64 experts (use 64) +---[DEBUG]--- entering build_moe_ffn at layer 27 with 64 experts (use 64) +``` + +This reveals how many experts (e.g., 6) and how many tokens (e.g., 16) were routed at that layer. + +{{% notice Note %}} +You can also trace the function `llm_graph_context::build_moe_ffn()` inside the `src/llama-graph.cpp` about how to select the expert. +{{% /notice %}} + +What to observe: +- Whether the number of active experts changes between the PT and Thinking models. +- Patterns in routing: e.g., different token batches routing to differing expert sets. +- Correlate routing behaviour with output differences: deeper routing variety may align with more detailed responses. + +## Takeaway +This task highlights the advantage of MoE fine-tuning: even under the same architecture, thoughtful tuning can significantly change a model’s reasoning behavior. It also reinforces that Thinking is better suited for applications requiring analytical depth—ideal for edge AI scenarios like customer profiling or real-time recommendations. + +In the next section, you’ll switch focus from model behavior to system-level performance — compiling with Armv9 instruction sets and measuring the impact on inference speed. diff --git a/content/learning-paths/cross-platform/ernie_moe_v9/4_v9_optimization.md b/content/learning-paths/cross-platform/ernie_moe_v9/4_v9_optimization.md new file mode 100644 index 000000000..41dc5e9be --- /dev/null +++ b/content/learning-paths/cross-platform/ernie_moe_v9/4_v9_optimization.md @@ -0,0 +1,132 @@ +--- +title: Arm v9 Optimization and MoE Efficiency +weight: 5 +layout: "learningpathall" +--- + +## Accelerate ERNIE-4.5 with Armv9 Optimizations + +In previous modules, you've learned how MoE enables large model deployment on CPUs, and how to observe inference behavior with ERNIE-4.5. Now, we'll optimize performance using Armv9 architecture features and benchmark the improvements. + +This section shows how to benchmark performance under two scenarios: with and without Armv9 vector instruction optimizations. + +We’ll compare: +- Baseline: regular CPU build +- Optimized: Armv9-specific build with SVE/i8mm/dotprod enabled + +To establish a baseline performance, let’s first compile llama.cpp without Armv9 optimizations. + +### Disable llama.cpp v9 Optimizations + +This step builds `llama.cpp` without Armv9 vector features to establish a baseline. + +```bash +cd ~/llama.cpp +mkdir build_v9_off && cd build_v9_off +cmake \ + -DLLAMA_CURL=OFF \ + -DGGML_LLAMAFILE=OFF \ + -DGGML_VULKAN=OFF \ + -DBUILD_SHARED_LIBS=OFF \ + -DCMAKE_SYSTEM_PROCESSOR=arm64 \ + -DCMAKE_OSX_ARCHITECTURES=arm64 \ + -DGGML_NATIVE=OFF \ + -DGGML_AVX=off \ + -DGGML_AVX2=off \ + -DGGML_AVX512=off \ + -DGGML_FMA=off \ + -DGGML_F16C=off \ + -DGGML_CPU_KLEIDIAI=OFF \ + .. + +make -j$(nproc) +``` + +Then run benchmark in `build_v9_off` directory. + +```bash +./bin/llama-bench -m ~/models/ernie-4.5/ERNIE-4.5-21B-A3B-Thinking-Q4_0.gguf -pg 128,128 -t 8 +``` + +The result for 24GB Radxa O6 will be: + +| model | size | params | backend | threads | test | t/s | +| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: | +| ernie4_5-moe 21B.A3B Q4_0 | 11.64 GiB | 21.83 B | CPU | 8 | pp512 | 14.96 ± 0.01 | +| ernie4_5-moe 21B.A3B Q4_0 | 11.64 GiB | 21.83 B | CPU | 8 | tg128 | 12.03 ± 0.02 | +| ernie4_5-moe 21B.A3B Q4_0 | 11.64 GiB | 21.83 B | CPU | 8 | pp128+tg128 | 13.51 ± 0.03 | + +With the baseline captured, we now recompile with Armv9 vector extensions enabled. + +### Enable llama.cpp v9 Optimizations + +Now rebuild with vector extensions enabled (i8mm, dotprod, SVE) by following configuration setting. + +```bash +cd ~/llama.cpp +mkdir build_v9_on && cd build_v9_on +cmake \ + -DLLAMA_CURL=OFF \ + -DGGML_LLAMAFILE=OFF \ + -DGGML_VULKAN=OFF \ + -DBUILD_SHARED_LIBS=OFF \ + -DCMAKE_SYSTEM_PROCESSOR=armv9-a \ + -DCMAKE_OSX_ARCHITECTURES=arm64 \ + -DGGML_NATIVE=OFF \ + -DGGML_AVX=off \ + -DGGML_AVX2=off \ + -DGGML_AVX512=off \ + -DGGML_FMA=off \ + -DGGML_F16C=off \ + -DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod+sve \ + -DGGML_CPU_KLEIDIAI=ON \ + .. + +make -j$(nproc) +``` + +{{% notice Note %}} +We disable GPU and other backend support to focus exclusively on CPU performance and optimization for this learning path. +{{% /notice %}} + +Then re-run benchmark in `build_v9_on` directory. + +```bash +./bin/llama-bench -m ~/models/ernie-4.5/ERNIE-4.5-21B-A3B-Thinking-Q4_0.gguf -pg 128,128 -t 8 +``` + +The result for 24GB Radxa O6 will be: + +| model | size | params | backend | threads | test | t/s | +| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: | +| ernie4_5-moe 21B.A3B Q4_0 | 11.64 GiB | 21.83 B | CPU | 8 | pp512 | 38.51 ± 0.11 | +| ernie4_5-moe 21B.A3B Q4_0 | 11.64 GiB | 21.83 B | CPU | 8 | tg128 | 15.96 ± 0.08 | +| ernie4_5-moe 21B.A3B Q4_0 | 11.64 GiB | 21.83 B | CPU | 8 | pp128+tg128 | 21.58 ± 0.11 | + + +Let’s now compare the results side-by-side to see how much performance is gained. + +### Comparing Performance: Armv9 Optimization Results + +After running benchmarks with and without Armv9-specific instructions, the results show significant gains. + +| Test | v9 off | v9 on | Gain | +|---------------|-----------------|----------------|---------| +| pp512 | 14.96 token/s | 38.51 token/s | 2.57x | +| tg128 | 12.03 token/s | 15.96 token/s | 1.32x | +| pp128 + tg128 | 13.51 token/s | 21.58 token/s | 1.59x | + +- Vectorized kernels (i8mm, dotprod, SVE) drastically improve inference throughput. +- The pp512 test shows the most significant acceleration, delivering a 2.57× improvement. +- Other patterns like tg128 and pp128+tg128 also achieve measurable gains, demonstrating the broad benefit of hardware-aware builds. +- Armv9 optimization enables practical real-time inference for 21B models on edge-class hardware. + + +### Summary +Over this learning path, you've walked through every stage of deploying a 21B parameter Chinese MoE model on edge-class Armv9 hardware. You: +- Understood how MoE reduces memory usage by only activating a small subset of parameters per token. +- Set up llama.cpp and deployed ERNIE-4.5 on a Radxa O6 board. +- Compared ERNIE-4.5 Thinking and PT model behaviors and examined expert routing logic with debug instrumentation. +- Applied Armv9 hardware optimizations to unlock over 2.5× speed improvements in token throughput. + +You now have the full-stack capabilities to deploy, profile, and tune Chinese LLMs for efficient inference on modern Arm CPUs. diff --git a/content/learning-paths/cross-platform/ernie_moe_v9/_index.md b/content/learning-paths/cross-platform/ernie_moe_v9/_index.md new file mode 100644 index 000000000..3dac628c6 --- /dev/null +++ b/content/learning-paths/cross-platform/ernie_moe_v9/_index.md @@ -0,0 +1,63 @@ +--- +title: Running ERNIE Mixture of Experts (MoE) Models on Armv9 with llama.cpp + +draft: true +cascade: + draft: true + +minutes_to_complete: 60 + +who_is_this_for: This learning path is designed for developers and engineers looking to deploy Mixture-of-Experts (MoE) models — such as ERNIE-4.5 — on edge-class devices. MoE architectures allow massive LLMs (21B+ parameters) to run with only a fraction of their weights active per inference, making them ideal for resource-constrained environments. + +learning_objectives: + - Understand how MoE models like ERNIE-4.5 enable large-scale inference on edge devices. + - Set up and execute ERNIE-4.5 (PT and Thinking versions) using llama.cpp and compare the inference behavior. + - Analyze the performance impact of enabling Armv9-specific hardware optimizations. + +prerequisites: + - One Arm V9 device at least 32GB of available disk space. In this learning path, I use [Radxa O6](https://radxa.com/products/orion/o6/) + +author: Odin Shen + +### Tags +skilllevels: Introductory +subjects: ML +armips: + - Cortex-A +operatingsystems: + - Linux +tools_software_languages: + - Python + - C++ + - Bash + - llama.cpp + +### Cross-platform metadata only +shared_path: true +shared_between: + - laptops-and-desktops + - servers-and-cloud-computing + - iot + - mobile-graphics-and-gaming + +further_reading: + - resource: + title: ERNIE-4.5-21B Modelscope link + link: https://modelscope.cn/models/unsloth/ERNIE-4.5-21B-A3B-PT-GGUF + type: website + - resource: + title: llama.cpp github repo + link: https://github.com/ggml-org/llama.cpp.git + type: documentation + - resource: + title: Arm Learning Path + link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/ + type: Learning Path + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/cross-platform/ernie_moe_v9/_next-steps.md b/content/learning-paths/cross-platform/ernie_moe_v9/_next-steps.md new file mode 100644 index 000000000..c3db0de5a --- /dev/null +++ b/content/learning-paths/cross-platform/ernie_moe_v9/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/cross-platform/ernie_moe_v9/meo_result.png b/content/learning-paths/cross-platform/ernie_moe_v9/meo_result.png new file mode 100644 index 000000000..a5a34df99 Binary files /dev/null and b/content/learning-paths/cross-platform/ernie_moe_v9/meo_result.png differ