LLM Prompt Processing Performance
Prompt processing in LLMs is compute-heavy, as every token passes through billions of parameters. Even with big GPUs it can feel slow, but quantization, optimized backends, and future NPU offloading promise big speedups.
In reference to hardware:
- 48GB Mac M4 Pro
- 128GB AMD AI 9 HX370 (32GB CPU, 96GB iGPU)
Testing with model qwen/qwen3-30b-a3b-2507 (4bit 17.19GB, modified context length 4K->260K)
Understanding and Improving the Prompt Processing Stage in LLM Inference
When you run a large language model (LLM) locally (for example, in LM Studio), the first phase of inference is prompt processing. In this stage, the model reads through the entire input prompt before it starts generating any output tokens. This involves converting the prompt text into tokens (IDs), then feeding those tokens through all the layers of the transformer model to build up the context. In practical terms, the model performs many large matrix multiplications and other tensor operations for each token in the prompt, using all of the model’s parameters (which can be tens of billions)[1]. As each new token is processed, the model updates its internal state (often called the key/value cache) so it “remembers” earlier tokens. By the end of prompt processing, the model has ingested the entire prompt and filled its attention cache with the prompt’s information. Only then can it begin the next phase: generating a response token by token.
This process is inherently heavy. Because of the transformer’s sequential nature, each token in the prompt must be processed in order, and each one passes through every layer of the network. In a 30B-parameter model like Qwen-30B, that means for each token the model is activating networks involving tens of billions of weight values. There’s no shortcut – even if the prompt text is provided all at once, the model effectively simulates reading it one token at a time (though optimized implementations may handle multiple tokens in parallel within a single layer). In summary, prompt processing (also called the “prefill” stage in some papers) requires the model to perform a huge amount of computation over a lot of data – it’s like doing the work of generating N tokens all at once, where N is the prompt length[1].
Why Prompt Processing Can Be So Slow
Several factors make this stage time-consuming:
- Amount of Computation: As described, the model must apply every transformer layer to each prompt token. If your prompt is long (hundreds or even thousands of tokens), the compute work multiplies. For example, a prompt of 1000 tokens might require roughly 1000 × (operations per token) computations. Even though high-end GPUs or accelerators can crunch many operations in parallel, the sheer volume of math is enormous. This makes prompt processing compute-bound in many cases[2][3] – meaning the speed is limited by how many FLOPs (floating point operations) your hardware can do per second.
- Memory Access and Bandwidth: Large models don’t fit their entire parameter set in cache, so for each token’s processing, the model has to fetch billions of parameters from memory (VRAM or unified RAM) to multiply with that token’s data[4]. Modern accelerators mitigate this with very fast memory. For instance, an NVIDIA RTX 4090 has ~1 TB/s memory bandwidth, and Apple’s M2 Pro/Max chips use wide unified memory buses offering on the order of 200–500 GB/s[5]. In your case, the Apple “M4 Pro” (presumably an M-series Pro with 48GB RAM) has a wide memory bus (256-bit or more) giving a few hundred GB/s throughput, and your AMD Ryzen AI 9 HX370 machine likely uses fast DDR5 memory for its iGPU (though not as fast as high-end GPU VRAM). Even with high bandwidth, however, moving gigabytes of model weights for each token is a bottleneck. During prompt processing, this is somewhat alleviated by the fact the work can be vectorized (processing multiple tokens together), which improves utilization of each memory fetch[3]. Still, if the prompt is very long, computing power becomes the limiting factor, whereas during generation (one token at a time) the memory bandwidth tends to be the limiter[6].
- Sequential Nature: Unlike the generation phase where you produce one token at a time (and can sometimes employ creative tricks like parallel speculative decoding), the prompt must be processed fully and mostly sequentially. The transformer’s self-attention means each token’s representation depends on all previous tokens, so you can’t easily split the prompt work into independent chunks without special handling. This limits opportunities to parallelize across the sequence – you generally have to go token by token in order. In practical terms, you might notice your GPU usage shoot up to 100% briefly when you first input a prompt (all cores working on that burst of computation), and only after that does the model start responding.
- Software/Implementation Overheads: The efficiency of prompt processing depends on the software backend. Some frameworks (like PyTorch or TensorRT) can process a whole sequence in a batch, which maximizes GPU utilization. Others (like some llama.cpp backends) might originally have processed prompts in a more incremental loop, which could be less efficient. The LM Studio application builds on open-source components (similar to llama.cpp), so its prompt processing speed also depends on how optimized those libraries are. If the implementation isn’t fully utilizing the GPU or has to fall back to CPU for certain operations, that will slow things down. For instance, there have been reports of inconsistent GPU usage in LM Studio where the GPU isn’t fully loaded during prompt processing[7]. Ensuring the most efficient backend (Metal on Mac, DirectML or ROCm on AMD, etc.) is used will make a difference.
In summary, prompt processing often ends up being the slowest part of running a local LLM, sometimes taking the majority of the total time before you see the first output token. Studies have noted that this prefill stage can dominate end-to-end latency – on a mobile CPU it might be >90% of the total runtime for a long prompt, and even on a mobile-class GPU it can be ~50–90% of the latency[2]. Your experience of waiting a long time during “Processing Prompt” in LM Studio aligns with these observations.
Current Strategies to Speed Up Prompt Processing
While we can’t eliminate the fundamental workload, there are a few things you can do (or ensure) now to make prompt processing faster:
- Use GPU Acceleration: This may sound obvious, but using the fastest processor available is critical for the compute-heavy prompt stage. GPUs (or other accelerators) are much faster than CPUs at the matrix math involved. It sounds like you already do this, since your model fits entirely in GPU memory. Make sure LM Studio is indeed using the GPU on both your Mac and your AMD machine. On Mac, LM Studio should leverage Apple’s Metal Performance Shaders (MPS) to run the model on the M-series GPU. On Windows/AMD, LM Studio can use DirectML or ROCm for the Radeon iGPU. GPU acceleration is vital because prompt processing benefits from raw compute throughput, where dedicated GPUs (and Apple’s unified GPU) excel[3]. If for some reason the GPU isn’t fully utilized (check usage graphs), you might need to tweak settings or update LM Studio – it had some bugs historically with GPU utilization that got improved in newer versions[8].
- Quantization of the Model: It appears you’re using a quantized Qwen-30B model (the qwen3-30b-a3b-2507 naming suggests a compressed weight format). Quantization significantly reduces model size by using lower precision (e.g. 4-bit or 3-bit integers instead of 16-bit floats). This not only lets the model fit in memory, but also cuts down memory bandwidth usage and can increase speed. With fewer bytes per weight, the model can fetch data faster and sometimes even use special integer math units for faster computation[9]. For example, going from 16-bit to 4-bit can increase generation speed by ~1.6× while only slightly impacting accuracy[10]. In prompt processing, a quantized model means each token requires transferring and multiplying much less data, which helps both the memory and compute aspects. Ensure you use an efficient quantization scheme (Q4 or Q5 K- variants are popular sweet spots[11]). The fact that the whole 30B fits in VRAM is great – it means the GPU isn’t constantly swapping data from system memory or disk, which would really* slow things down.
- Optimize Batch/Sequence Processing: Some model runtimes allow you to process the prompt in larger chunks or batches to better utilize the hardware. While generation inherently is one token at a time, prompt processing can sometimes use a batch approach internally (treating the prompt tokens as a batch sequence). This can improve throughput by amortizing memory loads over multiple tokens. In practice, this is handled by the library – just be aware that using the newest optimized libraries can help. For instance, libraries like NVIDIA’s FasterTransformer or huggingface’s Transformers with GPU acceleration will do batched matrix multiplies for the prompt. In llama.cpp, ensure you have the latest version, as it has improved a lot on prompt processing speed by using techniques like AVX/OpenBLAS on CPU and better Metal/OpenCL backends on GPUs.
- Avoid Unnecessarily Long Prompts: This is more of a usage tip, but since prompt processing cost scales with the number of tokens, any redundant length in the prompt exacts a speed penalty. If you’re able to shorten the prompt or omit extraneous context, that directly reduces processing time. Of course, often the long context is intentional (e.g. providing the model a lot of info). In cases like retrieval-augmented generation, you might feed many documents as context – recognize that this will slow things considerably. If possible, only include what you need in the prompt. (By contrast, once the prompt is processed, generating the output is usually comparatively faster per token, since it’s memory-bound and each token uses already-cached context).
- Reuse Context (if possible): Some workflows allow reusing part of the prompt computation across queries. For example, if you have a fixed system prompt or a long conversation history that doesn’t change, certain implementations can cache the model’s state for that and only process new user input each time. In ChatGPT’s API, they effectively do this under the hood (they don’t re-process the system prompt for every message). In local setups, this is tricky – currently LM Studio and llama.cpp typically recompute everything for a new prompt unless you keep the session open and continue the conversation (in which case the model’s state already contains the previous context). If you always use the same lengthy prefix, one hacky approach is to start a conversation with that prefix and then treat each new query as a continuation (though managing that can be awkward). This area might see future improvement where the software can accept a precomputed cache for a prompt prefix.
In summary, the immediate things you can do are mostly about using the right hardware and settings: ensure GPU is used, use quantization, keep software updated for better kernels, and manage prompt lengths. These won’t eliminate the delay, but they can mitigate it.
Future Improvements and Research (NPU Offloading and Beyond)
Looking ahead, there is active research and development aimed at making prompt processing faster on local devices. You mentioned an NPU (Neural Processing Unit) as a possible avenue – indeed, NPUs are specialized accelerators (common in phones and some new PCs) designed for neural network operations, and they could play a big role in speeding up LLM inference.
One cutting-edge example is a 2025 research project called llm.npu[12]. This work specifically targets the prefill/prompt stage bottleneck. The authors point out that for on-device LLMs, the prompt processing often dominates latency (sometimes 90%+ of total time) in tasks that require long context understanding[13][2]. Their solution was to offload much of this work to a mobile NPU (like those in smartphone chips) which is very fast at int8 matrix multiplies. However, using NPUs isn’t straightforward – one challenge is that NPUs (and frameworks like Qualcomm’s QNN or Apple’s CoreML) often expect fixed-size computation graphs. A prompt can be of arbitrary length, which would naïvely require rebuilding an NPU computation graph for each new input length – a very slow process that can negate any speed gains[14]. The llm.npu project tackled this by dividing the prompt into fixed-size chunks and streaming them through the NPU. In other words, they break a long prompt into, say, chunks of 128 tokens and process those sequentially on the NPU, reusing a pre-compiled graph for each chunk[15]. They also identify outlier operations that are not NPU-friendly (certain rare high-magnitude activation values) and handle those on CPU/GPU in parallel, plus schedule transformer blocks out-of-order between the NPU and CPU/GPU to keep all hardware busy[16]. The result is impressive: over 22× faster prompt processing on average compared to baseline, and they achieved over 1,000 tokens per second processed on a 1B-parameter model using a phone NPU[17]. This kind of speedup could dramatically reduce the waiting time for long prompts in the future[18].
For your context, “NPU” could also refer to things like Apple’s Neural Engine (ANE) or AMD’s XDNA/Ryzen AI engine. These are analogous specialized units for AI workloads. Apple’s 16-core Neural Engine, for example, is optimized for low-precision (INT8/16) neural ops at very high throughput per watt. In theory, it could accelerate LLM inference a lot if utilized – and there have been demos of smaller models running on the ANE in CoreML. Likewise, the AMD Ryzen 9 AI (the HX370’s AI engine) is a dedicated block stemming from Xilinx FPGA technology, intended to speed up AI tasks without using the CPU/GPU. Software support is the big catch: using these NPUs requires specific frameworks and optimizations. It’s still early days, but progress is being made. For instance, LM Studio’s developer has demonstrated a build running on a Qualcomm Snapdragon’s NPU (on a Snapdragon X Elite PC), and early reports suggest it works – though perhaps not faster than GPU yet, it is much more power-efficient[19][20]. Support for NPUs is coming in steps: Qualcomm and Microsoft have been working on enabling the Windows AI NPUs; Apple’s macOS CoreML can theoretically schedule models on the ANE; and AMD has indicated it’s possible to use their Ryzen AI for LLMs (they had a prototype using Xilinx libraries for 4-bit models)[21]. In a discussion on llama.cpp development, Qualcomm engineers noted NPU support is being looked at, but it’s complex and not yet ready (no ETA as of late 2024)[7][22]. The good news is that this is an active area of research – we can expect future versions of local LLM runtimes (LM Studio, llama.cpp, etc.) to incorporate NPU acceleration, which should specifically target the prompt/prefill phase for speedups[2][23].
Apart from NPUs, other future improvements include: better model architectures and algorithms. For example, some models use Multi-Query Attention (MQA) which uses a single key/value head for all attention heads – this reduces memory usage and can speed up attention, especially for long prompts, by cutting down on the size of the cache that must be read each time. Techniques like FlashAttention (an optimized attention algorithm) don’t change the big-O complexity but execute the attention calculation much faster and more memory-efficiently; those help more with long contexts and are being integrated into frameworks. There’s also exploration of speculative decoding and other clever decoding strategies – though those mainly target the generation phase rather than initial prompt processing. Another angle is model compression and distillation: if models can be made smaller or more efficient (via smarter training or architecture changes) while retaining quality, then of course the prompt processing gets faster simply because there’s less to compute. We’re seeing some movement towards more efficient architectures that might ease the load (for example, Mistral 7B achieving parity with larger 13B models means you can use a smaller model for similar results, cutting compute). In the near term, however, the biggest wins will likely come from software-hardware co-design – essentially what llm.npu is doing: splitting the work across the best-suited hardware units and optimizing the execution pattern.
To summarize the future outlook: Specialized AI accelerators (NPUs) are poised to speed up the slow prompt-processing stage by handling bulk matrix math in low precision very fast. Early research has demonstrated huge speedups in prefilling by chunking the prompt and offloading to NPUs[24][17]. Frameworks and tools are being developed to tap into NPUs on devices like your Mac’s Neural Engine or the AMD Ryzen AI engine, but they are not mainstream yet. We should keep an eye on LM Studio updates or plugins – for example, if LM Studio integrates something “like MLX” (an Apple CoreML backend) or Qualcomm’s AI Engine support, you might see options to use the NPU. It’s an area under active development[25][22].
The prompt processing delay, especially with a 30B model, is somewhat expected given the physics of the computation. But with ongoing optimizations – from better utilization of GPUs to new NPU offloading techniques – we can expect the time-to-first-token to improve significantly in the next generations of local LLM software. As one community member optimistically put it, the improvements in memory handling and upcoming NPU support are steps in the right direction to make local LLMs faster and more efficient[26].
Sources:
- Andreas Kunar, “Do we need GPUs/NPUs for local AI?” – Explains prompt processing vs. generation bottlenecks and the importance of compute vs. memory bandwidth[1][3]. Also discusses quantization benefits[10].
- Xu et al., “Fast On-device LLM Inference with NPUs” (ASPLOS 2025) – Research paper introducing llm.npu, which offloads the prefill stage to mobile NPUs with chunked prompts, achieving >22× speedups in prompt processing[12][17]. Provides insights into why prefill is often the latency bottleneck[13][2].
- Reddit discussion and llama.cpp GitHub – Notes on LM Studio and llama.cpp efforts to support NPUs. Highlights that Qualcomm and others are working on NPU backends (no ETA) and that prompt processing might improve if NPUs can be used[20][7]. Also confirms that AMD’s Ryzen AI (HX370) and Apple’s Neural Engine are on the radar for future support[25][21].
- User experiences – Reports of GPU vs CPU utilization in local LLMs, showing prompt processing is compute-heavy (GPU pegged briefly) and generation can be memory-bound (GPU utilization drops due to waiting on memory)[27]. These illustrate the current behavior and the effect of recent software updates on performance[8].
[1] [3] [4] [5] [6] [9] [10] [11] Do we need GPUs/NPUs for local AI? | by Andreas Kunar | Medium
https://medium.com/@andreask_75652/do-we-need-gpus-npus-for-local-ai-b6cd9b60f00c
[2] [12] [13] [14] [15] [16] [17] [18] [23] [24] Fast On-device LLM Inference with NPUs
https://arxiv.org/html/2407.05858v2
[7] [22] [25] [26] Performance of llama.cpp on Snapdragon X Elite/Plus · ggml-org llama.cpp · Discussion #8273 · GitHub
https://github.com/ggml-org/llama.cpp/discussions/8273
[8] [27] Local AI very low GPU usage. Bottleneck or setup problem? (Solved) - Machine Learning, LLMs, & AI - Level1Techs Forums
[19] [20] [21] LM Studio running on NPU, finally! (Qualcomm Snapdragon's Copilot+ PC ) : r/LocalLLaMA
https://www.reddit.com/r/LocalLLaMA/comments/1h5eyb8/lm_studio_running_on_npu_finally_qualcomm/
Aided by GPT-5 Deep Research