Posted by Mark Sherwood – Senior Product Supervisor and Juhyun Lee – Workers Software program Engineer
TensorFlow Lite has been a robust software for on-device machine studying since its launch in 2017, and MediaPipe additional prolonged that energy in 2019 by supporting full ML pipelines. Whereas these instruments initially centered on smaller on-device fashions, right this moment marks a dramatic shift with the experimental MediaPipe LLM Inference API.
This new launch permits Giant Language Fashions (LLMs) to run absolutely on-device throughout platforms. This new functionality is especially transformative contemplating the reminiscence and compute calls for of LLMs, that are over 100 instances bigger than conventional on-device fashions. Optimizations throughout the on-device stack make this attainable, together with new ops, quantization, caching, and weight sharing.
The experimental cross-platform MediaPipe LLM Inference API, designed to streamline on-device LLM integration for internet builders, helps Internet, Android, and iOS with preliminary help for 4 brazenly accessible LLMs: Gemma, Phi 2, Falcon, and Steady LM. It provides researchers and builders the flexibleness to prototype and take a look at well-liked brazenly accessible LLM fashions on-device.
On Android, the MediaPipe LLM Inference API is meant for experimental and analysis use solely. Manufacturing functions with LLMs can use the Gemini API or Gemini Nano on-device by way of Android AICore. AICore is the brand new system-level functionality launched in Android 14 to supply Gemini-powered options for high-end gadgets, together with integrations with the newest ML accelerators, use-case optimized LoRA adapters, and security filters. To begin utilizing Gemini Nano on-device along with your app, apply to the Early Entry Preview.
LLM Inference API
Beginning right this moment, you possibly can take a look at out the MediaPipe LLM Inference API by way of our internet demo or by constructing our pattern demo apps. You may experiment and combine it into your tasks by way of our Internet, Android, or iOS SDKs.
Utilizing the LLM Inference API permits you to carry LLMs on-device in just some steps. These steps apply throughout internet, iOS, and Android, although the SDK and native API will likely be platform particular. The next code samples present the net SDK.
1. Decide mannequin weights appropriate with one in every of our supported mannequin architectures
2. Convert the mannequin weights right into a TensorFlow Lite Flatbuffer utilizing the MediaPipe Python Bundle
from mediapipe.duties.python.genai import converter config = converter.ConversionConfig(...) converter.convert_checkpoint(config)
3. Embrace the LLM Inference SDK in your utility
import { FilesetResolver, LlmInference } from "https://cdn.jsdelivr.internet/npm/@mediapipe/tasks-genai”
4. Host the TensorFlow Lite Flatbuffer alongside along with your utility.
5. Use the LLM Inference API to take a textual content immediate and get a textual content response out of your mannequin.
const fileset = await FilesetResolver.forGenAiTasks("https://cdn.jsdelivr.internet/npm/@mediapipe/tasks-genai/wasm"); const llmInference = await LlmInference.createFromModelPath(fileset, "mannequin.bin"); const responseText = await llmInference.generateResponse("Hi there, good to fulfill you"); doc.getElementById('output').textContent = responseText;
Please see our documentation and code examples for an in depth stroll by way of of every of those steps.
Listed here are actual time gifs of Gemma 2B operating by way of the MediaPipe LLM Inference API.
Gemma 2B operating on-device in browser by way of the MediaPipe LLM Inference API |
Gemma 2B operating on-device on iOS (left) and Android (proper) by way of the MediaPipe LLM Inference API |
Fashions
Our preliminary launch helps the next 4 mannequin architectures. Any mannequin weights appropriate with these architectures will work with the LLM Inference API. Use the bottom mannequin weights, use a group fine-tuned model of the weights, or high quality tune weights utilizing your personal information.
Mannequin |
Parameter Dimension |
Falcon 1B |
1.3 Billion |
Gemma 2B |
2.5 Billion |
Phi 2 |
2.7 Billion |
Steady LM 3B |
2.8 Billion |
Mannequin Efficiency
By means of vital optimizations, a few of that are detailed beneath, the MediaPipe LLM Inference API is ready to ship state-of-the-art latency on-device, specializing in CPU and GPU to help a number of platforms. For sustained efficiency in a manufacturing setting on choose premium telephones, Android AICore can reap the benefits of hardware-specific neural accelerators.
When measuring latency for an LLM, there are a number of phrases and measurements to contemplate. Time to First Token and Decode Pace would be the two most significant as these measure how rapidly you get the beginning of your response and the way rapidly the response generates as soon as it begins.
Time period |
Significance |
Measurement |
Token |
LLMs use tokens somewhat than phrases as inputs and outputs. Every mannequin used with the LLM Inference API has a tokenizer constructed through which converts between phrases and tokens. |
100 English phrases ≈ 130 tokens. Nevertheless the conversion relies on the precise LLM and the language. |
Max Tokens |
The utmost complete tokens for the LLM immediate + response. |
Configured within the LLM Inference API at runtime. |
Time to First Token |
Time between calling the LLM Inference API and receiving the primary token of the response. |
Max Tokens / Prefill Pace |
Prefill Pace |
How rapidly a immediate is processed by an LLM. |
Mannequin and gadget particular. Benchmark numbers beneath. |
Decode Pace |
How rapidly a response is generated by an LLM. |
Mannequin and gadget particular. Benchmark numbers beneath. |
The Prefill Pace and Decode Pace are depending on mannequin, {hardware}, and max tokens. They’ll additionally change relying on the present load of the gadget.
The next speeds have been taken on excessive finish gadgets utilizing a max tokens of 1280 tokens, an enter immediate of 1024 tokens, and int8 weight quantization. The exception being Gemma 2B (int4), discovered right here on Kaggle, which makes use of a blended 4/8-bit weight quantization.
Benchmarks
On the GPU, Falcon 1B and Phi 2 use fp32 activations, whereas Gemma and StableLM 3B use fp16 activations because the latter fashions confirmed better robustness to precision loss in response to our high quality eval research. The bottom bit activation information sort that maintained mannequin high quality was chosen for every. Notice that Gemma 2B (int4) was the one mannequin we may run on iOS attributable to its reminiscence constraints, and we’re engaged on enabling different fashions on iOS as nicely. |
Efficiency Optimizations
To realize the efficiency numbers above, numerous optimizations have been made throughout MediaPipe, TensorFlow Lite, XNNPack (our CPU neural community operator library), and our GPU-accelerated runtime. The next are a choose few that resulted in significant efficiency enhancements.
Weights Sharing: The LLM inference course of includes 2 phases: a prefill part and a decode part. Historically, this setup would require 2 separate inference contexts, every independently managing sources for its corresponding ML mannequin. Given the reminiscence calls for of LLMs, we have added a function that permits sharing the weights and the KV cache throughout inference contexts. Though sharing weights may appear easy, it has vital efficiency implications when sharing between compute-bound and memory-bound operations. In typical ML inference eventualities, the place weights usually are not shared with different operators, they’re meticulously configured for every absolutely related operator individually to make sure optimum efficiency. Sharing weights with one other operator implies a lack of per-operator optimization and this mandates the authoring of recent kernel implementations that may run effectively even on sub-optimal weights.
Optimized Totally Linked Ops: XNNPack’s FULLY_CONNECTED operation has undergone two vital optimizations for LLM inference. First, dynamic vary quantization seamlessly merges the computational and reminiscence advantages of full integer quantization with the precision benefits of floating-point inference. The utilization of int8/int4 weights not solely enhances reminiscence throughput but in addition achieves outstanding efficiency, particularly with the environment friendly, in-register decoding of 4-bit weights requiring just one further instruction. Second, we actively leverage the I8MM directions in ARM v9 CPUs which allow the multiplication of a 2×8 int8 matrix by an 8×2 int8 matrix in a single instruction, leading to twice the velocity of the NEON dot product-based implementation.
Balancing Compute and Reminiscence: Upon profiling the LLM inference, we recognized distinct limitations for each phases: the prefill part faces restrictions imposed by the compute capability, whereas the decode part is constrained by reminiscence bandwidth. Consequently, every part employs totally different methods for dequantization of the shared int8/int4 weights. Within the prefill part, every convolution operator first dequantizes the weights into floating-point values earlier than the first computation, making certain optimum efficiency for computationally intensive convolutions. Conversely, the decode part minimizes reminiscence bandwidth by including the dequantization computation to the primary mathematical convolution operations.
Throughout the compute-intensive prefill part, the int4 weights are dequantized a priori for optimum CONV_2D computation. Within the memory-intensive decode part, dequantization is carried out on the fly, together with CONV_2D computation, to attenuate the reminiscence bandwidth utilization. |
Customized Operators: For GPU-accelerated LLM inference on-device, we rely extensively on customized operations to mitigate the inefficiency attributable to quite a few small shaders. These customized ops enable for particular operator fusions and numerous LLM parameters comparable to token ID, sequence patch dimension, sampling parameters, to be packed right into a specialised customized tensor used largely inside these specialised operations.
Pseudo-Dynamism: Within the consideration block, we encounter dynamic operations that improve over time because the context grows. Since our GPU runtime lacks help for dynamic ops/tensors, we go for fastened operations with a predefined most cache dimension. To cut back the computational complexity, we introduce a parameter enabling the skipping of sure worth calculations or the processing of lowered information.
Optimized KV Cache Format: Because the entries within the KV cache finally function weights for convolutions, employed in lieu of matrix multiplications, we retailer these in a specialised format tailor-made for convolution weights. This strategic adjustment eliminates the need for additional conversions or reliance on unoptimized layouts, and subsequently contributes to a extra environment friendly and streamlined course of.
What’s Subsequent
We’re thrilled with the optimizations and the efficiency in right this moment’s experimental launch of the MediaPipe LLM Inference API. That is simply the beginning. Over 2024, we’ll broaden to extra platforms and fashions, provide broader conversion instruments, complimentary on-device parts, excessive stage duties, and extra.
You may take a look at the official pattern on GitHub demonstrating all the things you’ve simply discovered about and browse by way of our official documentation for much more particulars. Keep watch over the Google for Builders YouTube channel for updates and tutorials.
Acknowledgements
We’d wish to thank all group members who contributed to this work: T.J. Alumbaugh, Alek Andreev, Frank Ban, Jeanine Banks, Frank Barchard, Pulkit Bhuwalka, Buck Bourdon, Maxime Brénon, Chuo-Ling Chang, Yu-hui Chen, Linkun Chen, Lin Chen, Nikolai Chinaev, Clark Duvall, Rosário Fernandes, Mig Gerard, Matthias Grundmann, Ayush Gupta, Mohammadreza Heydary, Ekaterina Ignasheva, Ram Iyengar, Grant Jensen, Alex Kanaukou, Prianka Liz Kariat, Alan Kelly, Kathleen Kenealy, Ho Ko, Sachin Kotwani, Andrei Kulik, Yi-Chun Kuo, Khanh LeViet, Yang Lu, Lalit Singh Manral, Tyler Mullen, Karthik Raveendran, Raman Sarokin, Sebastian Schmidt, Kris Tonthat, Lu Wang, Tris Warkentin, and the Gemma Group