On-Device AI Models and Core ML Tools: Insights From WWDC 2024

25 Jun 2024

During Apple’s Worldwide Developers Conference (WWDC) 2024, the company presented a number of improvements that are intended to improve the deployment and performance of on-device AI models. Some of the important changes included major enhancements to Core ML tools which came with the pre-release 8.0b1.

These updates are aimed at improving the efficiency and effectiveness of deploying machine learning (ML) models on Apple devices. Here is the breakdown of these innovations, how they affect developers, and the advantages for the end users.

Key Terminology Explained

Before diving into the updates, let's clarify some key terms:

Palettization

This technique decreases the model weight accuracy by grouping the weights into clusters and representing each cluster by one single value. It is like having a range of colors in a painting where one color is used to represent the whole range of colors. In machine learning, palettization significantly reduces the size of a model by compressing its weight values.

Quantization

Quantization is a process to reduce the precision of weights and activations from floating-point numbers, such as 32-bit floats, to lower precision numbers, for example, 8-bit integers. This compression technique helps to reduce the model size and also speeds up the inference by making computations faster on lower-precision hardware.

Blockwise Quantization

This variant of quantization divides the weights of the model into smaller blocks or chunks and quantizes each block separately which leads to a better improved accuracy due to more precise quantization performed on each of the chunks.

Pruning

It is a data compression technique that involves the elimination of weights in a model that is non-critical and has the least impact on the model’s prediction. This process sets the least important weights to zero which can be stored efficiently using sparse matrix representations.

Stateful Models

Stateful models are the models that keep track of the information that needs to be passed across multiple runs of the model, or in other words, retain the context and its state. This is important, especially for tasks such as language modeling where the model requires to remember the words that have been generated in the past in order to generate the next text properly and coherently.

What Are Core ML Tools?

The Core ML tools (coremltools) is a Python package for converting third-party models to formats suitable for Core ML (Apple’s framework for integrating machine learning models into apps). Core ML Tools supports conversion from popular libraries such as TensorFlow and PyTorch into the Core ML model package format.

The coremltools package allows you to:

Convert trained models from various libraries and different frameworks to the Core ML model package format.

Read, write, and optimize Core ML models with the goal of reducing storage space, lowering power consumption, and minimizing inference latency.

Verify creation and conversion by making predictions using Core ML in MacOS

Core ML provides a unified representation for all models, allowing your app to use Core ML APIs and user data to make predictions and to fine-tune models directly on the user’s device. This approach removes the need for a network connection, keeps user data private, and makes your app more responsive. Core ML optimizes on-device performance by leveraging the CPU, GPU, and Neural Engine (NE) all while minimizing memory footprint and power consumption.

Core ML Tools Updates

Now, let’s finally begin discussing the changes themselves. We’ve covered the theory and the terminology, and now it is time to dive into the new features and the changes in Core ML tools in the soon-to-be-released version 8.0b1.

New Utilities and Stateful Models

The introduction of coremltools.utils.MultiFunctionDescriptor() and coremltools.utils.save_multifunctionsimplify the creation of ML programs with multiple functions that can share weights between each other. This increases the versatility and ease of use of the models as it allows for the easy loading of specific functions for the prediction.

Core ML has now been enhanced to support stateful models through recent changes to the converter to generate models with the new State Type, which was introduced in iOS 18 and macOS 15. These models can maintain information from one inference run to another, which is especially useful for tasks where the model needs to remember the inputs it saw in the past.

Advanced Compression Techniques

The Core ML tools have expanded the scope of compression capabilities to shrink model sizes while maintaining performance. The updated coremltools.optimize module now supports:

Blockwise Quantization: Allows for more precise control of quantization since the weights of the model are partitioned into smaller sections and quantized separately.

Grouped Channel-wise Palettization: Clusters weights with similar values together into groups, so that there will be less the number of unique weight values and more flexibility and accuracy.

4-bit Weight Quantization: Reduces storage needs by half compared to 8-bit quantization, further decreasing the size of the model.

3-bit Palettization: Expands the possibilities of bit-depth options for palettization, using only three bits to represent weight clusters, which provides an opportunity for higher compression.

These techniques, in addition to the joint compression mode such as 8-bit look-up tables (LUTs) for palettization, weight pruning combined with quantization or palettization, offer efficient tools to reduce the model size and enhance the performance.

Advanced API Improvements: Compression and Quantization

The coremltools.optimize module has significant API updates to support advanced compression techniques. For example, a new API for activation quantization based on calibration data can change a W16A16 Core ML model (16-bit weight and activations) into a W8A8 model (8-bit weights and activations) improving efficiency while retaining accuracy. Additionally, updates to coremltools.optimize.torch introduced data-free compression methods based on calibration data, which made PyTorch model optimization for Core ML easier.

iOS 18/macOS 15 Optimizations

The latest operating systems support new operations such as constexpr_blockwise_shift_scale, constexpr_lut_to_dense, and constexpr_sparse_to_dense, which are crucial for efficient model compression. Updates to the Gated Recurrent Unit (GRU) operations and the addition of the PyTorch scaled_dot_product_attention operation helps improve the performance and make transformer models and other complex structures run well on Apple silicon. These updates ensure more efficient execution and better utilization of hardware capabilities.

Experimental Torch Export Conversion

The torch.export conversion support helps to seamlessly convert the model directly to Core ML from PyTorch.

This process involves:

import necessary libraries
export the PyTorch model using torch.export
convert the exported program into a Core ML model by using coremltools.convert

This simplified process reduced the complexity of deploying PyTorch models on Apple devices by taking benefit of the enhanced performance of Core ML.

Multifunction Models

Integration for multifunction models in Core ML tools allows merging models with shared weights into a single ML program. This is advantageous for applications requiring multiple tasks, such as combining a feature extractor with classifiers and regressors. The multifunction descriptor and save_multifunction utility guarantees that the shared weights are not duplicated, saving more storage space and performance.

Performance Enhancements and Glitches Removal

Core ML Tools new version 8.0b1 includes various bug fixes, enhancements, and optimizations to make a development experience smoother. Some known issues like conversion failures with certain palettization modes and incorrect quantization scales have been fixed to make the reliability and accuracy of compressed models better.

Benefits to End Users

The enhancements in coremltools 8.0b1 pre-release bring several major benefits to end users, improving the overall experience with AI-powered applications:

Improved Performance: smaller, optimized models are lighter, and load and run faster on devices, meaning that they can respond and engage faster for smoother interactions.

Reduced App Sizes: compressed models also use less space, meaning that the application can be lighter and more storage-efficient, which is especially helpful for users with less space on their mobile devices.

Enhanced Functionality: multifunction models and stateful models allow the usage of more complex innovative functions in your apps which helps to provide more sophisticated functions and more intelligent behaviors.

Better Battery Life: optimizing model execution translates to less energy consumption and longer battery life for mobile devices during heavy AI operations.

Enhanced Privacy: Integrating artificial intelligence into the device allows the user data to be processed locally, which results in eliminating the risk of data being transmitted to other external servers.

Conclusion

The pre-release of coremltools 8.0b1 represents a significant step forward in on-device AI model deployment. Now, developers can create more efficient, compact, and versatile ML models with enhanced compression techniques, stateful model support, and multifunction model utilities. These advancements highlight Apple's commitment to providing robust tools for developers to leverage the power of Apple silicon, ultimately delivering faster, more efficient, and more capable on-device AI applications.

As the Core ML and its environment evolve, the possibilities for innovation in AI-powered apps continue to expand and grow, opening the doors for more sophisticated and user-friendly experiences.

In the upcoming article, we will demonstrate these new features practically in a sample project, showcasing how to apply them in real-life scenarios. Stay tuned!