Written by Tatiana Passali, NLP Engineer, Medoid AI
Introduction
Balancing model accuracy while optimizing computational efficiency is a significant challenge in machine learning. Model quantization offers a key solution, making models faster and more efficient, enabling deployment in resource-constrained environments, typically with little to no drop in accuracy. Whether you’re a developer aiming to optimize your AI applications or simply curious about advanced AI techniques, this guide will help you make your models lighter, faster and more effective.
Model quantization offers a key solution, making models faster and more efficient, typically with little to no drop in accuracy.
In this post, we will explore the concept of model quantization, starting with an overview of the different data types used in training neural networks. We will then examine the process of model quantization, its practical benefits and potential limitations. Finally, we will provide a hands-on example with code to demonstrate how to easily implement quantization in your own projects. By the end, you’ll understand how quantization works and how it can optimize your models for faster inference and smaller memory footprint.
Floating-Point and Integer Data Formats
To get started, letâs break down the different data types that neural networks use to store model weights. Neural networks typically contain a large number of weights, so the choice of data type can significantly impact performance and computational resources. You might have heard of FP32, FP16, INT8 and INT4, but what do these terms actually mean? FP32 (32-bit floating point), aka single precision, is the gold standard for precision and is commonly used when training machine learning models. FP16 (16-bit floating point), aka half precision, is often preferred for inference as a compromise between accuracy and speed. On the other hand, INT8 (byte) and INT4 (nibble) use fewer bits to store data compared to FP32 and FP16, which enhances speed and memory efficiency but might reduce precision.
Floating point numbers, like FP32 and FP16, adhere the IEEE 754 standard, consisting of three parts:
- Sign: The sign bit is a single bit that determines whether the number is positive or negative.
- Exponent: The exponent scales the number by a power of two, allowing the representation of very large or very small values. In other words, the exponent indicates the position of the decimal point.
- Fraction (aka mantissa): The fraction indicates the precision of the number. More bits allocated to the fraction increase the precision of the represented number.
In contrast to floating point numbers, integer data types such as INT8 and INT4 use fewer bits and a different structure. More specifically, they use the two’s complement format, the most common method for representing signed integers. In this format, all bits except the leftmost one (sign bit) are used to store the value, making computations much faster but sacrificing some accuracy in the process.
FP32 (32-bit Floating Point)
The most common data type used for training neural networks is FP32. It allocates one bit for the sign, 8 bits for the exponent, and 23 bits for the fraction. This allocation offers a high level of precision, which is critical for capturing patterns and complex relationships in the training data. The precision of FP32 ensures that the model learns effectively and achieves high accuracy. However, the downside of FP32 is that it requires significant computational power and memory, making it less suitable for deployment on devices with limited resources.
FP16 (16-bit Floating Point)
FP16 is a lower-precision alternative to FP32, with one bit for the sign, 5 bits for the exponent and 10 bits for the fraction. By decreasing the number of bits required for encoding each number, FP16 helps to speed up computations and reduce memory use. This makes it a common choice for training large models on modern hardware, like GPUs, which typically support mixed-precision training. The main advantage of FP16 is that it achieves a good balance between precision and efficiency, allowing for faster training time without significantly reducing model accuracy.
INT8 (8-bit Integer)
INT8 is an even lower-precision format, requiring only 8 bits to represent each number. This reduction in precision leads to significant improvements in computational speed and memory usage. INT8 is particularly useful during the inference phase, where the model is used to make predictions rather than being trained. By converting model weights and activations to INT8, we can achieve a significantly faster inference time and deploy models on resource-constrained devices. However, the challenge with INT8 is managing the potential loss in accuracy due to the lower precision.
INT4 (4-bit Integer)
INT4 requires only 4 bits to represent each number. This significant reduction in bit allocation offers even greater improvements in computational speed and memory efficiency compared to INT8. INT4 can be particularly useful for very resource-constrained environments or when deploying large-scale models that require significant memory savings. However, the precision loss with INT4 is more evident and it can lead to even more decrease in accuracy. This makes it suitable for specific applications where speed and memory efficiency are prioritized over high accuracy.
Precision vs. Performance Trade-Off
The main trade-off when selecting a data type is between precision and performance. Higher precision (FP32) ensures better accuracy but at the cost of increased computational load and memory consumption. Lower precision (INT8 or INT4) reduces memory usage but may lead to a slight drop in accuracy. The challenge lies in finding the right balance that meets the specific requirements of your application. For instance, in scenarios where real-time performance is critical, the speed and efficiency gains from using INT8 can outweigh the minor loss in precision.
By converting model weights and activations to INT8, we can achieve a significantly faster inference time and deploy models on resource-constrained devices.
In the following sections, we will delve deeper into the process of quantization, its benefits, and its limitations, wrapping up with a hands-on example to illustrate its practical application.
What is Model Quantization?
Quantization is a technique used to reduce the computational and memory overhead of a machine learning model by reducing the precision of the numbers used to represent the model’s parameters. Typically, models use 32-bit floating-point numbers, but with quantization we can convert these to 8-bit integers (or 4-bit integers). This can significantly reduce the model size and increase the inference speed, especially on CPUs and other hardware with limited computational resources. While this can lead to a slight reduction in model accuracy, the trade-off is often worthwhile for faster and more efficient deployments.
Practical Benefits of Model Quantization
- Reduced Model Size: Quantization can significantly reduce the model size, by up to a quarter of its original size. In our experiments conducted on Google Colab (check next section), the quantization process reduced the model size from over 2GB to less than 1GB.
- Increased Inference Speed: Quantized models can significantly improve inference speed by reducing the precision of weights and activations, thereby decreasing the computational workload during inference. These improvements are even more noticeable on CPUs, making quantization particularly important in scenarios where access to a GPU is limited or not available. In our experiments, we observed that quantization reduced the inference time by one-third when running on a CPU.
- Scalability: Smaller model sizes and reduced computational demands allow for more efficient use of cloud computing resources and facilitate faster deployment cycles. This scalability is particularly important when rapid prototyping and deployment of machine learning models are required.
Quantization can significantly reduce the model size, by up to a quarter of its original size. In our experiments, we observed that quantization reduced the inference time by one-third when running on a CPU.
Limitations of Model Quantization
While quantization offers significant advantages, it also comes with certain limitations. One of the main concerns is the potential loss of accuracy due to the reduced precision. Some models may experience a noticeable drop in performance, which can be critical depending on the application. At the same time, not all frameworks support quantization, which might lead to compatibility issues.
In the experiments presented here, we observed only a very slight drop in accuracy after applying quantization. This suggests that the impact on model accuracy can be minimized, but quantization should be approached with caution. The precision of calculations can influence decision-making in the model, particularly in sequence-to-sequence models. Thus, slight variations may lead to different paths and generate different text outputs. Before applying quantization, it’s important to re-evaluate our model to ensure that it doesn’t compromise the quality of its outputs.
We observed only a very slight drop in accuracy after applying quantization. This suggests that the impact on model accuracy can be minimized, but quantization should be approached with caution.
Hands-On Example: Implementing Model Quantization
Let’s explore a hands-on example using PyTorch and the Hugging Face Transformers library. For this example, we will use the financial-summarization-pegasus model, developed by Medoid AI. This model was fine-tuned on a novel financial news dataset, which consists of 2K articles from Bloomberg, on topics such as stock, markets, currencies, rate and cryptocurrencies and has been highly praised, achieving a top 5 ranking out of 1,737 models on Hugging Face. Feel free to adjust the code to use any model from the Hugging Face hub or load your own model from a local path.
Load Model and Tokenizer
In this example, we will run inference on a CPU, shifting from FP32 to INT8. Note that INT8 quantization is generally less common for running quantized models on GPUs. In most cases, using FP16 instead of FP32 provides sufficient speed improvements for model inference on GPUs.
>import torch, time
from torch.quantization import quantize_dynamic
from transformers import AutoConfig, AutoTokenizer, AutoModelForSeq2SeqLM
device = "cpu"
model_name = 'human-centered-summarization/financial-summarization-pegasus'
We will first load the pre-trained model and tokenizer. The tokenizer will convert our text into a format the model can understand. Then, the model will process this input to produce the desired output.
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Set Text to Summarize
Here, we set an example financial article for the model to summarize. Note that you can provide any content you want to summarize.
text = '''Abu Dhabi National Oil Co. is close to hiring JPMorgan Chase & Co. and First Abu Dhabi Bank PJSC to help arrange the potential listing of its drilling business, according to people familiar with the matter.Adnoc, as the company is known, is looking to sell a minority stake in its drilling unit in a deal that could value the business at up to $10 billion, the people said, declining to be named because the matter is private. In 2018, when Baker Hughes bought a 5% stake in Adnoc Drilling, that deal valued the company at about $11 billion, including $1 billion of debt.Although the state energy firm has yet to award formal mandates, the two banks are in pole position for a role on the IPO at the Abu Dhabi Securities Exchange, the people said. Adnoc may also appoint additional advisers, they said.Adnoc and JPMorgan declined to comment. FAB didnât immediately respond to emails seeking comment.Alongside tapping new revenue sources, Abu Dhabi is looking to revive its dormant stock market by bringing in local or international investors. Government entities such as Adnoc, Mubadala Investment Co. and ADQ have also been exploring different ways to raise cash for their owner.Adnoc recently picked banks for the initial public offering of a fertilizer joint venture called Fertiglobe, while wealth fund Mubadala hired advisers for the listing of satellite operator Yahsat.Abu Dhabi is the capital of the United Arab Emirates and holds most of the countryâs crude deposits. The UAE is the third biggest producer in the Organization of Petroleum Exporting Countries, behind Saudi Arabia and Iraq.Unlocking ResourcesAdnocâs drilling division is responsible for unlocking the UAEâs oil and gas resources on land and at sea, according to its website. It has a fleet of 95 rigs in the Middle East and a workforce of about 7,000 engineers.In recent years, international and local funds have invested more than $20 billion in Adnoc assets such as pipelines and property. Last June, the company sold leasing rights over natural-gas pipelines to a consortium including Global Infrastructure Partners and Brookfield Asset Management Inc., in a deal worth $10.1 billion.Still, its sole IPO to date was the listing of its fuel-retailing unit, Abu Dhabi National Oil Co. for Distribution PJSC, in 2017.(Updates with Adnoc Drilling valuation in 2018 in second paragraph.)'''
Quantize Model
We can easily convert the model into a quantized form using 8-bit integers using the quantize_dynamic function. We will run the quantized version of the model. In this guide, we convert the model’s weights from 32-bit floating point (FP32) to 8-bit integers (INT8). You should observe a notable decrease in inference time compared to the original model.
model_quantized = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
model_quantized = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
model_input = tokenizer(text, truncation=True, return_tensors='pt').to(device)
model_output = model.generate(**model_input, max_new_tokens=64)
summary = tokenizer.decode(model_output[0], skip_special_tokens=True)
start_quantized = time.time()
model_input = tokenizer(text, truncation=True, return_tensors='pt').to(device)
model_output = model_quantized.generate(**model_input, max_new_tokens=64)
summary = tokenizer.decode(model_output[0], skip_special_tokens=True)
end_quantized = time.time()
print(f"Summary: {summary}")
print(f"Inference time using quantized model: {end_quantized - start_quantized}")
Save Quantized Model
In addition to running the quantized model, we can also save it for future use. Replace the path below with the specific path where you want to save the quantized model and model configuration.
model_quantized.config.save_pretrained(save_model_path)
quantized_state_dict = model_quantized.state_dict()
torch.save(quantized_state_dict, save_model_path + "model-quantized.pt")
That’s it! You have now saved your quantized model and can load it at any time to make predictions.
Ready to dive deeper into model quantization? You can find the full notebook here and start experimenting with quantization on your own models. We provide the full code to perform inference with a quantized model. Finally, we show how to save the quantized model and reload it again for making predictions.
If you have any questions or comments, feel free to contact us. We are here to help you optimize your models for maximum performance and efficiency.
Conclusion
In this post, we have explored the concept of model quantization, understanding its significance, benefits and limitations. We provided a hands-on example to demonstrate how to implement quantization in practice. Choosing the right quantization strategy can significantly enhance model performance, especially in environments with limited computational resources.
At Medoid AI, we specialize in optimizing models to meet specific hardware and system requirements, ensuring the best performance for your needs! We hope youâre excited to try out quantization for yourself and see the benefits first hand.