AI Quantization

Quantization is a powerful technique that can make AI models more efficient and accessible for various applications. By reducing the precision of the numerical values used in AI models, quantization can reduce the memory usage and computation time of AI models by up to 4 times. This can enable AI models to run on devices with limited resources and provide faster responses to users. Quantization can also reduce the size of the AI model file, making it easier to download and update. Quantization can be done in different ways and levels, depending on the needs and goals of the developers and users. According to LatentAI, by 2025, more than 50% of microcontroller-based embedded products will ship with on-device AI capabilities, such as those powered by tinyML. [1]  This means that billions of devices, from smart home appliances to wearable gadgets, will be able to perform intelligent tasks locally and efficiently, enhancing user experience and privacy while reducing energy consumption and latency.


Artificial intelligence (AI) models have gained widespread use in recent years, with applications ranging from image and speech recognition to autonomous vehicles and medical diagnosis. These models are typically trained and deployed using 32-bit floating-point numbers, which can represent a wide range of values with high accuracy. Although highly precise, these numbers consume significant memory and computational resources. This can be a major drawback for applications that rely on real-time processing or that operate on resource-constrained devices such as mobile phones or IoT edge devices. In these cases, 32-bit floating-point numbers can slow down processing times and consume valuable memory resources, limiting the efficiency and effectiveness of AI-powered applications. 

In a case study done by Mukhammed Garifulla, Juncheol Shin ,Chanho Kim, Won Hwa Kim, Hye Jung Kim, Jaeil Kim and Seokin Hong they acquired 1400 ultrasound images of patients with malignant or benign breast tumors from Kyungpook National University Chilgok hospitals. Senior radiologists with more than 10 years of experience reviewed all acquired images with associated radiological reports. All breast ultrasound images were anonymized by erasing personal information, such as patient name, patient ID, acquisition date, and manufacturer, in the acquired US images using an in-house anonymization software. They randomly divided the acquired data set into training, validation, and test sets with 1000, 200, and 200 images, respectively. The validation set was used to tune the hyperparameters of the classification models. All images were resized to a width of 224 pixels and a height of 224 pixels using bilinear interpolation. The pixel intensities of the images were normalized from 0 to 1 by dividing them by maximum intensity for each image. In the intensity normalization process, they converted the image data type from unsigned int to floating-point (32-bit) to prevent loss of image information by digitization. After applying dynamic range and full-integer quantizations, the model sizes were reduced by around four times. The model size of VGG was reduced to about 18.6 MB. The model size of GoogleNet and ResNet models were reduced to 15.6 MB and 21.5 MB, respectively. This significant reduction in the model sizes leads to lower consumption of memory storage, enabling the deployment of the complex CNN models to mobile devices. Dyanmic range and full-integer quantizations result in the same model size because they use the same target format during the quantization. Both quantization methods use full-integer quantization (8-bit integer) for weights and activations. [2] 

Quantization can reduce the size of the AI model file, making it easier to download and update. For example, a model that uses 32-bit floating-point numbers can be quantized to use 8-bit integers, which could reduce the file size by as much as 75%. This can save bandwidth and storage space for users and developers. There are different types of quantization, such as post-training quantization, which applies quantization after the model is trained, or quantization-aware training, which applies quantization during the training process. There are also different levels of quantization, such as full quantization, which converts all the values in the model to lower-bit integers or binary numbers, or hybrid quantization, which only converts some of the values and keeps others as floating-point numbers.


The process of quantizing AI models is proving to be highly beneficial for many established players in the market. It presents a unique opportunity for cloud service providers to scale up their AI models and offer them as services to their customers, thereby opening up new revenue streams and opportunities. With the help of this technology, generative AI tools can produce diverse, high-quality content such as images, text, audio, and video, while consuming fewer resources and requiring less time. Although artificial intelligence has transformational potential, it presents challenges such as high computational costs, large memory requirements, and complex deployment processes. To overcome these challenges, many market leaders are actively adopting quantization. It offers a solution that can help companies achieve their goals while minimizing costs and maximizing efficiency. Many market leaders are actively adopting quantization to overcome these challenges and unlock the full potential of AI. It offers a solution that can help companies achieve their goals while minimizing costs and maximizing efficiency. 

No comments:

Post a Comment