PyTorch Quantization Last Updated : 22 Jul, 2025 Comments Improve Suggest changes Like Article Like Report Quantization is a core method for deploying large neural networks such as Llama 2 efficiently on constrained hardware, especially embedded systems and edge devices. The aim is to reduce computational and memory costs by converting high-precision floating-point representations (like float32) into lower-precision integer types (such as int8). This process significantly reduces inference time and energy usage, often with negligible impact on model accuracy, making it possible to deploy even billion-parameter models on devices where floating point is not natively supported.Why Quantization? Model Size Reduction: Quantization compresses neural network weights/activations from 32-bit floats to 8-bit integers, reducing storage and memory requirements by up to 4x.Speed: Int8 matrix multiplications are much faster on most hardware (especially CPUs and embedded accelerators), accelerating inference significantly.Hardware Support: Embedded systems (e.g., microcontrollers, NPUs) often lack native float math support, necessitating integer-only arithmetic.Energy Efficiency: Less computation and lower memory bandwidth culminate in lower energy draw critical for mobile and IoT devices.How Quantization Works1. Representation MappingOriginal: Weights W and biases b are stored as 32-bit floats.Quantization: These values are mapped into lower-precision integer representations (usually int8 for weights, int32 for biases).Dequantization: Before feeding results to the next layer (often still expecting float values), results are mapped back to floating-point via the scale and zero-point parameters.Example (as in Llama 2 7B or similar large models):y = x W + bWhere:W: Quantized to 8-bit integer (int8)b: Quantized to 32-bit integer (int32, for accumulator width)Computation is performed in lower precision, then dequantized for subsequent operations.2. Formal Quantization EquationForward Quantizationq_x = \mathrm{round}\left(\frac{x}{\text{scale}}\right) + \text{zero\_point}Dequantizationx \approx \text{scale} \times (q_x - \text{zero\_point})scale: Determines the step size between integer values and their float counterparts.zero_point: Aligns the integer representation with the network’s value distribution (e.g., maps float zero to nonzero integer).Types of QuantizationMethodDescriptionDynamicQuantizes weights post-training; activations quantized during inference. Fast, minimal code changes.Static (PTQ)Requires calibration data; quantizes both weights and activations ahead of inference for best efficiency.QATQuantization-Aware Training. Simulates quantization noise during training for highest post-quantization accuracy.Implementation: PyTorch Workflow (Post-Training Quantization)Step 1 : Data PreparationLoads the MNIST handwritten digits dataset.Converts images to PyTorch tensors.Prepares DataLoaders for batching and iterating through data during training and testing. Python import torchvision import torchvision.transforms as transforms from torch.utils.data import DataLoader transform = transforms.Compose([ transforms.ToTensor() ]) trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform) testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform) train_loader = DataLoader(trainset, batch_size=64, shuffle=True, num_workers=2) test_loader = DataLoader(testset, batch_size=64, shuffle=False, num_workers=2) Step 2 :Define the CNN Model (with Quantization Stubs) Python import torch import torch.nn as nn import torch.quantization class SimpleCNN(nn.Module): def __init__(self): super(SimpleCNN, self).__init__() # For marking where quantization/dequantization happens self.quant = torch.quantization.QuantStub() self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1) self.relu1 = nn.ReLU() self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1) self.relu2 = nn.ReLU() self.pool = nn.MaxPool2d(2, 2) self.fc1 = nn.Linear(32 * 14 * 14, 128) self.relu3 = nn.ReLU() self.fc2 = nn.Linear(128, 10) self.dequant = torch.quantization.DeQuantStub() def forward(self, x): x = self.quant(x) # Quantize input to int8 x = self.relu1(self.conv1(x)) x = self.pool(self.relu2(self.conv2(x))) x = x.reshape(x.size(0), -1) x = self.relu3(self.fc1(x)) x = self.fc2(x) x = self.dequant(x) # Dequantize output back to float32 return x model_fp32 = SimpleCNN() Why QuantStub/DeQuantStub : Mark input/output boundaries for quantization and dequantization in the network so PyTorch knows where to apply quantized ops.Step 3 : (Optional) Quick TrainingModel trains for a couple of epochs. Even for quantization demos, decent weights are needed.The code will work even if you skip training (the quantization part is independent), but accuracy will be poor. Python import torch.optim as optim device = 'cuda' if torch.cuda.is_available() else 'cpu' model_fp32.to(device) criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model_fp32.parameters(), lr=0.001) print("Training (just a few epochs for demo)...") for epoch in range(2): # Just 2 epochs for time; increase for better accuracy! model_fp32.train() for images, labels in train_loader: images, labels = images.to(device), labels.to(device) optimizer.zero_grad() outputs = model_fp32(images) loss = criterion(outputs, labels) loss.backward() optimizer.step() Step 4 : Fuse Layers Python def fuse_model(model): torch.quantization.fuse_modules(model, [['conv1', 'relu1'], ['conv2', 'relu2'], ['fc1', 'relu3']], inplace=True) fuse_model(model_fp32) Rationale:Fusing Conv + ReLU (or Conv+BN+ReLU) is important as it combines them into one operation, improving both the accuracy and speed of quantized models.Done before quantization.Step 5. Set Quantization Configuration Python model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm') What is qconfig?It tells PyTorch how to observe and quantize the model.'fbgemm' is preferred for x86 CPUs. For ARM CPUs, use 'qnnpack'.Step 6. Prepare for QuantizationPrepare and inserts observer modules that record the ranges of activations/weights during calibration. Python model_fp32.cpu() # Quantization is CPU-only in PyTorch torch.quantization.prepare(model_fp32, inplace=True) Step 7 : CalibrationThis step feeds real data through the model.Observers collect min/max values to determine how to map floats to int8 (scale/zero-point). Python print("Calibrating...") model_fp32.eval() with torch.no_grad(): for images, _ in train_loader: model_fp32(images) break # In practical cases, use more calibration data; here just a few for demo Step 8 : Convert to Quantized ModelAll eligible layers (Conv, Linear, etc.) are replaced with quantized (int8) modules.Model is now ready for fast, int8 inference on CPU. Python quantized_model = torch.quantization.convert(model_fp32, inplace=False) Step 9 : Evaluate AccuracyRuns the test data through each model.Compare float32 and quantized int8 accuracy; usually, there is little loss (<1%). Python def evaluate(model, test_loader): model.eval() correct, total = 0, 0 with torch.no_grad(): for images, labels in test_loader: outputs = model(images) _, predicted = outputs.max(1) total += labels.size(0) correct += (predicted == labels).sum().item() print('Accuracy:', 100.0 * correct / total, '%') return correct / total print("\nEvaluating original (float32) model:") evaluate(model_fp32, test_loader) # Note: after quantization, model_fp32 is already quantized if inplace=True print("\nEvaluating quantized model:") evaluate(quantized_model, test_loader) Step 10: Check Model File SizesQuantized model file is about 1/4 the size.File size matches your expected storage/compression benefits from quantization. Python import os torch.save(model_fp32.state_dict(), "float_model.pth") torch.save(quantized_model.state_dict(), "quantized_model.pth") float_size = os.path.getsize("float_model.pth") / 1024 quant_size = os.path.getsize("quantized_model.pth") / 1024 print(f"\nModel size (float32): {float_size:.1f} KB") print(f"Model size (quantized): {quant_size:.1f} KB") Output:OutputGoogle Colab link : Pytorch Quantisation Key Technical PointsBias Quantization: Biases typically use int32 to prevent overflow during accumulation, as they sum many int8 products.Operations on Embedded Devices: Most embedded AI accelerators strictly support integer arithmetic, making quantization the de facto deployment path.Dequantization: At each layer’s output, results are dequantized (float recovered) if further float computation is needed, e.g., for softmax or loss calculation.Precision and Loss: Well-calibrated quantization (especially QAT or with asymmetric scaling and dynamic range selection) can keep final accuracy very close to original float network performance.Real-World Example: Llama 2 QuantizationLlama 2 7B, originally a multi-billion parameter float model, is quantized down (often to 8-bit int for weights and 32-bit int for biases and accumulators), enabling deployment on hardware-constrained servers, mobile or edge devices with minimal accuracy drop.Quantization reduces model size drastically and speeds up inference, since y= xW+ b (with W, b quantized) becomes a pure int8/int32 computation, replaced by a single scale factor per layer for seamless dequantization. Comment More infoAdvertise with us Next Article Introduction to Deep Learning S shambhava9ex Follow Improve Article Tags : Deep Learning AI-ML-DS With Python Similar Reads Deep Learning Tutorial Deep Learning is a subset of Artificial Intelligence (AI) that helps machines to learn from large datasets using multi-layered neural networks. It automatically finds patterns and makes predictions and eliminates the need for manual feature extraction. Deep Learning tutorial covers the basics to adv 5 min read Deep Learning BasicsIntroduction to Deep LearningDeep Learning is transforming the way machines understand, learn and interact with complex data. Deep learning mimics neural networks of the human brain, it enables computers to autonomously uncover patterns and make informed decisions from vast amounts of unstructured data. How Deep Learning Works? 7 min read Artificial intelligence vs Machine Learning vs Deep LearningNowadays many misconceptions are there related to the words machine learning, deep learning, and artificial intelligence (AI), most people think all these things are the same whenever they hear the word AI, they directly relate that word to machine learning or vice versa, well yes, these things are 4 min read Deep Learning Examples: Practical Applications in Real LifeDeep learning is a branch of artificial intelligence (AI) that uses algorithms inspired by how the human brain works. It helps computers learn from large amounts of data and make smart decisions. Deep learning is behind many technologies we use every day like voice assistants and medical tools.This 3 min read Challenges in Deep LearningDeep learning, a branch of artificial intelligence, uses neural networks to analyze and learn from large datasets. It powers advancements in image recognition, natural language processing, and autonomous systems. Despite its impressive capabilities, deep learning is not without its challenges. It in 7 min read Why Deep Learning is ImportantDeep learning has emerged as one of the most transformative technologies of our time, revolutionizing numerous fields from computer vision to natural language processing. Its significance extends far beyond just improving predictive accuracy; it has reshaped entire industries and opened up new possi 5 min read Neural Networks BasicsWhat is a Neural Network?Neural networks are machine learning models that mimic the complex functions of the human brain. These models consist of interconnected nodes or neurons that process data, learn patterns and enable tasks such as pattern recognition and decision-making.In this article, we will explore the fundamental 12 min read Types of Neural NetworksNeural networks are computational models that mimic the way biological neural networks in the human brain process information. They consist of layers of neurons that transform the input data into meaningful outputs through a series of mathematical operations. In this article, we are going to explore 7 min read Layers in Artificial Neural Networks (ANN)In Artificial Neural Networks (ANNs), data flows from the input layer to the output layer through one or more hidden layers. Each layer consists of neurons that receive input, process it, and pass the output to the next layer. The layers work together to extract features, transform data, and make pr 4 min read Activation functions in Neural NetworksWhile building a neural network, one key decision is selecting the Activation Function for both the hidden layer and the output layer. It is a mathematical function applied to the output of a neuron. It introduces non-linearity into the model, allowing the network to learn and represent complex patt 8 min read Feedforward Neural NetworkFeedforward Neural Network (FNN) is a type of artificial neural network in which information flows in a single direction i.e from the input layer through hidden layers to the output layer without loops or feedback. It is mainly used for pattern recognition tasks like image and speech classification. 6 min read Backpropagation in Neural NetworkBack Propagation is also known as "Backward Propagation of Errors" is a method used to train neural network . Its goal is to reduce the difference between the modelâs predicted output and the actual output by adjusting the weights and biases in the network.It works iteratively to adjust weights and 9 min read Deep Learning ModelsConvolutional Neural Network (CNN) in Machine LearningConvolutional Neural Networks (CNNs) are deep learning models designed to process data with a grid-like topology such as images. They are the foundation for most modern computer vision applications to detect features within visual data.Key Components of a Convolutional Neural NetworkConvolutional La 6 min read Introduction to Recurrent Neural NetworksRecurrent Neural Networks (RNNs) differ from regular neural networks in how they process information. While standard neural networks pass information in one direction i.e from input to output, RNNs feed information back into the network at each step.Lets understand RNN with a example:Imagine reading 10 min read What is LSTM - Long Short Term Memory?Long Short-Term Memory (LSTM) is an enhanced version of the Recurrent Neural Network (RNN) designed by Hochreiter and Schmidhuber. LSTMs can capture long-term dependencies in sequential data making them ideal for tasks like language translation, speech recognition and time series forecasting. Unlike 5 min read Gated Recurrent Unit NetworksIn machine learning Recurrent Neural Networks (RNNs) are essential for tasks involving sequential data such as text, speech and time-series analysis. While traditional RNNs struggle with capturing long-term dependencies due to the vanishing gradient problem architectures like Long Short-Term Memory 6 min read Transformers in Machine LearningTransformer is a neural network architecture used for performing machine learning tasks particularly in natural language processing (NLP) and computer vision. In 2017 Vaswani et al. published a paper " Attention is All You Need" in which the transformers architecture was introduced. The article expl 4 min read Autoencoders in Machine LearningAutoencoders are a special type of neural networks that learn to compress data into a compact form and then reconstruct it to closely match the original input. They consist of an:Encoder that captures important features by reducing dimensionality.Decoder that rebuilds the data from this compressed r 8 min read Generative Adversarial Network (GAN)Generative Adversarial Networks (GAN) help machines to create new, realistic data by learning from existing examples. It is introduced by Ian Goodfellow and his team in 2014 and they have transformed how computers generate images, videos, music and more. Unlike traditional models that only recognize 12 min read Deep Learning FrameworksTensorFlow TutorialTensorFlow is an open-source machine-learning framework developed by Google. It is written in Python, making it accessible and easy to understand. It is designed to build and train machine learning (ML) and deep learning models. It is highly scalable for both research and production.It supports CPUs 2 min read Keras TutorialKeras high-level neural networks APIs that provide easy and efficient design and training of deep learning models. It is built on top of powerful frameworks like TensorFlow, making it both highly flexible and accessible. Keras has a simple and user-friendly interface, making it ideal for both beginn 3 min read PyTorch TutorialPyTorch is an open-source deep learning framework designed to simplify the process of building neural networks and machine learning models. With its dynamic computation graph, PyTorch allows developers to modify the networkâs behavior in real-time, making it an excellent choice for both beginners an 7 min read Caffe : Deep Learning FrameworkCaffe (Convolutional Architecture for Fast Feature Embedding) is an open-source deep learning framework developed by the Berkeley Vision and Learning Center (BVLC) to assist developers in creating, training, testing, and deploying deep neural networks. It provides a valuable medium for enhancing com 8 min read Apache MXNet: The Scalable and Flexible Deep Learning FrameworkIn the ever-evolving landscape of artificial intelligence and deep learning, selecting the right framework for building and deploying models is crucial for performance, scalability, and ease of development. Apache MXNet, an open-source deep learning framework, stands out by offering flexibility, sca 6 min read Theano in PythonTheano is a Python library that allows us to evaluate mathematical operations including multi-dimensional arrays efficiently. It is mostly used in building Deep Learning Projects. Theano works way faster on the Graphics Processing Unit (GPU) rather than on the CPU. This article will help you to unde 4 min read Model EvaluationGradient Descent Algorithm in Machine LearningGradient descent is the backbone of the learning process for various algorithms, including linear regression, logistic regression, support vector machines, and neural networks which serves as a fundamental optimization technique to minimize the cost function of a model by iteratively adjusting the m 15+ min read Momentum-based Gradient Optimizer - MLMomentum-based gradient optimizers are used to optimize the training of machine learning models. They are more advanced than the classic gradient descent method and helps to accelerate the training process especially for large-scale datasets and deep neural networks.By incorporating a "momentum" ter 4 min read Adagrad Optimizer in Deep LearningAdagrad is an abbreviation for Adaptive Gradient Algorithm. It is an adaptive learning rate optimization algorithm used for training deep learning models. It is particularly effective for sparse data or scenarios where features exhibit a large variation in magnitude.Adagrad adjusts the learning rate 6 min read RMSProp Optimizer in Deep LearningRMSProp (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm designed to improve the performance and speed of training deep learning models.It is a variant of the gradient descent algorithm which adapts the learning rate for each parameter individually by considering th 5 min read What is Adam Optimizer?Adam (Adaptive Moment Estimation) optimizer combines the advantages of Momentum and RMSprop techniques to adjust learning rates during training. It works well with large datasets and complex models because it uses memory efficiently and adapts the learning rate for each parameter automatically.How D 4 min read Deep Learning ProjectsLung Cancer Detection using Convolutional Neural Network (CNN)Computer Vision is one of the applications of deep neural networks and one such use case is in predicting the presence of cancerous cells. In this article, we will learn how to build a classifier using Convolution Neural Network which can classify normal lung tissues from cancerous tissues.The follo 7 min read Cat & Dog Classification using Convolutional Neural Network in PythonConvolutional Neural Networks (CNNs) are a type of deep learning model specifically designed for processing images. Unlike traditional neural networks CNNs uses convolutional layers to automatically and efficiently extract features such as edges, textures and patterns from images. This makes them hi 5 min read Sentiment Analysis with an Recurrent Neural Networks (RNN)Recurrent Neural Networks (RNNs) are used in sequence tasks such as sentiment analysis due to their ability to capture context from sequential data. In this article we will be apply RNNs to analyze the sentiment of customer reviews from Swiggy food delivery platform. The goal is to classify reviews 5 min read Text Generation using Recurrent Long Short Term Memory NetworkLSTMs are a type of neural network that are well-suited for tasks involving sequential data such as text generation. They are particularly useful because they can remember long-term dependencies in the data which is crucial when dealing with text that often has context that spans over multiple words 4 min read Machine Translation with Transformer in PythonMachine translation means converting text from one language into another. Tools like Google Translate use this technology. Many translation systems use transformer models which are good at understanding the meaning of sentences. In this article, we will see how to fine-tune a Transformer model from 6 min read Deep Learning Interview QuestionsDeep learning is a part of machine learning that is based on the artificial neural network with multiple layers to learn from and make predictions on data. An artificial neural network is based on the structure and working of the Biological neuron which is found in the brain. Deep Learning Interview 15+ min read Like