What is Quantization?

Techniques

Quantization is a model optimisation technique that reduces numerical precision of weights and activations, such as from 16-bit to 8-bit, to lower memory and compute requirements.

Why It Matters

It enables faster and cheaper inference, especially on edge devices or cost-sensitive production systems.

In practice

A team quantizes a chatbot model to run on a mobile device without cloud inference.

Related Terms

Inference Time

Inference time is the time an AI model takes to generate an output after receiving input. It is often measured in milliseconds for real-time systems or seconds for complex generative tasks.

Latency

Latency is the delay between a user's request and the system's response. In AI systems, latency includes model processing time plus network and infrastructure overhead.

Model

In AI, a model is the mathematical representation that a machine learning system builds from training data. It captures the patterns, relationships, and rules discovered during training and uses them to make predictions or generate outputs on new data.

Keep learning with guided projects

Our programme follows a structured level 3-6 curriculum with project-based learning, practical workflows, and guided implementation across business and career use cases. The full programme fee is £2,999 with flexible instalment plans.

Apply Now See Pricing Options