Loading
Loading
Inference time is the time an AI model takes to generate an output after receiving input. It is often measured in milliseconds for real-time systems or seconds for complex generative tasks.
Fast inference is critical for user experience and operational cost, especially in high-traffic products.
An AI autocomplete feature is abandoned because users stop typing when suggestions take more than a second to appear.
Inference
Inference is the process of using a trained AI model to make predictions or generate outputs on new, unseen data. While training is about learning patterns, inference is about applying what the model has learned to real-world inputs.
Latency
Latency is the delay between a user's request and the system's response. In AI systems, latency includes model processing time plus network and infrastructure overhead.
Quantization
Quantization is a model optimisation technique that reduces numerical precision of weights and activations, such as from 16-bit to 8-bit, to lower memory and compute requirements.
Our programme follows a structured Level 4 curriculum with project-based learning, practical workflows, and guided implementation across business and career use cases. Funded route available for UK citizens and ILR holders.