What is Tokenisation?

Techniques

Tokenisation is the process of breaking text into smaller units called tokens — which can be words, subwords, or characters — so that an AI model can process them numerically. Each token is mapped to a number that the model uses for computation.

Why It Matters

Tokenisation determines how a model 'reads' text and directly affects its context window size, processing speed, and ability to handle different languages.

In practice

The word 'unhappiness' might be tokenised into three tokens: 'un', 'happi', and 'ness' — allowing the model to understand word parts it can recombine for new words.

Related Terms

Natural Language Processing (NLP)

Natural language processing is a branch of AI that enables machines to read, understand, interpret, and generate human language. It bridges the gap between human communication and computer understanding, covering tasks like translation, summarisation, and sentiment analysis.

Embedding

An embedding is a way of representing data — such as words, sentences, or images — as a list of numbers (a vector) in a continuous space. Items that are semantically similar end up close together in this space, allowing machines to understand relationships between concepts.

Large Language Model (LLM)

A large language model is an AI system trained on vast quantities of text data that can understand, generate, and reason about human language. LLMs use the transformer architecture and contain billions of parameters, enabling them to perform a wide range of language tasks.

Keep learning with guided projects

Our programme follows a structured level 3-6 curriculum with project-based learning, practical workflows, and guided implementation across business and career use cases. The full programme fee is £2,999 with flexible instalment plans.

Apply Now See Pricing Options