The Cat Sat on the Mat: How LLMs Work

Jul 05, 2025

The Meaning of Words

We all intuitively know that the meaning of a word depends on its context, typically the words around it. For example, sending an email that only contains the word ‘cat’ communicates nothing more than you are referring to a kind of animal.

Conversely, if I asked you to complete the sentence, "The cat sat on the...", there's a good chance you'd say 'mat', and so did my preferred Large Language Model (LLM) when I tried this out.

But why?

How LLMs Learn Word Relationships

As you probably know, LLMs are 'trained' by processing massive amounts of text from the internet, books, and other sources. It doesn't 'understand' words in a human sense (it doesn’t even see them); instead, each word is first converted into a unique numerical identifier, called a 'token'. During this training, the LLM's main goal is to learn how these tokens relate to each other through probability calculations.

For example, when given 'the cat sat on the...', the LLM processes these words as a sequence of tokens. It then calculates a probability for every possible next token (representing a word) in its vocabulary. Based on whether its prediction is right or wrong, the model adjusts its internal settings. This continuous adjustment helps it get better at calculating probabilities, essentially learning the patterns and relationships between words in the vast training data.

The core of this process lies in 'embeddings'. These are not the simple token numbers themselves, but rather complex numerical values assigned to each token. Initially, these embedding numbers are random. But as the model trains, it constantly adjusts these embedding values based on which words appear together. This allows it to ‘model’ the statistical relationships between words. Patterns form in these embeddings because language itself has patterns in how words relate to ideas and concepts.

For example, if you list animals, you're listing words that relate to the concept of an animal. While we might think of these relationships as 'concepts' or ‘semantics,’ the model itself doesn't 'understand' them like a human does. It's simply a complex system of statistical patterns that, when combined, influence the probability of one word following another.

Changing the Meaning of Words

The truly amazing piece is how these large language models dynamically adjust the meaning of words based on their immediate context. Think of it like this: the meaning of a word isn't fixed; it shifts depending on the other words around it in a sentence. This is why ‘context’ (the words you put in) is so important: it's fundamental to how every word gains its meaning through the LLM’s mathematical mechanisms, and this meaning, in turn, determines the next word produced.

For example, consider the word "cat."

  • If I say, "The cat purred contentedly on the rug," the word "purred" influences the meaning of "cat" towards a domestic animal. The LLM adjusts the numerical representation (embedding) of "cat" to reflect this "pet animal" meaning.
  • Similarly, in "He's a cool cat with a smooth jazz tune," the words "cool" and "jazz" influence the meaning of "cat" towards a stylish or hip person. The LLM adjusts the embedding of "cat" to reflect this "stylish person" meaning.

This dynamic adjustment of a word's embedding based on its context is crucial for predicting the next word. If the LLM has adjusted "cat" to mean "pet animal" because of words like "purred," it will then give a higher probability to words like "food" or "toy" as the next word. And if "cat" is adjusted to mean "stylish person" due to words like "cool," then "music" would become a more probable next word.

We don't actually ‘know’ what concepts (like pet animal or stylish person) words are being linked to, as the model builds these concepts through a statistical process based on how the words are associated with other words in the training data (which includes website content, book texts, and so on).

How the LLM Predicts the Next Word: Back to our Cat

This dynamic process means the meaning of each word is constantly shaped by its surroundings. When the LLM predicts 'mat' for "the cat sat on the...", it's because the preceding words, through statistical patterns learned during training, point towards concepts related to where a cat might sit.

Here's the truly mind-bending part: when the LLM predicts the next word, it doesn't re-process the entire sentence from the beginning. Instead, as each word is processed, the full context of the sentence up to that point is cleverly encoded into the numerical representation (embedding) of the last word.

So, in "the cat sat on the...", the final word 'the' holds all the contextual information that a cat is looking to sit on something. It is only this last, context-rich word (token) that the model uses to calculate the probability of the very next word.

This might seem counter-intuitive, but it's possible due to the enormous complexity of the mathematical 'embeddings'. Imagine representing each word's meaning not just in 3D space, but in thousands of dimensions (e.g., 3000). This allows for an incredibly rich and unique numerical representation for every possible thought, idea, or sentence context.

Each unique combination of words creates a unique value in the embedding of the last token that captures the full context, which then guides the prediction of the next. In other words, the full idea is captured in one set of numbers; this is possible because those numbers describe a relationship between that word and all the other words in the vocabulary.

This brings us back to the idea of context. A word means what it means by its relationships to other words. In the case of large language models, that full context, that meaning of what a word means in the context of everything that's come before it, is captured in this number for each and every token, and that is what drives the next word in the sequence.

It’s an amazing, almost mind-bending idea, but the key takeaways are far more pragmatic:

  • The models never retrieve data; they only provide probabilistic outputs.
  • Where facts matter, you need to verify the outputs.
  • Driving that probabilistic output to be most relevant and valuable is largely dependent on how well the model was trained and the words you put in to drive the context for the next word in the sequence.
  • The workflows that can most benefit from this are those where creativity and the surfacing of previously unseen patterns and concepts are valuable.