Embeddings, Explained Without the Hype
How large language models turn words into numbers – and why that matters.
Why this matters
If you have ever asked a chatbot a question and felt a small flicker of surprise that it actually understood you, there is a single idea sitting underneath that moment: the embedding. It is not the flashiest concept in AI, and it does not show up in marketing slides. But almost everything an LLM (large language model) appears to “understand” about your words runs through it first.
The intuition is simple. Humans understand words by their meaning. Computers, deep down, only understand numbers. Embeddings are the bridge between the two, they convert text into meaningful numbers.
What is an embedding?
An embedding is a numerical representation of a piece of text. Instead of storing the word “dog” as three letters, the model stores it as a long list of numbers – a vector. A vector is just an ordered list of numbers; you can think of it as coordinates, except in many more dimensions than the three we are used to.
For illustration, the numbers might look something like this:
"dog" -> [0.21, -0.44, 0.89, ...]"cat" -> [0.19, -0.40, 0.85, ...]"car" -> [-0.72, 0.11, -0.33, ...]
The interesting part is not the numbers themselves – nobody hand-picks them. It is what those numbers do once you compare them. Words with similar meanings end up with vectors that are close to each other in this space (often called the vector space). Words with unrelated meanings end up far apart.
So in our tiny example:
- “dog” and “cat” land near each other – both are pets, both are animals.
- “dog” and “car” land far apart – one barks, the other needs petrol.
This is what people mean when they say AI understands semantic meaning. It is not really “understanding” in the human sense; it is geometry. Similar meanings sit in similar neighbourhoods. That property “similarity” is the workhorse behind almost every clever thing LLMs do later.
Why embeddings are needed
An LLM cannot directly chew on text. The pipeline always looks roughly the same:
Text → Tokens → Embeddings → Neural network processing
Take the sentence “I love coffee.”
First, it is broken into small chunks of text(tokens), usually a word or part of a word:
["I", "love", "coffee"]
Then each token is turned into an embedding:
"I" -> vector"love" -> vector"coffee" -> vector
From this point on, the model is no longer working with letters or words. It is working entirely with numbers it can multiply, add, and compare. Everything downstream – understanding, reasoning, generating a reply – happens in this numerical world.
How embeddings are created
Embeddings are not written by hand. They are learned. During training, the model reads billions of sentences and gradually nudges the numbers so that words used in similar contexts end up with similar vectors.
The classic party-trick example shows how much structure this produces:
King - Man + Woman ≈ Queen
Take the vector for “King,” subtract “Man,” add “Woman,” and you land very close to “Queen.” Nobody told the model what royalty or gender are. It just noticed, across enough text, how those words tend to keep company.
From static to contextual
Older systems gave each word a single, fixed embedding. That works – until you hit a word like “Apple,” which can be a fruit on a kitchen counter or a company in California. A single vector cannot be both.
Modern LLMs solve this with contextual embeddings. The same word gets a different vector depending on the words around it. So:
- In “Apple is tasty,” the embedding for “Apple” leans towards the fruit neighbourhood.
- In “Apple launched the iPhone,” it leans towards the technology-company neighbourhood.
The same trick handles other tricky words too – “bank” as a riverbank versus a place that holds your money, for instance. Context shifts the coordinates.
How an LLM understands a question
Imagine you ask: “How do I fix slow laptop performance?” Roughly, here is what happens.
1. Tokenization
The sentence is split into tokens:
["How", "do", "I", "fix", "slow", "laptop", "performance"]
2. Embedding layer
Each token is converted into a vector. These vectors carry meaning – not just the identity of the word, but a hint of what kind of word it is and what kinds of words usually surround it.
3. Transformer processing
This is the brain of an LLM. The transformer combines several ideas: an attention mechanism, layers of neural processing, and a lot of pattern matching learned during training. Together, they let the model spot relationships between words even when those words sit far apart in the sentence.
In our laptop example, the model picks up on associations like:
slow ↔ performancelaptop ↔ fix
And quietly assembles a picture: the user has a problem, the device is a laptop, and the issue is about speed.
The attention mechanism
Attention is worth its own moment, because it is the single piece that makes modern LLMs feel so different from older chatbots. The specific flavour used inside transformers is called self-attention.
Self-attention helps the model answer one question for every word: “Which other words in this sentence should I pay attention to in order to understand this one?”
Take a classic ambiguous sentence:
“The animal didn’t cross the road because it was tired.”
What does “it” refer to? A human reads this and instantly thinks: the animal. The model arrives at the same answer because attention links “it” back to “animal” rather than “road.” Multiply that little decision across every word in every sentence, and you start to see how a stack of these layers can build up a surprisingly nuanced view of language.
How an LLM generates a response
Once the model has “understood” the input, generation is almost anti-climactic. It predicts the next token. Then the next. Then the next.
Ask: “What is the capital of France?” and the response grows one token at a time:
"The""The capital""The capital of""The capital of France""The capital of France is""The capital of France is Paris"
It really is that incremental. There is no moment where the model writes the full sentence and then hands it over. It commits to one token, looks at everything so far, and picks the next.
The full pipeline, end to end
User input ↓Tokenization ↓Embeddings (text → vectors) ↓Transformer + attention ↓Context understanding ↓Next-token prediction ↓Generated response
Why responses feel intelligent
From its training data, the model has absorbed grammar, facts, patterns, reasoning structures, and the rhythms of human conversation. So when it picks the next token, it is choosing the one that is most likely to make sense given everything before it. That is why the output reads like a thoughtful reply, even though, mechanically, it is a very fast game of “guess what comes next.”
Where embeddings show up in the real world
Embeddings are not a curiosity buried inside chatbots. They quietly power a lot of the AI features people use every day, a category often called downstream tasks, meaning anything that builds on top of these learned representations.
| Use Case | What embeddings do there |
|---|---|
| Semantic search | Match by meaning, not just keywords |
| RAG systems | Pull in the most relevant documents for a question |
| Recommendations | Surface items similar to what you liked |
| Chatbots | Recognise what the user is actually asking for |
| Vector databases | Store and search large libraries of embeddings |
| Fraud detection | Spot patterns that look suspiciously alike |
If you have ever typed a vague description into a search box and still found the right product, or watched a recommendation feed serve up the next thing you wanted before you knew you wanted it, embeddings were almost certainly involved.
Embeddings in RAG
One use case deserves a closer look: RAG, short for retrieval-augmented generation. It is the pattern behind most enterprise chatbots that can answer questions about a company’s own documents.
The flow is straightforward:
- Documents are converted into embeddings.
- Those embeddings are stored in a vector database.
- When the user asks a question, the question is converted into an embedding too.
- The system finds the document embeddings closest to the question’s embedding.
- Those relevant documents are passed to the LLM.
- The LLM uses them to generate an answer grounded in the company’s actual content.
It is a bit like asking a helpful colleague a question – except the colleague first scans the right shelf of the filing cabinet, pulls out the three most relevant pages, and only then starts talking. Embeddings are what make that scanning step possible.
A simpler way to picture it
If all this still feels abstract, here is a friendlier mental model: think of embeddings as GPS coordinates of meaning. Every word, sentence, or document gets dropped onto a giant map. Things that mean similar things land near each other.
dog -> near catking -> near queenapple (company) -> near microsoftapple (fruit) -> near banana
The closer two points are, the more similar their meaning. Most of what an LLM does – search, recommend, answer, summarise – is, at heart, a question about distance on this map.
In summary
Embeddings are the foundation of how LLMs make sense of language. They turn words into mathematical objects so that transformers can understand context, compare meanings, retrieve related information, and generate replies that feel coherent.
They are not the part of AI that gets the headlines. But strip them away, and the rest of the stack has nothing to stand on. Once you can see the embedding step, a lot of what looks like AI magic starts to look like something more interesting: careful geometry, at scale.