Embeddings, Vectors & RAG: How LLMs Actually Work

How large language models turn words into numbers – and why that matters.

Why this matters

If you have ever asked a chatbot a question and felt a small flicker of surprise that it actually understood you, there is a single idea sitting underneath that moment: the embedding. It is not the flashiest concept in AI, and it does not show up in marketing slides. But almost everything an LLM (large language model) appears to “understand” about your words runs through it first.

The intuition is simple. Humans understand words by their meaning. Computers, deep down, only understand numbers. Embeddings are the bridge between the two, they convert text into meaningful numbers.

What is an embedding?

An embedding is a numerical representation of a piece of text. Instead of storing the word “dog” as three letters, the model stores it as a long list of numbers – a vector. A vector is just an ordered list of numbers; you can think of it as coordinates, except in many more dimensions than the three we are used to.

For illustration, the numbers might look something like this:

			
"dog"  ->  [0.21, -0.44, 0.89, ...]
"cat"  ->  [0.19, -0.40, 0.85, ...]
"car"  ->  [-0.72, 0.11, -0.33, ...]

The interesting part is not the numbers themselves – nobody hand-picks them. It is what those numbers do once you compare them. Words with similar meanings end up with vectors that are close to each other in this space (often called the vector space). Words with unrelated meanings end up far apart.

So in our tiny example:

“dog” and “cat” land near each other – both are pets, both are animals.
“dog” and “car” land far apart – one barks, the other needs petrol.

This is what people mean when they say AI understands semantic meaning. It is not really “understanding” in the human sense; it is geometry. Similar meanings sit in similar neighbourhoods. That property “similarity” is the workhorse behind almost every clever thing LLMs do later.

Why embeddings are needed

An LLM cannot directly chew on text. The pipeline always looks roughly the same:

Text → Tokens → Embeddings → Neural network processing

Take the sentence “I love coffee.”

First, it is broken into small chunks of text(tokens), usually a word or part of a word:

["I", "love", "coffee"]

Then each token is turned into an embedding:

			
"I"      ->  vector
"love"   ->  vector
"coffee" ->  vector

From this point on, the model is no longer working with letters or words. It is working entirely with numbers it can multiply, add, and compare. Everything downstream – understanding, reasoning, generating a reply – happens in this numerical world.

How embeddings are created

Embeddings are not written by hand. They are learned. During training, the model reads billions of sentences and gradually nudges the numbers so that words used in similar contexts end up with similar vectors.

The classic party-trick example shows how much structure this produces:

King  -  Man  +  Woman  ≈  Queen

Take the vector for “King,” subtract “Man,” add “Woman,” and you land very close to “Queen.” Nobody told the model what royalty or gender are. It just noticed, across enough text, how those words tend to keep company.

From static to contextual

Older systems gave each word a single, fixed embedding. That works – until you hit a word like “Apple,” which can be a fruit on a kitchen counter or a company in California. A single vector cannot be both.

Modern LLMs solve this with contextual embeddings. The same word gets a different vector depending on the words around it. So:

In “Apple is tasty,” the embedding for “Apple” leans towards the fruit neighbourhood.
In “Apple launched the iPhone,” it leans towards the technology-company neighbourhood.

The same trick handles other tricky words too – “bank” as a riverbank versus a place that holds your money, for instance. Context shifts the coordinates.

How an LLM understands a question

Imagine you ask: “How do I fix slow laptop performance?” Roughly, here is what happens.

1. Tokenization

The sentence is split into tokens:

["How", "do", "I", "fix", "slow", "laptop", "performance"]

2. Embedding layer

Each token is converted into a vector. These vectors carry meaning – not just the identity of the word, but a hint of what kind of word it is and what kinds of words usually surround it.

3. Transformer processing

This is the brain of an LLM. The transformer combines several ideas: an attention mechanism, layers of neural processing, and a lot of pattern matching learned during training. Together, they let the model spot relationships between words even when those words sit far apart in the sentence.

In our laptop example, the model picks up on associations like:

			
slow    ↔  performance
laptop  ↔  fix

And quietly assembles a picture: the user has a problem, the device is a laptop, and the issue is about speed.

The attention mechanism

Attention is worth its own moment, because it is the single piece that makes modern LLMs feel so different from older chatbots. The specific flavour used inside transformers is called self-attention.

Self-attention helps the model answer one question for every word: “Which other words in this sentence should I pay attention to in order to understand this one?”

Take a classic ambiguous sentence:

“The animal didn’t cross the road because it was tired.”

What does “it” refer to? A human reads this and instantly thinks: the animal. The model arrives at the same answer because attention links “it” back to “animal” rather than “road.” Multiply that little decision across every word in every sentence, and you start to see how a stack of these layers can build up a surprisingly nuanced view of language.

How an LLM generates a response

Once the model has “understood” the input, generation is almost anti-climactic. It predicts the next token. Then the next. Then the next.

Ask: “What is the capital of France?” and the response grows one token at a time:

			
"The"
"The capital"
"The capital of"
"The capital of France"
"The capital of France is"
"The capital of France is Paris"

		

It really is that incremental. There is no moment where the model writes the full sentence and then hands it over. It commits to one token, looks at everything so far, and picks the next.

The full pipeline, end to end

			
User input
    ↓
Tokenization
    ↓
Embeddings (text → vectors)
    ↓
Transformer + attention
    ↓
Context understanding
    ↓
Next-token prediction
    ↓
Generated response

		

Why responses feel intelligent

From its training data, the model has absorbed grammar, facts, patterns, reasoning structures, and the rhythms of human conversation. So when it picks the next token, it is choosing the one that is most likely to make sense given everything before it. That is why the output reads like a thoughtful reply, even though, mechanically, it is a very fast game of “guess what comes next.”

Where embeddings show up in the real world

Embeddings are not a curiosity buried inside chatbots. They quietly power a lot of the AI features people use every day, a category often called downstream tasks, meaning anything that builds on top of these learned representations.

Use Case	What embeddings do there
Semantic search	Match by meaning, not just keywords
RAG systems	Pull in the most relevant documents for a question
Recommendations	Surface items similar to what you liked
Chatbots	Recognise what the user is actually asking for
Vector databases	Store and search large libraries of embeddings
Fraud detection	Spot patterns that look suspiciously alike

If you have ever typed a vague description into a search box and still found the right product, or watched a recommendation feed serve up the next thing you wanted before you knew you wanted it, embeddings were almost certainly involved.

Embeddings in RAG

One use case deserves a closer look: RAG, short for retrieval-augmented generation. It is the pattern behind most enterprise chatbots that can answer questions about a company’s own documents.

The flow is straightforward:

Documents are converted into embeddings.
Those embeddings are stored in a vector database.
When the user asks a question, the question is converted into an embedding too.
The system finds the document embeddings closest to the question’s embedding.
Those relevant documents are passed to the LLM.
The LLM uses them to generate an answer grounded in the company’s actual content.

It is a bit like asking a helpful colleague a question – except the colleague first scans the right shelf of the filing cabinet, pulls out the three most relevant pages, and only then starts talking. Embeddings are what make that scanning step possible.

A simpler way to picture it

If all this still feels abstract, here is a friendlier mental model: think of embeddings as GPS coordinates of meaning. Every word, sentence, or document gets dropped onto a giant map. Things that mean similar things land near each other.

			
dog              ->  near cat
king             ->  near queen
apple (company)  ->  near microsoft
apple (fruit)    ->  near banana

The closer two points are, the more similar their meaning. Most of what an LLM does – search, recommend, answer, summarise – is, at heart, a question about distance on this map.

In summary

Embeddings are the foundation of how LLMs make sense of language. They turn words into mathematical objects so that transformers can understand context, compare meanings, retrieve related information, and generate replies that feel coherent.

They are not the part of AI that gets the headlines. But strip them away, and the rest of the stack has nothing to stand on. Once you can see the embedding step, a lot of what looks like AI magic starts to look like something more interesting: careful geometry, at scale.

Embeddings, Explained Without the Hype

Why this matters

What is an embedding?

Why embeddings are needed

How embeddings are created

From static to contextual

How an LLM understands a question

1. Tokenization

2. Embedding layer

3. Transformer processing

The attention mechanism

How an LLM generates a response

The full pipeline, end to end

Why responses feel intelligent

Where embeddings show up in the real world

Embeddings in RAG

A simpler way to picture it

In summary

Like this:

Related

Prompt Engineering for Developers: The 7-Layer Structure for Production-Ready Results

Like this:

Micro Front-end: An Architecture Overview

Like this:

Angular Signals

Like this:

Leave a ReplyCancel reply

Why this matters

What is an embedding?

Why embeddings are needed

How embeddings are created

From static to contextual

How an LLM understands a question

1. Tokenization

2. Embedding layer

3. Transformer processing

The attention mechanism

How an LLM generates a response

The full pipeline, end to end

Why responses feel intelligent

Where embeddings show up in the real world

Embeddings in RAG

A simpler way to picture it

In summary

Share this:

Like this:

Related

Similar Posts

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Tech Realm