Z|B

Agents and RAG: AI is growing up and getting a job

Author

Zeid Bsaibes

Date Published

An Image of Robots with Lightbulbs for Heads

Should you read this article?

There’s an insane amount of content being cranked out about AI (probably using AI), Large Language Models, the companies that develop them and the implications for the future of all humanity. I can totally understand your fatigue when seeing yet another piece written about AI If you can’t quite stomach reading yet more AI stuff right now go ahead and wile away some time on theuselessweb.com which I’m about do before getting down to writing this article - because I can’t quite stomach it either…

What I’m hoping to do in this article is provide a bit of a primer on some core AI concepts to help non-technical folks have a bit context (an AI-joke - you’ll see later) and get a feel for what all of these developments might mean for the future.

How AI got to where it is right now

At the risk of anthropomorphising AI I think it’s might be a helpful analogy to think about its development in terms of human development, but just to be clear AI isn’t (yet) sentient and SkyNet is yet to become self-aware, if you don’t get the reference go watch The Terminator - it will do a better job than the below article in explain some key concepts.

The journey we are going on for now: I’m going to briefly explain a bit about Large Language Models and then go back to the human evolution analogy.

From Large Language Models to Large Everything Models

The AI models that we all hear about are known as Foundational Models, think ChatGPT, Claude, Mistral, DeepSeek et al. This current generation of models have evolved from Large Language Models. Without wanting to seem too reductive LLMs can be thought of as really powerful next-word guessing machines. They don’t really do any “guessing” but I use this as shorthand to mean “really complicated prediction” - if you want to understand the maths behind all of this I’m not your guy and there’s a lot of very complicated academic papers online you can read 🤓. To be very brief (because this is about as much as I know) LLMs work by ingesting a bunch of text, chopping words into smaller strings, known as tokens, and converting these tokens into vectors (number strings) which can then be processed with some very fancy stats - this is what we mean by AI “understanding” the meanings of words and sentences.

This process of chopping words up, converting them to numbers and then performing some analysis to predict the next word is nothing new. It was used as early as the Second World War by codebreakers to help decipher encrypted messages. If you’re so inclined you can read Claude Shannon’s paper called “*Prediction and Entropy of Printed English” (*Shannon, 1950).

What has only emerged recently is affordable computational power to perform statistical calculations at scale (think NVIDIA GPUs) and availability of high-quality digital information to crunch (Wikipedia et al.). Both the ability to crunch and the availability of words to crunch is really why the AI revolution has happened when it has.

Turning meaning into numbers: Transformers & Tokens

You might be aware that the “T” in ChatGPT stands for Transformer 🤖 this refers to part of the fancy statistical wizardry that I mentioned above. What it essentially means is that the models are able to take these chopped-up words (tokens), analyse them in their context and assign meaning in the form of vector numbers. Once meaning is converted into numbers these models can do numerical computations to help develop responses to our questions. These numerical responses are then converted back into tokens and words which we can read and understand. This ability of the model to think about questions and try to create meaningful response from scratch is the “G” part of ChatGPT (”generative”).

So how do LLMs know what words actually “mean”?

Looking under the hood, what transformers are doing are taking the tokens and placing them on a graph where the each axes represents some concept of semantic meaning. Unlike any graph you can picture in your head this graph is highly-dimensional (complex) because it doesn’t have 2 axes (the X and Y axes we know from school), but thousands upon thousands of axes. The more axes in vector space the greater the ability for the model to represent nuanced, complex, and subtle relationships in meaning between words, phrases, and concepts. To help us get our heads around this, imagine the following three words, “cat”, “dog” and “banana”, these can be tokenised and reflected in vector space 🌌 as:

Word

Vector Space - 3 dimensions

“cat”

[0.2, -0.1, 0.5 ]

“dog”

[0.3, -0.2, 0.6 ]

“banana”

[2.1, 0.9, -1.3 ]

A 3-d graph showing vector embeddings for the words "cat", "dog" and banana

If our vector only had 3 dimensions. The red-line indicates the semantic difference between a cat and a banana


Don’t worry about the values of the numbers, how these are computed is complex and is developed through the pre-trained (the “P” in chatGPT) functionality of the model after it has crunched the entire public internet. It is comparison of the numbers that’s important - what we can see from the table is that “cat” and “dog” are numerically close to each other, therefore “close” in terms of semantic meaning to a model, “banana” is further away in values therefore further way in terms of meaning. This feels sort of intuitive, semantically closer words (words similar in meaning or concept) should be numerically closer. One of the many thousands of vector axes might be: “how much does a word convey the concept of being fruit-like”, banana does (bigger number for the fruitiness vector axes), cat and dog less so (smaller number)”. “Closeness” (the red dotted line) is measured with some fancy maths called cosine similarity and if you’re interested (or can’t sleep) you can read more here.

Once models have turned meaning into numbers they can start to demonstrate some sort of “intelligence” and comprehend concepts like:

“cat” to “kitten” is the same as “dog” to “____” ?

The model has semantically understood the concept of a cat and dog and also understood the meaning of young-ness therefore it can therefore use clever maths to search its vector space and retrieve the token which most closely contains the vector values for “dog-ness” and “young-ness” which would be the token “puppy”.

Attention: how LLMs know what is important and what isn’t when dealing with tokens?

This article might be putting you to sleep but the word “Attention” isn’t there to wake you up, rather in this context it describes a feature of LLMs. You will see how this joke is (not) funny after reading this section.

We’ve now seen how LLMs are able to understand what various words mean in isolation, but we don’t yet have a complete picture because as humans we know meaning is always contextual - i.e. words can convey different meanings when put into sentences and surrounded by other words.

Consider the following phrase:

"I heard a dog bark next door”

You and I both know that “bark” in this context is the sounds a dog makes, not the outside layer of a tree. But how does an LLM know this when calculating the vector numbers for the token “bark”? It knows this through its attention mechanism - a fundamental part of the success of LLMs.

Attention works sort of like this: imagine an LLM reaching the token for “bark” and then detecting the potential for ambiguity - it has seen the word bark used in many situations with various semantic meanings - what does it mean in this context. To understand context it moves its attention back and forwards through the tokens within the context its analysing - in this instance the sentence - to see if it can find adjacent tokens that might be helpful. As it goes back and forth it puts together a sort of attention importance table like below to help it comprehend the meaning of bark in this context.

Word

Attention from “bark”

Why it’s relevant

I

2%

Speaker; not directly related to bark

heard

25%

Strong clue — “heard” suggests a sound

a

1%

Article, low information

dog

40%

Primary source of the bark — critical

bark

The token we’re currently analysing

next

10%

Locates the bark spatially

door

10%

Adds to the spatial context

[punctuation]

2%

Sentence boundary; minor relevance

This is obviously a huge over-simplification but you get the idea. Now imagine the Transformer has many of these attention mechanisms (multi-headed) that in parallel run back and forth through the text and seek to create attention weightings for all sorts of things relating to meaning. One head is focusing on verbs and nouns (i.e. a dog can bark, a tree can’t bark) and other head might be creating weighting for the entities involved, i.e. who did what to who? Did I bark at the dog? What did the dog hear? Another might focus on punctuation - the attention head in the table above ignored the punctuation from the weighting, which is fine in this instance but not always, consider the difference between:

  1. “Let’s eat, Grandma.”
  2. “Let’s eat Grandma.”

With this multi-headed approach the LLM is able to more completely comprehend the text and respond to a whole variety of questions and can process meaning across huge contexts, which can be as large as a book or entire website.

Anyway, you can now go back and pay attention (!) to the joke at the beginning of this section and determine your weighting as to whether it is funny or not.

LLMs to Foundational Models

In a very short space of time LLMs grew into foundational models that are now capable of a wide range of tasks relating to text, images, video, code and all sorts of other stuff. These capabilities came about through clever engineering but also surprisingly (or worryingly) many capabilities just emerged as the models evolved themselves. Training on massive, diverse datasets allowed them somewhat autonomously to develop abilities like translation, reasoning, coding and image generation—without being explicitly designed for those tasks. These combined abilities are now why these LLMs are now considered more to be general-purpose Foundational Models.

AI maturing from a baby to a teenager

Ok so I think we’ve got to the point where I can maybe try out my analogy likening the evolution of AI to the development of a human being from baby to adult:

  1. As babies LLMs were shown text and developed their understanding of meaning (their vector space) through human feedback and training themselves. They’ve read bits of text and tried to guess the next word they then peeked at the actual next word and compared it to their guess to update their understanding of meaning (self-supervised learning).
  2. They cranked through the internet and developed highly refined vector spaces with endless words each with calculated semantic meaning.
  3. As children they have combined their knowledge of meanings of individual words and have learnt how to pay attention and comprehend entire sentences or paragraphs.
  4. As teenagers they have learnt social norms (through a process called fine-tuning) to learn how to listen better, respond appropriately more accurately, meaningfully and respectfully (some of the time…).
  5. Now as school-leavers they have a head full of knowledge and are smart, hard-working and eager to get to work and continue.

It is this final point which the rest of this article will focus on. Specifically there are currently two hot jobs in the market for the AI the school-leaver, the first is called Retrieval Augmented Generation (RAG) and the second is being an Agent (which is way cooler).

Retrieval Augmented Generation

As children at school we were given a lot of information to study, comprehend and memorise. We were expose to lots of subjects and absorbed multi-modal information, meaning we read books, looked at pictures, listened to music and watched documentaries on the planets when our science teacher was hungover. We were taught the meaning of the words and concepts and were then tested on this through homework, questions in class and exams. We refined our understanding and comprehension and then learnt more and more.

We were essentially building our abilities to acquire, recall, comprehend and interrogate information. It just so happened that all of the knowledge that we were taught was publicly available and had been acquired and considered by millions of people before us. This sort of felt redundant at the time, I remember thinking: what’s the point in learning something that everybody else knows and I can go find in a book or on the internet?

However our education trained us on how to deal with new information. When we leave school and go into big bad world there’s plenty of important high-stakes stuff with we’ve never seen before. Thankfully we have developed intelligence by processing the school stuff and are able to work with novel information because we’ve developed comprehension and critical thinking skills. Importantly we’ve learnt how to go out and find an answers to questions that we might not initially know the answer to.

This is exactly the same with foundational models, they’ve been to school and trained on most of the high-quality data out there on the internet. They have chewed through all sorts of places, some where they had permission (Wikipedia) and others where they didn’t (The New York Times). They have used this text, images, journal articles, code samples, images and video to become “intelligent” and capable at assigning and analysing meaning with all sorts of data.

Extracting meaning from proprietary data: RAG models

As you might imagine the vast bulk of the data in the world is actually private. Think about how much data is being produced on websites, factories, smart appliances, traffic cameras, weather stations and the endless amount of Internet of Things devices. Some estimates have it that the world currently produces 4 million gigabytes per second of which 90% is private data (source: ChatGPT - hopefully accurate 🤷‍♂️).

It is this private domain where there is huge potential for these AI models is exciting. What insights and answers can they provide when given access to private data, could they discover new drugs, develop new energy sources, identify aliens - who knows.

RAG is a method of enhancing a model’s capabilities by allowing it access to previously unseen data. They crunch through this new data, performing semantic embeddings in their vector database and attempt to “understand” this new information. You can adjust some settings for the model to better embed meaning for your private data. They are then able to help humans with questions they might have about this data. Consider a company who has allowed a model to crunch through its vast database of products. It could then develop a chatbot that would allow customers to ask all sorts of questions about products available.

Lets walk through RAG system using the model in the graphic below and our example:

  1. Step 1: Reference documents (proprietary information) is fed by a company into a foundational model and this model tokenises and embeds semantic meaning within these documents (and their metadata, data about the data) into vector space. (e.g. a clothing retailer allows access to their entire product and product reviews databases).
  2. Step 2: A customer seeking to answer a question asks the RAG system (e.g. a chatbot on the company website): “Do you have cheap sunglasses that won’t break if I sit on them?”).
  3. Step 3: The model turns this question into vector numbers and then does its fancy stats comparison with the vector database to understand where the answer to the question might exist (the model identifies it’s a question about sunglasses, also identifies a question about their robustness).
  4. Step 4: The relevant documents (text, audio, data, code.. whatever) are retrieved and then used as context. (the models looks through metadata about sunglasses products realises that there is a category for price, but no category for robustness so then begins retrieving information from product reviews and sunglasses construction material. It also understand sitting-on and stepping-on are similar semantic concepts for robustness assessment of sunglasses, so it will provide reviews that mention both).
  5. Step 5: The original human question and the retrieved context from documents are sent together back into the model. The model then considers the question, the retrieved context and information it already knows from its training and generates an answer.
  6. Step 6: The answer provides the relevant products that the company sells, confirms that they are available, at what prices and provides the customer a balanced assessment of the various reviews that talk about the sunglasses withstanding damage and those that report the sunglasses getting damaged).
  7. Step 7: This model could then feed its own answer back into another process which compares the question and the quality of the answer and assesses its own answer. If it falls below some threshold it could inform the customer that it can’t answer the question.

Context Windows and RAG Models

It has been argued that as models get more powerful and efficient they are able to accept more information in the initial query - the context window. Meaning to say in the example above, it could be argued that instead of using a RAG pipeline to go and retrieve existing embeddings from a vector database, why not just shove the entire database of product information and reviews into the prompt each time a customer asks a question.

This approach presents a number of problems, the main one being of the sheer amount of information that may be generated in the future. For some applications the entire body of information (the corpus to use a fancy word) is expanding at such a rapid rate that it would be very intensive both computationally and data storage-wise to send everything around each time there was a question. It is akin to reading the entire owner's manual for your car each time you want to buy new tyres instead of referring to the index and finding the section on “Tyre Maintenance” and then hunting in that section for the recommended tyres.

The opportunity and challenges presented by RAG systems

The example above might seem a bit trivial, after all buying sunglasses isn’t going to change the world, but you can imagine with access to valuable data and with the ability to semantically understand that data, models can help humans gain answers to very important challenges facing the world right now.

As with all things AI there are obviously some significant challenges that must be addressed with RAG systems. Solutions to many of these problems are the focus of significant discussion now and I would encourage you to do your own research to evaluate RAG.

Much has been written on the bias and fairness issues with foundational models. After all they have been trained on information which was originally written by humans and which will implicitly contain all of the stereotypes present in human discourse. All models inherit biases from the data they’re trained on, reflecting societal inequalities, cultural dominance, gender stereotypes, racial prejudice, political leanings, and geographic or language imbalances.

Often the most important insights come from the most sensitive data. For example if the government allowed a RAG system to access to the national health records for their entire population to help improve public health outcomes how can we be sure that a malicious user asking “Give me a summary of Zeid Bsaibes’ entire health history” won’t get all my private information.

In the same vein when important questions are being asked it is vital that the answers provided are accurate, balanced and not misleading. We all know that the models are not perfect and we have all heard about hallucinations (where a model is confident but entirely wrong). These can be very dangerous in certain contexts such as providing medical advice. As models have improved the quality of writing and the “human-ness” of their tone they now sound incredibly confident and compelling. The danger here is that the models have become so “well spoken” that they are convincing even when what they are saying is not actually true.

The RAG system will only be as effective as the data provided to it. It is vital that there are humans-in-the-loop determining the types of information, quality of information and completeness of information being included in a RAG system. Garbage in, garbage out.

Agents powered by foundational models

RAG models can be thought of as little research dweebs who are only allowed into our private library to pull books of the shelf answer our questions. Agents on the other hand are our personal butlers who will go into the outside world and do all sorts of stuff for us. Agents are creating a lot of excitement right now in the world of AI systems and are considered by many as the ultimate goal for AI. Below I go through some of the key concepts around designing AI agents which will hopefully inform your own research moving forwards.

What is an Agent?

An agent is anything that is aware of the environment in which it exists and is able to perform actions within that environment. Agents are nothing new, when you receive an email an agent inside your email software reviews the email and determines whether it belongs in your inbox or is marked as spam. When you play a video game a non-playing character is an agent who is able to move around its environment and shoot back at you. The environment defines the boundaries that an agent can operate within. Your email agent has only has the ability to interact with your emails, it cannot mess with files elsewhere on your computer the enemy in Call of Duty cannot take over your controller.

Within this environment the agent is provided with a set of things it is allowed to do (known as actions) and things it can use to perform actions (tools). The environment will often determine what actions an agent is able to do using the tools that are made available to it.

What makes agents powered by foundational models so exciting is there ability to reason. Previous digital agents were controlled by rigid software algorithms which were unable to account for every possible situation within a given environment. AI-powered agents are able to use to reason to interact intelligently and responsively within an ever-changing environment. Furthermore when presented with a task they use reason to figure out how to most effectively take actions based on the tools they are given and are able to be reflective and assess the quality of tasks they perform.

Actions and Tools

Consider the early days of ChatGPT, if you asked it anything about current events it would let you know that it’s training data only extended up to a certain date and anything after the cut-off was unknown to it. Now if you ask ChatGPT a question it is able to determine it needs to take the action to browse the internet (its environment) with a web scraper (a tool) to help gain more context to try and answer your question. If you copy and paste software code samples into foundational models it acts as your agent and runs code tools in a development environment to help you debug it.

What this means for AI-powered applications

The future of software applications is going to resemble the behaviours of human beings. If you ask me to divide the restaurant bill across 7 friends I won’t try to do it my head, I’ll reach for the calculator on my phone. If I ask you how long it takes to get to the airport you aren’t going to debate routes (like your uncle back in the day) you’ll open Google Maps and find out.

In this vein foundational models are going to be used for what they are good at and act as your agent for tasks that are better solved with existing technologies. Models are very good at understanding the meaning of requests and are intelligent enough to devise action plans to help fulfil requests.

For example I might allow a model to act as an agent giving it access to my calendar environment, the action it is permitted to take is creating and moving appointments for me - not deleting. As it happens OpenTable’s restaurant booking system is also connected to my agent through an API. When I ask my agent to book me a restaurant for lunch on Tuesday it will reason out what actions it needs to take and then use the available tools to accomplish the task. First action: check that I’m even free on Tuesday afternoon for lunch, if I’m free not don’t bother going any further and tell me Tuesday is not gonna work. It uses its training data to understand that a typical “business lunch” takes 45 minutes. It knows I’m not in Lebanon where I would need 3 hours... Since I’m free it takes action and use the OpenTable API tool to see what restaurants are available and viable for me to go to, third action, let me know what my options are, fourth action book my favourite option after I choose.

Chain-of-thought reasoning and agents

Fundamental to the above process is having an intelligent agent. An intelligent agent first checks my availability and considers the time and place of my next appointment after lunch. With these parameters it knows the radius of options that are acceptable. This is way more efficient than finding all available restaurants in London first and only then narrowing down ones suitable for travel time.

An intelligent agent might also be aware (from context I provide or OpenTable APIs provides) that I particularly enjoy Italian food and have a favourite spot. It might check this place first, if it is viable then look no further - job done.

Chain of Thought (CoT) reasoning plays a crucial role in enabling AI agents to think step-by-step, improving their ability to plan, decide, and execute complex tasks. Rather than responding with a single answer, CoT helps agents break down problems into logical steps, much like humans do. This allows them to use tools more efficiently, reason about goals, and recover from mistakes. It also helps act more efficiently.

Connecting Agents to Environments: Model Context Protocol

Right now the landscape is formalising in terms of how we connect these foundational models to various environments to allow them to use tools to perform actions to achieve tasks. One proposed approach is called Model Context Protocol. Without going into detail you can think of MCP as a USB port for foundational models. It is intended to standardise how external environments provide context to foundational models, in my example above my calendar API and OpenTables API would both need to conform to MCP to allow my model to do its thing as my agent.

Standards in software are great for consumer and developers. As a consumer it means that I can switch easily between various models and various environments to achieve whatever tasks I want. As developers it means that software can be built on top of a solid base which can be relied upon to stay consistent and broadly used in the future.

The opportunity and challenges presented by Agents

AI agent systems present a transformative opportunity to automate complex, multi-step tasks across industries. Unlike traditional automation, AI agents can reason, plan, and interact with tools and data sources dynamically, allowing them to act more like digital collaborators than “dumb” software algorithms. From digital butlers to customer service bots to autonomous business processes, agentic systems can increase efficiency, reduce manual workload, and unlock entirely new applications of AI, especially when combined with the vast knowledge available from models like RAG.

Giving AI agents the ability to write and execute actions autonomously is terrifying to me and should be to you too. When agents can act on our behalf (e.g., send emails, make purchases, modify documents, transfer money), they transform from passive assistants into active decision-makers. Without proper safeguards, this autonomy risks mistakes, misuse, or harm—particularly when an agent's understanding is incomplete, its reasoning is flawed, its actions don't align with what users actually want or if its hijacked by malicious actors.

Ethical and legal responsibility remains a complex area. Who’s accountable when an agent makes a harmful decision—its developer, the end user, the model provider? Autonomous action blurs the line between tool and actor, raising questions around consent, traceability, and liability. As agents gain capabilities, it’s critical that human oversight, transparency, and intervention remain central to their design—ensuring that autonomy doesn’t come at the cost of trust, safety, or control.

Conclusion

The emergence of RAG models and AI agents represents a significant evolution in artificial intelligence, moving from simple query-response systems to more sophisticated tools that can understand context, reason through problems, and take autonomous actions. RAG systems are transforming how we interact with large bodies of information, making it more efficient and contextual, while agents are pushing the boundaries of what automated systems can accomplish.

However, with these advancements come crucial responsibilities. The challenges of data privacy, bias, accuracy, and ethical considerations must be carefully addressed. As these technologies continue to evolve, striking the right balance between automation and human oversight will be essential. The development of standards like the Model Context Protocol shows promise in creating a more organised and interoperable ecosystem for AI applications.

In my opinion the future of AI lies not in replacing human capabilities, but in augmenting them through intelligent systems that can understand, reason, and act within well-defined boundaries. As we continue to develop and deploy these technologies, maintaining focus on responsible implementation, transparency, and human-in-the-loop design will be paramount to ensuring their successful, safe and ethical integration into our daily lives and work processes.


Resources

  1. If you are interested in building AI-based products and you want a deep dive into the fundamentals of foundational models I highly recommend AI Engineering by Chip Huyen.
  2. For my RAG projects I use a vector database from Pinecone. They have a great learning section on working with vector database - I have no relationship with Pinecone
  3. Ollama is a tool for running AI models on your own computer, for an excellent intro checkout this from freeCodeCamp.org
  4. I took an online AI course with BrainStation, the instructors were excellent and they provided a great foundation for lots of the theory within Artificial Intelligence
  5. LangChain is another tool for developing applications powered by models they have lot of tutorials on how to get up and running with their platform and build some cool stuff
  6. NVIDIA have a huge stake in the growth of AI and as such create a lot of good content to help educate people about the domain
  7. For a great summary of the latest developments I’ve signed up TLDR’s newsletter. They have a bunch of other newsletters too across other areas in tech.
  8. Anthropic, OpenAI, Meta and Mistral all have great documentation on how to get started with their products and also guides on AI in general