Z|B

Harnessing AI To Build Applications That Can Reason: ZeidBot and LangChain

Author

Zeid Bsaibes

Date Published

An old rusty machine of cogs and chain

Should you read this?

We’ve all been wowed by the performance of chatbots like ChatGPT and their ability to answer almost any question on any topic with incredible accuracy and speed. Even more impressive is their ability to really understand the intent behind the questions we ask and then formulate responses in a way that they “think” will be most insightful for us. All that being said, in the grand scheme of things AI can and will do, their chatbots are really just scratching the surface.

This article is aimed at going a bit deeper and discusses how AI models can be combined with pipeline orchestration tools and proprietary data to create powerful and focused reasoning applications. By adding a few more elements to your AI stack we’re able to create outcomes which are far more context aware, reliable and tailored towards our specific needs.

What this article is and isn’t

This isn’t a step-by-step walkthrough on how to build a retrieval augmented generation (RAG) application from scratch. It will contain some code snippets to illustrate specific points around key tools that I use, but it won’t provide a ready-made GitHub repo that you can clone. Since the nature of RAG applications is very domain-specific this article is intended to be more of a conceptual guide to help you get to grips with the major moving pieces and then reflect on how you might integrate AI-powered reasoning capabilities with your own projects.

The Tools Involved in Building Reasoning Applications

Using AI models and the ecosystems of tools around them can unlock really exciting opportunities to create functionality that even a few years ago would only have been within reach of the mega-tech companies. At the core of most reasoning applications is a large language model (such as OpenAI’s gpt-4o-mini), but the really interesting stuff is actually to be found in the other areas of the AI stack where you can exert precise control and make these AI models really sing for you.

I have no relationships commercial or otherwise with any of the companies or tools mentioned anywhere in this article. I’ve used them myself and I like working with them.

LangChain as the orchestrator of the reasoning chain

LangChain’s website claims their framework helps you build “applications that can reason”, a pretty bold claim for a framework but for once seemingly reasonable. As we shall see both LangChain and its companion package LangGraph make developing AI-powered reasoning applications much easier than building one from scratch.

LangChain allows you to build modular AI applications by “chaining” together components of the AI stack like prompt templates, memory, databases, tools, and models. What this means in reality is that you have flexibility in building pipelines (chains) with huge customisability and specificity for each link in the chain. LangGraph allows you to see what’s happening in each step in your chain.

For a variety of reasons the language of choice for most AI applications has historically been Python, thankfully LangChain have built a JavaScript package to help web developers who are probably building the rest of their applications within JavaScript. Should I learn Python? Yes. I probably won’t though, so thank you LangChain.js team!

ZeidBot: an AI-powered reasoning chain in action ⛓️

Here is ZeidBot in action, he’s great he never complains, never sleeps, never makes excuses and works for peanuts - the complete opposite of the actual Zeid.

As an example of a chain we’re going to explore how I built ZeidBot, a RAG ChatBot that has access to all the content contained within the CMS that powers this website. Below is a summary of the elements which are orchestrated by LangChain allowing ZeidBot to do his thing. We go into more detail about each step later, but I think it’s useful at this stage to give you an overview of the whole picture. Here’s a run-through of the chain:

  1. A document loader scrapes the latest data from my content management system and feeds it into…
  2. …an AI-powered embedding model that convert text chunks into vector embeddings to be stored in a…
  3. …vector database with metadata tags per embedded chunk which is filtered by an…
  4. …AI-powered classification system which considers the human question and determines the most relevant metadata filters to apply to the Pinecone database to send to the…
  5. …Retrieval system which has processed human questions posed to ZeidBot and tells the vector store to perform a similarity search on the metadata-filtered documents and return the most meaningful chunks…
  6. …for the AI-powered text generation function to combine with the original question to create responses for the human user…
  7. …based on a conversation history that has been saved after each question-response cycle and is fed back into context.

The annoying dots above are really my attempt to show how all these steps form part of a chain ⛓️.

Converting content into meaning that computers understand

There is a lot of stuff going on in the summary above, but if we break it down step-by-step it isn’t that complicated in reality. Steps 1-3 can be visualised by the below diagram from LangChain’s website:

The graphic describes the process of converting information, in my case text from my website’s content management system, into smaller chunks which can be turned into little packets of semantic meaning and embedded into a vector database along with relevant metadata. If it’s not clear to you why we would need to chunk and embed information it might be worth reading my article on RAG systems. Essentially the embedding process using an LLM to convert words (how humans understand meaning) into vectors, which is how computers understand meaning.

After loading my data (Load step) and chopping it up (Split step) the first real AI-powered step in the chain is the Embed step. This is where the little chunks of my documents are sent to an LLM which turns them into vectors and sends these vector for storage into a special kind of database - known as a vector database.

Embedding: converting semantic meaning into vectors

LangChain provides built-in support for embedding models, abstracting away the complex logic involved in generating vector representations of text. If you’re using OpenAI, all you need to do is pass in your API key and select an embedding model (like text-embedding-3-large). LangChain then handles the rest — transforming each chunk of text into a high-dimensional vector.

The text-embedding-3-large embedder produces vectors with 3,072 dimensions, meaning each text chunk is converted into an array of 3,072 numbers. Pretty mind-bending stuff since humans can really only visualise just 3 dimensions. These embeddings capture the semantic meaning of the text and can be stored in a vector database for similarity search and retrieval (more on this later).

1// LangChain configures embedder
2import { OpenAIEmbeddings } from '@langchain/openai'
3
4const embeddings = new OpenAIEmbeddings({
5// It uses OpenAI's specialised embedding model which creates
6// vector embeddings of up to 3,072 dimensions
7 model: 'text-embedding-3-large',
8 apiKey: process.env.OPENAI_API_KEY,
9})
10
11export default embeddings

The vector database - where my semantic meaning is saved

So once we can convert text into high-dimensional vector embeddings, we need an efficient way to store and query those vectors. This is where a vector database like Pinecone come in. Pinecone is optimised for fast, scalable similarity search across millions—or even billions—of vector embeddings. LangChain provides native support for Pinecone, allowing easy integration by simply providing your API key and index name.

1import { PineconeStore } from '@langchain/pinecone'
2import { Pinecone as PineconeClient } from '@pinecone-database/pinecone'
3import embeddings from './embedding-model'
4
5// Pinecone is a hosted vector database service
6const pinecone = new PineconeClient({ apiKey: process.env.PINECONE_API_KEY || '' })
7
8const pineconeIndex = pinecone.Index(process.env.PINECONE_INDEX!)
9
10// my embeddings function is passed into the vector store
11const vectorStore = await PineconeStore.fromExistingIndex(embeddings, {
12 pineconeIndex,
13 maxConcurrency: 5,
14})
15
16export default vectorStore

When we initialise a VectorStore using PineconeStore.fromExistingIndex(), we also pass in our embeddings model. This tells LangChain how to generate embeddings in a consistent format as 3072-dimensional arrays in our case. These embeddings are used in two key operations:

  • Embedding (Adding Documents): Each document is chunked into smaller segments, embedded using our embedder, and the resulting vectors—along with their metadata—are stored in Pinecone.
  • Retrieval (Querying): When the chatbot receives a human question, the query is embedded using the same model to produce a 3,071 vector representation of the question. Pinecone then performs an search using this query vector to return the most semantically similar document chunks based on cosine similarity (or another metric).

The embeddings model ensures that semantically similar content (e.g., “capital of France” and “Paris”) are placed close together in vector space, allowing Pinecone to efficiently return the most relevant content to support accurate, relevant responses from ZeidBot.

Capturing Meaning: Embedding in Action

Here’s a bit of a complicated code snippet, but we will go through each concept in turn:

1import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter'
2import { Document } from 'langchain/document'
3
4const splitter = new RecursiveCharacterTextSplitter({
5 chunkSize: 600,
6 chunkOverlap: 200,
7})
8
9const allChunks: Document[] = []
10
11//Imagine some logic with fetches data from my CMS, labels it
12// with some metadata and plops it into an object, see an explanation below
13
14for (const doc of documents) {
15 const chunks = await splitter.splitDocuments([doc])
16
17 chunks.forEach((chunk, i) => {
18 allChunks.push(
19 new Document({
20 pageContent: chunk.pageContent,
21 metadata: {
22 ...chunk.metadata,
23 chunkIndex: i,
24 },
25 })
26 )
27 })
28}
29
30if (allChunks.length > 0) {
31 await vectorStore.addDocuments(allChunks, {
32 ids: allChunks.map((chunk, i) =>
33 `${doc.id}-${chunk.metadata.section || 'content'}-${i}`
34 ),
35 })
36}

Chunking it up, smaller bites means more focused semantics

The first thing you will notice is the splitter we have created using RecursiveCharacterTextSplitter. This is simply the tool that LangChain gives us to “chunk” up our big blocks of text as we mentioned before. We want to chunk up a large blog articles and then embed each of these chunks separately for a couple of reasons:

  1. Performance: Embedding models and large language models have strict size limits; they can only process a certain number of tokens (units of text) at a time. To stay within those limits, we break large documents into smaller, manageable chunks. This ensures each piece of text fits within the model’s capabilities and can be processed effectively.
  2. Semantic Accuracy: Chunking also significantly improves the quality of information retrieval. By dividing the text into smaller, more focused sections, the system can match a user’s query with the most relevant part of the original document — rather than searching within a single, large, unfocused block of content. This leads to more accurate and contextually relevant responses. It it kind of like the difference between knowing the important chapter in a textbook versus knowing the exact paragraph in that chapter that you’re looking for.

When we split large blocks of text into smaller chunks for processing, we often use a technique called chunk overlap. This means that instead of cutting the text into strictly separate 600 character pieces, we allow some of the content at the end of one chunk to repeat at the beginning of the next. The reason for this is simple: important context or meaning often lives near the edges of chunks, and if we were to split the text cleanly, we risk losing that continuity.

Imagine chopping up a book into a series of 600 character bits and writing them on separate Post-It notes. If you tried to read these notes in isolation you would inevitably lose the context on some because it started or finished mid-sentence and you might lose track of who’s speaking or what is happening. Overlapping preserves the flow of ideas and ensures that the AI model has access to the full context it needs, even when working with chunked text. This improves both understanding and the quality of any retrieval or response based on that text. If you want to visualise chunking check out this demo from ChunkViz.

Saving to the Vector Database

I am going to repeat a bit of the code snippet above, specifically the code that prepares the data for saving into the vector database:

1import { Document } from 'langchain/document'
2
3const allChunks: Document[] = []
4
5// Imagine some logic with fetches documents from my CMS, labels it
6// with some metadata and plops it into a 'documents' object, see an
7// explanation below
8
9for (const doc of documents) {
10 const chunks = await splitter.splitDocuments([doc])
11
12 chunks.forEach((chunk, i) => {
13 allChunks.push(
14 new Document({
15 pageContent: chunk.pageContent,
16 metadata: {
17 ...chunk.metadata,
18 chunkIndex: i,
19 },
20 })
21 )
22 })
23}
24
25if (allChunks.length > 0) {
26 await vectorStore.addDocuments(allChunks, {
27 ids: allChunks.map((chunk, i) =>
28 `${doc.id}-${chunk.metadata.section || 'content'}-${i}`
29 ),
30 })
31}

The first thing we notice is the Document object we import from LangChain. This is an object type that contains two things: the pageContent - the actual text content we want to vector embed and the metadata which is non-embedded data about the text chunk itself, e.g, is the embedded chunk from a career page or from a blog page on zeidbsaibes.com.

For brevity I have excluded the code which queries my CMS and returns the content, in reality I traverse through the API JSON response for each website page and pull out individual components like <h1> , <h2> <p> captions, callouts and so on. As I traverse through the responses I have saved all these bits and pieces already in their own Document objects. Many of them will have large pageContents because a <p> could be hundreds of characters long.

This is where the chunking comes in, I pass enough each Document object through my splitter (in a for loop) and chunk up the pageContent ensuring that original metadata from the big Document is preserved and stuck on the individual mini-Document chunks.

Once I’ve got all my mini-Documents with associated metadata I shove them into the vectorStore using the addDocuments method. This turns the pageContent in each document into those 3,072 vector arrays and adds the metadata to each of those. In reality what this looks like in Pinecone (my vector db of choice) is shown below:

As you can see from above each chunk in Pinecone has a number of data attributes. First you can see the Dense values , these are 3,072 numbers which embed the semantic meaning of the pageContent information we talked about before. It is the vector equivalent of the text we can see within the text field of the records. We can also see all the metadata that we added during our embedding process. As we currently stand the entirety of zeidbsaibes.com is comprised of 596 such chunks stored within Pinecone. Each chunk has a handy ID value which allows for easy updating should any of the information on my embedded pages changes. My chatbot has access to all of these chunks when formulating responses to user questions and this is what powers its ability to give very context aware answers.

Using embeddings to provide chatbot answers

So at this point information from my website has been successfully converted into meaning that AI models can search, retrieve and understand. Note: This embedding process doesn’t occur on every interaction with ZeidBot, I trigger it manually after I’ve made some content changes somewhere on zeidbsaibes.com

Now we have to work on the answering parts of the reasoning pipeline which is handling human questions posed to ZeidBot. Let’s recap how this works and then go into detail about each step:

  1. A human poses a question into ZeidBot.
  2. The classify step in my LangChain determines the intent of this question, meaning is it about my career, writing, education or a more general question?
  3. This intent is passed into the retrieve step of my LangChain as a metadata filter which then constrains retrieval to chunks with metadata labelled either posts , education , career or null. These metadata labels correspond to the various collections (or database tables) within my Payload CMS (my CMS of choice).
  4. The retrieved documents and previous chat history (including the latest question) is passed into my generate step and the LLM then creates a response which is sent back to the user of ZeidBot.
  5. The whole process loops again if another question is posed to ZeidBot

The graphic below gives and overview of all of this:

A graphic showing the vector retrieval process for a conversational chatbot

Determining the intent of the questions

Before we start doing anything here we need to define the Large Language Model that we are using to generate ZeidBot responses, at this point you might be thinking, didn’t we connect with an OpenAI language model already? You’re right we did, but we used a very specific embedding model to help us turn out website content into vectors (specifically we used text-embedding-3-large ). This model is specifically fine-tuned for representing meaning ****across large documents, but not optimised for reasoning and response generation. This is where gpt-4o-mini comes in and once again LangChain helps us out here with a handy package to specifically connect to OpenAI’s chat models.

1import { ChatOpenAI } from '@langchain/openai'
2
3const llm = new ChatOpenAI({
4 model: 'gpt-4o-mini',
5 temperature: 0,
6 apiKey: process.env.OPENAI_API_KEY,
7})
8
9export default llm

Very briefly, the apiKey and model are self-explanatory, you can read more about gpt-4o-mini here if you want, I mainly chose it for its performance-cost balance. What might be a new concept is temperature. A lot has been written about this and I would direct you to this IBM article on temperature for a deep-dive, but essentially it is a number between 0 and 1 which determines how creative versus factual the output of an AI model is. I want the ZeidBot to be as precise as possible when giving answers about me and not to be overly creative and freestyle and say things which might be accurate. If I gave ZeidBot a temperature nearer to 1, he would get creative and inevitably starting throwing some real shade and it would be no time at all before he’s calling me a “known dickhead” - possibly true, but not technically within his vector embeddings (well now it is since I’ve written it in this article, oops).

Now we’re connected to an LLM, ZeidBot has the power to reason over human text and comprehend the intent behind questions and create responses to questions. Let’s jump into the classification step and seeing how ZeidBot determines the intent of a user question:

1const classify = async (state: typeof InputStateAnnotation.State) => {
2 const classificationPrompt = [
3 {
4 role: 'user',
5 content: state.question,
6 },
7 {
8 role: 'system',
9 content: `Classify the user's question into one of the following categories:
10- education (if about studies, degrees, courses, institutions)
11- career (if about work, job titles, roles, companies, Hawkker, Kind Consumer or Teaching)
12- posts (if about written content, blog posts, engineering, start-ups)
13- general (if it doesn't fit clearly)
14
15Return only one word: "education", "career", "posts", or "general".`,
16 }
17 ]
18
19 const response = await llm.invoke(classificationPrompt)
20
21 // Normalize output
22 let intent = ''
23 if (typeof response?.content === 'string') {
24 intent = response.content.trim().toLowerCase()
25 }
26
27 // Fallback to 'general' if model gives something unexpected
28 if (!['education', 'career', 'posts'].includes(intent)) {
29 intent = 'general'
30 }
31
32 return { intent }
33}

So what’s happening above looks pretty complicated but essentially the classify function does its job by sending a structured prompt to the language model we set up with two components:

  1. A user message containing the human question posed to ZeidBot.
  2. A system message containing instructions for the LLM—specifically, to categorise the question as either "education," "career," "posts," or "general".

After the model returns one of these categories, the function implements several safety checks since LLMs can sometimes provide imprecise responses (even with temperatures of 0):

  • It normalises the model's response by removing extra spaces and converting to lowercase
  • It validates the output, defaulting to "general" if the response doesn't match the expected categories

It spits out a nice clean intent which can then be neatly passed onto the next step in the chain to help our retrieval function do it’s job better by filtering the chunks in our vector database to only those chunks where the collections match the intent string.

LangSmith, our ability to look under the hood in our chains

At this point you might be forgiven for assuming that the only way we are able to understand whether we are building an effective reasoning chain is to just see what ZeidBot says when we ask him a bunch of questions. This is definitely an important strategy to assess what the experience will be like for end users, but we can do a bit better though, this is where LangSmith comes in. LangSmith acts like a terminal with logs for your LangChain chains, at every step of the chain you can see what’s happening behind the scenes. If we go all the way back to the brief demo of ZeidBot the conversation went like this (I’ve truncated for brevity, you can go back the video for the full convo):

Human: “Tell me about yourself”

ZeidBot: “I am Zeid Bsaibes, a professional with a diverse background in technology, entrepreneurship, and management consulting. I have a strong foundation in full-stack engineering, currently working as a Fullstack Engineer specializing in React and Next.js, with experience in AI integration and product management…..

Human: “what articles have you written?”

ZeidBot: “***I have written several articles that cover a range of topics related to technology, startups, and AI. Here are the articles along with their links:

1. A Primer on Running AI Models on Your Own Computer - Read here

2. Agents and RAG: AI is growing up and getting a job - Read here* ……”

The below extract from LangSmith shows a much more detailed version of what’s going on within ZeidBot:

It shows a trace of the second question: “what articles have you written”. As we can see from the trace in the graphic we can see all the steps in our chain: classification, retrieval and generation. This allows us to see what is happening at each stage. For example I can confirm that the full message history is being inputted at the beginning of each chat cycle. I can also see that our classify step is correctly understanding that the question relates to the metadata filter: posts.

LangSmith as a diagnostic tool

You may wonder why LangSmith is so useful, after all isn’t it just showing in long-form what I’ve already told my chain to do in code? This is true, but the current reality of LLMs and embeddings is that they are far from perfect and you cannot blindly rely on them to perfectly answer any human question, they need a little bit of help. One day in the future this might be the case but not right now.

In early testing ZeidBot would hallucinate and give answers that conflated my career, writing and education. For example it would get confused and respond with answers that claimed that I worked at Next.js or AWS which is certainly not true - I work with those technologies, not at the companies which develop them. I could see in LangSmith that certain documents were being retrieved in context, which would lead to confusing answers. This issue helped me come up with the classify step and also refine my system prompt.

I realised that I needed another step to help ZeidBot conceptually separate my education, my writing, and my career when determining how to answer questions. What LangSmith helped me to do is create a multi-turn process with my LLM to first use it to understand the type of question which helped Pinecone only perform the vector search on a sub-set of vectors to then return to my generate function for the LLM to then create the best possible response.

LangSmith also helped me understand how to tweak my SYSTEM_PROMPT for best response performance, it involved my passing in objects summarising the various documents available to the ChatBot.

1const postsInfo = await getPostSummaryForPrompt()
2const careersInfo = await getCareerSummaryForPrompt()
3const educationInfo = await getEducationForPrompt()
4
5const SYSTEM_PROMPT = `You are an intelligent assistant for the personal website of Zeid Bsaibes. This website contains a number of different page types. You are able to answer as if you Zeid.
6These are the current posts:
7${postsInfo}
8
9This is a summary of my career info please ensure you list them chronologically
10${careersInfo}
11
12This is a summary of my eduction please ensure you list them chronologically
13${educationInfo}
14
15.....
16//other system prompt information
17`

One issue with this system prompt is that it is sent alongside every question posed to ZeidBot and this incurs token fees. As the size of postsInfo , for example, increases this is going to bloat my token expenditure and impact the cost and performance of ZeidBot. LangSmith will be useful in the future in understanding how to best balance cost and performance.

LangSmith also helped me optimise the embeddings stage. Originally I was using a tool called Cheerio to actually scrape the live URLs of my website to then convert and embed into my vector database. Without going into too much detail, LangSmith helped me see how this was not a great approach for the quality of ZeidBot’s responses. Since I have access to the API behind my own website, I didn’t need to scrape public URL’s instead I could query my own API, process the response and then embed (the current approach I’m using as explained above).

Bringing it all together and ZeidBot’s responses

No doubt you’re getting pretty tired of this long read about now. If so I recommend going over to ZeidBot and asking him to summarise it all for you then you can get on with the rest of your day or go to sleep or fire up your PlayStation.

That said we’re getting to the end of the story about now. The code below is basically the amalgamation of everything we’ve gone through in this article:

1// Imagine the below existing within an API endpoint for the chatbot
2
3// All the other stuff we've already been through omitted for brevity
4
5const graph = new StateGraph(StateAnnotation)
6 .addNode('classify', classify)
7 .addNode('retrieve', retrieve)
8 .addNode('generate', generate)
9 .addEdge('__start__', 'classify')
10 .addEdge('classify', 'retrieve')
11 .addEdge('retrieve', 'generate')
12 .addEdge('generate', '__end__')
13 .compile()
14
15// Pseudocode for the endpoint
16// I've included actual code for the interesting bit
17
18export async function POST(req: NextRequest) {
19 When a POST request is received:
20 1. Extract the user's question and chat history from the request.
21
22 2. If the question is missing or not text:
23 → Send an error response.
24
25 3. Check if the history is valid
26 (a list of messages from the user and assistant).
27 → If not valid, use an empty history.
28
29 4. Pass the question and history into a LangGraph workflow.
30
31// the interesting bit:
32 const { answerStream } = await graph.invoke({
33 question,
34 history: validatedHistory,
35 })
36
37 → This graph decides what to do (
38 classify, retrieve info, generate an answer).
39
40 5. Set up a stream to send the assistant’s response back gradually,
41 like typing.
42
43 6. Send the streamed response to the user.
44
45If something goes wrong:
46 → Log the error and send an error message back.
47
48 }

In LangGraph, a graph defines the flow of operations using nodes and edges, making it easy to structure complex language model workflows. Nodes represent the individual steps we’ve talked about above —like classifying a question, retrieving relevant documents, or generating a response Edges define the transitions between nodes, controlling the order in which steps run (e.g., from classification to retrieval to generation). What this means is that I could add other fun steps in my chain, like a step which determines if the question has been asked before and a “cheaper” cached answer can be provided, or a step which determines whether ZeidBot is likely to hurt the real Zeid’s feelings…

I have omitted the API endpoint logic since it mainly Next.js specific stuff, the key point here is that each question & answer cycle with ZeidBot sends the question and the previous chat history to the endpoint, both of which are piped down the LangChain graph every turn. The API endpoint also receives the generated response as a ReadableStream versus a single JSON object. This means that the user starts seeing the response relatively quickly and it gives the appearance of ZeidBot typing (similar to the AI chatbots we know and love). The alternative here would be waiting for entire response and just plonking a big block of text after some longer time period.

Summing Up

This article has explored the process of building a ZeidBot, a RAG (Retrieval-Augmented Generation) chatbot that can communicate about the content of zeidbsaibes.com. By combining LLMs with pipeline orchestration tools and proprietary data within vector stores, we've created an AI-powered reasoning application that interprets questions in the context of a chat history and answer them based on information that was never within the original training data of the underlying LLM*.*

LangChain and LangSmith are powerful tools for orchestrating AI pipelines. LangChain structures logic with modular nodes and edges, enabling step-by-step control. LangSmith offers observability—tracking inputs, outputs, and model behavior—making debugging and evaluation easy. Together, they streamline building, testing, and maintaining robust, reliable, and traceable LLM-based applications.

While there's always room for optimisation - particularly in areas like token efficiency - this project shows how modern AI tools can be combined to create practical, focused applications for almost any use-case.


A group of Llama in a rocky environment
AIEngineering

Step-by-step walkthrough showing how to harness the power of AI locally. Learn to run language models on your own computer, opening up new possibilities for privacy-focused and tailored AI applications.