The DeepEval framework is a comprehensive automated testing solution for evaluating chatbots, voice AI agents, and RAG systems. It integrates multiple components: a chatbot system for conversational AI testing, a RAG Explorer for pipeline visualization, and the DeepEval system for automated evaluation. The framework uses LLMs like OpenAI GPT, Grok, and Ollama as judges to assess quality metrics including answer relevance, hallucinations, toxicity, faithfulness, and contextual recall. It supports both local and cloud LLM configurations. The framework includes golden datasets with expected inputs and outputs, test files for evaluating chatbots and RAG pipelines, and a dashboard for visualizing evaluation results. Components communicate through APIs, enabling flexible and scalable AI system testing.
Approfondir
Prérequis
- Pas de données disponibles.
Prochaines étapes
- Pas de données disponibles.
Approfondir
Build an AI Testing Framework for QA | Chatbot, Voice, RAG + DeepEval | Part 3Indexé :
Want to become an AI Expert in QA & Automation? 🔗 Link :- https://sdet.live/ai-course Become AI Tester in 12+ Weeks. In this video, I’m sharing a complete 12+ Weeks Action Plan and structured roadmap to help you become an AI-Powered Tester. 📌 Explore the AI Roadmap: https://sdet.live/roadmap-ai 📌 Check Complete AI Topics: https://sdet.live/ai-topics 📩 Have any questions? Message me directly on WhatsApp: https://sdet.live/WhatsApp Stop just using AI. Learn how to build with it. 🔥 🚀 Download Notes - https://sdet.live/notes 🚀 Watch Full Playlist : https://apitesting.co/30days 🚀 Download Roadmaps- https://sdet.live/roadmaps 🚀 BONUS VIDEO 🚀 ❤️ Mind Map Download - https://sdet.live/notes ❤️ Become Better QA : https://sdet.live/30days ✅ Resources for Selenium Training with Custom Framework. 🔥🔥 https://sdet.live/30dayselenium 🔥🔥 https://sdet.live/2WlO ✅ If you are new consider subscribing and follow day by day to become an Automation Testing expert in 30 days. 🔥 Become Better QA ✅ 30 Days of Automation Testing Playlist ✅ 30 Days of Selenium (With Framework) ✅ 30 Days of API Testing with Postman ✅ 30 Days of API testing with Rest Assured ✅ Karate API Testing for Beginners ✅ Link to Playlist 👉 https://sdet.live/30days
Welcome guys. Welcome. What we're going to do today is we're going to wipe code for our Deep Evil framework. We will try to complete it. Ah we will have a fine tuning the LL.M. We're going to have a couple of extra topics related to mastering cloud code. Mastering Open Code. What's new in Gema Four. So there's a lot of new Friday content that we're planning. Let's do one thing. Let's Complete Our Framework First. Then I will guide you sir what are the things that we need to do. Sir what is a framework that we are creating. We want a framework which will basically help us to do evaluation of a chatbot and its output. Ok? We want a chatbot which can test the voice AI agent as well as its RAG output. It can verify our RAG output also and LM evaluations also.
Ok? By using the deep eval function. Ok? So today I will be coding fully by using the wipe coded method everyone. Ok? We will try to understand each and everyone. So I'm going to use Claude and we'll be using this into the right side. If you want to test and verify the rag also you can use ragas also. Come on, let's go last one. We are going to create a project number 23 which is Deep Eval framework. I hope everyone with me types yes in that.
Let me close this. So the Deep Eval framework that we're now working on. So we will be working into this directory. Ok?
So I just mentioned we will work in this directory. Ok? Let us go. Ok. So what exactly are we building sir? Ok.
First of all what are we going to do?
Today's objective is we will be building a framework which will help us to automate and test our chat bots as well. Ok? We are building a framework with Deep Eval which will help us to do evaluation of two Eval the chatbot voice agent and rig it out. I hope this is clear. So first of all I will mention in the cloud that first thing first okay we are going to build a full framework in the deep Ivan. First of all our task is that we want to automate the chatbot. So it's a chatbot of an e-commerce website that we want to basically evaluate by using the DP value. So the first task will be we are going to build the React chatbot. The second one is we're going to build something called the S Rag Pipeline. Ok?
I want you to also create a second folder where you will create a.rag based output.
Okay, in the Rack based explorer, what you need to do is you just need to create certain documents related to let's say e-commerce website only. In this rag based pipeline. You will basically have an embedding of nomic embedding. You will also use a local Chroma DB for searching and embedding. You will ingest the data. And it can be a PDF or text file. Ok?
After that you will see the chunks in the HTML format. You are you are also going to create a chatbot which is a grok one okay in the grok I will share with you the grok key you will be creating this agent which will fetch the details from the rag and vector database which is a chroma db that we have done and what we need to do is we need to evaluate all this the API of rag explorer project as well as the second project which is a chatbot of an e-commerce we need to evaluate by using the deep well we will be evaluating for all the type of metrics this you need to prepare a proper framework where we will be using the grok an open AI GPT OSS 120 billion one as an LLM as a judge and you can generate the output by using the gemmer 3 billion one okay that we already have it and you can fetch this by using the key also if you want.
So first task first you need to generate a chat bot first and open the URL. The second one is the Rag Explorer or Rag Pipeline. Overall make sure that you showcase the full pipeline I'm gonna give you a screenshot also of how the Rag Explorer will look like. And the third task will be generating the framework from the scratch of deep eval testing each and everything.
Make sure Deep Eval is we have a setup of local LLM as well as the cloud LLM switching both off. Make sure that's all the things we've learned, right? Answer relevance, hallucinations, toxicity, and all the 15 plus matrices we are basically checking, testing the chat bot as well as the rag output also overall with the agent.
Ok? So guys, I hope this is clear? It's a very big prompt everyone, do you see that? This is the prompt which I have given? I'm gonna share with you now, everyone, how many you want this prompt type yes in the chat.
Ok? Prompt I'm Gonna Share With You Because It's Going to Be Taking a Lot of Time, Okay? This is a prep which we are going to work with okay so I will show you the rag explorer do you want to see rag explorer everyone let me show you what exactly this is okay this one we have it is a naive rag what is this everyone it is a naive rag okay so we have build a naive rag in the previous atb one atb to x atb to x guys what we have build is I will show you okay so let me show you what we have build how many you want to see type yes chat we have build a system rag pipeline okay a simple rag we have build basic rag rag. Ok? What is this basic rag is whatever the data that you put. For example we have a product requirement document of bw.com do you remember this PDF guys we have build this. Do you remember this give me are in the chat. It is a product requirement document of we have created an end flow for this advanced rag hybrid rag we have built this right? Yes. So this is a project we have done. It will just go into the chroma DB and this is what actually happens.
Ok? So you can ask for example weed is used by how many companies. So what happened is this is the data injection I have already done our PDF has been divided into how many parts everywhere 19 chunks so nomadic mnemonic embedding has divided this into 19 chunks and put it into the chroma TV I hope this is clear I hope this is clear type yes in the chat sir okay what we are going to do is if I ask this question it will tell me how many chunks are returned if I ask this question cvw is used by how many companies so there are top four chunks which are returned see there are four returns. And here it is telling it does not specify the number in this case. Ok? Ok. Let me ask the question how many countries? He is used by how many countries? Let me ask this.
Here it is saying that okay it does not specify the countries. Okay, okay guys, what is this rack explorer? Do you you you you you ummer? This is nothing but a way to visualize. I have just built a visualization of your rag. If you remember it if it's a flow. If you remember it, what we have done behind a data. This data will be the first of all instead yes or no everyone?
This was ingested by one of the embeddings into the Chroma DB. Agree that we have already done.
What is a chunk size. What is an overlap? So, it is ingested. So, I have created a view so that you can ask questions also.
It's like very similar to N10 right? It's the same, the same flow as any 10, right? Here you can ask questions, the Grok will basically ask the questions from the chunks, it will basically get the questions from them.
It's a visual representation of a rag, yes or no.
What happened in the rag? If You Remember Guys, What Happens in Rags? If You Remember It Guys, What You Had in Rag? You have certain data.
First of all it will be chucked. After it will put into vector database. Then we will basically do a query. It will basically return chunks of a top. It will return chunks of top. This will be given to LLM and LLM will give you the answer right yes or no?
That's what it was. This is the Rag Pipeline, yes or no Rag Pipeline. Basic this was a naive naive rag if you remember or know. This boat was just a rag.
This is a naive rag yes or no?
Same thing we have created like this. Ok.
Did you get the point? Based on your question I got these chunks from the rack and based on this this LLM AI agent has generated the answer. So retrieval, argumentation and generation. Yes and no everyone retrieval given to him argumentation and then the last part is generation. How many you got it type yes that. Best Explorer I Have Created For You. It can't get simpler than this.
I Think This Is The Most Simple One That You Can Get It Everyone To Understand?
That Sir Which Chunk Is Also Coming. Database is also coming. Each and everything is I have mensh. Yes. So Rack Explorer I'm pretty much sure everyone got it. Isn't it? So this is a rack explorer. We also have a simple e-commerce bot. A simple e-commerce bot also we have it that we have already built. If you remember last time guys also had an e-commerce bot we built, right?
This chat bot of ours was a small chat bot. This was the chat bot that we have already built. Which Is Shopify Boat Shopify Shop Easy Boat Right? So this is also going to create a boat also in this case. Ok? So let's create it, we will do it. We will see one by one what it is doing. Ok? Ok.
So I think Rag Explorer also has it built. Okay so we are just checking out what is happening so react chatbot is done rack explorer is done so let me open it everyone so chatbot it has build rack explorer also it has build and now it is going to build a deep e UL system also we have used pine right yes no an and in flow do you remember this we have we have used pine also right a that is a production database okay that is again a production database sir in the IT testing project test automation framework will be separate and deep well will be separate right yes ideally yes guys in the real world these two folders will be separate they will not be in the same directory. This is just for demo purpose I have added. They will not be part of your repository everyone. This Just A Demo Purpose We Have Added. Actually you will have only this folder which is Deep UL Framework. We will have only Deep UL framework running directly. And how they are communicating with each other. Everyone. How they are communicating with others. Through API.
Through API they are communicating. Chat Bot APIs Directly We Can Use Rag Explorer API Also We Can Use Directly. So we will use APIs to communicate with them. Ok? So the third system which we are building right now if you see we are building a third system which is an overall framework that we are building cyber this is a rack explorer as well as a react chatbot that we are creating deep eval will be the one who will be testing both the project it will test both okay it will test your chatbot also it will also test your rack explorer both of them so it has created a framework also let's see what is a framework it has created okay so here chatbot rag explorer dp well framework report json and everything I think it has done everything dp well is done this is requirements requirement document dpl version we are using open ai grok we are using llm providers file pi test dot ili file it has created information file a requirements file it has created Environment This is environment.enb file temporary containing who is going to be judge who is going to be what a factory it has used which is all the providers guys we have open ai also support for grok also support for llama how many fare with me type w in the chat all three support are there.
If you want to use Open AI, it supports all three: Olama and Grow. Ok? Then we have base file. Base file is just a judge. Judge Configuration A Judge Configuration That We Have Done. Similarly, we have done it previously and also aligned it properly. It has just aligned it properly into folders. Now data sets this is a chat board data sets it has basically created. Now data sets are nothing but the here it's all added up to the expected results everyone. What is the input? What is an output? And what is a context? What is input? What is expected outward? What is the context? Some of the information it has added is like a golden data set. What is a golden set? The golden data set is nothing but a sum of the scenarios that are truly ideal. Like we have an expected results everyone. In the test case when we write, we have an expected result, right? Yes and no. For example if I use valid username valid password it will show me dashboard. So dashboard is a golden data set. Expected result. So that is the expected result. He just wrote that to him. Ok? He wrote all the scenarios.
What should be the one okay what will happen with the one safety protocols also it has written rag's it has written the golden data set golden data set is nothing but an expected result everyone for this input this is the output for this input this is the output for this input this is the output it has written all that line by line that's it okay for certain scenario that's nothing else for chatboard this is for chatbot to make a connection with chatboard to make a connection with rack pipeline that's it it has created a file to make connection to both nothing else okay. Apart from this it is going to write one more test files.
Now it will create test files. The test files will be How to Test This Chatbot and How to Test a Rack File. He will make both of them.
Ok. Now it will create the test. Do You See That? Do you see the taste everyone? Now make the test. Look, he has made so many tests. Do You See That? Yes or no everyone? Now it has created all the taste. For chat bots for the rack. Isn't it awesome everyone types yes in the chat. Answer relevance check for chatboard. Ok? It's a simple check, sir.
R Answer Relevancy Check It is going to ask from the boat. Ask from this and verify the matrices it should not be there chat bot failure analysis all have been entered it everyone faithfulness hallucination biasing toxicity g welcome leakage contextual recall all have been entered safety measures summarization all the it has added all of them one by one ok I will also tell him to create a UI for deep eval also how if you wanted type this check I will basically ask him to create a deep eval AI what we call UI also so that we can see how the things are going on right guys last time deep eval metric evaluator do you remember that deep eval metric evaluator we have created right UI was in that UI so dashboard will be good right let it run first of all let it run once the entire framework guys by the way is already created this is actually completed framework only The thing we're asking is to create a dashboard. That's it. It's created the framework. It's created the tests. It's created all the tests. Now it's created everyone.
Okay, the first thing it has shown me is this. Okay, so we have ingestion. We have embeddings. We have storing. We're storing the data here.
Retrieval. Okay, so you can retrieve it here.
Right, and this is the answer. You can ask the answer. Also, mock data because right now we don't have this.
Okay, so it's telling me that I already added the data. Right now it has also added the data itself.
Okay, by the way, it has added a data. This is an inserted data already. The total number of junks which are available is 21. Five are distance resources.
Okay, right now we are in a mock mode. Okay, why we are in mock mode because we are right now in a grow key. We have not added right. That's why it has not been injected yet okay you can add the file if you want to everyone if you want to add more files you can upload a new file did you get the point everyone you can upload a new file also upload chunk and add okay if you select a new file any file sir I have a deep eval file if you do it it will basically add the file also see this will add it okay append if I do append it will also add the file if you go back to dashboard six done everyone. Do You See That? Deep Evel PDF is also gone. Ok?
Deep Evel PDF Also Is Given. We have a rack. We can ask questions. Now you can ask directly. Yes we can ask questions directly. Ok? Tell me about the refunds for electronics items. So it 's telling me that okay top retrieve chunks are retrieve policy and everything.
These chunks are returned in retrieve policies. And what is the answer? Well, it just gave the retrieval.
Anything related to DP well?
Look Deepwell PDF is a chance zero. Do You See That? Yes or no? It is basically not giving the answer as of now but you will know that chunk is telling that this is the chunk which is received everyone. Ok? Still it is correct to be honest. Chunk is correct everyone. Chunk it is telling me that this file basically I will be searching from this chunk.
Ok? So which is good. Which is very good in this case. I will ask him to modify. Let's take it to search. Okay now let's do vector search. Tell me about the refunds.
Ok? So Sir There Are Top Results About The Refund Policy Seven Business Days. Ok? Let me ask in chat. Ok? Let's go to the open now and we will ask a question. Tell me about the refunds ok he replied here isnt it yes or no is correct reply yes or no everyone is refunded within seven days right he replied proper and this is the refund policy and all the chunks are received four chunks have been received we can make it three also ok let me ask another question tell me about which protest you have chat board is saying that I don't have any information about knowledge please collect this ok ok can I refund after seven days the answer is according to the refund policy the refund must be processed in this case however it does not specify the time frame of if for a resource correctly the answer is this ok is working fine isnt it yes or no guys we will check the relevance and everything later but kind of returning the data From the contact right in production you will never know if the chatbot is connected with rag or no right but if you are developing when you are developing internally this application if this is your product then ideally this chatbot is connected to rag isn't it yes or no so in a nutshell you are basically testing the answers what is the retrieval and everything happening right how how the answer will be and whatever the LLM that we are using is it good or not by using DP DP well it is only checking the 15 plus matrices nothing else is checking all of them so here it is almost verifying the live mode now okay it is verifying itself our rack pipeline is working by the way guys is n't it we can ask question one more question I will ask okay let me ask one more question how many franchise partner managers ok dalchini has 60 plus franchise partner isn't it yes or no here you will not don't see anything at all everyone yes or no do you see do you see anything at all you fools you don't know anything now let me see how it is replying i apologize i could not help so ok i think toxicity and everything is also mentioned right yes i have already returned my hoodie after 36 days why you guys are still not doing it worst experience ok let me see how it is replying i apologize for this properly reply ok how many franchises we have in dalchini don't know isko thik hai abhi bechara trend nahi hua hai yeh it is only trend for the customer support everyone. Ok. So building the dashboard is almost the last part which is DP well live dashboard UI just to our purpose. Come on everyone. So the dashboard is ready everyone do you want to see a dashboard guys beautiful dashboard made by him.
OK I love it guys. Isn't it good? Did n't you make it great? Yes or no? So here we have chatbot which is connected, rag which is connected, ah judge which is connected, judge this open an eye and this is how many passes and how many fails. So answer relevance we are checking if you run it asking from chat board everyone this answer relevance chatbot is fine if you go to delay detail I have asked what is your refund it has given me the answer absolutely the answer is correct it has given very ok by the way all the test cases will pass most of them because we have mostly done things ok answer relevance of rag this is the second one of rag ok so for rag also you can check sir UI can only be created in clod no it can be created in other one also ok same thing isn't it super cool everyone tell me yes in the chat. How cool did it look, right? Yes or no. So summarization each and everything we can check. Right? So we can check synthetic date with synthetic data with rag application with chatb sare check kar sakte hain abhi. Right? If you want to target synthetic data, synthetic forest summarization, Rag Explorer or chat, whatever you want to select based on that. A you want to use judge. Judge s an open ai, olama or croc you can change. Model If you want to change the model you can you can apply or if you want to click run all. It will run all of them everywhere. He will score all the runs.
So right now it is running all of them. He will score all the runs. Ok? Isn't it super awesome everyone? Tell me in the chat. He will score all the runs at once.
DP while all test cases are running.
Faithfulness, Answer Relevances.
Ok? Sir I am not getting why this refund question only. Refund Question Because We Have Asked. This is actually a dummy one.
Mansi this is actually a but we call we have trend in a such a way to ask the refund question. What is the refund question? The Answer Is Seven Days. Which he verified from the golden data set, that's it. You can test other LLMs also, other chatbots also. If you have a chatbot running it can basically connect with different chatbots and check it.
Guys if I plug the real initiation here initiation will fail. Yes or no. Our LL Our Judge Will Fail Hit. Yes or no. He will definitely chase me away and kill me. Initiation will absolutely fail here.
So here you can see the error. Some of them we have error message also. So I hope you got the point. How amazing everyone is. Yes or no. I don't think so you have seen it that this is how the framework is. And how we have created it in the form of a proper dashboard.
Ok. Properly we have made it in a form of dashboard. This is your chatboard.
So I also keep a chatboard. Ok? S a screenshot. I am just keeping it as a screenshot so that you guys can see them properly.
Is it clear everyone? Yes or no? Proper injections and everything. I have kept everything into a form up screenshot also. You can run it. I'll add all this inside the screenshots.
Ok? Our Repository Will Be the Same. Ok?
Repository I will push it. Tell Me That How Many You Have Loved The Overall Journey Till Now.
Well how many of you are going to refer other people sir totally enjoyed it. Ok. Yogesh is saying that sir I am thinking I am the future testing will be QA guys. Let's get this done.
Let's update him.
Vidéos Similaires
Elon Musk’s XAI, Fiber-Optic Drones & the New Era of US Defense & Winning the AI Arms Race
DefenseNow
250 views•2026-05-15
I Read Every Google Antigravity 2.0 Doc So You Don't Have To (13-Min Operator Playbook)
hyperautomationlabs1045
120 views•2026-05-19
Could AI change the future of cancer survival?
MotherConservative
999 views•2026-05-16
[RQ] All Preview 2 Midnight Horror School Deepfakes in Macbg Major
macbghuggylego
102 views•2026-05-15
Firefox on Android Just Added 'Shake to Summarize'
BrenTech
349 views•2026-05-19
Google’s NEW AI Just SHOCKED The World…
JulianGoldiePodcast
188 views•2026-05-21
WWDC 2026 Promises Apple Intelligence and Siri Upgrades | Episode 195
TheMacRumorsShow
104 views•2026-05-22
RNNs Had a Fatal Flaw — Why Transformers Replaced Sequential Processing
axiom-motion-math
567 views•2026-05-18











