Intelligent Routing for Chatbot Questions: How We Cut AI API Costs by 95%
Your AI chatbot works perfectly — it retrieves relevant documents, grades their quality, and generates accurate answers. But your monthly OpenAI bill shows $3,000, and when you analyze the logs, a disturbing pattern emerges: 30% of queries are simple questions like "What are you?" or "Hello" that trigger your entire expensive RAG pipeline. Each "Hi" costs $0.05 and takes 25 seconds to process a full vector search, document grading, and LLM generation for a greeting.
This is the hidden waste in naive chatbot implementations: every query follows the same path, regardless of complexity. A user asking "How does your system work?" triggers the same document retrieval and grading process as someone asking "Explain the GDPR compliance requirements for API data processing in multi-region deployments." The first needs a simple pre-written response; the second demands your full RAG capabilities.
In this article, I'll show you how to implement intelligent routing for questions, explain the classification logic, share production metrics and code examples, and help you achieve similar cost savings without sacrificing functionality.
In this article:
- Why do chatbots without intelligent routing generate unnecessary costs?
- How does a three-tier question classification strategy work?
- How can you implement intelligent routing with structured output and LangGraph?
- What production results can intelligent routing deliver? Up to 85% cost reduction
- When intelligent routing makes sense (and when it doesn't)
- How can you get started with intelligent routing in your chatbot?
- Intelligent routing for chatbot questions – conclusion
- Want to implement intelligent routing in your RAG system?
Why do chatbots without intelligent routing generate unnecessary costs?
Most RAG chatbots treat every user message identically. Whether someone types "Hello" or asks a complex technical question, the system executes the full pipeline: embed the query, search the vector database, grade documents, prepare context, and generate a response. This uniform approach is simple to implement but wasteful in production.
How do RAG processing costs differ across query types?
Let's break down what actually happens for different query types in a naive implementation and compare this to a routed approach.
Generic query without intelligent routing: "What can you do?"
1. Query embedding: 50 tokens → $0.000025
2. Vector search: Database query → processing time + infrastructure
3. Document retrieval: 20 candidate chunks → retrieved but useless
4. Document grading: 20 LLM calls × 250 tokens → $0.012500
5. Answer generation: 15,000 tokens → $0.039550
- Total cost: $0.052075 per generic query
- Time: 25-30 seconds
- Value: Zero - a simple pre-written response would suffice
Generic query with intelligent routing: "What can you do?"
1. Classification: 150 tokens → $0.000375
2. Direct generation: 766 tokens → $0.001900
- Total cost: $0.00247 per routed query
- Time: 2-3 seconds
- Value: Identical answer quality, 95% cost reduction
What impact does intelligent routing have at the savings scale?
For a moderate-traffic chatbot handling 10,000 queries monthly with 10% being generic:
Without routing:
- 1,000 generic queries × $0.052075 = $52.08/month wasted.
- 1,000 queries × 25 seconds = 6.9 hours of cumulative user wait time.
- Unnecessary load on vector database and infrastructure.
- Higher rate limit consumption.
With intelligent routing:
- 1,000 generic queries × $0.00247 = $2.47/month.
- Savings: $49.61/month on generic queries alone.
- 1,000 queries × 2 seconds = 33 minutes total wait time.
- Reduced infrastructure load.
- 95% fewer LLM API calls for these queries.
For higher-volume deployments savings scale to thousands of dollars, justifying additional development and maintenance.

An example of a project in which we implemented intelligent routing to optimize query handling and reduce RAG processing costs
Read the full AI Document Chatbot case study here →
How does a three-tier question classification strategy work?
Our production implementation uses a three-tier classification system, each tier handling different query patterns with varying processing requirements.
Tier 1: Generic questions
These are fundamental queries about the chatbot itself: "What is this?", "How do you work?", "Who built you?". Every user asks these questions when first encountering the chatbot.
Characteristics:
- About the chatbot system itself, not the knowledge base content.
- Highly repetitive across all users.
- Answers don't require document retrieval.
- Best served with pre-written responses.
Examples from production:
- "Hello"
- "Who are you?"
- "What are you?"
- "How do you function?"
- "What can you help me with?"
Routing decision:
- Skip RAG entirely → direct response generation.
Tier 2: Conversational/social
These are small talk, gratitude, and social pleasantries that don't require knowledge base access.
Characteristics:
- Social conventions and politeness.
- Don't seek information.
- Should be acknowledged but brief.
- No document context needed.
Examples from production:
- "Thank you" / "Danke"
- "That was helpful"
- "Have a nice day"
- "How are you doing?"
Routing decision:
- Skip RAG → simple acknowledgment.
Tier 3: Document search queries
Questions that require searching the knowledge base and retrieving specific information.
Characteristics:
- Information-seeking intent.
- Reference specific topics, concepts, or procedures.
- Benefit from document context and source attribution.
- Require full RAG pipeline.
Examples from production:
- "What is the process for GDPR compliance?"
- "Tell me about zero-trust architecture"
- "How do I configure API authentication?"
- "What are the best practices for data encryption?"
Routing decision:
- Execute full RAG pipeline → retrieve, grade, generate.
Read also: A 40% Accuracy Boost for RAG Chatbots Using Document Grading →
How can you implement intelligent routing with structured output and LangGraph?
Let's explore how to implement intelligent routing using GPT-4 for classification and LangGraph for workflow orchestration.
Step 1: Define question types with structured output
We use OpenAI's structured output feature to ensure consistent, parseable classification results:
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
class QuestionType(BaseModel):
"""Structured output for question classification"""
category: str = Field(
description="Question category: 'generic', 'conversational', or 'document_search'"
)
confidence: float = Field(
description="Confidence score between 0.0 and 1.0"
)
reasoning: str = Field(
description="Brief explanation of classification decision"
)
# Initialize LLM with structured output
classifier_llm = ChatOpenAI(
model="gpt-4o",
temperature=0, # Deterministic classification
).with_structured_output(QuestionType)
Step 2: Create classification prompt
The classification prompt teaches the LLM to recognize different question types:
classification_prompt = ChatPromptTemplate.from_messages([
("system", """You are an expert question classifier for a knowledge base chatbot.
Classify each question into one of three categories:
1. GENERIC: Questions about the chatbot itself
- Examples: "What are you?", "How do you work?", "Who built you?"
- These don't require searching documents
2. CONVERSATIONAL: Social pleasantries and gratitude
- Examples: "Thank you", "Hello", "Have a nice day"
- These are acknowledgments, not information requests
3. DOCUMENT_SEARCH: Information-seeking questions
- Examples: "What is X?", "How do I configure Y?", "Tell me about Z"
- These require searching the knowledge base
Consider:
- Intent: Is the user seeking information or just conversing?
- Context: Does answering require knowledge base access?
- Specificity: Generic questions about the system vs. specific content questions
Provide:
- category: The classification
- confidence: Your confidence (0.0 to 1.0)
- reasoning: Brief explanation
If confidence is low (<0.7), default to 'document_search' to ensure users get helpful answers.
"""),
("human", "Classify this question: {question}")
])
Step 3: Implement classifier function
Here is the example:
def classify_question(state: dict) -> dict:
"""
Classify user question to determine routing path.
Args:
state: Dict containing 'question' key
Returns:
Updated state with 'question_type', 'confidence', 'reasoning'
"""
question = state["question"]
# Get classification from LLM
chain = classification_prompt | classifier_llm
result = chain.invoke({"question": question})
# Low confidence? Default to document_search for safety
if result.confidence < 0.7:
result.category = "document_search"
result.reasoning += " (Low confidence - defaulting to document search)"
# Update state
return {
**state,
"question_type": result.category,
"confidence": result.confidence,
"classification_reasoning": result.reasoning
}
Step 4: Build LangGraph intelligent routing workflow
LangGraph orchestrates the conditional workflow based on classification:
from langgraph.graph import StateGraph, END
# Define workflow state
class ChatbotState(TypedDict):
question: str
question_type: str
confidence: float
classification_reasoning: str
retrieved_docs: List[Document]
answer: str
# Build workflow
workflow = StateGraph(ChatbotState)
# Add nodes
workflow.add_node("classify_question", classify_question)
workflow.add_node("handle_generic", generate_generic_response)
workflow.add_node("handle_conversational", generate_conversational_response)
workflow.add_node("document_search", execute_full_rag_pipeline)
# Define routing logic
def route_question(state: dict) -> str:
"""
Route to appropriate handler based on classification.
"""
question_type = state["question_type"]
routing_map = {
"generic": "handle_generic",
"conversational": "handle_conversational",
"document_search": "document_search"
}
return routing_map.get(question_type, "document_search")
# Set entry point and routing
workflow.set_entry_point("classify_question")
workflow.add_conditional_edges(
"classify_question",
route_question,
{
"handle_generic": "handle_generic",
"handle_conversational": "handle_conversational",
"document_search": "document_search"
}
)
# All paths end after their handler
workflow.add_edge("handle_generic", END)
workflow.add_edge("handle_conversational", END)
workflow.add_edge("document_search", END)
# Compile workflow
app = workflow.compile()
Step 5: Implement response handlers
Each route has a dedicated handler optimized for that query type:
def generate_generic_response(state: dict) -> dict:
"""
Generate response for generic questions about the chatbot.
No document retrieval needed.
"""
question = state["question"]
# Simple system prompt for generic questions
generic_prompt = ChatPromptTemplate.from_messages([
("system", """You are a helpful AI assistant for a knowledge base.
Briefly explain what you do and how you can help users.
Keep responses concise (2-3 sentences)."""),
("human", "{question}")
])
llm = ChatOpenAI(model="gpt-4o", temperature=0.3)
chain = generic_prompt | llm
response = chain.invoke({"question": question})
return {
**state,
"answer": response.content,
"retrieved_docs": [] # No docs retrieved
}
def generate_conversational_response(state: dict) -> dict:
"""
Handle conversational/gratitude messages.
Very lightweight - just acknowledge politely.
"""
question = state["question"]
# Ultra-lightweight responses
conversational_responses = {
"thank": "You're welcome! Happy to help.",
"danke": "Gern geschehen!",
"hello": "Hello! How can I help you today?",
"hi": "Hi there! What can I assist you with?",
}
# Simple keyword matching for common phrases
question_lower = question.lower()
for keyword, response in conversational_responses.items():
if keyword in question_lower:
return {
**state,
"answer": response,
"retrieved_docs": []
}
# Default conversational response
return {
**state,
"answer": "Thank you! Is there anything else I can help you with?",
"retrieved_docs": []
}
def execute_full_rag_pipeline(state: dict) -> dict:
"""
Execute complete RAG pipeline for document search queries.
This is the expensive path with retrieval and grading.
"""
question = state["question"]
# Full RAG implementation (simplified for clarity)
# 1. Generate optimized search phrase
search_phrase = generate_search_phrase(question)
# 2. Retrieve candidate documents (~20)
candidates = vector_store.similarity_search(
search_phrase,
k=20,
search_type="mmr"
)
# 3. Grade documents for relevance
graded_docs = grade_documents(question, candidates)
# 4. Select top documents (12)
relevant_docs = graded_docs[:12]
# 5. Generate answer with context
answer = generate_answer(question, relevant_docs)
return {
**state,
"answer": answer,
"retrieved_docs": relevant_docs
}
Read also: A Comparison of RAG Stacks — LangChain vs. LangGraph vs. Raw Openai →
What production results can intelligent routing deliver? Up to 85% cost reduction
Our production deployment revealed substantial cost and performance improvements from intelligent routing.
How do costs change before and after routing?
Generic query (10% of traffic):
- Before routing: $0.052075 per query.
- After routing: $0.00247 per query.
- Savings: $0.04941 per query (95% reduction).
- Token usage: 19,045 tokens → 766 tokens (96% reduction).
Document search query (90% of traffic):
- Cost: $0.052075 per query (unchanged—these need full RAG).
- No optimization applied: these queries benefit from a full pipeline.
By how much does intelligent routing improve response time?
Response time reduction for routed queries:
- Generic questions: 25 seconds → 2-3 seconds (88% faster).
- Conversational: 25 seconds → <1 second (96% faster).
- Document search: No change (still requires full processing).
Infrastructure benefits:
- 10% reduction in vector database queries.
- 10% reduction in Elasticsearch load.
- Better rate limit headroom with OpenAI API.
- Lower overall infrastructure costs.
How accurate is the question classifier in real conditions?
From monitoring 1,000 classified queries:
- True positive rate: 94% (correctly identified query types).
- False positive rate: 6% (misclassified queries).
- Confidence threshold fallback: Used in 8% of cases.
- User complaints: 0 (no reports of inappropriate routing).
Edge cases handled:
- Ambiguous questions default to document_search (confidence <0.7).
- Mixed queries ("Thanks, and can you tell me about X?") route to document_search.
- Multilingual queries correctly classified by intent.
Read also: Intelligent Caching Techniques for Faster AI Chatbot Responses →
When intelligent routing makes sense (and when it doesn't)
Intelligent routing for questions isn't universally beneficial. Understanding when it helps and when it adds unnecessary complexity ensures you invest effort appropriately.
Ideal use cases for intelligent routing
Intelligent routing isn’t necessary for every chatbot, but in certain environments it delivers substantial and measurable benefits. Below are the scenarios where this optimization provides the strongest return on investment.
High-volume chatbots (1,000+ queries/month)
The more queries you handle, the more repetitive patterns emerge. With significant traffic, even small percentage savings compound into meaningful cost reductions. A chatbot answering 100 queries daily will see different economics than one handling 10,000.
Mixed query types
If your users ask both simple questions ("What is this?") and complex information requests, routing provides clear value. Knowledge base chatbots, customer support bots, and educational assistants typically exhibit this pattern.
Cost-sensitive deployments
When API costs are a significant budget concern, routing offers immediate savings. Startups, non-profits, and organizations with tight budgets benefit most from optimization that reduces expenses without sacrificing quality.
Predictable conversation patterns
If analytics reveal that 10-30% of queries fall into generic or conversational categories, routing will deliver measurable savings. Review your logs – if you see repetitive simple questions, routing makes sense.
When to avoid or delay intelligent routing
Intelligent routing isn't universally beneficial. In some situations, the gains are minimal or the added complexity outweighs the advantages. Here are the cases where implementing it may not be the right move.
Low-volume deployments (<100 queries/month)
If you're handling fewer than 100 queries monthly, routing optimization likely won't justify the implementation effort. The absolute savings will be too small to matter ($5-10/month), and your time is better spent on other improvements.
Uniform query complexity
If essentially all queries require document search (technical documentation bots, research assistants), routing adds complexity without benefit. When 95%+ of queries need full RAG, just execute the full pipeline for everything.
Early development stage
Get basic RAG working first before optimizing. Implement core retrieval, grading, and generation functionality. Prove the concept works. Add routing only after you have production traffic data showing where optimization would help.
Highly specialized domains
Narrow-domain chatbots where users only ask technical questions may not benefit. A molecular biology research assistant or legal document analyzer likely won't receive "Hello" or "What are you?" very often – users know exactly what they're querying.
Read also: Real-time Data Synchronization Methods to Keep Your RAG Chatbot’s Knowledge Fresh →
How can you get started with intelligent routing in your chatbot?
Ready to implement intelligent question routing? Here's a practical roadmap to follow.
Step 1: Analyze your query patterns
Before implementing routing, understand your chatbot traffic:
- Enable comprehensive logging if you haven't already.
- Capture 2-4 weeks of production queries (minimum 500 queries).
- Manually categorize 100-200 queries into potential categories.
- Calculate percentages: How many generic queries? Conversational? Document search?
- Estimate savings: Generic % × query volume × cost per query.
If generic + conversational queries represent less than 5% of traffic, routing may not be worth implementing.
Step 2: Start with simple classification
Don't over-engineer initially. Begin with basic keyword classification:
def simple_classify(question: str) -> str:
"""
Simple keyword-based classification for MVP.
"""
question_lower = question.lower()
# Generic keywords
generic_keywords = ["what are you", "who are you", "how do you work"]
if any(kw in question_lower for kw in generic_keywords):
return "generic"
# Conversational keywords
conversational_keywords = ["thank", "thanks", "hello", "hi"]
if any(kw in question_lower for kw in conversational_keywords):
return "conversational"
# Default to document search
return "document_search"Deploy this simple classifier, monitor accuracy, and iterate based on results.
Step 3: Implement LLM-based classification
Once simple routing proves valuable, upgrade to LLM classification:
- Define structured output model (as shown in implementation section).
- Create a classification prompt with examples from your domain.
- Add confidence threshold (0.7 works well as default).
- Implement fallback logic to document_search on low confidence.
- Log all classifications for monitoring and improvement.
Step 4: Integrate with LangGraph
Build conditional workflow routing:
- Add classification node as workflow entry point.
- Create routing function based on classification result.
- Implement separate handlers for each route.
- Connect handlers to workflow end state.
- Compile and test workflow with diverse queries.
Step 5: Monitor and optimize
Track performance and refine based on real data.
- Key metrics to track:
- Classification accuracy (sample manual review).
- Cost per query by category.
- Response time by route.
- False positive/negative rate.
- User satisfaction by query type.
- Weekly reviews:
- Review misclassified queries.
- Identify new patterns.
- Update classification prompt.
- Adjust confidence threshold if needed.
- Monthly optimization:
- Analyze cost savings achieved.
- Review edge cases and failures.
- Update routing logic based on learnings.
- Consider adding new categories if patterns emerge.
Intelligent routing for chatbot questions – conclusion
Intelligent question routing is a high-ROI optimization for RAG chatbots with mixed query types. By classifying questions before processing, we achieved 95% cost reduction and 88% latency improvement for generic queries without changing answer quality or user experience.
The key insight is recognizing that not all questions require the same processing. Simple questions about your chatbot don't need document retrieval, grading, and complex context assembly. Routing those queries to lightweight handlers saves money and improves response times.
For our production deployment handling 10,000+ monthly queries with 10% generic traffic, routing saves approximately $50/month and eliminates hours of unnecessary infrastructure load. More importantly, users asking simple questions now receive instant responses instead of waiting 25 seconds for full RAG processing they don't need.
The implementation is straightforward: classify with structured LLM output, route with LangGraph conditional edges, and handle each type appropriately. Start simple with keyword matching, upgrade to LLM classification when justified, and optimize based on your actual traffic patterns.
If your chatbot handles mixed query types and you're concerned about API costs or response times, intelligent routing should be one of your first optimizations. Review your query logs, calculate potential savings, and implement classification. The combination of cost reduction and performance improvement makes this one of the most impactful changes you can make.
Want to implement intelligent routing in your RAG system?
This blog post is based on our real production implementation of an AI document chatbot serving thousands of users, where intelligent routing for questions is one of several optimizations we deployed. For a deeper look at the full architecture and results, see our AI document chatbot case study.
Interested in building a high-performance RAG system with intelligent routing and other production-grade optimizations? Our team specializes in creating cost-effective AI applications that balance quality, speed, and operational efficiency. Visit our generative AI development services to discover how we can help you design and implement the right solution for your project.