Intelligent Routing for Chatbot Questions: How We Cut AI API Costs by 95%

21.11.2025

Your AI chatbot works perfectly — it retrieves relevant documents, grades their quality, and generates accurate answers. But your monthly OpenAI bill shows $3,000, and when you analyze the logs, a disturbing pattern emerges: 30% of queries are simple questions like "What are you?" or "Hello" that trigger your entire expensive RAG pipeline. Each "Hi" costs $0.05 and takes 25 seconds to process a full vector search, document grading, and LLM generation for a greeting.

This is the hidden waste in naive chatbot implementations: every query follows the same path, regardless of complexity. A user asking "How does your system work?" triggers the same document retrieval and grading process as someone asking "Explain the GDPR compliance requirements for API data processing in multi-region deployments." The first needs a simple pre-written response; the second demands your full RAG capabilities.

In this article, I'll show you how to implement intelligent routing for questions, explain the classification logic, share production metrics and code examples, and help you achieve similar cost savings without sacrificing functionality.

In this article:

Why do chatbots without intelligent routing generate unnecessary costs?
How does a three-tier question classification strategy work?
How can you implement intelligent routing with structured output and LangGraph?
What production results can intelligent routing deliver? Up to 85% cost reduction
When intelligent routing makes sense (and when it doesn't)
How can you get started with intelligent routing in your chatbot?
Intelligent routing for chatbot questions – conclusion
Want to implement intelligent routing in your RAG system?

Why do chatbots without intelligent routing generate unnecessary costs?

Most RAG chatbots treat every user message identically. Whether someone types "Hello" or asks a complex technical question, the system executes the full pipeline: embed the query, search the vector database, grade documents, prepare context, and generate a response. This uniform approach is simple to implement but wasteful in production.

How do RAG processing costs differ across query types?

Let's break down what actually happens for different query types in a naive implementation and compare this to a routed approach.

Generic query without intelligent routing: "What can you do?"

1. Query embedding: 50 tokens → $0.000025

2. Vector search: Database query → processing time + infrastructure

3. Document retrieval: 20 candidate chunks → retrieved but useless

4. Document grading: 20 LLM calls × 250 tokens → $0.012500

5. Answer generation: 15,000 tokens → $0.039550

Total cost: $0.052075 per generic query
Time: 25-30 seconds
Value: Zero - a simple pre-written response would suffice

Generic query with intelligent routing: "What can you do?"

1. Classification: 150 tokens → $0.000375

2. Direct generation: 766 tokens → $0.001900

Total cost: $0.00247 per routed query
Time: 2-3 seconds
Value: Identical answer quality, 95% cost reduction

What impact does intelligent routing have at the savings scale?

For a moderate-traffic chatbot handling 10,000 queries monthly with 10% being generic:

Without routing:

1,000 generic queries × $0.052075 = $52.08/month wasted.
1,000 queries × 25 seconds = 6.9 hours of cumulative user wait time.
Unnecessary load on vector database and infrastructure.
Higher rate limit consumption.

With intelligent routing:

1,000 generic queries × $0.00247 = $2.47/month.
Savings: $49.61/month on generic queries alone.
1,000 queries × 2 seconds = 33 minutes total wait time.
Reduced infrastructure load.
95% fewer LLM API calls for these queries.

For higher-volume deployments savings scale to thousands of dollars, justifying additional development and maintenance.

An example of a project in which we implemented intelligent routing to optimize query handling and reduce RAG processing costs

Read the full AI Document Chatbot case study here →

How does a three-tier question classification strategy work?

Our production implementation uses a three-tier classification system, each tier handling different query patterns with varying processing requirements.

Tier 1: Generic questions

These are fundamental queries about the chatbot itself: "What is this?", "How do you work?", "Who built you?". Every user asks these questions when first encountering the chatbot.

Characteristics:

About the chatbot system itself, not the knowledge base content.
Highly repetitive across all users.
Answers don't require document retrieval.
Best served with pre-written responses.

Examples from production:

"Hello"
"Who are you?"
"What are you?"
"How do you function?"
"What can you help me with?"

Routing decision:

Skip RAG entirely → direct response generation.

Tier 2: Conversational/social

These are small talk, gratitude, and social pleasantries that don't require knowledge base access.

Characteristics:

Social conventions and politeness.
Don't seek information.
Should be acknowledged but brief.
No document context needed.

Examples from production:

"Thank you" / "Danke"
"That was helpful"
"Have a nice day"
"How are you doing?"

Routing decision:

Skip RAG → simple acknowledgment.

Tier 3: Document search queries

Questions that require searching the knowledge base and retrieving specific information.

Characteristics:

Information-seeking intent.
Reference specific topics, concepts, or procedures.
Benefit from document context and source attribution.
Require full RAG pipeline.

Examples from production:

"What is the process for GDPR compliance?"
"Tell me about zero-trust architecture"
"How do I configure API authentication?"
"What are the best practices for data encryption?"

Routing decision:

Execute full RAG pipeline → retrieve, grade, generate.

How can you implement intelligent routing with structured output and LangGraph?

Let's explore how to implement intelligent routing using GPT-4 for classification and LangGraph for workflow orchestration.

Step 1: Define question types with structured output

We use OpenAI's structured output feature to ensure consistent, parseable classification results:

from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

class QuestionType(BaseModel):
    """Structured output for question classification"""
    category: str = Field(
        description="Question category: 'generic', 'conversational', or 'document_search'"
    )
    confidence: float = Field(
        description="Confidence score between 0.0 and 1.0"
    )
    reasoning: str = Field(
        description="Brief explanation of classification decision"
    )

# Initialize LLM with structured output
classifier_llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0,  # Deterministic classification
).with_structured_output(QuestionType)

Step 2: Create classification prompt

The classification prompt teaches the LLM to recognize different question types:

classification_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an expert question classifier for a knowledge base chatbot.

Classify each question into one of three categories:

1. GENERIC: Questions about the chatbot itself
   - Examples: "What are you?", "How do you work?", "Who built you?"
   - These don't require searching documents

2. CONVERSATIONAL: Social pleasantries and gratitude
   - Examples: "Thank you", "Hello", "Have a nice day"
   - These are acknowledgments, not information requests

3. DOCUMENT_SEARCH: Information-seeking questions
   - Examples: "What is X?", "How do I configure Y?", "Tell me about Z"
   - These require searching the knowledge base

Consider:
- Intent: Is the user seeking information or just conversing?
- Context: Does answering require knowledge base access?
- Specificity: Generic questions about the system vs. specific content questions

Provide:
- category: The classification
- confidence: Your confidence (0.0 to 1.0)
- reasoning: Brief explanation

If confidence is low (<0.7), default to 'document_search' to ensure users get helpful answers.
"""),
    ("human", "Classify this question: {question}")
])

Step 3: Implement classifier function

Here is the example:

def classify_question(state: dict) -> dict:
    """
    Classify user question to determine routing path.

    Args:
        state: Dict containing 'question' key

    Returns:
        Updated state with 'question_type', 'confidence', 'reasoning'
    """
    question = state["question"]

    # Get classification from LLM
    chain = classification_prompt | classifier_llm
    result = chain.invoke({"question": question})

    # Low confidence? Default to document_search for safety
    if result.confidence < 0.7:
        result.category = "document_search"
        result.reasoning += " (Low confidence - defaulting to document search)"

    # Update state
    return {
        **state,
        "question_type": result.category,
        "confidence": result.confidence,
        "classification_reasoning": result.reasoning
    }

Step 4: Build LangGraph intelligent routing workflow

LangGraph orchestrates the conditional workflow based on classification:

from langgraph.graph import StateGraph, END

# Define workflow state
class ChatbotState(TypedDict):
    question: str
    question_type: str
    confidence: float
    classification_reasoning: str
    retrieved_docs: List[Document]
    answer: str

# Build workflow
workflow = StateGraph(ChatbotState)

# Add nodes
workflow.add_node("classify_question", classify_question)
workflow.add_node("handle_generic", generate_generic_response)
workflow.add_node("handle_conversational", generate_conversational_response)
workflow.add_node("document_search", execute_full_rag_pipeline)

# Define routing logic
def route_question(state: dict) -> str:
    """
    Route to appropriate handler based on classification.
    """
    question_type = state["question_type"]

    routing_map = {
        "generic": "handle_generic",
        "conversational": "handle_conversational",
        "document_search": "document_search"
    }

    return routing_map.get(question_type, "document_search")

# Set entry point and routing
workflow.set_entry_point("classify_question")
workflow.add_conditional_edges(
    "classify_question",
    route_question,
    {
        "handle_generic": "handle_generic",
        "handle_conversational": "handle_conversational",
        "document_search": "document_search"
    }
)

# All paths end after their handler
workflow.add_edge("handle_generic", END)
workflow.add_edge("handle_conversational", END)
workflow.add_edge("document_search", END)

# Compile workflow
app = workflow.compile()

Step 5: Implement response handlers

Each route has a dedicated handler optimized for that query type:

def generate_generic_response(state: dict) -> dict:
    """
    Generate response for generic questions about the chatbot.
    No document retrieval needed.
    """
    question = state["question"]

    # Simple system prompt for generic questions
    generic_prompt = ChatPromptTemplate.from_messages([
        ("system", """You are a helpful AI assistant for a knowledge base.
Briefly explain what you do and how you can help users.
Keep responses concise (2-3 sentences)."""),
        ("human", "{question}")
    ])

    llm = ChatOpenAI(model="gpt-4o", temperature=0.3)
    chain = generic_prompt | llm
    response = chain.invoke({"question": question})

    return {
        **state,
        "answer": response.content,
        "retrieved_docs": []  # No docs retrieved
    }


def generate_conversational_response(state: dict) -> dict:
    """
    Handle conversational/gratitude messages.
    Very lightweight - just acknowledge politely.
    """
    question = state["question"]

    # Ultra-lightweight responses
    conversational_responses = {
        "thank": "You're welcome! Happy to help.",
        "danke": "Gern geschehen!",
        "hello": "Hello! How can I help you today?",
        "hi": "Hi there! What can I assist you with?",
    }

    # Simple keyword matching for common phrases
    question_lower = question.lower()
    for keyword, response in conversational_responses.items():
        if keyword in question_lower:
            return {
                **state,
                "answer": response,
                "retrieved_docs": []
            }

    # Default conversational response
    return {
        **state,
        "answer": "Thank you! Is there anything else I can help you with?",
        "retrieved_docs": []
    }


def execute_full_rag_pipeline(state: dict) -> dict:
    """
    Execute complete RAG pipeline for document search queries.
    This is the expensive path with retrieval and grading.
    """
    question = state["question"]

    # Full RAG implementation (simplified for clarity)
    # 1. Generate optimized search phrase
    search_phrase = generate_search_phrase(question)

    # 2. Retrieve candidate documents (~20)
    candidates = vector_store.similarity_search(
        search_phrase,
        k=20,
        search_type="mmr"
    )

    # 3. Grade documents for relevance
    graded_docs = grade_documents(question, candidates)

    # 4. Select top documents (12)
    relevant_docs = graded_docs[:12]

    # 5. Generate answer with context
    answer = generate_answer(question, relevant_docs)

    return {
        **state,
        "answer": answer,
        "retrieved_docs": relevant_docs
    }

What production results can intelligent routing deliver? Up to 85% cost reduction

Our production deployment revealed substantial cost and performance improvements from intelligent routing.

How do costs change before and after routing?

Generic query (10% of traffic):

Before routing: $0.052075 per query.
After routing: $0.00247 per query.
Savings: $0.04941 per query (95% reduction).
Token usage: 19,045 tokens → 766 tokens (96% reduction).

Document search query (90% of traffic):

Cost: $0.052075 per query (unchanged—these need full RAG).
No optimization applied: these queries benefit from a full pipeline.

By how much does intelligent routing improve response time?

Response time reduction for routed queries:

Generic questions: 25 seconds → 2-3 seconds (88% faster).
Conversational: 25 seconds → <1 second (96% faster).
Document search: No change (still requires full processing).

Infrastructure benefits:

10% reduction in vector database queries.
10% reduction in Elasticsearch load.
Better rate limit headroom with OpenAI API.
Lower overall infrastructure costs.

How accurate is the question classifier in real conditions?

From monitoring 1,000 classified queries:

True positive rate: 94% (correctly identified query types).
False positive rate: 6% (misclassified queries).
Confidence threshold fallback: Used in 8% of cases.
User complaints: 0 (no reports of inappropriate routing).

Edge cases handled:

Ambiguous questions default to document_search (confidence <0.7).
Mixed queries ("Thanks, and can you tell me about X?") route to document_search.
Multilingual queries correctly classified by intent.

When intelligent routing makes sense (and when it doesn't)

Intelligent routing for questions isn't universally beneficial. Understanding when it helps and when it adds unnecessary complexity ensures you invest effort appropriately.

Ideal use cases for intelligent routing

Intelligent routing isn’t necessary for every chatbot, but in certain environments it delivers substantial and measurable benefits. Below are the scenarios where this optimization provides the strongest return on investment.

High-volume chatbots (1,000+ queries/month)

The more queries you handle, the more repetitive patterns emerge. With significant traffic, even small percentage savings compound into meaningful cost reductions. A chatbot answering 100 queries daily will see different economics than one handling 10,000.

Mixed query types

If your users ask both simple questions ("What is this?") and complex information requests, routing provides clear value. Knowledge base chatbots, customer support bots, and educational assistants typically exhibit this pattern.

Cost-sensitive deployments

When API costs are a significant budget concern, routing offers immediate savings. Startups, non-profits, and organizations with tight budgets benefit most from optimization that reduces expenses without sacrificing quality.

Predictable conversation patterns

If analytics reveal that 10-30% of queries fall into generic or conversational categories, routing will deliver measurable savings. Review your logs – if you see repetitive simple questions, routing makes sense.

When to avoid or delay intelligent routing

Intelligent routing isn't universally beneficial. In some situations, the gains are minimal or the added complexity outweighs the advantages. Here are the cases where implementing it may not be the right move.

Low-volume deployments (<100 queries/month)

If you're handling fewer than 100 queries monthly, routing optimization likely won't justify the implementation effort. The absolute savings will be too small to matter ($5-10/month), and your time is better spent on other improvements.

Uniform query complexity

If essentially all queries require document search (technical documentation bots, research assistants), routing adds complexity without benefit. When 95%+ of queries need full RAG, just execute the full pipeline for everything.

Early development stage

Get basic RAG working first before optimizing. Implement core retrieval, grading, and generation functionality. Prove the concept works. Add routing only after you have production traffic data showing where optimization would help.

Highly specialized domains

Narrow-domain chatbots where users only ask technical questions may not benefit. A molecular biology research assistant or legal document analyzer likely won't receive "Hello" or "What are you?" very often – users know exactly what they're querying.

How can you get started with intelligent routing in your chatbot?

Ready to implement intelligent question routing? Here's a practical roadmap to follow.

Step 1: Analyze your query patterns

Before implementing routing, understand your chatbot traffic:

Enable comprehensive logging if you haven't already.
Capture 2-4 weeks of production queries (minimum 500 queries).
Manually categorize 100-200 queries into potential categories.
Calculate percentages: How many generic queries? Conversational? Document search?
Estimate savings: Generic % × query volume × cost per query.

If generic + conversational queries represent less than 5% of traffic, routing may not be worth implementing.

Step 2: Start with simple classification

Don't over-engineer initially. Begin with basic keyword classification:

def simple_classify(question: str) -> str:
    """
    Simple keyword-based classification for MVP.
    """
    question_lower = question.lower()

    # Generic keywords
    generic_keywords = ["what are you", "who are you", "how do you work"]
    if any(kw in question_lower for kw in generic_keywords):
        return "generic"

    # Conversational keywords
    conversational_keywords = ["thank", "thanks", "hello", "hi"]
    if any(kw in question_lower for kw in conversational_keywords):
        return "conversational"

    # Default to document search
    return "document_search"

Deploy this simple classifier, monitor accuracy, and iterate based on results.

Step 3: Implement LLM-based classification

Once simple routing proves valuable, upgrade to LLM classification:

Define structured output model (as shown in implementation section).
Create a classification prompt with examples from your domain.
Add confidence threshold (0.7 works well as default).
Implement fallback logic to document_search on low confidence.
Log all classifications for monitoring and improvement.

Step 4: Integrate with LangGraph

Build conditional workflow routing:

Add classification node as workflow entry point.
Create routing function based on classification result.
Implement separate handlers for each route.
Connect handlers to workflow end state.
Compile and test workflow with diverse queries.

Step 5: Monitor and optimize

Track performance and refine based on real data.

Key metrics to track:

Classification accuracy (sample manual review).
Cost per query by category.
Response time by route.
False positive/negative rate.
User satisfaction by query type.

Weekly reviews:

Review misclassified queries.
Identify new patterns.
Update classification prompt.
Adjust confidence threshold if needed.

Monthly optimization:

Analyze cost savings achieved.
Review edge cases and failures.
Update routing logic based on learnings.
Consider adding new categories if patterns emerge.

Intelligent routing for chatbot questions – conclusion

Intelligent question routing is a high-ROI optimization for RAG chatbots with mixed query types. By classifying questions before processing, we achieved 95% cost reduction and 88% latency improvement for generic queries without changing answer quality or user experience.

The key insight is recognizing that not all questions require the same processing. Simple questions about your chatbot don't need document retrieval, grading, and complex context assembly. Routing those queries to lightweight handlers saves money and improves response times.

For our production deployment handling 10,000+ monthly queries with 10% generic traffic, routing saves approximately $50/month and eliminates hours of unnecessary infrastructure load. More importantly, users asking simple questions now receive instant responses instead of waiting 25 seconds for full RAG processing they don't need.

The implementation is straightforward: classify with structured LLM output, route with LangGraph conditional edges, and handle each type appropriately. Start simple with keyword matching, upgrade to LLM classification when justified, and optimize based on your actual traffic patterns.

If your chatbot handles mixed query types and you're concerned about API costs or response times, intelligent routing should be one of your first optimizations. Review your query logs, calculate potential savings, and implement classification. The combination of cost reduction and performance improvement makes this one of the most impactful changes you can make.

Want to implement intelligent routing in your RAG system?

This blog post is based on our real production implementation of an AI document chatbot serving thousands of users, where intelligent routing for questions is one of several optimizations we deployed. For a deeper look at the full architecture and results, see our AI document chatbot case study.

Interested in building a high-performance RAG system with intelligent routing and other production-grade optimizations? Our team specializes in creating cost-effective AI applications that balance quality, speed, and operational efficiency. Visit our generative AI development services to discover how we can help you design and implement the right solution for your project.

Why do chatbots without intelligent routing generate unnecessary costs?

How do RAG processing costs differ across query types?

Generic query without intelligent routing: "What can you do?"

Generic query with intelligent routing: "What can you do?"

What impact does intelligent routing have at the savings scale?

How does a three-tier question classification strategy work?

Tier 1: Generic questions

Tier 2: Conversational/social

Tier 3: Document search queries

How can you implement intelligent routing with structured output and LangGraph?

Step 1: Define question types with structured output

Step 2: Create classification prompt

Step 3: Implement classifier function

Step 4: Build LangGraph intelligent routing workflow

Step 5: Implement response handlers

What production results can intelligent routing deliver? Up to 85% cost reduction

How do costs change before and after routing?

By how much does intelligent routing improve response time?

How accurate is the question classifier in real conditions?

When intelligent routing makes sense (and when it doesn't)

Ideal use cases for intelligent routing

High-volume chatbots (1,000+ queries/month)

Mixed query types

Cost-sensitive deployments

Predictable conversation patterns

When to avoid or delay intelligent routing

Low-volume deployments (<100 queries/month)

Uniform query complexity

Early development stage

Highly specialized domains

How can you get started with intelligent routing in your chatbot?

Step 1: Analyze your query patterns

Step 2: Start with simple classification

Step 3: Implement LLM-based classification

Step 4: Integrate with LangGraph

Step 5: Monitor and optimize

Intelligent routing for chatbot questions – conclusion

Want to implement intelligent routing in your RAG system?

Subscribe to our Blog

Looking for web development experts? Check our services

All services

Drupal

CMS

Design / UX

Intranet systems

Support