Machine LearningIntermediate

Build a TF-IDF Chatbot in Python

Build a real chatbot in Python with TF-IDF and cosine similarity. No deep learning required — just scikit-learn, runnable instantly in your browser.

Try it yourself

Run this code directly in your browser. Click "Open in full editor" to experiment further.

Loading...

Click Run to see output

Or press Ctrl + Enter

How it works

Despite all the hype around large language models, a huge fraction of the chatbots running in production today are not LLMs at all. FAQ bots, support assistants, internal helpdesks — many of them are built on a technique that was around 30 years before ChatGPT: TF-IDF + cosine similarity. It's fast, predictable, has zero hallucinations, and you can build a working version in about 30 lines of Python.

The Core Idea

The bot has a list of (question, answer) pairs. When the user types something, the bot:

1. Turns the user's message into a numeric vector.

2. Turns every known question into a numeric vector (done once, up front).

3. Computes how similar the user's vector is to each known question.

4. Returns the answer to whichever question is closest.

No training. No GPU. No model weights. Just math on word counts.

What TF-IDF Actually Does

TF-IDF stands for Term Frequency – Inverse Document Frequency. It assigns each word in each document a weight that captures two things at once:

  • Term Frequency — how often does this word appear in this document? Words that show up a lot are probably important to that document.
  • Inverse Document Frequency — how rare is this word across all documents? Words like "the" and "is" appear everywhere, so they get a low weight. Words like "matplotlib" or "recursion" appear in only a few documents, so they get a high weight.
  • Multiply them together and you get a number that captures "how important is this word to this specific document, given how special the word is overall?" Every document becomes a vector in a high-dimensional space — one dimension per word in the vocabulary.

    Cosine Similarity — Comparing Two Vectors

    Once every question is a vector, you need a way to ask: how similar is the user's question vector to each knowledge-base question vector?

    Cosine similarity measures the angle between two vectors:

  • 1.0 — the vectors point in exactly the same direction. The two pieces of text use the same words in roughly the same proportions.
  • 0.0 — the vectors are perpendicular. Completely unrelated.
  • −1.0 — opposite directions. (Doesn't happen with TF-IDF since all values are non-negative.)
  • The magic: cosine similarity ignores length. A 5-word question and a 50-word question can still have a similarity of 1.0 if they're about the same thing. That's exactly the property you want for a chatbot — users type short messages, your knowledge base might have longer entries.

    Why The User's Vector Has To Use The Same Vocabulary

    The single most common bug when building this kind of bot:

    user_vector = vectorizer.transform([user_message])     # CORRECT
    user_vector = TfidfVectorizer().fit_transform([user_message])   # WRONG

    You must call transform, not fit_transform, on user input. The vectorizer was fit on your knowledge base — it has a fixed vocabulary. The user's message has to be converted into that same vocabulary, even if it contains words the vectorizer has never seen (those words just get ignored).

    The Confidence Threshold — Knowing When To Say "I Don't Know"

    Without a threshold, your bot will always return something — even when the user asked about quantum physics and your knowledge base is about Python. The fix is a simple cutoff:

    if best_score < 0.2:
        return "I don't know the answer to that."

    The threshold is the most important parameter to tune. Too high and the bot refuses to answer easy questions. Too low and it confidently gives the wrong answer to anything. Start at 0.2 and adjust based on real user queries.

    Tuning Knobs Worth Knowing

    The TfidfVectorizer has a handful of parameters that meaningfully change behavior:

    ParameterWhat it does
    lowercase=TrueTreats "Python" and "python" as the same word. Almost always what you want.
    stop_words='english'Drops common words (the, is, a, of). Boosts signal-to-noise.
    ngram_range=(1, 2)Captures both single words and two-word phrases. "data science" becomes its own term.
    min_df=2Ignore words that appear in fewer than 2 documents — kills rare typos.
    max_df=0.9Ignore words that appear in more than 90% of documents — extra stop word filter.
    sublinear_tf=TrueUse 1 + log(tf) instead of raw counts — softens the impact of repeated words.

    For most chatbots, the defaults plus stop_words='english' and ngram_range=(1, 2) work well.

    Strengths Of This Approach

  • Zero hallucinations — the bot can only return answers you wrote. It cannot make things up.
  • Instant — no model loading, no GPU, no API calls. Responses in milliseconds.
  • Auditable — you can always see why the bot picked an answer (the cosine similarity score and the matched question).
  • Easy to update — add a new (question, answer) pair, refit the vectorizer, done. No retraining.
  • Works offline — no API keys, no internet needed.
  • Where It Falls Short

  • Synonym blindness — TF-IDF only matches words that literally appear. "car" and "automobile" look unrelated to it. Fix: use sentence embeddings (sentence-transformers) instead. Same overall architecture, much smarter matching — but doesn't run in Pyodide.
  • Word order is ignored — "dog bites man" and "man bites dog" have identical TF-IDF vectors.
  • No reasoning — it can't combine information from multiple entries. If the answer requires synthesizing two facts, it will fail.
  • Doesn't generate — it can only return text you've already written.
  • Real-World Upgrade Path

    When you're ready to graduate from this:

    1. Replace TF-IDF with sentence embeddings — same code structure, just swap the vectorizer for sentence-transformers. Gets you synonym understanding and much better paraphrase matching.

    2. Add a fallback to an LLM — when confidence is low, hand the question off to GPT/Claude with the top-3 matching knowledge base entries as context. This is the standard "RAG" (retrieval-augmented generation) pattern that powers most production AI chatbots today.

    3. Index with a vector database — when your knowledge base grows past a few thousand entries, switch from cosine_similarity to FAISS, Chroma, or Pinecone for fast nearest-neighbor search.

    Run the snippet above and you'll see a real chatbot answer questions about Python — including ones phrased completely differently from the original entries — and correctly admit when it has no idea what the user is asking.

    Related examples