Build a TF-IDF Chatbot in Python
Build a real chatbot in Python with TF-IDF and cosine similarity. No deep learning required — just scikit-learn, runnable instantly in your browser.
Try it yourself
Run this code directly in your browser. Click "Open in full editor" to experiment further.
Click Run to see output
Or press Ctrl + Enter
How it works
Despite all the hype around large language models, a huge fraction of the chatbots running in production today are not LLMs at all. FAQ bots, support assistants, internal helpdesks — many of them are built on a technique that was around 30 years before ChatGPT: TF-IDF + cosine similarity. It's fast, predictable, has zero hallucinations, and you can build a working version in about 30 lines of Python.
The Core Idea
The bot has a list of (question, answer) pairs. When the user types something, the bot:
1. Turns the user's message into a numeric vector.
2. Turns every known question into a numeric vector (done once, up front).
3. Computes how similar the user's vector is to each known question.
4. Returns the answer to whichever question is closest.
No training. No GPU. No model weights. Just math on word counts.
What TF-IDF Actually Does
TF-IDF stands for Term Frequency – Inverse Document Frequency. It assigns each word in each document a weight that captures two things at once:
Multiply them together and you get a number that captures "how important is this word to this specific document, given how special the word is overall?" Every document becomes a vector in a high-dimensional space — one dimension per word in the vocabulary.
Cosine Similarity — Comparing Two Vectors
Once every question is a vector, you need a way to ask: how similar is the user's question vector to each knowledge-base question vector?
Cosine similarity measures the angle between two vectors:
The magic: cosine similarity ignores length. A 5-word question and a 50-word question can still have a similarity of 1.0 if they're about the same thing. That's exactly the property you want for a chatbot — users type short messages, your knowledge base might have longer entries.
Why The User's Vector Has To Use The Same Vocabulary
The single most common bug when building this kind of bot:
user_vector = vectorizer.transform([user_message]) # CORRECT
user_vector = TfidfVectorizer().fit_transform([user_message]) # WRONGYou must call transform, not fit_transform, on user input. The vectorizer was fit on your knowledge base — it has a fixed vocabulary. The user's message has to be converted into that same vocabulary, even if it contains words the vectorizer has never seen (those words just get ignored).
The Confidence Threshold — Knowing When To Say "I Don't Know"
Without a threshold, your bot will always return something — even when the user asked about quantum physics and your knowledge base is about Python. The fix is a simple cutoff:
if best_score < 0.2:
return "I don't know the answer to that."The threshold is the most important parameter to tune. Too high and the bot refuses to answer easy questions. Too low and it confidently gives the wrong answer to anything. Start at 0.2 and adjust based on real user queries.
Tuning Knobs Worth Knowing
The TfidfVectorizer has a handful of parameters that meaningfully change behavior:
| Parameter | What it does |
|---|---|
lowercase=True | Treats "Python" and "python" as the same word. Almost always what you want. |
stop_words='english' | Drops common words (the, is, a, of). Boosts signal-to-noise. |
ngram_range=(1, 2) | Captures both single words and two-word phrases. "data science" becomes its own term. |
min_df=2 | Ignore words that appear in fewer than 2 documents — kills rare typos. |
max_df=0.9 | Ignore words that appear in more than 90% of documents — extra stop word filter. |
sublinear_tf=True | Use 1 + log(tf) instead of raw counts — softens the impact of repeated words. |
For most chatbots, the defaults plus stop_words='english' and ngram_range=(1, 2) work well.
Strengths Of This Approach
Where It Falls Short
sentence-transformers) instead. Same overall architecture, much smarter matching — but doesn't run in Pyodide.Real-World Upgrade Path
When you're ready to graduate from this:
1. Replace TF-IDF with sentence embeddings — same code structure, just swap the vectorizer for sentence-transformers. Gets you synonym understanding and much better paraphrase matching.
2. Add a fallback to an LLM — when confidence is low, hand the question off to GPT/Claude with the top-3 matching knowledge base entries as context. This is the standard "RAG" (retrieval-augmented generation) pattern that powers most production AI chatbots today.
3. Index with a vector database — when your knowledge base grows past a few thousand entries, switch from cosine_similarity to FAISS, Chroma, or Pinecone for fast nearest-neighbor search.
Run the snippet above and you'll see a real chatbot answer questions about Python — including ones phrased completely differently from the original entries — and correctly admit when it has no idea what the user is asking.
Related examples
Logistic Regression in Python
Learn logistic regression in Python with scikit-learn. Binary classification, decision boundary, probabilities, and ROC curve — all explained and runnable in your browser.
PCA (Principal Component Analysis) in Python
Learn PCA in Python with scikit-learn. Reduce high-dimensional data to 2D, visualize hidden structure, and understand explained variance — runnable in your browser.