Project: AI PDF Search
An AI-powered PDF search tool built with Streamlit and OpenAI embeddings that enables semantic search across PDF documents using natural language queries.
Project Overview
AI PDF Search addresses a common frustration: finding specific information buried inside long PDF documents. Traditional keyword search often fails when the wording in your query doesn't exactly match the document text. This project uses OpenAI embeddings and cosine similarity to enable true semantic search โ meaning you can ask questions in natural language and get relevant results even when the terminology differs.
Architecture
The application follows a simple but powerful pipeline architecture:
+-------------------+ +------------------+ +-------------------+ +------------------+
| PDF Upload | -> | Text Extraction | -> | Chunk & Embed | -> | Vector Search |
| (Streamlit UI) | | (PyMuPDF) | | (OpenAI API) | | (Cosine Sim) |
+-------------------+ +------------------+ +-------------------+ +------------------+
|
v
+-------------------+ +------------------+ +-------------------+ +------------------+
| Display Results | <- | Rank & Format | <- | Query Embedding | <- | User Query |
| (Context + Score)| | (Top-K Filter) | | (OpenAI API) | | (Natural Lang) |
+-------------------+ +------------------+ +-------------------+ +------------------+
Technology Stack
| Component | Technology | Purpose |
|---|---|---|
| Frontend UI | Streamlit | Interactive web interface for upload and search |
| PDF Parsing | PyMuPDF (fitz) | Fast, accurate text extraction from PDF documents |
| Embeddings | OpenAI API (text-embedding-ada-002) | Convert text to high-dimensional vectors |
| Vector Search | NumPy + Cosine Similarity | Local similarity computation, no external DB needed |
| Text Processing | tiktoken | Token counting for chunk boundary management |
| Language | Python 3.9+ | Core application logic |
How It Works
Step 1: PDF Text Extraction with PyMuPDF
PyMuPDF (also known as fitz) is one of the fastest PDF libraries available. It extracts text while preserving reading order and page structure.
import fitz # PyMuPDF
def extract_text_from_pdf(pdf_path):
"""
Extract all text from a PDF file, preserving page boundaries.
Returns a list of dicts with page number and text content.
"""
document = fitz.open(pdf_path)
pages = []
for page_num in range(len(document)):
page = document.load_page(page_num)
text = page.get_text()
pages.append({
"page": page_num + 1,
"text": text
})
return pages
Step 2: Text Chunking and OpenAI Embedding Generation
PDF pages are split into overlapping chunks (typically 500 tokens with 100-token overlap) to ensure context is preserved at boundaries. Each chunk is converted to an embedding vector using OpenAI's embedding model.
import openai
import numpy as np
import tiktoken
def chunk_text(text, max_tokens=500, overlap=100, model="text-embedding-ada-002"):
"""
Split text into overlapping chunks based on token count.
Uses tiktoken for accurate OpenAI token counting.
"""
tokenizer = tiktoken.encoding_for_model(model)
tokens = tokenizer.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + max_tokens, len(tokens))
chunk_tokens = tokens[start:end]
chunk_text = tokenizer.decode(chunk_tokens)
chunks.append(chunk_text)
start += max_tokens - overlap # Slide window with overlap
return chunks
def get_embedding(text, api_key, model="text-embedding-ada-002"):
"""
Generate an embedding vector for a text chunk using OpenAI API.
Returns a NumPy array of 1536 dimensions (for ada-002).
"""
client = openai.OpenAI(api_key=api_key)
response = client.embeddings.create(
model=model,
input=text.replace("\n", " ") # OpenAI recommends replacing newlines
)
embedding = response.data[0].embedding
return np.array(embedding, dtype=np.float32)
Step 3: Vector Similarity Search (Cosine Similarity)
Cosine similarity measures the cosine of the angle between two vectors, producing a score between -1 and 1. For OpenAI embeddings, similar texts score close to 1. This implementation uses NumPy for fast local computation without requiring a vector database.
def cosine_similarity(a, b):
"""
Compute cosine similarity between two vectors.
Returns a float between -1 (opposite) and 1 (identical).
"""
dot = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot / (norm_a * norm_b)
def search_chunks(query, chunk_embeddings, chunks, api_key, top_k=5):
"""
Embed the query and find the top-k most similar chunks.
Returns a list of results with text, page, and similarity score.
"""
query_embedding = get_embedding(query, api_key)
scores = []
for i, (chunk_emb, chunk) in enumerate(zip(chunk_embeddings, chunks)):
score = cosine_similarity(query_embedding, chunk_emb)
scores.append((score, i, chunk))
# Sort by similarity descending and return top-k
scores.sort(key=lambda x: x[0], reverse=True)
return [
{
"score": round(float(score), 4),
"page": chunk["page"],
"text": chunk["text"][:300] + "..." if len(chunk["text"]) > 300 else chunk["text"]
}
for score, i, chunk in scores[:top_k]
]
Step 4: Ranked Results with Context Snippets
Results are presented with similarity scores, source page numbers, and contextual text snippets. Users can click through to view the full context for each match.
Complete Code Walkthrough
File: code.py โ Main Application
#!/usr/bin/env python3
"""
AI PDF Search - Semantic search across PDF documents using OpenAI embeddings.
Author: John Ian Medilo (j1-medilo06)
GitHub: https://github.com/j1-medilo06/ai-pdfsearch
License: MIT
Usage:
export OPENAI_API_KEY="sk-..."
streamlit run code.py
"""
import os
import re
import tempfile
from typing import List, Dict
import fitz # PyMuPDF for PDF text extraction
import numpy as np # Vector operations and cosine similarity
import openai # OpenAI API for embeddings
import streamlit as st
import tiktoken # Token counting for chunk boundaries
# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------
DEFAULT_EMBEDDING_MODEL = "text-embedding-ada-002"
CHUNK_SIZE_TOKENS = 500
CHUNK_OVERLAP_TOKENS = 100
TOP_K_RESULTS = 5
def get_openai_client() -> openai.OpenAI:
"""Initialize OpenAI client from environment or Streamlit secrets."""
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
try:
api_key = st.secrets["OPENAI_API_KEY"]
except KeyError:
pass
if not api_key:
st.error("OpenAI API key not found. Set OPENAI_API_KEY environment variable.")
st.stop()
return openai.OpenAI(api_key=api_key)
# ---------------------------------------------------------------------------
# PDF Text Extraction
# ---------------------------------------------------------------------------
def extract_pages_from_pdf(pdf_bytes: bytes) -> List[Dict]:
"""
Extract text from each page of a PDF document.
Args:
pdf_bytes: Raw PDF file contents as bytes.
Returns:
List of dicts: [{"page": 1, "text": "..."}, ...]
"""
pages = []
with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
tmp.write(pdf_bytes)
tmp.flush()
doc = fitz.open(tmp.name)
for page_num in range(len(doc)):
page = doc.load_page(page_num)
text = page.get_text("text")
# Clean up excessive whitespace
text = re.sub(r'\n+', '\n', text).strip()
if text:
pages.append({"page": page_num + 1, "text": text})
doc.close()
os.unlink(tmp.name)
return pages
# ---------------------------------------------------------------------------
# Text Chunking with Token-Aware Boundaries
# ---------------------------------------------------------------------------
def chunk_pages(
pages: List[Dict],
max_tokens: int = CHUNK_SIZE_TOKENS,
overlap: int = CHUNK_OVERLAP_TOKENS,
model: str = DEFAULT_EMBEDDING_MODEL
) -> List[Dict]:
"""
Split page text into overlapping chunks based on token count.
Uses tiktoken for accurate token counting matching the embedding model.
Chunks maintain page attribution for result referencing.
Args:
pages: List of page dicts from extract_pages_from_pdf.
max_tokens: Maximum tokens per chunk.
overlap: Overlapping tokens between consecutive chunks.
model: OpenAI model name for tokenizer selection.
Returns:
List of chunk dicts: [{"page": 1, "text": "...", "tokens": 420}, ...]
"""
tokenizer = tiktoken.encoding_for_model(model)
chunks = []
for page in pages:
text = page["text"]
tokens = tokenizer.encode(text)
start = 0
while start < len(tokens):
end = min(start + max_tokens, len(tokens))
chunk_tokens = tokens[start:end]
chunk_text = tokenizer.decode(chunk_tokens)
chunks.append({
"page": page["page"],
"text": chunk_text,
"tokens": len(chunk_tokens)
})
start += max_tokens - overlap
return chunks
# ---------------------------------------------------------------------------
# OpenAI Embedding Generation
# ---------------------------------------------------------------------------
@st.cache_data(show_spinner=False)
def generate_embeddings(
chunk_texts: tuple,
api_key: str,
model: str = DEFAULT_EMBEDDING_MODEL
) -> np.ndarray:
"""
Generate embedding vectors for all text chunks via OpenAI API.
Cached by Streamlit to avoid re-computing on every query.
Uses batch processing for efficiency (max 2048 chunks per request).
Args:
chunk_texts: Tuple of chunk text strings (tuple for hashability).
api_key: OpenAI API key.
model: Embedding model name.
Returns:
NumPy array of shape (n_chunks, 1536) with float32 embeddings.
"""
client = openai.OpenAI(api_key=api_key)
texts = list(chunk_texts)
embeddings = []
batch_size = 100 # OpenAI recommends batches of up to 2048, 100 is safe
progress_bar = st.progress(0, text="Generating embeddings...")
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
# Replace newlines as recommended by OpenAI
batch = [t.replace("\n", " ") for t in batch]
response = client.embeddings.create(model=model, input=batch)
batch_embeddings = [item.embedding for item in response.data]
embeddings.extend(batch_embeddings)
progress_bar.progress(
min((i + batch_size) / len(texts), 1.0),
text=f"Embedding batch {i // batch_size + 1}/{(len(texts) - 1) // batch_size + 1}..."
)
progress_bar.empty()
return np.array(embeddings, dtype=np.float32)
# ---------------------------------------------------------------------------
# Vector Similarity Search
# ---------------------------------------------------------------------------
def cosine_similarity_batch(query_vec: np.ndarray, embeddings: np.ndarray) -> np.ndarray:
"""
Compute cosine similarity between query vector and all embeddings efficiently.
Uses vectorized NumPy operations for performance.
Args:
query_vec: Shape (1536,) query embedding.
embeddings: Shape (n_chunks, 1536) document embeddings.
Returns:
Shape (n_chunks,) array of similarity scores.
"""
dot_products = np.dot(embeddings, query_vec)
query_norm = np.linalg.norm(query_vec)
embedding_norms = np.linalg.norm(embeddings, axis=1)
return dot_products / (query_norm * embedding_norms)
def semantic_search(
query: str,
chunks: List[Dict],
embeddings: np.ndarray,
client: openai.OpenAI,
model: str = DEFAULT_EMBEDDING_MODEL,
top_k: int = TOP_K_RESULTS
) -> List[Dict]:
"""
Execute semantic search: embed query, compute similarities, return top-k.
Args:
query: Natural language query string.
chunks: List of chunk dicts with page and text.
embeddings: Pre-computed chunk embeddings array.
client: OpenAI client instance.
model: Embedding model name.
top_k: Number of results to return.
Returns:
List of result dicts with score, page, and text snippet.
"""
# Generate query embedding
response = client.embeddings.create(
model=model,
input=query.replace("\n", " ")
)
query_embedding = np.array(response.data[0].embedding, dtype=np.float32)
# Compute similarities
similarities = cosine_similarity_batch(query_embedding, embeddings)
# Get top-k indices
top_indices = np.argsort(similarities)[::-1][:top_k]
results = []
for idx in top_indices:
chunk = chunks[idx]
results.append({
"score": round(float(similarities[idx]), 4),
"page": chunk["page"],
"text": chunk["text"]
})
return results
# ---------------------------------------------------------------------------
# Streamlit UI
# ---------------------------------------------------------------------------
def main():
"""Main Streamlit application entry point."""
st.set_page_config(
page_title="AI PDF Search",
page_icon="๐",
layout="wide"
)
st.title("๐ AI PDF Search")
st.markdown(
"Upload a PDF and ask questions in natural language. "
"Powered by OpenAI embeddings and cosine similarity."
)
# Sidebar configuration
with st.sidebar:
st.header("โ๏ธ Configuration")
api_key = st.text_input(
"OpenAI API Key (optional)",
type="password",
help="Leave blank to use OPENAI_API_KEY env var"
)
if api_key:
os.environ["OPENAI_API_KEY"] = api_key
model = st.selectbox(
"Embedding Model",
["text-embedding-ada-002", "text-embedding-3-small", "text-embedding-3-large"],
index=0
)
chunk_size = st.slider("Chunk Size (tokens)", 100, 1000, 500, 50)
overlap = st.slider("Chunk Overlap (tokens)", 0, 200, 100, 10)
top_k = st.slider("Results to Show", 1, 10, 5)
st.markdown("---")
st.markdown(
"๐ [GitHub Repo](https://github.com/j1-medilo06/ai-pdfsearch)"
)
# Initialize OpenAI client
try:
client = get_openai_client()
except Exception as e:
st.error(f"Failed to initialize OpenAI client: {e}")
return
# PDF Upload
uploaded_file = st.file_uploader(
"๐ Upload PDF Document",
type=["pdf"],
help="Upload a PDF to index for semantic search"
)
if uploaded_file is not None:
# Process PDF
with st.spinner("Extracting text from PDF..."):
pdf_bytes = uploaded_file.read()
pages = extract_pages_from_pdf(pdf_bytes)
st.success(f"Extracted {len(pages)} pages with text content")
# Chunk text
with st.spinner("Chunking text..."):
chunks = chunk_pages(pages, chunk_size, overlap, model)
st.success(f"Created {len(chunks)} text chunks")
# Generate embeddings
chunk_texts = tuple(c["text"] for c in chunks)
api_key_for_cache = os.environ.get("OPENAI_API_KEY", "")
try:
embeddings = generate_embeddings(chunk_texts, api_key_for_cache, model)
except Exception as e:
st.error(f"Embedding generation failed: {e}")
return
st.success(f"Generated {len(embeddings)} embedding vectors")
# Search interface
st.markdown("---")
st.subheader("๐ Ask a Question")
query = st.text_input(
"Your question",
placeholder="e.g., What are the termination clauses in this contract?"
)
if query:
with st.spinner("Searching..."):
results = semantic_search(query, chunks, embeddings, client, model, top_k)
st.subheader(f"๐ Top {len(results)} Results")
for i, result in enumerate(results, 1):
with st.container():
col1, col2 = st.columns([1, 4])
with col1:
st.metric(
label=f"Result #{i}",
value=f"{result['score']:.3f}",
delta=f"Page {result['page']}"
)
with col2:
# Highlight the relevant portion
snippet = result["text"]
if len(snippet) > 500:
snippet = snippet[:500] + "..."
st.markdown(f"> {snippet}")
st.markdown("---")
if __name__ == "__main__":
main()
File: install.sh โ Installation Script
#!/usr/bin/env bash
# =============================================================================
# AI PDF Search โ Installation Script
# Author: John Ian Medilo (j1-medilo06)
# GitHub: https://github.com/j1-medilo06/ai-pdfsearch
# =============================================================================
set -euo pipefail
RED='\033[0;31m'
GREEN='\033[0;32m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
echo -e "${BLUE}=== AI PDF Search Installer ===${NC}"
echo ""
# Check Python version
PYTHON_VERSION=$(python3 --version 2>/dev/null | cut -d' ' -f2 | cut -d'.' -f1,2 || echo "0")
MIN_VERSION="3.9"
if ! python3 -c "import sys; exit(0 if sys.version_info >= (3, 9) else 1)" 2>/dev/null; then
echo -e "${RED}Error: Python 3.9 or higher is required.${NC}"
echo "Current version: $(python3 --version 2>/dev/null || echo 'not found')"
exit 1
fi
echo -e "${GREEN}โ Python $(python3 --version) detected${NC}"
# Create virtual environment
VENV_DIR="venv"
if [ ! -d "$VENV_DIR" ]; then
echo "Creating virtual environment..."
python3 -m venv "$VENV_DIR"
fi
echo -e "${GREEN}โ Virtual environment ready${NC}"
# Activate and install dependencies
echo "Installing dependencies..."
source "$VENV_DIR/bin/activate"
pip install --upgrade pip setuptools wheel
pip install streamlit PyMuPDF openai numpy tiktoken
echo -e "${GREEN}โ Dependencies installed${NC}"
# Check for OpenAI API key
if [ -z "${OPENAI_API_KEY:-}" ]; then
echo ""
echo -e "${RED}โ Warning: OPENAI_API_KEY environment variable not set.${NC}"
echo "Set it before running the app:"
echo " export OPENAI_API_KEY='sk-your-key-here'"
fi
# Create .env template
if [ ! -f ".env" ]; then
cat > .env << 'EOF'
# AI PDF Search Configuration
# Copy your OpenAI API key here or set as environment variable
OPENAI_API_KEY=sk-your-key-here
EOF
echo -e "${GREEN}โ Created .env template${NC}"
fi
echo ""
echo -e "${GREEN}=== Installation Complete ===${NC}"
echo ""
echo "To start the application:"
echo " source venv/bin/activate"
echo " export OPENAI_API_KEY='sk-your-key-here'"
echo " streamlit run code.py"
echo ""
echo "Or use the .env file:"
echo " source venv/bin/activate && export $(cat .env | xargs) && streamlit run code.py"
Installation Instructions
-
Clone the repository
git clone https://github.com/j1-medilo06/ai-pdfsearch.git cd ai-pdfsearch -
Run the installation script
chmod +x install.sh ./install.sh -
Set your OpenAI API key
export OPENAI_API_KEY="sk-your-openai-api-key"Security Note: Never commit API keys to version control. Use environment variables,.envfiles (added to.gitignore), or a secrets manager. The install script creates a.envtemplate for local development. -
Launch the application
The app will open in your browser atsource venv/bin/activate streamlit run code.pyhttp://localhost:8501. -
Upload a PDF and search
- Click "Browse files" to upload any PDF document.
- Wait for text extraction and embedding generation (progress bar shown).
- Type a natural language question in the search box.
- Review ranked results with similarity scores and page references.
Configuration
| Parameter | Default | Description |
|---|---|---|
OPENAI_API_KEY | Required | OpenAI API key for embedding generation |
CHUNK_SIZE_TOKENS | 500 | Tokens per chunk (100โ1000 recommended) |
CHUNK_OVERLAP_TOKENS | 100 | Overlapping tokens between chunks |
TOP_K_RESULTS | 5 | Number of results to display |
DEFAULT_EMBEDDING_MODEL | text-embedding-ada-002 | OpenAI embedding model |
Usage Examples
Below are example queries demonstrating the semantic search capabilities:
| Document Type | Example Query | Expected Result |
|---|---|---|
| Employment Contract | "What happens if I resign without notice?" | Termination clause section with notice period details |
| Technical Whitepaper | "How does the consensus algorithm handle byzantine faults?" | Relevant section on consensus mechanism |
| Legal Agreement | "Intellectual property ownership terms" | IP assignment and licensing clauses |
| Research Paper | "Summary of experimental methodology" | Methods section with procedure details |

Try It Yourself
Copy and run the following commands to get started in under 2 minutes:
# Clone and setup
git clone https://github.com/j1-medilo06/ai-pdfsearch.git
cd ai-pdfsearch && ./install.sh
# Set API key and launch
export OPENAI_API_KEY="sk-your-key-here"
source venv/bin/activate && streamlit run code.py
Future Enhancements
- Multi-document search: Index multiple PDFs simultaneously with document-level filtering
- Local LLM support: Integration with Ollama or llama.cpp for fully offline operation
- Persistent vector storage: ChromaDB or Weaviate integration for large document collections
- Hybrid search: Combine keyword (BM25) and semantic search for optimal relevance
- Source highlighting: Visual highlight of matching regions on original PDF pages
- Chat interface: Conversational follow-up questions with context memory
References & Links
| Resource | Link |
|---|---|
| GitHub Repository | github.com/j1-medilo06/ai-pdfsearch |
| OpenAI Embeddings API | platform.openai.com/docs/guides/embeddings |
| Streamlit Documentation | docs.streamlit.io |
| PyMuPDF Documentation | pymupdf.readthedocs.io |
| Author GitHub | github.com/j1-medilo06 |