Project: AI PDF Search

An AI-powered PDF search tool built with Streamlit and OpenAI embeddings that enables semantic search across PDF documents using natural language queries.

Project Overview

AI PDF Search addresses a common frustration: finding specific information buried inside long PDF documents. Traditional keyword search often fails when the wording in your query doesn't exactly match the document text. This project uses OpenAI embeddings and cosine similarity to enable true semantic search — meaning you can ask questions in natural language and get relevant results even when the terminology differs.

Motivation: Built to solve the "haystack problem" in technical documentation, legal documents, and research papers where keyword search is insufficient. After struggling to find specific clauses across 50+ page contracts and technical specs, this tool was built to demonstrate how vector embeddings can transform document retrieval workflows.

Architecture

The application follows a simple but powerful pipeline architecture:

+-------------------+    +------------------+    +-------------------+    +------------------+
|   PDF Upload      | -> |  Text Extraction | -> |  Chunk & Embed    | -> |  Vector Search   |
|   (Streamlit UI)  |    |  (PyMuPDF)       |    |  (OpenAI API)     |    |  (Cosine Sim)    |
+-------------------+    +------------------+    +-------------------+    +------------------+
                                                                                |
                                                                                v
+-------------------+    +------------------+    +-------------------+    +------------------+
|   Display Results | <- |  Rank & Format   | <- |  Query Embedding  | <- |  User Query      |
|   (Context + Score)|   |  (Top-K Filter)  |    |  (OpenAI API)     |    |  (Natural Lang)  |
+-------------------+    +------------------+    +-------------------+    +------------------+

Technology Stack

Component	Technology	Purpose
Frontend UI	Streamlit	Interactive web interface for upload and search
PDF Parsing	PyMuPDF (fitz)	Fast, accurate text extraction from PDF documents
Embeddings	OpenAI API (text-embedding-ada-002)	Convert text to high-dimensional vectors
Vector Search	NumPy + Cosine Similarity	Local similarity computation, no external DB needed
Text Processing	tiktoken	Token counting for chunk boundary management
Language	Python 3.9+	Core application logic

How It Works

Step 1: PDF Text Extraction with PyMuPDF

PyMuPDF (also known as fitz) is one of the fastest PDF libraries available. It extracts text while preserving reading order and page structure.

import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    """
    Extract all text from a PDF file, preserving page boundaries.
    Returns a list of dicts with page number and text content.
    """
    document = fitz.open(pdf_path)
    pages = []
    for page_num in range(len(document)):
        page = document.load_page(page_num)
        text = page.get_text()
        pages.append({
            "page": page_num + 1,
            "text": text
        })
    return pages

Step 2: Text Chunking and OpenAI Embedding Generation

PDF pages are split into overlapping chunks (typically 500 tokens with 100-token overlap) to ensure context is preserved at boundaries. Each chunk is converted to an embedding vector using OpenAI's embedding model.

import openai
import numpy as np
import tiktoken

def chunk_text(text, max_tokens=500, overlap=100, model="text-embedding-ada-002"):
    """
    Split text into overlapping chunks based on token count.
    Uses tiktoken for accurate OpenAI token counting.
    """
    tokenizer = tiktoken.encoding_for_model(model)
    tokens = tokenizer.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunk_tokens = tokens[start:end]
        chunk_text = tokenizer.decode(chunk_tokens)
        chunks.append(chunk_text)
        start += max_tokens - overlap  # Slide window with overlap
    return chunks

def get_embedding(text, api_key, model="text-embedding-ada-002"):
    """
    Generate an embedding vector for a text chunk using OpenAI API.
    Returns a NumPy array of 1536 dimensions (for ada-002).
    """
    client = openai.OpenAI(api_key=api_key)
    response = client.embeddings.create(
        model=model,
        input=text.replace("\n", " ")  # OpenAI recommends replacing newlines
    )
    embedding = response.data[0].embedding
    return np.array(embedding, dtype=np.float32)

Step 3: Vector Similarity Search (Cosine Similarity)

Cosine similarity measures the cosine of the angle between two vectors, producing a score between -1 and 1. For OpenAI embeddings, similar texts score close to 1. This implementation uses NumPy for fast local computation without requiring a vector database.

def cosine_similarity(a, b):
    """
    Compute cosine similarity between two vectors.
    Returns a float between -1 (opposite) and 1 (identical).
    """
    dot = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot / (norm_a * norm_b)

def search_chunks(query, chunk_embeddings, chunks, api_key, top_k=5):
    """
    Embed the query and find the top-k most similar chunks.
    Returns a list of results with text, page, and similarity score.
    """
    query_embedding = get_embedding(query, api_key)
    scores = []
    for i, (chunk_emb, chunk) in enumerate(zip(chunk_embeddings, chunks)):
        score = cosine_similarity(query_embedding, chunk_emb)
        scores.append((score, i, chunk))
    
    # Sort by similarity descending and return top-k
    scores.sort(key=lambda x: x[0], reverse=True)
    return [
        {
            "score": round(float(score), 4),
            "page": chunk["page"],
            "text": chunk["text"][:300] + "..." if len(chunk["text"]) > 300 else chunk["text"]
        }
        for score, i, chunk in scores[:top_k]
    ]

Step 4: Ranked Results with Context Snippets

Results are presented with similarity scores, source page numbers, and contextual text snippets. Users can click through to view the full context for each match.

Complete Code Walkthrough

File: code.py — Main Application

#!/usr/bin/env python3
"""
AI PDF Search - Semantic search across PDF documents using OpenAI embeddings.
Author: John Ian Medilo (j1-medilo06)
GitHub: https://github.com/j1-medilo06/ai-pdfsearch
License: MIT

Usage:
    export OPENAI_API_KEY="sk-..."
    streamlit run code.py
"""

import os
import re
import tempfile
from typing import List, Dict

import fitz           # PyMuPDF for PDF text extraction
import numpy as np    # Vector operations and cosine similarity
import openai         # OpenAI API for embeddings
import streamlit as st
import tiktoken       # Token counting for chunk boundaries


# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------
DEFAULT_EMBEDDING_MODEL = "text-embedding-ada-002"
CHUNK_SIZE_TOKENS = 500
CHUNK_OVERLAP_TOKENS = 100
TOP_K_RESULTS = 5


def get_openai_client() -> openai.OpenAI:
    """Initialize OpenAI client from environment or Streamlit secrets."""
    api_key = os.environ.get("OPENAI_API_KEY")
    if not api_key:
        try:
            api_key = st.secrets["OPENAI_API_KEY"]
        except KeyError:
            pass
    if not api_key:
        st.error("OpenAI API key not found. Set OPENAI_API_KEY environment variable.")
        st.stop()
    return openai.OpenAI(api_key=api_key)


# ---------------------------------------------------------------------------
# PDF Text Extraction
# ---------------------------------------------------------------------------
def extract_pages_from_pdf(pdf_bytes: bytes) -> List[Dict]:
    """
    Extract text from each page of a PDF document.
    
    Args:
        pdf_bytes: Raw PDF file contents as bytes.
        
    Returns:
        List of dicts: [{"page": 1, "text": "..."}, ...]
    """
    pages = []
    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
        tmp.write(pdf_bytes)
        tmp.flush()
        doc = fitz.open(tmp.name)
        for page_num in range(len(doc)):
            page = doc.load_page(page_num)
            text = page.get_text("text")
            # Clean up excessive whitespace
            text = re.sub(r'\n+', '\n', text).strip()
            if text:
                pages.append({"page": page_num + 1, "text": text})
        doc.close()
        os.unlink(tmp.name)
    return pages


# ---------------------------------------------------------------------------
# Text Chunking with Token-Aware Boundaries
# ---------------------------------------------------------------------------
def chunk_pages(
    pages: List[Dict],
    max_tokens: int = CHUNK_SIZE_TOKENS,
    overlap: int = CHUNK_OVERLAP_TOKENS,
    model: str = DEFAULT_EMBEDDING_MODEL
) -> List[Dict]:
    """
    Split page text into overlapping chunks based on token count.
    
    Uses tiktoken for accurate token counting matching the embedding model.
    Chunks maintain page attribution for result referencing.
    
    Args:
        pages: List of page dicts from extract_pages_from_pdf.
        max_tokens: Maximum tokens per chunk.
        overlap: Overlapping tokens between consecutive chunks.
        model: OpenAI model name for tokenizer selection.
        
    Returns:
        List of chunk dicts: [{"page": 1, "text": "...", "tokens": 420}, ...]
    """
    tokenizer = tiktoken.encoding_for_model(model)
    chunks = []
    for page in pages:
        text = page["text"]
        tokens = tokenizer.encode(text)
        start = 0
        while start < len(tokens):
            end = min(start + max_tokens, len(tokens))
            chunk_tokens = tokens[start:end]
            chunk_text = tokenizer.decode(chunk_tokens)
            chunks.append({
                "page": page["page"],
                "text": chunk_text,
                "tokens": len(chunk_tokens)
            })
            start += max_tokens - overlap
    return chunks


# ---------------------------------------------------------------------------
# OpenAI Embedding Generation
# ---------------------------------------------------------------------------
@st.cache_data(show_spinner=False)
def generate_embeddings(
    chunk_texts: tuple,
    api_key: str,
    model: str = DEFAULT_EMBEDDING_MODEL
) -> np.ndarray:
    """
    Generate embedding vectors for all text chunks via OpenAI API.
    
    Cached by Streamlit to avoid re-computing on every query.
    Uses batch processing for efficiency (max 2048 chunks per request).
    
    Args:
        chunk_texts: Tuple of chunk text strings (tuple for hashability).
        api_key: OpenAI API key.
        model: Embedding model name.
        
    Returns:
        NumPy array of shape (n_chunks, 1536) with float32 embeddings.
    """
    client = openai.OpenAI(api_key=api_key)
    texts = list(chunk_texts)
    embeddings = []
    batch_size = 100  # OpenAI recommends batches of up to 2048, 100 is safe
    
    progress_bar = st.progress(0, text="Generating embeddings...")
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        # Replace newlines as recommended by OpenAI
        batch = [t.replace("\n", " ") for t in batch]
        response = client.embeddings.create(model=model, input=batch)
        batch_embeddings = [item.embedding for item in response.data]
        embeddings.extend(batch_embeddings)
        progress_bar.progress(
            min((i + batch_size) / len(texts), 1.0),
            text=f"Embedding batch {i // batch_size + 1}/{(len(texts) - 1) // batch_size + 1}..."
        )
    progress_bar.empty()
    return np.array(embeddings, dtype=np.float32)


# ---------------------------------------------------------------------------
# Vector Similarity Search
# ---------------------------------------------------------------------------
def cosine_similarity_batch(query_vec: np.ndarray, embeddings: np.ndarray) -> np.ndarray:
    """
    Compute cosine similarity between query vector and all embeddings efficiently.
    
    Uses vectorized NumPy operations for performance.
    
    Args:
        query_vec: Shape (1536,) query embedding.
        embeddings: Shape (n_chunks, 1536) document embeddings.
        
    Returns:
        Shape (n_chunks,) array of similarity scores.
    """
    dot_products = np.dot(embeddings, query_vec)
    query_norm = np.linalg.norm(query_vec)
    embedding_norms = np.linalg.norm(embeddings, axis=1)
    return dot_products / (query_norm * embedding_norms)


def semantic_search(
    query: str,
    chunks: List[Dict],
    embeddings: np.ndarray,
    client: openai.OpenAI,
    model: str = DEFAULT_EMBEDDING_MODEL,
    top_k: int = TOP_K_RESULTS
) -> List[Dict]:
    """
    Execute semantic search: embed query, compute similarities, return top-k.
    
    Args:
        query: Natural language query string.
        chunks: List of chunk dicts with page and text.
        embeddings: Pre-computed chunk embeddings array.
        client: OpenAI client instance.
        model: Embedding model name.
        top_k: Number of results to return.
        
    Returns:
        List of result dicts with score, page, and text snippet.
    """
    # Generate query embedding
    response = client.embeddings.create(
        model=model,
        input=query.replace("\n", " ")
    )
    query_embedding = np.array(response.data[0].embedding, dtype=np.float32)
    
    # Compute similarities
    similarities = cosine_similarity_batch(query_embedding, embeddings)
    
    # Get top-k indices
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    results = []
    for idx in top_indices:
        chunk = chunks[idx]
        results.append({
            "score": round(float(similarities[idx]), 4),
            "page": chunk["page"],
            "text": chunk["text"]
        })
    return results


# ---------------------------------------------------------------------------
# Streamlit UI
# ---------------------------------------------------------------------------
def main():
    """Main Streamlit application entry point."""
    st.set_page_config(
        page_title="AI PDF Search",
        page_icon="🔍",
        layout="wide"
    )
    
    st.title("🔍 AI PDF Search")
    st.markdown(
        "Upload a PDF and ask questions in natural language. "
        "Powered by OpenAI embeddings and cosine similarity."
    )
    
    # Sidebar configuration
    with st.sidebar:
        st.header("⚙️ Configuration")
        api_key = st.text_input(
            "OpenAI API Key (optional)",
            type="password",
            help="Leave blank to use OPENAI_API_KEY env var"
        )
        if api_key:
            os.environ["OPENAI_API_KEY"] = api_key
        
        model = st.selectbox(
            "Embedding Model",
            ["text-embedding-ada-002", "text-embedding-3-small", "text-embedding-3-large"],
            index=0
        )
        
        chunk_size = st.slider("Chunk Size (tokens)", 100, 1000, 500, 50)
        overlap = st.slider("Chunk Overlap (tokens)", 0, 200, 100, 10)
        top_k = st.slider("Results to Show", 1, 10, 5)
        
        st.markdown("---")
        st.markdown(
            "🔗 [GitHub Repo](https://github.com/j1-medilo06/ai-pdfsearch)"
        )
    
    # Initialize OpenAI client
    try:
        client = get_openai_client()
    except Exception as e:
        st.error(f"Failed to initialize OpenAI client: {e}")
        return
    
    # PDF Upload
    uploaded_file = st.file_uploader(
        "📄 Upload PDF Document",
        type=["pdf"],
        help="Upload a PDF to index for semantic search"
    )
    
    if uploaded_file is not None:
        # Process PDF
        with st.spinner("Extracting text from PDF..."):
            pdf_bytes = uploaded_file.read()
            pages = extract_pages_from_pdf(pdf_bytes)
        st.success(f"Extracted {len(pages)} pages with text content")
        
        # Chunk text
        with st.spinner("Chunking text..."):
            chunks = chunk_pages(pages, chunk_size, overlap, model)
        st.success(f"Created {len(chunks)} text chunks")
        
        # Generate embeddings
        chunk_texts = tuple(c["text"] for c in chunks)
        api_key_for_cache = os.environ.get("OPENAI_API_KEY", "")
        try:
            embeddings = generate_embeddings(chunk_texts, api_key_for_cache, model)
        except Exception as e:
            st.error(f"Embedding generation failed: {e}")
            return
        st.success(f"Generated {len(embeddings)} embedding vectors")
        
        # Search interface
        st.markdown("---")
        st.subheader("🔎 Ask a Question")
        
        query = st.text_input(
            "Your question",
            placeholder="e.g., What are the termination clauses in this contract?"
        )
        
        if query:
            with st.spinner("Searching..."):
                results = semantic_search(query, chunks, embeddings, client, model, top_k)
            
            st.subheader(f"📋 Top {len(results)} Results")
            for i, result in enumerate(results, 1):
                with st.container():
                    col1, col2 = st.columns([1, 4])
                    with col1:
                        st.metric(
                            label=f"Result #{i}",
                            value=f"{result['score']:.3f}",
                            delta=f"Page {result['page']}"
                        )
                    with col2:
                        # Highlight the relevant portion
                        snippet = result["text"]
                        if len(snippet) > 500:
                            snippet = snippet[:500] + "..."
                        st.markdown(f"> {snippet}")
                    st.markdown("---")


if __name__ == "__main__":
    main()

File: install.sh — Installation Script

#!/usr/bin/env bash
# =============================================================================
# AI PDF Search — Installation Script
# Author: John Ian Medilo (j1-medilo06)
# GitHub: https://github.com/j1-medilo06/ai-pdfsearch
# =============================================================================

set -euo pipefail

RED='\033[0;31m'
GREEN='\033[0;32m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color

echo -e "${BLUE}=== AI PDF Search Installer ===${NC}"
echo ""

# Check Python version
PYTHON_VERSION=$(python3 --version 2>/dev/null | cut -d' ' -f2 | cut -d'.' -f1,2 || echo "0")
MIN_VERSION="3.9"

if ! python3 -c "import sys; exit(0 if sys.version_info >= (3, 9) else 1)" 2>/dev/null; then
    echo -e "${RED}Error: Python 3.9 or higher is required.${NC}"
    echo "Current version: $(python3 --version 2>/dev/null || echo 'not found')"
    exit 1
fi

echo -e "${GREEN}✓ Python $(python3 --version) detected${NC}"

# Create virtual environment
VENV_DIR="venv"
if [ ! -d "$VENV_DIR" ]; then
    echo "Creating virtual environment..."
    python3 -m venv "$VENV_DIR"
fi
echo -e "${GREEN}✓ Virtual environment ready${NC}"

# Activate and install dependencies
echo "Installing dependencies..."
source "$VENV_DIR/bin/activate"
pip install --upgrade pip setuptools wheel
pip install streamlit PyMuPDF openai numpy tiktoken

echo -e "${GREEN}✓ Dependencies installed${NC}"

# Check for OpenAI API key
if [ -z "${OPENAI_API_KEY:-}" ]; then
    echo ""
    echo -e "${RED}⚠ Warning: OPENAI_API_KEY environment variable not set.${NC}"
    echo "Set it before running the app:"
    echo "  export OPENAI_API_KEY='sk-your-key-here'"
fi

# Create .env template
if [ ! -f ".env" ]; then
    cat > .env << 'EOF'
# AI PDF Search Configuration
# Copy your OpenAI API key here or set as environment variable
OPENAI_API_KEY=sk-your-key-here
EOF
    echo -e "${GREEN}✓ Created .env template${NC}"
fi

echo ""
echo -e "${GREEN}=== Installation Complete ===${NC}"
echo ""
echo "To start the application:"
echo "  source venv/bin/activate"
echo "  export OPENAI_API_KEY='sk-your-key-here'"
echo "  streamlit run code.py"
echo ""
echo "Or use the .env file:"
echo "  source venv/bin/activate && export $(cat .env | xargs) && streamlit run code.py"

Installation Instructions

Clone the repository

git clone https://github.com/j1-medilo06/ai-pdfsearch.git
cd ai-pdfsearch

Run the installation script
```
chmod +x install.sh
./install.sh
```
Set your OpenAI API key
```
export OPENAI_API_KEY="sk-your-openai-api-key"
```
Security Note: Never commit API keys to version control. Use environment variables, .env files (added to .gitignore), or a secrets manager. The install script creates a .env template for local development.
Launch the application
```
source venv/bin/activate
streamlit run code.py
```
The app will open in your browser at http://localhost:8501.
Upload a PDF and search
1. Click "Browse files" to upload any PDF document.
2. Wait for text extraction and embedding generation (progress bar shown).
3. Type a natural language question in the search box.
4. Review ranked results with similarity scores and page references.

Configuration

Parameter	Default	Description
`OPENAI_API_KEY`	Required	OpenAI API key for embedding generation
`CHUNK_SIZE_TOKENS`	500	Tokens per chunk (100–1000 recommended)
`CHUNK_OVERLAP_TOKENS`	100	Overlapping tokens between chunks
`TOP_K_RESULTS`	5	Number of results to display
`DEFAULT_EMBEDDING_MODEL`	text-embedding-ada-002	OpenAI embedding model

Usage Examples

Below are example queries demonstrating the semantic search capabilities:

Document Type	Example Query	Expected Result
Employment Contract	"What happens if I resign without notice?"	Termination clause section with notice period details
Technical Whitepaper	"How does the consensus algorithm handle byzantine faults?"	Relevant section on consensus mechanism
Legal Agreement	"Intellectual property ownership terms"	IP assignment and licensing clauses
Research Paper	"Summary of experimental methodology"	Methods section with procedure details

AI PDF Search Screenshot

Try It Yourself

Copy and run the following commands to get started in under 2 minutes:

# Clone and setup
git clone https://github.com/j1-medilo06/ai-pdfsearch.git
cd ai-pdfsearch && ./install.sh

# Set API key and launch
export OPENAI_API_KEY="sk-your-key-here"
source venv/bin/activate && streamlit run code.py

Future Enhancements

Roadmap — Contributions Welcome!

Multi-document search: Index multiple PDFs simultaneously with document-level filtering
Local LLM support: Integration with Ollama or llama.cpp for fully offline operation
Persistent vector storage: ChromaDB or Weaviate integration for large document collections
Hybrid search: Combine keyword (BM25) and semantic search for optimal relevance
Source highlighting: Visual highlight of matching regions on original PDF pages
Chat interface: Conversational follow-up questions with context memory

References & Links

Resource	Link
GitHub Repository	github.com/j1-medilo06/ai-pdfsearch
OpenAI Embeddings API	platform.openai.com/docs/guides/embeddings
Streamlit Documentation	docs.streamlit.io
PyMuPDF Documentation	pymupdf.readthedocs.io
Author GitHub	github.com/j1-medilo06