AI Document Review & Q&A Portal

RAG-powered legal document intelligence with e-signature workflows

Role

End-to-end (Frontend, RAG Pipeline, Ingestion Workers, Infrastructure)

Problem

Legal and business teams spend hours manually reviewing contracts and documents to answer questions, extract timelines, and identify risks—tasks that are repetitive but require high accuracy.

This platform transforms how teams interact with large document sets. Instead of searching PDFs manually, users ask natural-language questions and receive cited, sourced answers in real-time. Built as a full-stack RAG system, it handles document ingestion (PDF, Word, Excel, PPT), embedding generation, vector search with reranking, and LLM-powered summarization. Additional modules include timeline visualization, due diligence checklists, party/risk analysis, and a complete e-signature workflow with templating and audit trails.

The Challenge

Legal and M&A teams often work with hundreds of contracts, agreements, and diligence documents. Answering a single question—like 'What are the termination clauses?'—requires reading entire files, cross-referencing dates and parties, and summarizing findings. Traditional keyword search fails on synonyms and context. The goal was to build a system that could ingest any document format, understand semantic intent, retrieve relevant passages with high precision, and generate accurate, explainable answers with citations.

Technical Architecture

Frontend

Next.js (App Router)
React 19
TypeScript
TailwindCSS
shadcn/ui
React Email

RAG & AI

LangChain / LangGraph
Modal (vLLM hosting: Qwen2.5-32B)
BGE-M3 Embeddings + Reranker
@xenova/transformers (local fallback)
OpenAI-compatible endpoints

Data & Vector Search

Supabase (PostgreSQL + pgvector)
Full-text lexical search
Custom RPCs

Ingestion Workers

Modal Python workers
Marker PDF extraction
PyPDF2, textract, python-docx
OCR fallback

Infrastructure

Vercel (Frontend + API Routes + Cron)
Modal (LLM inference + ingestion)
Supabase Cloud / BYOD Supabase

Integrations

Clerk (Auth + Orgs)
Resend (Email)
pdfme (E-signatures)
Mixpanel (Analytics)
Sentry (Monitoring)

Key Features Built

Natural-language Q&A with streaming responses and inline source citations
Enhanced retrieval: BGE-M3 embeddings, reranking, lexical augmentation, and adaptive thresholds
Multi-format ingestion: PDF, Word, Excel, PowerPoint, TXT with OCR fallback
Timeline visualization: LLM-assisted extraction of dates, events, and parties from contracts
Due diligence checklists: Link documents to checklist items with completion tracking
Party and risk analysis: Extract entities, aliases, and assess risk via custom rules
Knowledge graph foundation: Items, links, and relationships
E-signature module: Templates (pdfme), guest signing, audit logs, final PDF storage
Multi-tenant SaaS with BYOD: Org-scoped data with encrypted customer Supabase credentials (AES-256-GCM)
Monitoring dashboards: Processing metrics, extraction timing, error tracking, cost summaries

Technical Challenges & Solutions

Challenge

Low retrieval precision on legal jargon and synonyms.

Solution

Switched to BGE-M3 embeddings (trained on domain-specific text) + reranker; improved precision by 40%.

Challenge

Large PDFs (1000+ pages) caused ingestion timeouts.

Solution

Batched chunking with progress callbacks; added job queue with retry logic.

Challenge

LLM hallucinations citing non-existent clauses.

Solution

Enforced strict context-only generation; added citation validation to ensure quoted text exists in source chunks.

Challenge

Multi-tenant data isolation with customer-owned DBs.

Solution

Built connection router with encrypted config storage (AES-256-GCM) and per-request DB client resolution.

←AI Real Estate Investment Platform Healthcare Data Processing Platform→