An open-source pipeline to scrape, download, extract, and parse every document from the DOJ's Epstein Files collection — enabling full-text search, entity extraction, and public accountability.
A complete, automated pipeline for accessing the DOJ's publicly released Epstein files — from initial web scrape to structured, searchable data.
Automatically scrapes all 12 DOJ data set pages for PDF download links. Handles age verification cookies and rate limiting.
Multi-worker async downloads with retry logic and resume support. Skip already-downloaded files automatically.
Convert all PDFs to searchable text using pdfplumber. Handles scanned documents and variable OCR quality.
Extract people, organizations, locations, dates, case numbers, financial figures, and more from every document.
Search across all extracted documents instantly. Find references to specific names, locations, dates, or phrases.
Output to JSON and CSV formats. Entity summaries with frequency counts, ready for analysis in any tool.
Four-stage automated pipeline transforms raw DOJ web pages into structured, searchable data.
Scrapes all 12 data set pages for PDF links
document_index.json
document_index.csv
Parallel downloads with retry & resume
dataset_1/ … dataset_12/
PDF → text using pdfplumber
EFTA00000001.txt
... 14,914 files
Regex + keyword entity extraction
parsed_documents.json
entity_summary.json
epstein_data/
├── index/
│ ├── document_index.json # Master file index with URLs, status
│ └── document_index.csv # Same in CSV format
├── pdfs/
│ ├── dataset_1/ # EFTA00000001.pdf - EFTA00003150.pdf
│ ├── dataset_2/ # ...
│ └── dataset_12/
├── text/
│ ├── EFTA00000001.txt # Extracted text per document
│ ├── EFTA00000002.txt
│ └── ...
└── parsed/
├── parsed_documents.json # Structured entity data
├── parsed_documents.csv # Flattened for spreadsheets
└── entity_summary.json # Frequency counts
The DOJ released documents in 12 batches spanning December 2025 through January 2026, totaling over 3.5 million pages.
| Data Set | EFTA Range | Released | Notes |
|---|---|---|---|
| 1 | 00000001 — 00003150 |
Dec 19, 2025 | Initial EFTA release |
| 2 | 00003151 — 00003785 |
Dec 19, 2025 | |
| 3 | 00003786 — 00005380 |
Dec 19, 2025 | |
| 4 | 00005381 — 00005855 |
Dec 19, 2025 | |
| 5 | 00005856 — 00007430 |
Dec 19, 2025 | |
| 6 | 00007431 — 00007443 |
Dec 19, 2025 | ~13 files |
| 7 | 00007444 — 00009675 |
Dec 19, 2025 | |
| 8 | 00009676 — 00039023 |
Dec 20, 2025 | Largest set; includes videos, Excel |
| 9–12 | 00039024+ |
Dec 22 – Jan 30 | Later releases, 3.5M pages total |
Each document is analyzed for a comprehensive set of structured fields, enabling deep analysis across the entire collection.
Named individuals (Epstein, Maxwell, etc.)
FBI, DOJ, courts, financial institutions
New York, Palm Beach, Virgin Islands
All date formats found in text
EFTA reference numbers
Email addresses discovered in text
US phone numbers
Court case references
Financial figures and transactions
Court filing, letter, transcript, FBI 302, flight log
Whether document contains redacted content
Search, filter, and browse parsed EFTA documents. Entity extraction runs on extracted text from the pipeline.
Loading parsed documents...
Get up and running in minutes. Install dependencies and run the full pipeline with a single command.
pip install requests beautifulsoup4 pdfplumber aiohttp
# Run individual stages
python epstein_scraper.py scrape
python epstein_scraper.py download --workers 8
python epstein_scraper.py extract
python epstein_scraper.py parse
# Or run the full pipeline:
python epstein_scraper.py all
# Search across all extracted text
python epstein_scraper.py search "flight log"
python epstein_scraper.py search "Palm Beach"
# View collection stats
python epstein_scraper.py stats
# Download ZIP files from DOJ:
# DataSet 1.zip … DataSet 12.zip
python epstein_scraper.py zip
The DOJ site requires age verification (18+). The scraper automatically sets the
justiceGovAgeVerified=true cookie.
The complete collection is 50+ GB of PDFs. Ensure you have sufficient disk space.
Many documents are scanned images with variable OCR quality. Results may vary.
Heavy redactions throughout, especially victim information.
The scraper includes delays to be respectful to DOJ servers.
Downloads skip already-existing files automatically.
These documents are publicly available. Here are the primary sources and community projects.
Official Department of Justice collection
justice.gov/epstein →Searchable index of all documents
journaliststudio.google.com →Community-organized document index
epstein-docs.github.io →PDF Association case study
pdfa.org →Tommy Carstensen's detailed file indexes
tommycarstensen.com →Analyze documents with Gemini AI, search the web in real-time with Grok, and monitor X/Twitter conversations — all powered by your API keys.
Drop a PDF here or click to upload
Max 50MB • Up to 1000 pagesPowered by Grok's web_search tool. Search for the latest news, analysis, and discussions
about the EFTA documents.
Powered by Grok's x_search tool. Find real-time posts, threads, and discussions about the
Epstein files on X.