14,914+ Documents • 12 Data Sets • 50+ GB

Epstein Files
Transparency Act

An open-source pipeline to scrape, download, extract, and parse every document from the DOJ's Epstein Files collection — enabling full-text search, entity extraction, and public accountability.

efta_scraper.py
$ python epstein_scraper.py all
[scrape] Scanning 12 data set pages...
Found 14,914 document URLs
[download] Downloading PDFs (8 workers)...
Downloaded 14,914 files (52.3 GB)
[extract] Extracting text from PDFs...
Extracted 3.5M pages of text
[parse] Parsing entities across all documents...
Pipeline complete — 14,914 documents processed
$
0 Documents
0 Data Sets
0 Pages
0 GB+ Total Size

What Is This?

A complete, automated pipeline for accessing the DOJ's publicly released Epstein files — from initial web scrape to structured, searchable data.

Web Scraping

Automatically scrapes all 12 DOJ data set pages for PDF download links. Handles age verification cookies and rate limiting.

Parallel Downloads

Multi-worker async downloads with retry logic and resume support. Skip already-downloaded files automatically.

Text Extraction

Convert all PDFs to searchable text using pdfplumber. Handles scanned documents and variable OCR quality.

Entity Parsing

Extract people, organizations, locations, dates, case numbers, financial figures, and more from every document.

Structured Export

Output to JSON and CSV formats. Entity summaries with frequency counts, ready for analysis in any tool.

Processing Pipeline

Four-stage automated pipeline transforms raw DOJ web pages into structured, searchable data.

01

DOJScraper

Scrapes all 12 data set pages for PDF links

document_index.json document_index.csv
02

Downloader

Parallel downloads with retry & resume

dataset_1/ … dataset_12/
03

TextExtractor

PDF → text using pdfplumber

EFTA00000001.txt ... 14,914 files
04

DocumentParser

Regex + keyword entity extraction

parsed_documents.json entity_summary.json

Pipeline Output

epstein_data/
├── index/
│   ├── document_index.json      # Master file index with URLs, status
│   └── document_index.csv       # Same in CSV format
├── pdfs/
│   ├── dataset_1/               # EFTA00000001.pdf - EFTA00003150.pdf
│   ├── dataset_2/               # ...
│   └── dataset_12/
├── text/
│   ├── EFTA00000001.txt         # Extracted text per document
│   ├── EFTA00000002.txt
│   └── ...
└── parsed/
    ├── parsed_documents.json    # Structured entity data
    ├── parsed_documents.csv     # Flattened for spreadsheets
    └── entity_summary.json      # Frequency counts

12 Data Sets

The DOJ released documents in 12 batches spanning December 2025 through January 2026, totaling over 3.5 million pages.

Data Set EFTA Range Released Notes
1 00000001 — 00003150 Dec 19, 2025 Initial EFTA release
2 00003151 — 00003785 Dec 19, 2025
3 00003786 — 00005380 Dec 19, 2025
4 00005381 — 00005855 Dec 19, 2025
5 00005856 — 00007430 Dec 19, 2025
6 00007431 — 00007443 Dec 19, 2025 ~13 files
7 00007444 — 00009675 Dec 19, 2025
8 00009676 — 00039023 Dec 20, 2025 Largest set; includes videos, Excel
9–12 00039024+ Dec 22 – Jan 30 Later releases, 3.5M pages total

Extracted Entities

Each document is analyzed for a comprehensive set of structured fields, enabling deep analysis across the entire collection.

👤

People

Named individuals (Epstein, Maxwell, etc.)

🏛

Organizations

FBI, DOJ, courts, financial institutions

📍

Locations

New York, Palm Beach, Virgin Islands

📅

Dates

All date formats found in text

🔖

Bates Numbers

EFTA reference numbers

📧

Emails

Email addresses discovered in text

📞

Phone Numbers

US phone numbers

⚖️

Case Numbers

Court case references

💰

Dollar Amounts

Financial figures and transactions

📄

Document Type

Court filing, letter, transcript, FBI 302, flight log

█▓

Redaction Status

Whether document contains redacted content

Data Explorer

Search, filter, and browse parsed EFTA documents. Entity extraction runs on extracted text from the pipeline.

📊
Parsed Documents
📝
With Extracted Text
🔗
With Entities Found
📄
Total Pages

👤 Top People Mentioned

🏛 Top Organizations

📍 Top Locations

📑 Document Type Breakdown

Loading data...

Loading parsed documents...

Quick Start

Get up and running in minutes. Install dependencies and run the full pipeline with a single command.

1

Install Dependencies

Terminal
pip install requests beautifulsoup4 pdfplumber aiohttp
2

Run the Pipeline

Terminal
# Run individual stages
python epstein_scraper.py scrape
python epstein_scraper.py download --workers 8
python epstein_scraper.py extract
python epstein_scraper.py parse

# Or run the full pipeline:
python epstein_scraper.py all
Alt

Bulk ZIP Download

Terminal
# Download ZIP files from DOJ:
# DataSet 1.zip … DataSet 12.zip
python epstein_scraper.py zip

Usage Notes

🔞

Age Gate

The DOJ site requires age verification (18+). The scraper automatically sets the justiceGovAgeVerified=true cookie.

💾

Total Size

The complete collection is 50+ GB of PDFs. Ensure you have sufficient disk space.

🔍

OCR Quality

Many documents are scanned images with variable OCR quality. Results may vary.

█▓░

Redactions

Heavy redactions throughout, especially victim information.

Rate Limiting

The scraper includes delays to be respectful to DOJ servers.

🔄

Resume Support

Downloads skip already-existing files automatically.

Community & Sources

These documents are publicly available. Here are the primary sources and community projects.

Intelligence Tools

Analyze documents with Gemini AI, search the web in real-time with Grok, and monitor X/Twitter conversations — all powered by your API keys.

API Keys

Not configured
For document analysis & chat
For web search & X search

Drop a PDF here or click to upload

Max 50MB • Up to 1000 pages
— or —
🔍

Document Intelligence

Upload or link an EFTA document to start analyzing it with Gemini AI. Ask questions, extract entities, summarize content, or transcribe pages.