EFTA Files — Epstein Files Transparency Act Document Explorer

Overview

What Is This?

A complete, automated pipeline for accessing the DOJ's publicly released Epstein files — from initial web scrape to structured, searchable data.

Web Scraping

Automatically scrapes all 12 DOJ data set pages for PDF download links. Handles age verification cookies and rate limiting.

Parallel Downloads

Multi-worker async downloads with retry logic and resume support. Skip already-downloaded files automatically.

Text Extraction

Convert all PDFs to searchable text using pdfplumber. Handles scanned documents and variable OCR quality.

Entity Parsing

Extract people, organizations, locations, dates, case numbers, financial figures, and more from every document.

Full-Text Search

Search across all extracted documents instantly. Find references to specific names, locations, dates, or phrases.

Structured Export

Output to JSON and CSV formats. Entity summaries with frequency counts, ready for analysis in any tool.

Architecture

Processing Pipeline

Four-stage automated pipeline transforms raw DOJ web pages into structured, searchable data.

01

DOJScraper

Scrapes all 12 data set pages for PDF links

document_index.json document_index.csv

02

Downloader

Parallel downloads with retry & resume

dataset_1/ … dataset_12/

03

TextExtractor

PDF → text using pdfplumber

EFTA00000001.txt ... 14,914 files

04

DocumentParser

Regex + keyword entity extraction

parsed_documents.json entity_summary.json

Pipeline Output

epstein_data/
├── index/
│   ├── document_index.json      # Master file index with URLs, status
│   └── document_index.csv       # Same in CSV format
├── pdfs/
│   ├── dataset_1/               # EFTA00000001.pdf - EFTA00003150.pdf
│   ├── dataset_2/               # ...
│   └── dataset_12/
├── text/
│   ├── EFTA00000001.txt         # Extracted text per document
│   ├── EFTA00000002.txt
│   └── ...
└── parsed/
    ├── parsed_documents.json    # Structured entity data
    ├── parsed_documents.csv     # Flattened for spreadsheets
    └── entity_summary.json      # Frequency counts

Data

12 Data Sets

The DOJ released documents in 12 batches spanning December 2025 through January 2026, totaling over 3.5 million pages.

Data Set	EFTA Range	Released	Notes
1	`00000001 — 00003150`	Dec 19, 2025	Initial EFTA release
2	`00003151 — 00003785`	Dec 19, 2025
3	`00003786 — 00005380`	Dec 19, 2025
4	`00005381 — 00005855`	Dec 19, 2025
5	`00005856 — 00007430`	Dec 19, 2025
6	`00007431 — 00007443`	Dec 19, 2025	~13 files
7	`00007444 — 00009675`	Dec 19, 2025
8	`00009676 — 00039023`	Dec 20, 2025	Largest set; includes videos, Excel
9–12	`00039024+`	Dec 22 – Jan 30	Later releases, 3.5M pages total

Parsing

Extracted Entities

Each document is analyzed for a comprehensive set of structured fields, enabling deep analysis across the entire collection.

👤

People

Named individuals (Epstein, Maxwell, etc.)

🏛

Organizations

FBI, DOJ, courts, financial institutions

📍

Locations

New York, Palm Beach, Virgin Islands

📅

Dates

All date formats found in text

🔖

Bates Numbers

EFTA reference numbers

📧

Emails

Email addresses discovered in text

📞

Phone Numbers

US phone numbers

⚖️

Case Numbers

Court case references

💰

Dollar Amounts

Financial figures and transactions

📄

Document Type

Court filing, letter, transcript, FBI 302, flight log

█▓

Redaction Status

Whether document contains redacted content

Explorer

Data Explorer

Search, filter, and browse parsed EFTA documents. Entity extraction runs on extracted text from the pipeline.

📊

—

Parsed Documents

📝

—

With Extracted Text

🔗

—

With Entities Found

📄

—

Total Pages

👤 Top People Mentioned

🏛 Top Organizations

📍 Top Locations

📑 Document Type Breakdown

Loading data...

Loading parsed documents...

Get Started

Quick Start

Get up and running in minutes. Install dependencies and run the full pipeline with a single command.

1

Install Dependencies

Terminal

pip install requests beautifulsoup4 pdfplumber aiohttp

2

Run the Pipeline

Terminal

# Run individual stages
python epstein_scraper.py scrape
python epstein_scraper.py download --workers 8
python epstein_scraper.py extract
python epstein_scraper.py parse

# Or run the full pipeline:
python epstein_scraper.py all

3

Search & Analyze

Terminal

# Search across all extracted text
python epstein_scraper.py search "flight log"
python epstein_scraper.py search "Palm Beach"

# View collection stats
python epstein_scraper.py stats

Alt

Bulk ZIP Download

Terminal

# Download ZIP files from DOJ:
# DataSet 1.zip … DataSet 12.zip
python epstein_scraper.py zip

Important

Usage Notes

🔞

Age Gate

The DOJ site requires age verification (18+). The scraper automatically sets the justiceGovAgeVerified=true cookie.

💾

Total Size

The complete collection is 50+ GB of PDFs. Ensure you have sufficient disk space.

🔍

OCR Quality

Many documents are scanned images with variable OCR quality. Results may vary.

█▓░

Redactions

Heavy redactions throughout, especially victim information.

⏱

Rate Limiting

The scraper includes delays to be respectful to DOJ servers.

🔄

Resume Support

Downloads skip already-existing files automatically.

Resources

Community & Sources

These documents are publicly available. Here are the primary sources and community projects.

DOJ Epstein Library

Official Department of Justice collection

justice.gov/epstein →

Google Pinpoint

Searchable index of all documents

journaliststudio.google.com →

Epstein Archive

Community-organized document index

epstein-docs.github.io →

PDF Forensic Analysis

PDF Association case study

pdfa.org →

File Indexes

Tommy Carstensen's detailed file indexes

tommycarstensen.com →

AI-Powered

Intelligence Tools

Analyze documents with Gemini AI, search the web in real-time with Grok, and monitor X/Twitter conversations — all powered by your API keys.

API Keys

Not configured

GOOGLE_API_KEY For document analysis & chat

XAI_API_KEY For web search & X search

Drop a PDF here or click to upload

Max 50MB • Up to 1000 pages

— or —

🔍

Document Intelligence

Upload or link an EFTA document to start analyzing it with Gemini AI. Ask questions, extract entities, summarize content, or transcribe pages.

Real-Time Web Search

Powered by Grok's web_search tool. Search for the latest news, analysis, and discussions about the EFTA documents.

X/Twitter Search

Powered by Grok's x_search tool. Find real-time posts, threads, and discussions about the Epstein files on X.