Skip to main content
FormatDrop
How-To Guide

How to Convert PDF to TXT

Pulling plain text out of a PDF is the foundation of every PDF data pipeline — search indexing, LLM context loading, regex parsing, content moderation, document migration, you name it. The conversion itself is mostly easy: searchable PDFs (made from Word, Google Docs, web pages) extract cleanly with one command. But scanned PDFs (made from cameras or scanners) contain only images of text and require OCR to extract anything readable. Multi-column layouts, tables, footnotes, and right-to-left scripts all add complications. This guide covers every PDF-to-TXT method — browser, command line, Python, OCR for scanned PDFs — and the gotchas that consistently trip people up: missing characters, wrong column order, mangled tables, encoding bugs.

Quick answer

For searchable PDFs, the fastest method is `pdftotext input.pdf output.txt` (install: `brew install poppler` on Mac, `apt install poppler-utils` on Linux). Add `-layout` to preserve columns. For scanned PDFs containing only images, you need OCR — use `ocrmypdf input.pdf searchable.pdf` then `pdftotext searchable.pdf output.txt`. The browser-based converter handles both with no install.

Method 1: Convert PDF to TXT online (free, in your browser)

  1. 1

    Open the FormatDrop PDF to TXT converter

    Open formatdrop.com/pdf-converter in any modern browser. Conversion runs locally in WebAssembly — no upload, no server. Works in Chrome, Safari, Firefox, Edge on Mac, Windows, Linux, iPhone, Android.

    Go to converter
  2. 2

    Upload your PDF

    Drop your PDF onto the upload area or click to choose. The converter reads it locally. For searchable PDFs the extraction completes in milliseconds; for scanned PDFs it auto-detects the lack of text layer and offers OCR (English by default; multi-language available).

  3. 3

    Choose TXT as output and format options

    Select TXT (UTF-8). Optional: 'Preserve layout' tries to keep columns and tables in roughly the right place; 'Linear' flattens everything into reading order; 'Page-separated' inserts form-feed characters between pages for downstream parsing. UTF-8 is the right encoding for almost every modern use.

  4. 4

    Download or copy the TXT

    Single-page conversions show the text inline so you can copy it directly. Multi-page or larger conversions download as .txt. The output is ready to feed to grep, regex parsers, Pandas, Elasticsearch, or LLM context windows.

Method 2Command line (pdftotext)

Method 2: Convert PDF to TXT with pdftotext (best for searchable PDFs)

pdftotext is the standard command-line tool for PDF text extraction. Part of the poppler-utils package. Free, fast, and used by virtually every search engine and indexing system that processes PDFs.

  1. Install. Mac: `brew install poppler`. Linux: `apt install poppler-utils` or `dnf install poppler-utils`. Windows: download from poppler.freedesktop.org or use WSL.
  2. Basic extraction: `pdftotext input.pdf output.txt` — produces a UTF-8 text file with the PDF's text content in reading order.
  3. Preserve layout (best for multi-column PDFs and tables): `pdftotext -layout input.pdf output.txt` — keeps spacing approximately as in the original.
  4. Extract specific pages: `pdftotext -f 5 -l 10 input.pdf output.txt` — pages 5 through 10.
  5. Force a specific encoding: `pdftotext -enc UTF-8 input.pdf output.txt` (default is usually UTF-8 on modern systems).
  6. Pipe to other tools: `pdftotext input.pdf - | grep -i 'invoice'` — extract and filter in one shot. The `-` means write to stdout.

Note: pdftotext is the right tool 80% of the time. It only fails on scanned PDFs (no text layer to extract) and PDFs with unusual font embedding (rare). For everything else, it's fast, accurate, and produces clean output.

Method 3Python (pdfplumber)

Method 3: Convert PDF to TXT with Python pdfplumber (best for tables and structured PDFs)

pdfplumber is a Python library that excels at PDFs with tables, columns, and structured layouts. Built on pdfminer.six but with a much more usable API. Best when you need programmatic control over what you extract.

  1. Install: `pip install pdfplumber`.
  2. Basic extraction: `import pdfplumber; with pdfplumber.open('input.pdf') as pdf: text = '\n'.join(p.extract_text() for p in pdf.pages); open('output.txt','w').write(text)`.
  3. Extract page by page with metadata: `for page in pdf.pages: print(f'=== Page {page.page_number} ==='); print(page.extract_text())`.
  4. Extract tables specifically: `for page in pdf.pages: for table in page.extract_tables(): print(table)` — returns lists of lists you can write to CSV.
  5. Filter text by position (e.g., skip headers and footers): `text = page.crop((0, 50, page.width, page.height - 50)).extract_text()`.
  6. For LLM workflows, combine with chunking: split the extracted text into 1000-token chunks for embedding generation.

Note: pdfplumber is slower than pdftotext but produces better results for complex layouts. Use it when pdftotext mangles tables or column order. For pure speed on simple PDFs, stick with pdftotext.

Method 4OCR for scanned PDFs (ocrmypdf + Tesseract)

Method 4: OCR a scanned PDF to extract text (when pdftotext returns empty)

If pdftotext produces an empty file, your PDF is scanned (just images of text). You need OCR to recognize the characters. The best free toolchain is ocrmypdf (which handles the PDF wrapping) + Tesseract (the OCR engine).

  1. Install. Mac: `brew install ocrmypdf tesseract`. Linux: `apt install ocrmypdf tesseract-ocr`. Windows: WSL or download from github.com/ocrmypdf/OCRmyPDF.
  2. Add an OCR text layer to a scanned PDF: `ocrmypdf input.pdf searchable.pdf`. The output is the same PDF visually, but with hidden text behind each image character.
  3. Then extract text normally: `pdftotext searchable.pdf output.txt`.
  4. For multi-language documents: `ocrmypdf -l eng+fra+deu input.pdf searchable.pdf` (English, French, German simultaneously). Available languages: `tesseract --list-langs`.
  5. If your scan quality is poor, add deskew and denoise: `ocrmypdf --deskew --clean input.pdf searchable.pdf` — typically improves OCR accuracy by 20-40%.
  6. For huge batches: `for f in *.pdf; do ocrmypdf "$f" "${f%.pdf}-searchable.pdf" && pdftotext "${f%.pdf}-searchable.pdf" "${f%.pdf}.txt"; done`.

Note: OCR accuracy depends heavily on input quality. Sharp 300 DPI scans get >99% accuracy; blurry phone-camera shots get 80-90%. For mission-critical accuracy, use a commercial OCR service like Adobe Acrobat OCR or Google Document AI; for everyday use, Tesseract is excellent and free.

Method 5macOS Automator + sips

Method 5: Extract text on macOS using built-in tools (no install)

macOS has decent PDF text extraction built in via Quartz (Apple's PDF engine). No third-party install required. Works for searchable PDFs only.

  1. Open Automator → New → Quick Action.
  2. Add 'Extract PDF Text' action (built-in). Set output to 'Plain Text'.
  3. Add 'New Text File' action. Configure file naming as 'Same as original'.
  4. Save the Quick Action as 'PDF to TXT'.
  5. In Finder, right-click any PDF → Quick Actions → PDF to TXT. The .txt file appears next to the PDF.
  6. Alternative: open Terminal and use the built-in `mdimport` or AppleScript with the PDFKit framework for more control.

Note: macOS's built-in extractor is convenient but produces less consistent layout than pdftotext. For one-off use, it's fine; for repeatable workflows, install pdftotext.

Method 6Online (browser-based)

Method 6: PDF to TXT online without installing anything

If you can't or don't want to install command-line tools, browser-based converters work. Choose carefully — most send your PDF to a server.

  1. Local browser conversion (privacy-respecting): formatdrop.com/pdf-converter — runs in WebAssembly inside your browser. No upload.
  2. Server-based conversion (faster for large PDFs but uploads your file): smallpdf.com, ilovepdf.com, cloudconvert.com.
  3. For each: drop the PDF, choose TXT or 'Extract Text', download the result.
  4. Verify the output looks right by opening in a text editor. Compare a few sentences with the source PDF to catch encoding bugs (mojibake from wrong codepage).

Note: For sensitive PDFs (legal documents, medical records, financial data), only use local browser tools or command-line extraction. Online services typically retain uploaded files for a period and may use them for training.

When you need to convert PDF to TXT

  • 1

    Feeding PDFs to an LLM (Claude, GPT, Gemini)

    LLMs consume text, not PDFs. Extract with pdftotext (fast) or pdfplumber (better for tables), chunk the result into context-window-friendly sizes, and feed to the model. For RAG pipelines, this is the canonical first step.

  • 2

    Building a search index over a document corpus

    Elasticsearch, Solr, Meilisearch, and every other search engine index plain text. Extract once at ingestion time, store the .txt alongside the PDF, point your indexer at the .txt files. Searchable in milliseconds even for million-document corpora.

  • 3

    Migrating a PDF library to a wiki, CMS, or knowledge base

    Notion, Confluence, Outline, and most wikis import .txt or Markdown but not PDF. Bulk-extract your PDFs to TXT, optionally convert TXT to Markdown with a script, then bulk-import to the destination. Migrating thousands of pages becomes scriptable instead of manual.

  • 4

    Compliance and content moderation scanning

    Extract every PDF in a corpus to TXT, then run regex or NLP scanning for sensitive patterns (PII, credit cards, restricted terms). The text-based pipeline is much faster than visual analysis and integrates with existing data-loss-prevention tools.

  • 5

    Academic and research analysis

    Bulk download papers from arxiv or a journal corpus, extract to TXT, run citation analysis, topic modeling, or term frequency analysis with Python (NLTK, spaCy). This is how every academic NLP pipeline starts.

Troubleshooting common PDF to TXT problems

pdftotext outputs an empty file or just whitespace

Your PDF is scanned — it contains images of text, not text. pdftotext extracts only the text layer, which doesn't exist in scanned PDFs. Solution: run OCR first using ocrmypdf (Method 4 above), then pdftotext on the resulting searchable PDF. Verify by opening the original in any PDF viewer and trying to select text — if you can't select, you need OCR.

Multi-column PDF text comes out interleaved (line 1 col A, line 1 col B, line 2 col A, ...)

By default pdftotext extracts in PDF stream order, which interleaves columns. Add the `-layout` flag: `pdftotext -layout input.pdf output.txt`. This preserves visual layout including column structure. For more aggressive column detection, use Python pdfplumber which has a layout-aware mode.

Special characters appear as ? or boxes (mojibake)

Encoding issue. Force UTF-8: `pdftotext -enc UTF-8 input.pdf output.txt`. If that doesn't help, the PDF may use a custom encoding or embedded CMap that pdftotext can't decode. Try pdfplumber instead, which uses pdfminer's more capable encoding handler. As a last resort, OCR the PDF — OCR ignores the original encoding entirely and recognizes characters visually.

Tables come out as garbled space-separated columns

Tables are PDF's worst format because they're just positioned text — there's no actual table structure in the PDF binary. Solutions: (1) Use `pdftotext -layout` to preserve approximate spacing. (2) For real tables, use tabula-py or camelot — both extract tables to CSV. (3) For LLM consumption, pdfplumber's `extract_tables()` returns clean Python lists you can format however you need.

Footnotes and headers appear inline with body text

PDFs don't distinguish footnotes from body text in their text streams — it's all positioned text. Filter by position: pdfplumber's `crop()` method lets you exclude top and bottom margins. Or post-process with regex to remove lines matching footnote patterns (e.g., starting with a number followed by a tab).

Large PDF (1000+ pages) takes forever or runs out of memory

Process page-by-page rather than all at once. With pdftotext: `for i in $(seq 1 1000); do pdftotext -f $i -l $i input.pdf page-$i.txt; done`. With pdfplumber: open in a `with` block and process pages in a generator, never holding the whole document in memory.

Text is in the wrong order (later page text appears earlier)

Some PDFs have unusual page rotations or layered content. Try `-layout` mode. If that doesn't help, the PDF may use an unusual reading order tagged in its accessibility metadata — pdftotext respects that, but you may need to extract by visual position using pdfplumber's bounding-box methods. For RTL languages (Arabic, Hebrew), use `pdftotext -enc UTF-8` and ensure your editor handles RTL.

Why convert PDF to TXT?

PDF-to-TXT is the most common PDF processing task and the foundation of every document data pipeline. Search indexers run it. RAG systems run it. Compliance scanners run it. Migration tools run it. The conversion is mostly easy — for searchable PDFs, `pdftotext input.pdf output.txt` does it in milliseconds.

The only catch is scanned PDFs, which contain only images of text. Those require OCR (Optical Character Recognition) to extract anything meaningful. The free Tesseract + ocrmypdf toolchain handles 90% of OCR needs at a quality level that was professional-grade five years ago. For mission-critical accuracy, commercial services (Adobe Acrobat Pro, Google Document AI) are 99%+ accurate at a per-page cost.

The other catch is layout. PDFs don't actually have 'columns' or 'tables' — they have positioned text that visually looks structured. Multi-column PDFs interleave incorrectly without the `-layout` flag. Tables come out as space-separated text instead of CSV. For programmatic table extraction, tabula or camelot are the right tools; for column-preserving prose, pdftotext -layout works.

Whatever your use case — LLM context, search index, knowledge base migration, compliance scan — start with pdftotext, fall back to pdfplumber for complex layouts, add ocrmypdf for scans. That covers virtually every PDF-to-TXT scenario.

Your files never leave your device

FormatDrop runs the conversion engine entirely inside your browser using WebAssembly. No file upload. No server. Nothing stored. You can verify this by opening DevTools → Network tab and watching: zero upload requests.

Frequently asked questions

Is converting PDF to TXT free?
Yes. pdftotext, pdfplumber, ocrmypdf, Tesseract, and the FormatDrop browser converter are all free. Online tools are usually free for small files; some throttle large or batch jobs. The only paid options are commercial OCR services (Adobe Acrobat Pro, Google Document AI), which deliver higher OCR accuracy on poor-quality scans.
Will text extraction work on a scanned PDF?
Not directly. Scanned PDFs contain images of text, not actual text. pdftotext, pdfplumber, and similar tools return an empty result. You need OCR (Optical Character Recognition) to recognize the characters. ocrmypdf with Tesseract is the best free option; it adds a hidden text layer to your scanned PDF, which then extracts normally.
Best command-line tool for PDF to TXT?
pdftotext for searchable PDFs (fast, accurate, simple). pdfplumber for PDFs with complex layouts or tables (slower but more capable). ocrmypdf + pdftotext for scanned PDFs (handles OCR + extraction). For 90% of cases, pdftotext is the right choice.
How do I handle a multi-column PDF?
Use `pdftotext -layout input.pdf output.txt`. The `-layout` flag preserves the original visual layout including column structure. Without it, columns interleave (line 1 of column A, line 1 of column B, line 2 of column A...) and produce gibberish. For especially complex layouts, pdfplumber's column-aware extraction works better.
Will my PDF tables convert correctly to TXT?
Approximately. Tables in PDFs aren't actual tables — they're positioned text that visually looks like a table. pdftotext with `-layout` preserves approximate spacing, but doesn't produce structured CSV-style output. For real table extraction (rows and columns as data), use tabula-py, camelot, or pdfplumber's `extract_tables()` method, which output CSV-compatible rows.
Can I extract text from a password-protected PDF?
Only if you have the password. With pdftotext: `pdftotext -opw owner-password input.pdf output.txt` for owner password, `-upw user-password` for user password. Without the password, none of the tools can decrypt the content. Adobe Acrobat Pro can sometimes brute-force weak passwords; nothing free can.
Does converting PDF to TXT preserve formatting?
No — TXT is plain text with no formatting. Bold, italic, headings, font choices, and color all disappear. If you need to preserve some formatting, convert to RTF, DOCX, or Markdown instead. TXT is for downstream programmatic processing, not human reading.
What's the difference between pdftotext and pdfplumber?
pdftotext is C-based, fast (millisecond extraction), and simple — good for batch jobs and pipelines. pdfplumber is Python-based, slower, but produces better output for complex layouts, tables, and structured documents. Use pdftotext as the default; reach for pdfplumber when pdftotext mangles a specific document.
How do I OCR a PDF for free?
ocrmypdf + Tesseract. Mac install: `brew install ocrmypdf`. Then `ocrmypdf scanned.pdf searchable.pdf` — output PDF has a hidden text layer that extracts cleanly with pdftotext. Tesseract handles 100+ languages. For 90% of OCR needs, this stack is free, accurate, and reliable.
Convert PDF to TXT Now — Free

No account. No upload. Works in any browser.