Extraction · PHP

Extract text from a PDF in PHP

Pull the text out of any PDF from PHP. rust-pdf walks the content stream, maps shown glyph codes back to Unicode through each font's ToUnicode map, and infers spaces and line breaks, including two-byte Type0 fonts and CJK.

Why PHP needs this

PHP's classic libraries (TCPDF, FPDF) predate modern PDF and have no real PDF/A, no PAdES signatures and no AES-256, while the commercial alternatives are costly.

Reliable extraction is harder than it looks: codes in the stream are font-specific and must be mapped to Unicode, and spacing has to be inferred from positioning. rust-pdf does both, so the text you get back is the text a human reads, ready for search, indexing or data pipelines.

  • Maps glyph codes to Unicode via each font's ToUnicode map, including two-byte Type0 and CJK.
  • Infers spaces from large negative adjustments and line breaks from text positioning.
  • Pull raster images out too: JPEGs verbatim as .jpg, everything else as .png.

Extract text in PHP with rust-pdf

Install the package, then call the same idiomatic API every rust-pdf binding shares. The snippet below is real PHP code from the reference docs.

PHP
use RustPdf\Pdf;

$data = file_get_contents('report.pdf');
echo Pdf::extractText($data);

$n = Pdf::extractImagesToDir($data, 'out_images/');   // returns how many were written
echo "wrote $n image(s)\n";
Validated by: pdftotext

This is part of the free tier in PHP. No license required.

Full PHP reference in the documentation.

Text extraction in PHP: FAQ

Does it extract Unicode and CJK text in PHP?

Yes. Each shown code is mapped back to Unicode through the font's ToUnicode CMap, including two-byte Type0 fonts, so Japanese, Greek, Arabic and other scripts come back correctly.

Can it extract scanned PDFs?

No. Extraction reads the text layer of a PDF. A scanned document is an image with no text layer, which needs OCR first. For PDFs that contain real text, extraction is exact.

Is text extraction free?

Yes. Text and image extraction are part of the free tier in PHP. No license token is required.

Extract Text From a PDF in PHP

One Rust core, the same output across every language. Prototype for free, license the corporate features when you ship.