Extraction · Ruby

Extract text from a PDF in Ruby

Pull the text out of any PDF from Ruby. rust-pdf walks the content stream, maps shown glyph codes back to Unicode through each font's ToUnicode map, and infers spaces and line breaks, including two-byte Type0 fonts and CJK.

Why Ruby needs this

Prawn generates beautiful PDFs but has no PDF/A, no digital signatures, no AES-256 and no PDF/UA, so the regulated features have simply been missing from Ruby.

Reliable extraction is harder than it looks: codes in the stream are font-specific and must be mapped to Unicode, and spacing has to be inferred from positioning. rust-pdf does both, so the text you get back is the text a human reads, ready for search, indexing or data pipelines.

  • Maps glyph codes to Unicode via each font's ToUnicode map, including two-byte Type0 and CJK.
  • Infers spaces from large negative adjustments and line breaks from text positioning.
  • Pull raster images out too: JPEGs verbatim as .jpg, everything else as .png.

Extract text in Ruby with rust-pdf

Install the package, then call the same idiomatic API every rust-pdf binding shares. The snippet below is real Ruby code from the reference docs.

Ruby
data = File.binread("report.pdf")
puts RustPdf.extract_text(data)

n = RustPdf.extract_images_to_dir(data, "out_images/")   # returns how many were written
puts "wrote #{n} image(s)"
Validated by: pdftotext

This is part of the free tier in Ruby. No license required.

Full Ruby reference in the documentation.

Text extraction in Ruby: FAQ

Does it extract Unicode and CJK text in Ruby?

Yes. Each shown code is mapped back to Unicode through the font's ToUnicode CMap, including two-byte Type0 fonts, so Japanese, Greek, Arabic and other scripts come back correctly.

Can it extract scanned PDFs?

No. Extraction reads the text layer of a PDF. A scanned document is an image with no text layer, which needs OCR first. For PDFs that contain real text, extraction is exact.

Is text extraction free?

Yes. Text and image extraction are part of the free tier in Ruby. No license token is required.

Extract Text From a PDF in Ruby

One Rust core, the same output across every language. Prototype for free, license the corporate features when you ship.