Extraction · Go

Extract text from a PDF in Go

Pull the text out of any PDF from Go. rust-pdf walks the content stream, maps shown glyph codes back to Unicode through each font's ToUnicode map, and infers spaces and line breaks, including two-byte Type0 fonts and CJK.

Why Go needs this

Go's PDF story is fragmented: gofpdf is archived and unipdf is commercial, so production-grade output usually means a paid dependency or a brittle wrapper.

Reliable extraction is harder than it looks: codes in the stream are font-specific and must be mapped to Unicode, and spacing has to be inferred from positioning. rust-pdf does both, so the text you get back is the text a human reads, ready for search, indexing or data pipelines.

  • Maps glyph codes to Unicode via each font's ToUnicode map, including two-byte Type0 and CJK.
  • Infers spaces from large negative adjustments and line breaks from text positioning.
  • Pull raster images out too: JPEGs verbatim as .jpg, everything else as .png.

Extract text in Go with rust-pdf

Install the package, then call the same idiomatic API every rust-pdf binding shares. The snippet below is real Go code from the reference docs.

Go
data, _ := os.ReadFile("report.pdf")
text, _ := rustpdf.ExtractText(data)
fmt.Println(text)

n, _ := rustpdf.ExtractImagesToDir(data, "out_images/")   // returns how many were written
fmt.Printf("wrote %d image(s)\n", n)
Validated by: pdftotext

This is part of the free tier in Go. No license required.

Full Go reference in the documentation.

Text extraction in Go: FAQ

Does it extract Unicode and CJK text in Go?

Yes. Each shown code is mapped back to Unicode through the font's ToUnicode CMap, including two-byte Type0 fonts, so Japanese, Greek, Arabic and other scripts come back correctly.

Can it extract scanned PDFs?

No. Extraction reads the text layer of a PDF. A scanned document is an image with no text layer, which needs OCR first. For PDFs that contain real text, extraction is exact.

Is text extraction free?

Yes. Text and image extraction are part of the free tier in Go. No license token is required.

Extract Text From a PDF in Go

One Rust core, the same output across every language. Prototype for free, license the corporate features when you ship.