Extraction · Node.js and TypeScript
Extract text from a PDF in Node.js
Pull the text out of any PDF from Node.js. rust-pdf walks the content stream, maps shown glyph codes back to Unicode through each font's ToUnicode map, and infers spaces and line breaks, including two-byte Type0 fonts and CJK.
Why Node.js and TypeScript needs this
Node typically reaches for heavy wrappers or headless Chrome for anything past basic output, which is slow, fragile and never archival-grade.
Reliable extraction is harder than it looks: codes in the stream are font-specific and must be mapped to Unicode, and spacing has to be inferred from positioning. rust-pdf does both, so the text you get back is the text a human reads, ready for search, indexing or data pipelines.
- Maps glyph codes to Unicode via each font's ToUnicode map, including two-byte Type0 and CJK.
- Infers spaces from large negative adjustments and line breaks from text positioning.
- Pull raster images out too: JPEGs verbatim as .jpg, everything else as .png.
Extract text in Node.js with rust-pdf
Install the package, then call the same idiomatic API every rust-pdf binding shares. The snippet below is real Node.js code from the reference docs.
const data = fs.readFileSync("report.pdf");
console.log(rustpdf.extractText(data));
const n = rustpdf.extractImagesToDir(data, "out_images/"); // returns how many were written
console.log(`wrote ${n} image(s)`);
This is part of the free tier in Node.js. No license required.
Full Node.js reference in the documentation.
Text extraction in Node.js: FAQ
Does it extract Unicode and CJK text in Node.js?
Yes. Each shown code is mapped back to Unicode through the font's ToUnicode CMap, including two-byte Type0 fonts, so Japanese, Greek, Arabic and other scripts come back correctly.
Can it extract scanned PDFs?
No. Extraction reads the text layer of a PDF. A scanned document is an image with no text layer, which needs OCR first. For PDFs that contain real text, extraction is exact.
Is text extraction free?
Yes. Text and image extraction are part of the free tier in Node.js. No license token is required.
Extract Text From a PDF in Node.js
One Rust core, the same output across every language. Prototype for free, license the corporate features when you ship.