Concept · Reading PDFs

Extract Images from PDF
Programmatically

A PDF can hold dozens of embedded raster images across its pages: product photos, diagrams, scanned pages, logos. Extracting them programmatically means walking the document structure, recovering each image in its original format, and writing the files without loss. This page explains exactly how that works and shows the one-line call to do it in your language.

Like lifting photographs out of a printed magazine

A printed magazine is a finished artifact: text, layout and photographs fused together on paper. If you want the photographs on their own, you cannot simply copy them; you have to physically cut them out. A PDF works the same way at the file level. The images are embedded objects in the document structure, each stored with its own codec and color space, woven into the page content stream by reference.

Image extraction is the act of finding those embedded objects, recovering the original compressed data, and writing each one as a standalone image file. Done well, the originals come out intact: a JPEG that was embedded in the PDF comes back out as the same JPEG, not a re-compressed copy. Done carelessly, every extraction pass degrades the image. The difference matters for asset recovery, archiving and any pipeline that feeds the images into another system.

JPEG verbatim vs everything decoded to PNG

The right output format depends on how the image was stored in the PDF in the first place. Two clear rules keep the result lossless.

/Filter /DCTDecode

JPEG returned verbatim as .jpg

Zero generation loss

When an image is stored with /DCTDecode, the compressed JPEG bytes are sitting in the PDF unchanged. The extractor reads that stream body and writes it directly to a .jpg file without touching a single byte. No decode, no re-encode, no quality loss. The output is the same file that was embedded, down to the last bit. This is the only way to recover a JPEG without accumulating another round of lossy compression.

All other codecs

Everything else decoded to .png

Color space and alpha preserved

Flate-compressed (deflate/zlib), LZW and raw sample images are decoded to their raw pixel plane and re-encoded as PNG, a lossless format. Color space is honored throughout: DeviceGray stays gray, DeviceRGB stays RGB, DeviceCMYK is converted to RGB, Indexed palettes and ICCBased spaces are handled correctly. If the image has a separate /SMask soft-mask stream whose dimensions and bit depth match, it is merged as the alpha channel, producing a Gray+A or RGBA PNG. Transparency is not discarded.

What gets skipped, and why

Three legacy or specialized codecs cannot be cleanly decoded into PNG without a full codec implementation. Rather than producing corrupt or silently wrong output, the extractor skips these images entirely. The returned count reflects only the images that were successfully written.

JPEG 2000 — /JPXDecode

JPEG 2000 is a wavelet codec found in some scanned document workflows and in PDF/X files. Decoding it fully requires a JPEG 2000 library. These images are skipped rather than emitted corrupt.

CCITT Fax — /CCITTFaxDecode

CCITT Group 3 and Group 4 fax encoding is used almost exclusively for bilevel (black-and-white) scanned pages, particularly in older documents and fax-derived PDFs. These are skipped.

JBIG2 — /JBIG2Decode

JBIG2 is a bilevel compression standard common in high-compression scanned document archives. It uses a symbol dictionary and arithmetic coding that requires a dedicated decoder. These are skipped.

When you need to extract images from a PDF

The same one-call API covers a wide range of real workflows, from archiving to machine-learning pipelines.

Asset recovery

Retrieve product images, diagrams or photos that were placed in a PDF report years ago and are no longer available in the original source files.

Content migration

Move image content from PDF reports into a CMS, digital asset management system or image store without re-sourcing every file.

Building thumbnails

Extract the first meaningful image on each page to generate a visual preview without rendering the full page through a PDF renderer.

Feeding vision models

Supply individual images to a vision model or OCR pipeline by page, with clean per-image files rather than whole-page renders.

Document auditing

Inspect what images a document contains, including embedded images not visible in the rendered page, for compliance or quality checks.

Print production

Recover high-resolution images from a supplied PDF file so they can be placed back into a layout at original quality.

How to extract images from a PDF with rust-pdf

Extract every raster image into a directory; JPEGs come out untouched as .jpg, the rest as .png. Files are named page{N}_{name}.{ext}.

# pip install rustpdf
import rustpdf

n = rustpdf.extract_images_to_dir(open("document.pdf", "rb").read(), "images/")
print(f"{n} images written to images/")
// dotnet add package RustPdf
using RustPdf;

int n = Pdf.ExtractImagesToDir(File.ReadAllBytes("document.pdf"), "images");
Console.WriteLine($"{n} images written");
// go get github.com/rustpdf/rustpdf-go@latest
data, _ := os.ReadFile("document.pdf")
n, _ := rustpdf.ExtractImagesToDir(data, "images")
fmt.Printf("%d images written\n", n)
// npm install rustpdf
const { extractImagesToDir } = require("rustpdf");
const fs = require("fs");

const n = extractImagesToDir(fs.readFileSync("document.pdf"), "images");
console.log(`${n} images written`);

The library is available in nine languages: Python, C#/.NET, Go, Node.js, Java, PHP, Ruby, Delphi and Swift. One Rust core, thin native bindings. See the full documentation for each language.

PDF image extraction FAQ

How do I extract images from a PDF?

Call extract_images_to_dir (Python), Pdf.ExtractImagesToDir (C#), rustpdf.ExtractImagesToDir (Go), or extractImagesToDir (Node) with the PDF bytes and an output directory path. The library walks every page's /Resources /XObject tree, including nested Form XObjects, and writes each raster image into the directory. JPEGs come out as .jpg files, everything else as .png.

Are JPEGs re-encoded (lossy)?

No. JPEG images (stored with /Filter /DCTDecode in the PDF) are returned verbatim, byte-for-byte, without decoding or re-encoding. The output .jpg file is the exact same compressed data that was embedded in the PDF, so there is zero generation loss.

What formats are output?

Two formats: .jpg for images stored as JPEG (/DCTDecode), and .png for everything else. PNG output honors the image's color space (DeviceGray, RGB, CMYK converted to RGB, Indexed, or ICCBased), so the colors you see match what was in the PDF.

Are transparency and alpha masks preserved?

Yes. When a PDF image has a separate /SMask (soft mask) stream and the dimensions and bit depth match, the mask is merged into the output as the alpha channel, producing a Gray+Alpha or RGBA PNG. The transparency information is not discarded.

Which images cannot be extracted?

Images stored with JPEG 2000 (/JPXDecode), CCITT fax (/CCITTFaxDecode), or JBIG2 (/JBIG2Decode) compression cannot be cleanly converted to PNG, so they are skipped rather than emitted corrupt. The function returns a count of the images it did write; skipped images are not included.

Extract PDF images in your language

One Rust core, the same zero-loss image extraction across nine languages. Prototype for free, license the corporate features when you ship.