PDF-Table

PDF-table is Java utility library that can be used for parsing tabular data in PDF documents. Core processing of PDF documents is performed with utilization of Apache PDFBox and OpenCV.

License	License MIT License
Categories	Categories PDF Data
GroupId	GroupId com.github.rostrovsky
ArtifactId	ArtifactId pdf-table
Last Version	Last Version 1.0.0
Release Date	Release Date 03-May-2020
Type	Type jar
Description	Description PDF-Table PDF-table is Java utility library that can be used for parsing tabular data in PDF documents. Core processing of PDF documents is performed with utilization of Apache PDFBox and OpenCV.
Project URL	Project URL https://github.com/rostrovsky/pdf-table
Source Code Management	Source Code Management https://github.com/rostrovsky/pdf-table

Download pdf-table

Filename	Size
pdf-table-1.0.0.pom
pdf-table-1.0.0.jar	14 KB
pdf-table-1.0.0-sources.jar	6 KB
pdf-table-1.0.0-javadoc.jar	407 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.github.rostrovsky/pdf-table/ -->
<dependency>
    <groupId>com.github.rostrovsky</groupId>
    <artifactId>pdf-table</artifactId>
    <version>1.0.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.github.rostrovsky/pdf-table/
implementation 'com.github.rostrovsky:pdf-table:1.0.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.github.rostrovsky/pdf-table/
implementation ("com.github.rostrovsky:pdf-table:1.0.0")

Apache Buildr

'com.github.rostrovsky:pdf-table:jar:1.0.0'

Apache Ivy

<dependency org="com.github.rostrovsky" name="pdf-table" rev="1.0.0">
  <artifact name="pdf-table" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.github.rostrovsky', module='pdf-table', version='1.0.0')
)

Scala SBT

libraryDependencies += "com.github.rostrovsky" % "pdf-table" % "1.0.0"

Leiningen

[com.github.rostrovsky/pdf-table "1.0.0"]

Dependencies

runtime (4)

Group / Artifact	Type	Version
org.apache.pdfbox : pdfbox	jar	2.0.19
org.apache.pdfbox : pdfbox-tools	jar	2.0.19
org.apache.commons : commons-lang3	jar	3.5
org.openpnp : opencv	jar	3.4.2-2

test (1)

Group / Artifact	Type	Version
org.testng : testng	jar	7.1.0

Project Modules

There are no modules declared in this project.

PDF-table

Table of Contents

What is PDF-table?

PDF-table is Java utility library that can be used for parsing tabular data in PDF documents.
Core processing of PDF documents is performed with utilization of Apache PDFBox and OpenCV.

Prerequisites

JDK

JAVA 8 is required.

External dependencies

pdf-table requires compiled OpenCV 3.4.2 to work properly:

Download OpenCV v3.4.2 from https://github.com/opencv/opencv/releases/tag/3.4.2
Unpack it and add to your system PATH:
- Windows: <opencv dir>\build\java\x64
- Linux: TODO

Installation

<dependency>
  <groupId>com.github.rostrovsky</groupId>
  <artifactId>pdf-table</artifactId>
  <version>1.0.0</version>
</dependency>

Usage

Parsing PDFs

When PDF document page is being parsed, following operations are performed:

Page is converted to grayscale image [OpenCV].
Binary Inverted Threshold (BIT) is applied to grayscaled image [OpenCV].
Contours are detected on BIT image and contour mask is created (additional Canny filtering can be turned on in this step) [OpenCV].
Contour mask is XORed with BIT image [OpenCV].
Contours are detected once again on XORed image (additional Canny filtering can be turned on in this step) [OpenCV].
Final contours are drawn [OpenCV].
Bounding rectangles are detected from final contours [OpenCV].
PDF is being parsed region-by-region using bounding rectangles coordinates [Apache PDFBox].

Above algorithm is mostly derived from http://stackoverflow.com/a/23106594.

For more information about parsed output, refer to Output format

single-threaded example

class SingleThreadParser {
    public static void main(String[] args) throws IOException {
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        PdfTableReader reader = new PdfTableReader();
        List<ParsedTablePage> parsed = reader.parsePdfTablePages(pdfDoc, 1, pdfDoc.getNumberOfPages());
    }
}

multi-threaded example

class MultiThreadParser {
    public static void main(String[] args) throws IOException {
        final int THREAD_COUNT = 8;
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        PdfTableReader reader = new PdfTableReader();

        // parse pages simultaneously
        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
        List<Future<ParsedTablePage>> futures = new ArrayList<>();
        for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
            Callable<ParsedTablePage> callable = () -> {
                ParsedTablePage page = reader.parsePdfTablePage(pdfDoc, pageNum);
                return page;
            };
            futures.add(executor.submit(callable));
        }

        // collect parsed pages
        List<ParsedTablePage> unsortedParsedPages = new ArrayList<>(pdfDoc.getNumberOfPages());
        try {
            for (Future<ParsedTablePage> f : futures) {
                ParsedTablePage page = f.get();
                unsortedParsedPages.add(page.getPageNum() - 1, page);
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        }

        // sort pages by pageNum
        List<ParsedTablePage> sortedParsedPages = unsortedParsedPages.stream()
                .sorted((p1, p2) -> Integer.compare(p1.getPageNum(), p2.getPageNum())).collect(Collectors.toList());
    }
}

Saving PDF pages as PNG images

PDF-Table provides methods for saving PDF pages as PNG images.
Rendering DPI can be modified in PdfTableSettings (see: Parsing settings).

single-threaded example

class SingleThreadPNGDump {
    public static void main(String[] args) throws IOException {
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        Path outputPath = Paths.get("C:", "some_directory");
        PdfTableReader reader = new PdfTableReader();
        reader.savePdfPagesAsPNG(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
    }
}

multi-threaded example

class MultiThreadPNGDump {
    public static void main(String[] args) throws IOException {
        final int THREAD_COUNT = 8;
        Path outputPath = Paths.get("C:", "some_directory");
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        PdfTableReader reader = new PdfTableReader();

        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
        List<Future<Boolean>> futures = new ArrayList<>();
        for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
            Callable<Boolean> callable = () -> {
                reader.savePdfPageAsPNG(pdfDoc, pageNum, outputPath);
                return true;
            };
            futures.add(executor.submit(callable));
        }

        try {
            for (Future<Boolean> f : futures) {
                f.get();
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

Saving debug PNG images

When tables in PDF document cannot be parsed correctly with default settings, user can save debug images that show page at various stages of processing.
Using these images, user can adjust PdfTableSettings accordingly to achieve desired results (see: Parsing settings).

single-threaded example

class SingleThreadDebugImgsDump {
    public static void main(String[] args) throws IOException {
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        Path outputPath = Paths.get("C:", "some_directory");
        PdfTableReader reader = new PdfTableReader();
        reader.savePdfTablePagesDebugImages(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
    }
}

multi-threaded example

class MultiThreadDebugImgsDump {
    public static void main(String[] args) throws IOException {
        final int THREAD_COUNT = 8;
        Path outputPath = Paths.get("C:", "some_directory");
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        PdfTableReader reader = new PdfTableReader();

        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
        List<Future<Boolean>> futures = new ArrayList<>();
        for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
            Callable<Boolean> callable = () -> {
                reader.savePdfTablePagesDebugImage(pdfDoc, pageNum, outputPath);
                return true;
            };
            futures.add(executor.submit(callable));
        }

        try {
            for (Future<Boolean> f : futures) {
                f.get();
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

Parsing settings

PDF rendering and OpenCV filtering settings are stored in PdfTableSettings object.

Custom settings instance can be passed to PdfTableReader constructor when non-default values are needed:

(...)

// build settings object
PdfTableSettings settings = PdfTableSettings.getBuilder()
                .setCannyFiltering(true)
                .setCannyApertureSize(5)
                .setCannyThreshold1(40)
                .setCannyThreshold2(190.5)
                .setPdfRenderingDpi(160)
                .build();

// pass settings to reader
PdfTableReader reader = new PdfTableReader(settings);

Output format

Each parsed PDF page is being returned as ParsedTablePage object:

(...)

PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();

// first page in document has index == 1, not 0 !
ParsedTablePage firstPage = reader.parsePdfTablePage(pdfDoc, 1);

// getting page number
assert firstPage.getPageNum() == 1;

// rows and cells are zero-indexed just like elements of the List
// getting first row
ParsedTablePage.ParsedTableRow firstRow = firstPage.getRow(0);

// getting third cell in second row
String thirdCellContent = firstPage.getRow(1).getCell(2);

// cell content usually contain <CR><LF> characters,
// so it is recommended to trim them before processing
double thirdCellNumericValue = Double.valueOf(thirdCellContent.trim());

Versions

Version
1.0.0 03-May-2020

PDF-Table

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management

Download pdf-table

How to add to project

Dependencies

runtime (4)

test (1)

Project Modules

PDF-table

What is PDF-table?

Prerequisites

JDK

External dependencies

Installation

Usage

Parsing PDFs

single-threaded example

multi-threaded example

Saving PDF pages as PNG images

single-threaded example

multi-threaded example

Saving debug PNG images

single-threaded example

multi-threaded example

Parsing settings

Output format

Versions