PDF Extractor
Powered by BeeHyv Software Solutions Pvt Ltd and Distributed under Apache 2.0 Licence
Overview
Extracting meaningful data out of documents is a standard problem and many attempts have been made till date, with partial success. The goal of this project is to extract data and metadata in a structured manner for any given PDF document.
Features
- Table of contents : A TOC generally provides an overview of the content within the document. A PDF may or may not have a table of contents. This code extracts a TOC from a PDF which doesn't have one using a heuristic based approach.
- Text : Entire text , text from a particular page
- Sections : Splitting PDF content (text , image , tables) into sections could help to extract more relevant content .
- Font information : Color , font type . font weight , font size etc.
- Tables : Table heading , rows , cells
- Images : Image files , Text inside images.
- Metadata of PDF : Author info , Creation date , Size etc.
Technologies Used
- This library uses The Apache PDFBox® library's PDF content stream engine to stream the PDF file.
- Tabula 1.2.1 (an Open source library) is used for table extraction.
Installation Instructions
Pre-requisites
- Java (>1.6)
- Maven
Installation and Setup
From Source
- Clone the project
- Add an environment variable for the tabula jar (used for tables extraction and unit tests)
TABULA_JAR_LOCATION={Project-dir}/lib/tabula/tabula-0.9.1-jar-with-dependencies.jar
- Run
mvn clean install
to install it in your local environment. It might take some time (~15 mins) as there are ~400 unit tests within the project. In order to skip tests , run with-DskipTests
Import the pdf-extractor dependency to your project
- Adding the maven dependency
<dependency>
<groupId>com.beehyv</groupId>
<artifactId>pdf-extractor</artifactId>
<version>1.0</version>
</dependency>
- Adding the jar to the classpath
pdf-extractor.jar
file for the project can be found under {Project-dir}/target
Run Extraction
-
Create a document object
HolmesPdfDocument pdfDocument = new HolmesPdfDocument(file);
-
Create an extractor instance
PdfBoxExtractor pdfBoxExtractor = new PdfBoxExtractor();
-
Extract text
String text = pdfBoxExtractor.getText(pdfDocument,startPage,endPage);
-
Extract images
pdfBoxExtractor.getImages(pdfDocument,startPage,endPage);
-
Extract tables
pdfBoxExtractor.getTabularData(pdfDocument,startPage,endPage);
-
Extract Structured Text
With this feature you can extract data in a structured manner. The data is extracted in sections with the hierarchy of the sections being intact. All the texts , images , tables , paragraphs are assigned to the respective sections giving the extracted data a structure and hence more meaningful.
All this information resides in an
InfoNode
model.InfoNode infoNode = pdfBoxExtractor.getStructuredText(hdoc)
- Sections
infoNode.getSections()
- Paragraphs
infoNode.getParagraphs
- Content
infoNode.getContent()
- Section Heading
infoNode.getHeading()
- Section Images
infoNode.getImageSections()
- Lines
infoNode.getContentLineObjects()
- Sections
Feature Request
In case of new feature requests please use the Github Issues page to raise tickets for Bugs as well as enhancements. The community can then take up the functionality as per need.