lucene-pdf

A library enabling easy Lucene indexing of PDF text and metadata via integration with PDFxStream

License	License MIT
Categories	Categories IDE Development Tools PDF Data
GroupId	GroupId com.snowtide
ArtifactId	ArtifactId lucene-pdf
Last Version	Last Version 3.0.0
Release Date	Release Date 25-Nov-2014
Type	Type jar
Description	Description lucene-pdf A library enabling easy Lucene indexing of PDF text and metadata via integration with PDFxStream
Project URL	Project URL http://github.com/snowtide/lucene-pdf
Source Code Management	Source Code Management https://github.com/snowtide/lucene-pdf

Download lucene-pdf

Filename	Size
lucene-pdf-3.0.0.pom
lucene-pdf-3.0.0.jar	15 KB
lucene-pdf-3.0.0-sources.jar	15 KB
lucene-pdf-3.0.0-javadoc.jar	7 MB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.snowtide/lucene-pdf/ -->
<dependency>
    <groupId>com.snowtide</groupId>
    <artifactId>lucene-pdf</artifactId>
    <version>3.0.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.snowtide/lucene-pdf/
implementation 'com.snowtide:lucene-pdf:3.0.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.snowtide/lucene-pdf/
implementation ("com.snowtide:lucene-pdf:3.0.0")

Apache Buildr

'com.snowtide:lucene-pdf:jar:3.0.0'

Apache Ivy

<dependency org="com.snowtide" name="lucene-pdf" rev="3.0.0">
  <artifact name="lucene-pdf" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.snowtide', module='lucene-pdf', version='3.0.0')
)

Scala SBT

libraryDependencies += "com.snowtide" % "lucene-pdf" % "3.0.0"

Leiningen

[com.snowtide/lucene-pdf "3.0.0"]

Dependencies

compile (2)

Group / Artifact	Type	Version
com.snowtide » pdfxstream	jar	3.1.1
org.apache.lucene : lucene-core	jar	1.9.1

test (3)

Group / Artifact	Type	Version
org.clojure : clojure	jar	1.6.0
org.clojure : tools.nrepl	jar	0.2.3
clojure-complete » clojure-complete	jar	0.2.3

Project Modules

There are no modules declared in this project.

lucene-pdf

lucene-pdf is a JVM (Java, Scala, Groovy, Clojure, etc) library enabling easy Lucene indexing of PDF text and metadata via integration with PDFxStream.

"Installation"

lucene-pdf is available in Maven central; add it to your Maven project's pom.xml:

<dependency>
  <groupId>com.snowtide</groupId>
  <artifactId>lucene-pdf</artifactId>
  <version>3.0.0</version>
</dependency>

Or, add the above Maven artifact coordinates to your {Gradle, Leiningen, sbt, etc} project file.

lucene-pdf is suitable for use with JDK 1.5+, and is tested against the latest releases of each major revision of Lucene core (1.x, 2.x, 3.x, an 4.x). See the project file for the exact versions used under test.

While lucene-pdf is suitable for many typical Lucene PDF indexing jobs, there may be aspects of your project's requirements that it cannot meet (e.g. taking advantage of some of the more esoteric document indexing parameters available in more recent versions of Lucene). In that case, its source can serve as a useful starting point, exhibiting how PDF data can be extracted using PDFxStream and turned into Lucene Documents; please feel free to import it into your projects and modify it as needed to suit your needs.

Documentation

A detailed tutorial is available: Indexing PDF Documents with Lucene and PDFxStream
Javadocs are available at http://snowtide.github.io/lucene-pdf

Example usage

Given a PDF file stored on disk at /tmp/foo.pdf, this Java code will use lucene-pdf to construct a Lucene org.apache.lucene.document.Document populated with fields corresponding to the PDF's main body text and metadata attributes:

import com.snowtide.PDF;
import com.snowtide.pdf.lucene.LucenePDFDocumentFactory;
import org.apache.lucene.document.Document;

// ....

com.snowtide.pdf.Document pdf = PDF.open(new File("/tmp/foo.pdf"));
Document luceneDocument = LucenePDFDocumentFactory.buildPDFDocument(pdf);
pdf.close();

luceneDocument can then be added to a Lucene index.

This is the simplest sample possible, but it uses a default configuration to name the fields in the created Lucene document. You will likely want to provide your own names for:

the field containing the source PDF document's main body text
fields corresponding to various PDF document metadata attributes

This Java code does just that, using a LucenePDFConfiguration object to control the mapping:

import com.snowtide.PDF;
import com.snowtide.pdf.lucene.LucenePDFDocumentFactory;
import com.snowtide.pdf.lucene.LucenePDFConfiguration;
import org.apache.lucene.document.Document;

// ....

File f = new File("/tmp/foo.pdf");

LucenePDFConfiguration config = new LucenePDFConfiguration();
config.setBodyTextFieldName("mainText");
config.setMetadataFieldMapping("Author", "document_author");
config.setMetadataFieldMapping("Title", "document_title");

com.snowtide.pdf.Document pdf = PDF.open(new File("/tmp/foo.pdf"));
Document luceneDocument = LucenePDFDocumentFactory.buildPDFDocument(pdf, config);
pdf.close();

LucenePDFConfiguration provides a number of additional ways to control how Lucene fields and documents are created, including setting storage, tokenization, and indexing/analysis flags. See Indexing PDF Documents with Lucene and PDFxStream and the lucene-pdf javadoc for details.

License

Distributed under the terms of the MIT License.

Snowtide

Snowtide, makers of PDFxStream for Java and .NET

Versions

Version
3.0.0 25-Nov-2014

lucene-pdf

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management

Download lucene-pdf

How to add to project

Dependencies

compile (2)

test (3)

Project Modules

lucene-pdf

"Installation"

Documentation

Example usage

License

Snowtide

Versions