SCIE PDF Text Extractor

This is an optimized version of Apache PDFBox. It allows to extract the rough structure of a document (pages, blocks of text and paragraphs as well as formatting information) and was made with the intent to optimize text extraction results for scientific papers. The output can easily be transformed to plaintext (toString) or to an XML format (toXML).

License	License The GNU Affero General Public License, Version 3
Categories	Categories PDF Data
GroupId	GroupId de.cit-ec.scie
ArtifactId	ArtifactId pdf-extractor
Last Version	Last Version 2.0.1
Release Date	Release Date 10-Dec-2014
Type	Type jar
Description	Description SCIE PDF Text Extractor This is an optimized version of Apache PDFBox. It allows to extract the rough structure of a document (pages, blocks of text and paragraphs as well as formatting information) and was made with the intent to optimize text extraction results for scientific papers. The output can easily be transformed to plaintext (toString) or to an XML format (toXML).
Project URL	Project URL http://openresearch.cit-ec.de/projects/scie/
Source Code Management	Source Code Management https://opensource.cit-ec.de/projects/scie/repository/revisions/master/show/modules/pdf-extractor

Download pdf-extractor

Filename	Size
pdf-extractor-2.0.1.pom
pdf-extractor-2.0.1.jar	33 KB
pdf-extractor-2.0.1-sources.jar	32 KB
pdf-extractor-2.0.1-javadoc.jar	127 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/de.cit-ec.scie/pdf-extractor/ -->
<dependency>
    <groupId>de.cit-ec.scie</groupId>
    <artifactId>pdf-extractor</artifactId>
    <version>2.0.1</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/de.cit-ec.scie/pdf-extractor/
implementation 'de.cit-ec.scie:pdf-extractor:2.0.1'

Gradle Kotlin

// https://jarcasting.com/artifacts/de.cit-ec.scie/pdf-extractor/
implementation ("de.cit-ec.scie:pdf-extractor:2.0.1")

Apache Buildr

'de.cit-ec.scie:pdf-extractor:jar:2.0.1'

Apache Ivy

<dependency org="de.cit-ec.scie" name="pdf-extractor" rev="2.0.1">
  <artifact name="pdf-extractor" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='de.cit-ec.scie', module='pdf-extractor', version='2.0.1')
)

Scala SBT

libraryDependencies += "de.cit-ec.scie" % "pdf-extractor" % "2.0.1"

Leiningen

[de.cit-ec.scie/pdf-extractor "2.0.1"]

Dependencies

compile (1)

Group / Artifact	Type	Version
org.apache.pdfbox : pdfbox	jar	1.8.2

test (1)

Group / Artifact	Type	Version
junit : junit	jar	4.11

Project Modules

There are no modules declared in this project.

Versions

Version
2.0.1 10-Dec-2014
2.0 18-Nov-2014

SCIE PDF Text Extractor

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management