essence
An automatic web page content extractor for Kotlin and Java.
Given an HTML document, essence automatically extracts the main text content (and much more).
Try out the demo - a simple webapp to demonstrate essence.
This library is inspired by node-unfluff and its lineage
Usage
Java
import io.github.cdimascio.essence.Essence;
EssenceResult data = Essence.extract(html);
System.out.println(data.getText()); 
Kotlin
val data = Essence.extract(html)
println(data.text) 
See Extracted data elements for additional extracted metadata.
Install
Maven
<dependency>
  <groupId>io.github.cdimascio</groupId>
  <artifactId>essence</artifactId>
  <version>0.13.0</version>
  <type>pom</type>
</dependency> 
Gradle
compile 'io.github.cdimascio:essence:0.13.0' 
Try the Essence web demo
Essence web is a simple web page that fetches content at a given url and passes the HTML to this essence library.
The essence web project lives here
Extracted data elements
essence attempts to extract the following content:
- title- The document's title
- softTitle- A version of- titlewith less truncation
- date- The document's publication date
- copyright- The document's copyright line, if present
- author- The document's author
- publisher- The document's publisher (website name)
- text- The main text of the document with all the junk thrown away
- image- The main image for the document (what's used by facebook, etc.)
- (coming soon...)videos- An array of videos that were embedded in the article. Each video has src, width and height.
- tags- Any tags or keywords that could be found by checking <rel> tags or by looking at href urls.
- canonicalLink- The canonical url of the document, if given.
- lang- The language of the document, either detected or supplied by you.
- description- The description of the document, from <meta> tags
- favicon- The url of the document's favicon.
- links- An array of links embedded within the article text. (text and href for each)
Credits
- node-unfluff by https://github.com/ageitgey
- python-goose by Xavier Grangier
- goose by Gravity Labs
License
Contributors 
   ✨ 
  
 
Thanks goes to these wonderful people (emoji key):
| Clément P. | 
This project follows the all-contributors specification. Contributions of any kind welcome!
 JarCasting
 JarCasting

