logtrix

Parses and summarises Heritrix crawl logs

License

License

Categories

Categories

Net
GroupId

GroupId

org.netpreserve
ArtifactId

ArtifactId

logtrix
Last Version

Last Version

0.1.0
Release Date

Release Date

Type

Type

jar
Description

Description

logtrix
Parses and summarises Heritrix crawl logs
Project URL

Project URL

https://github.com/iipc/logtrix
Source Code Management

Source Code Management

https://github.com/iipc/logtrix

Download logtrix

How to add to project

<!-- https://jarcasting.com/artifacts/org.netpreserve/logtrix/ -->
<dependency>
    <groupId>org.netpreserve</groupId>
    <artifactId>logtrix</artifactId>
    <version>0.1.0</version>
</dependency>
// https://jarcasting.com/artifacts/org.netpreserve/logtrix/
implementation 'org.netpreserve:logtrix:0.1.0'
// https://jarcasting.com/artifacts/org.netpreserve/logtrix/
implementation ("org.netpreserve:logtrix:0.1.0")
'org.netpreserve:logtrix:jar:0.1.0'
<dependency org="org.netpreserve" name="logtrix" rev="0.1.0">
  <artifact name="logtrix" type="jar" />
</dependency>
@Grapes(
@Grab(group='org.netpreserve', module='logtrix', version='0.1.0')
)
libraryDependencies += "org.netpreserve" % "logtrix" % "0.1.0"
[org.netpreserve/logtrix "0.1.0"]

Dependencies

compile (4)

Group / Artifact Type Version
org.slf4j : slf4j-api jar 1.7.25
com.fasterxml.jackson.core : jackson-databind jar 2.9.8
com.fasterxml.jackson.datatype : jackson-datatype-jsr310 jar 2.9.8
com.google.guava : guava jar 27.1-jre

test (2)

Group / Artifact Type Version
junit : junit jar 4.12
org.slf4j : slf4j-simple jar 1.7.25

Project Modules

There are no modules declared in this project.

logtrix

Examples

Parsing a log file

try (CrawlLogIterator log = new CrawlLogIterator(Paths.get("crawl.log"))) {
    for (CrawlDataItem line : log) {
        System.out.println(line.getStatusCode());
        System.out.println(line.getURL());
    }
}

Grouping the summary by various things

CrawlSummary.byRegisteredDomain(log);
CrawlSummary.byHost(log);
CrawlSummary.byKey(log, item -> item.getCaptureBegan().toString().substring(0, 4)); // by year

Limit top N results

CrawlSummary.build(log).topN(10); // top 10 status codes, mime-types etc

Working with status codes

StatusCodes.describe(404);      // "Not found"
StatusCodes.describe(-4);       // "HTTP timeout"
StatusCodes.isError(-4);        // true
StatusCodes.isServerError(503); // true

Command-line interface

Output a JSON crawl summary grouped by registered domain:

java -jar target/*.jar -g registered-domain crawl.log

For more options:

java -jar target/*.jar --help

Compiling

Install Maven and then run:

mvn package
org.netpreserve

IIPC

International Internet Preservation Consortium

Versions

Version
0.1.0