bigdoc

This library allows you to handle gigabyte order huge files easily with high performance. You can search bytes or words / read data/text from huge files.

License	License MIT License
GroupId	GroupId org.riversun
ArtifactId	ArtifactId bigdoc
Last Version	Last Version 0.3.0
Release Date	Release Date 18-Dec-2016
Type	Type jar
Description	Description bigdoc This library allows you to handle gigabyte order huge files easily with high performance. You can search bytes or words / read data/text from huge files.
Project URL	Project URL https://github.com/riversun/bigdoc
Source Code Management	Source Code Management https://github.com/riversun/bigdoc

Download bigdoc

Filename	Size
bigdoc-0.3.0.pom
bigdoc-0.3.0.jar	25 KB
bigdoc-0.3.0-sources.jar	13 KB
bigdoc-0.3.0-javadoc.jar	88 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/org.riversun/bigdoc/ -->
<dependency>
    <groupId>org.riversun</groupId>
    <artifactId>bigdoc</artifactId>
    <version>0.3.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/org.riversun/bigdoc/
implementation 'org.riversun:bigdoc:0.3.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/org.riversun/bigdoc/
implementation ("org.riversun:bigdoc:0.3.0")

Apache Buildr

'org.riversun:bigdoc:jar:0.3.0'

Apache Ivy

<dependency org="org.riversun" name="bigdoc" rev="0.3.0">
  <artifact name="bigdoc" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='org.riversun', module='bigdoc', version='0.3.0')
)

Scala SBT

libraryDependencies += "org.riversun" % "bigdoc" % "0.3.0"

Leiningen

[org.riversun/bigdoc "0.3.0"]

Dependencies

compile (1)

Group / Artifact	Type	Version
org.riversun : finbin	jar	0.6.2

test (2)

Group / Artifact	Type	Version
org.hamcrest : hamcrest-all	jar	1.3
junit : junit	jar	4.7

Project Modules

There are no modules declared in this project.

Overview

'bigdoc' allows you to handle gigabyte order files easily with high performance. You can search bytes or words / read data/text from huge files.

It is licensed under MIT license.

Quick start

Search sequence of bytes from a big file quickly.

Search mega-bytes,giga-bytes order file.

package org.example;

import java.io.File;
import java.util.List;

import org.riversun.bigdoc.bin.BigFileSearcher;

public class Example {

	public static void main(String[] args) throws Exception {

		byte[] searchBytes = "hello world.".getBytes("UTF-8");

		File file = new File("/var/tmp/yourBigfile.bin");

		BigFileSearcher searcher = new BigFileSearcher();

		List<Long> findList = searcher.searchBigFile(file, searchBytes);

		System.out.println("positions = " + findList);
	}
}

Performance Test

Search sequence of bytes from big file

Environment

Tested on AWS t2.*

Results

CPU Instance	EC2 t2.2xlarge vCPU x 8,32GiB	EC2 t2.xlarge vCPU x 4,16GiB	EC2 t2.large vCPU x 2,8GiB	EC2 t2.medium vCPU x 2,4GiB
File Size	Time(sec)	Time(sec)	Time(sec)	Time(sec)
10MB	0.5s	0.6s	0.8s	0.8s
50MB	2.8s	5.9s	13.4s	12.8s
100MB	5.4s	10.7s	25.9s	25.1s
250MB	15.7s	32.6s	77.1s	74.8s
1GB	55.9s	120.5s	286.1s	-
5GB	259.6s	566.1s	-	-
10GB	507.0s	1081.7s	-	-

Please Note

Processing speed depends on the number of CPU Cores(included hyper threading) not memory capacity.
The result is different depending on the environment of the Java ,Java version and compiler or runtime optimization.

Architecture and Tuning

You can tune the performance using the following methods. It can be adjusted according to the number of CPU cores and memory capacity.

BigFileSearcher#setBlockSize
BigFileSearcher#setMaxNumOfThreads
BigFileSearcher#setBufferSizePerWorker
BigFileSearcher#setBufferSize
BigFileSearcher#setSubThreadSize

BigFileSearcher can search for sequence of bytes by dividing a big file into multiple blocks. Use multiple workers to search for multiple blocks concurrently. One worker thread sequentially searches for one block. The number of workers is specified by #setMaxNumOfThreads. Within a single worker thread, it reads and searches into the memory by the capacity specified by #setBufferSize. A small area - used to compare sequence of bytes when searching - is called a window, and the size of that window is specified by #setSubBufferSize. Multiple windows can be operated concurrently, and the number of conccurent operations in a worker is specified by #setSubThreadSize.

More Details

See javadoc as follows.

https://riversun.github.io/javadoc/bigdoc/

Downloads

maven

You can add dependencies to maven pom.xml file.

<dependency>
  <groupId>org.riversun</groupId>
  <artifactId>bigdoc</artifactId>
  <version>0.3.0</version>
</dependency>

Versions

Version
0.3.0 18-Dec-2016

bigdoc

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management