com.cldellow:manu

Utilities to manage timeseries data.

License

License

GroupId

GroupId

com.cldellow
ArtifactId

ArtifactId

manu
Last Version

Last Version

0.2.2
Release Date

Release Date

Type

Type

pom
Description

Description

com.cldellow:manu
Utilities to manage timeseries data.
Project URL

Project URL

https://github.com/cldellow/manu
Source Code Management

Source Code Management

https://github.com/cldellow/manu/tree/master

Download manu

Filename Size
manu-0.2.2.pom 7 KB
Browse

How to add to project

<!-- https://jarcasting.com/artifacts/com.cldellow/manu/ -->
<dependency>
    <groupId>com.cldellow</groupId>
    <artifactId>manu</artifactId>
    <version>0.2.2</version>
    <type>pom</type>
</dependency>
// https://jarcasting.com/artifacts/com.cldellow/manu/
implementation 'com.cldellow:manu:0.2.2'
// https://jarcasting.com/artifacts/com.cldellow/manu/
implementation ("com.cldellow:manu:0.2.2")
'com.cldellow:manu:pom:0.2.2'
<dependency org="com.cldellow" name="manu" rev="0.2.2">
  <artifact name="manu" type="pom" />
</dependency>
@Grapes(
@Grab(group='com.cldellow', module='manu', version='0.2.2')
)
libraryDependencies += "com.cldellow" % "manu" % "0.2.2"
[com.cldellow/manu "0.2.2"]

Dependencies

There are no dependencies for this project. It is a standalone project that does not depend on any other jars.

Project Modules

  • common
  • format
  • cli
  • serve
  • report

Manu: "Mostly archived, not updated"

Build Status codecov Maven Central

A time series storage format for integers and floats, using efficient delta encodings from FastPFOR.

Examples: pageviews by article in Wikipedia, stock open/close/high/low prices, weather temperatures.

Components

  • manu-format, a library for maintaining the data on disk
  • manu-cli, a command-line tool for ingesting data into the format
  • manu-serve, a web server to expose the data over REST

Design criteria

Priorities

  • Cheap
    • I'm doing this to drive a hobby project; my dream would be to host a variety of datasets for $10/month.
    • A Fermi estimate suggests Wikipedia pageviews has 100B datapoints over the last 10 years. This implies that storage costs will dominate.
  • Doesn’t need to be always-on
    • This sort of follows from cheap -- the ability to load subsets of data, or to run on spot instances will be a useful tool to cut costs.

Non-priorities

  • Concurrent / fast writes
    • These can happen offline.
  • Fast reads
    • The pareto principle will likely apply to queries - 1% of keys will get 99% of reads. We can use Varnish or similar to cache at the application level.

Assumptions

  • Dense datasets
    • Keys: if we see a key once, we expect to see it again.
    • Values: if key X has a datapoint at T1, we expect most other keys will as well.
  • Correlated values
    • Value for key X at T1 is likely related to value at T2.
  • Some datasets can be lossy
    • Wikipedia pageviews, e.g., are likely insensitive to precision so long as the trend is generally correct.

Obligatory

Manu

Credit: Our Greatest Asset, Saturday Morning Breakfast Cereal

Versions

Version
0.2.2
0.2.1
0.2.0
0.1.0