comet

License	License APL2
GroupId	GroupId com.ebiznext
ArtifactId	ArtifactId comet_2.11
Last Version	Last Version 0.1.10
Release Date	Release Date Jul 16, 2020
Type	Type jar
Description	Description comet comet
Project URL	Project URL https://github.com/ebiznext/comet-data-pipeline
Project Organization	Project Organization Ebiznext
Source Code Management	Source Code Management https://github.com/ebiznext/comet-data-pipeline

Download comet_2.11

Filename	Size
comet_2.11-0.1.10.pom
comet_2.11-0.1.10.jar	1 MB
comet_2.11-0.1.10-sources.jar	178 KB
comet_2.11-0.1.10-javadoc.jar	1 MB
comet_2.11-0.1.10-assembly.jar	109 MB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.ebiznext/comet_2.11/ -->
<dependency>
    <groupId>com.ebiznext</groupId>
    <artifactId>comet_2.11</artifactId>
    <version>0.1.10</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.ebiznext/comet_2.11/
implementation 'com.ebiznext:comet_2.11:0.1.10'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.ebiznext/comet_2.11/
implementation ("com.ebiznext:comet_2.11:0.1.10")

Apache Buildr

'com.ebiznext:comet_2.11:jar:0.1.10'

Apache Ivy

<dependency org="com.ebiznext" name="comet_2.11" rev="0.1.10">
  <artifact name="comet_2.11" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.ebiznext', module='comet_2.11', version='0.1.10')
)

Scala SBT

libraryDependencies += "com.ebiznext" % "comet_2.11" % "0.1.10"

Leiningen

[com.ebiznext/comet_2.11 "0.1.10"]

Dependencies

compile (21)

Group / Artifact	Type	Version
org.scala-lang : scala-library	jar	2.11.12
org.scalatra.scalate : scalate-core_2.11	jar	1.9.6
com.typesafe.scala-logging : scala-logging_2.11	jar	3.9.2
com.github.kxbmap : configs_2.11	jar	0.4.4
com.github.pathikrit : better-files_2.11	jar	3.9.1
org.scalatest : scalatest_2.11	jar	3.2.0
com.github.scopt : scopt_2.11	jar	4.0.0-RC2
org.elasticsearch : elasticsearch-hadoop	jar	7.8.0
com.softwaremill.sttp : core_2.11	jar	1.7.2
com.google.cloud.bigdataoss : gcs-connector	jar	hadoop3-2.1.4
com.google.cloud.bigdataoss : gcs-connector	jar	hadoop3-2.1.4
com.google.cloud.bigdataoss : bigquery-connector	jar	hadoop3-1.0.0
com.google.cloud.bigdataoss : bigquery-connector	jar	hadoop3-1.0.0
com.google.cloud : google-cloud-bigquery	jar	1.116.2
org.apache.poi : poi-ooxml	jar	4.1.2
com.fasterxml.jackson.core : jackson-core	jar	2.7.9
com.fasterxml.jackson.core : jackson-annotations	jar	2.7.9
com.fasterxml.jackson.core : jackson-databind	jar	2.7.9
com.fasterxml.jackson.module : jackson-module-scala_2.11	jar	2.7.9
com.fasterxml.jackson.dataformat : jackson-dataformat-yaml	jar	2.7.9
org.scala-lang : scala-reflect	jar	2.11.12

provided (12)

Group / Artifact	Type	Version
org.apache.hadoop : hadoop-common	jar	3.2.0
org.apache.hadoop : hadoop-hdfs	jar	3.2.0
org.apache.hadoop : hadoop-yarn-client	jar	3.2.0
org.apache.hadoop : hadoop-mapreduce-client-app	jar	3.2.0
org.apache.hadoop : hadoop-client	jar	3.2.0
com.google.cloud.spark : spark-bigquery-with-dependencies_2.11	jar	0.16.1
org.apache.hadoop : hadoop-azure	jar	3.3.0
com.microsoft.azure : azure-storage	jar	8.6.5
org.apache.spark : spark-core_2.11	jar	2.4.6
org.apache.spark : spark-sql_2.11	jar	2.4.6
org.apache.spark : spark-hive_2.11	jar	2.4.6
org.apache.spark : spark-mllib_2.11	jar	2.4.6

test (2)

Group / Artifact	Type	Version
pl.allegro.tech : embedded-elasticsearch	jar	2.10.0
com.h2database : h2	jar	1.4.200

Project Modules

There are no modules declared in this project.

About Comet Data Pipeline

Complete documentation available here

Introduction

The purpose of this project is to efficiently ingest various data sources in different formats and make them available for analytics. Usualluy, ingestion is done by writing hand made custom parsers that transform input files into datasets of records.

This project aims at automating this parsing task by making data ingestion purely declarative.

The workflow below is a typical use case :

Export your data as a set of DSV (Delimiter-separated values) or JSON files
Define each DSV/JSON file with a schema using YAML syntax
Configure the ingestion process
Start watching your data being available as Hive Tables in your datalake

The main advantages of the Comet Data Pipeline project are :

Eliminates manual coding for data ingestion
Assign metadata to each dataset
Expose data ingestion metrics and history
Transform text files to strongly typed records
Support semantic types
Force privacy on specific fields (RGPD)
very, very simple piece of software to administer

How it works

Comet Data Pipeline automates the loading and parsing of files and their ingestion into a Hadoop Datalake where datasets become available as Hive tables.

Landing Area : Files are first stored in the local file system
Staging Area : Files associated with a schema are imported into the datalake
Working Area : Staged Files are parsed against their schema and records are rejected or accepted and made available in parquet/orc/... files as Hive Tables.
Business Area : Tables in the working area may be joined to provide a hoslictic view of the data through the definition of AutoJob.
Data visualization : parquet/orc/... tables may be exposed in datawarehouses or elasticsearch indexes

Versions

Version
0.1.10 Jul 16, 2020
0.1.9 Jul 14, 2020
0.1.8 Jun 27, 2020
0.1.7 Jun 27, 2020
0.1.6 Jun 22, 2020
0.1.5 May 11, 2020
0.1.4 May 11, 2020
0.1.3 Apr 24, 2020
0.1.2 Nov 22, 2019
0.1.1 Nov 3, 2019

comet

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Project Organization

Source Code Management