comet


License

License

GroupId

GroupId

com.ebiznext
ArtifactId

ArtifactId

comet_2.11
Last Version

Last Version

0.1.10
Release Date

Release Date

Type

Type

jar
Description

Description

comet
comet
Project URL

Project URL

https://github.com/ebiznext/comet-data-pipeline
Project Organization

Project Organization

Ebiznext
Source Code Management

Source Code Management

https://github.com/ebiznext/comet-data-pipeline

Download comet_2.11

How to add to project

<!-- https://jarcasting.com/artifacts/com.ebiznext/comet_2.11/ -->
<dependency>
    <groupId>com.ebiznext</groupId>
    <artifactId>comet_2.11</artifactId>
    <version>0.1.10</version>
</dependency>
// https://jarcasting.com/artifacts/com.ebiznext/comet_2.11/
implementation 'com.ebiznext:comet_2.11:0.1.10'
// https://jarcasting.com/artifacts/com.ebiznext/comet_2.11/
implementation ("com.ebiznext:comet_2.11:0.1.10")
'com.ebiznext:comet_2.11:jar:0.1.10'
<dependency org="com.ebiznext" name="comet_2.11" rev="0.1.10">
  <artifact name="comet_2.11" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.ebiznext', module='comet_2.11', version='0.1.10')
)
libraryDependencies += "com.ebiznext" % "comet_2.11" % "0.1.10"
[com.ebiznext/comet_2.11 "0.1.10"]

Dependencies

compile (21)

Group / Artifact Type Version
org.scala-lang : scala-library jar 2.11.12
org.scalatra.scalate : scalate-core_2.11 jar 1.9.6
com.typesafe.scala-logging : scala-logging_2.11 jar 3.9.2
com.github.kxbmap : configs_2.11 jar 0.4.4
com.github.pathikrit : better-files_2.11 jar 3.9.1
org.scalatest : scalatest_2.11 jar 3.2.0
com.github.scopt : scopt_2.11 jar 4.0.0-RC2
org.elasticsearch : elasticsearch-hadoop jar 7.8.0
com.softwaremill.sttp : core_2.11 jar 1.7.2
com.google.cloud.bigdataoss : gcs-connector jar hadoop3-2.1.4
com.google.cloud.bigdataoss : gcs-connector jar hadoop3-2.1.4
com.google.cloud.bigdataoss : bigquery-connector jar hadoop3-1.0.0
com.google.cloud.bigdataoss : bigquery-connector jar hadoop3-1.0.0
com.google.cloud : google-cloud-bigquery jar 1.116.2
org.apache.poi : poi-ooxml jar 4.1.2
com.fasterxml.jackson.core : jackson-core jar 2.7.9
com.fasterxml.jackson.core : jackson-annotations jar 2.7.9
com.fasterxml.jackson.core : jackson-databind jar 2.7.9
com.fasterxml.jackson.module : jackson-module-scala_2.11 jar 2.7.9
com.fasterxml.jackson.dataformat : jackson-dataformat-yaml jar 2.7.9
org.scala-lang : scala-reflect jar 2.11.12

provided (12)

Group / Artifact Type Version
org.apache.hadoop : hadoop-common jar 3.2.0
org.apache.hadoop : hadoop-hdfs jar 3.2.0
org.apache.hadoop : hadoop-yarn-client jar 3.2.0
org.apache.hadoop : hadoop-mapreduce-client-app jar 3.2.0
org.apache.hadoop : hadoop-client jar 3.2.0
com.google.cloud.spark : spark-bigquery-with-dependencies_2.11 jar 0.16.1
org.apache.hadoop : hadoop-azure jar 3.3.0
com.microsoft.azure : azure-storage jar 8.6.5
org.apache.spark : spark-core_2.11 jar 2.4.6
org.apache.spark : spark-sql_2.11 jar 2.4.6
org.apache.spark : spark-hive_2.11 jar 2.4.6
org.apache.spark : spark-mllib_2.11 jar 2.4.6

test (2)

Group / Artifact Type Version
pl.allegro.tech : embedded-elasticsearch jar 2.10.0
com.h2database : h2 jar 1.4.200

Project Modules

There are no modules declared in this project.

Build Status Snapshot Status Scala Steward badge codecov Documentation Maven Central Comet Spark 3

About Comet Data Pipeline

Complete documentation available here

Introduction

The purpose of this project is to efficiently ingest various data sources in different formats and make them available for analytics. Usualluy, ingestion is done by writing hand made custom parsers that transform input files into datasets of records.

This project aims at automating this parsing task by making data ingestion purely declarative.

The workflow below is a typical use case :

  • Export your data as a set of DSV (Delimiter-separated values) or JSON files
  • Define each DSV/JSON file with a schema using YAML syntax
  • Configure the ingestion process
  • Start watching your data being available as Hive Tables in your datalake

The main advantages of the Comet Data Pipeline project are :

  • Eliminates manual coding for data ingestion
  • Assign metadata to each dataset
  • Expose data ingestion metrics and history
  • Transform text files to strongly typed records
  • Support semantic types
  • Force privacy on specific fields (RGPD)
  • very, very simple piece of software to administer

How it works

Comet Data Pipeline automates the loading and parsing of files and their ingestion into a Hadoop Datalake where datasets become available as Hive tables.

Complete Comet Data pipeline

  1. Landing Area : Files are first stored in the local file system
  2. Staging Area : Files associated with a schema are imported into the datalake
  3. Working Area : Staged Files are parsed against their schema and records are rejected or accepted and made available in parquet/orc/... files as Hive Tables.
  4. Business Area : Tables in the working area may be joined to provide a hoslictic view of the data through the definition of AutoJob.
  5. Data visualization : parquet/orc/... tables may be exposed in datawarehouses or elasticsearch indexes
com.ebiznext

Versions

Version
0.1.10
0.1.9
0.1.8
0.1.7
0.1.6
0.1.5
0.1.4
0.1.3
0.1.2
0.1.1