Data Prep
A collection of libraries, a pipeline plugin, and a CDAP service for performing data cleansing, transformation, and filtering using a set of data manipulation instructions (directives). These instructions are either generated using an interative visual tool or are manually created.
- Data Prep defines few concepts that might be useful if you are just getting started with it. Learn about them here
- The Data Prep Transform is separately documented.
- Data Prep Cheatsheet
New Features
More here on upcoming features.
-
User Defined Directives, also known as UDD, allow you to create custom functions to transform records within CDAP DataPrep or a.k.a Wrangler. CDAP comes with a comprehensive library of functions. There are however some omissions, and some specific cases for which UDDs are the solution. Additional information on how you can build your custom directives here.
-
A new capability that allows CDAP Administrators to restrict the directives that are accessible to their users. More information on configuring can be found here
Demo Videos and Recipes
Videos and Screencasts are best way to learn, so we have compiled simple, short screencasts that shows some of the features of Data Prep. Additional videos can be found here
Videos
- [SCREENCAST] Creating Lookup Dataset and Joining
- [SCREENCAST] Restricted Directives
- [SCREENCAST] Parse Excel files in CDAP
- [SCREENCAST] Parse File As AVRO File
- [SCREENCAST] Parsing Binary Coded AVRO Messages
- [SCREENCAST] Parsing Binary Coded AVRO Messages & Protobuf messages using schema registry
- [SCREENCAST] Quantize a column - Digitize
- [SCREENCAST] Data Cleansing capability with send-to-error directive
- [SCREENCAST] Building Data Prep from the GitHub source
- [VOICE-OVER] End-to-End Demo Video
- [SCREENCAST] Ingesting into Kudu
- [SCREENCAST] Realtime HL7 CCDA XML from Kafka into Time Parititioned Parquet
- [SCREENCAST] Parsing JSON file
- [SCREENCAST] Flattening arrays
- [SCREENCAST] Data cleansing with send-to-error directive
- [SCREENCAST] Publishing to Kafka
- [SCREENCAST] Fixed length to JSON
Recipes
Available Directives
These directives are currently available:
Directive | Description |
---|---|
Parsers | |
JSON Path | Uses a DSL (a JSON path expression) for parsing JSON records |
Parse as AVRO | Parsing an AVRO encoded message - either as binary or json |
Parse as AVRO File | Parsing an AVRO data file |
Parse as CSV | Parsing an input record as comma-separated values |
Parse as Date | Parsing dates using natural language processing |
Parse as Excel | Parsing excel file. |
Parse as Fixed Length | Parses as a fixed length record with specified widths |
Parse as HL7 | Parsing Health Level 7 Version 2 (HL7 V2) messages |
Parse as JSON | Parsing a JSON object |
Parse as Log | Parses access log files as from Apache HTTPD and nginx servers |
Parse as Protobuf | Parses an Protobuf encoded in-memory message using descriptor |
Parse as Simple Date | Parses date strings |
Parse XML To JSON | Parses an XML document into a JSON structure |
Parse as Currency | Parses a string representation of currency into a number. |
Parse as Datetime | Parses strings with datetime values to CDAP datetime type |
Output Formatters | |
Write as CSV | Converts a record into CSV format |
Write as JSON | Converts the record into a JSON map |
Write JSON Object | Composes a JSON object based on the fields specified. |
Format as Currency | Formats a number as currency as specified by locale. |
Transformations | |
Changing Case | Changes the case of column values |
Cut Character | Selects parts of a string value |
Set Column | Sets the column value to the result of an expression execution |
Find and Replace | Transforms string column values using a "sed"-like expression |
Index Split | (Deprecated) |
Invoke HTTP | Invokes an HTTP Service (Experimental, potentially slow) |
Quantization | Quantizes a column based on specified ranges |
Regex Group Extractor | Extracts the data from a regex group into its own column |
Setting Character Set | Sets the encoding and then converts the data to a UTF-8 String |
Setting Record Delimiter | Sets the record delimiter |
Split by Separator | Splits a column based on a separator into two columns |
Split Email Address | Splits an email ID into an account and its domain |
Split URL | Splits a URL into its constituents |
Text Distance (Fuzzy String Match) | Measures the difference between two sequences of characters |
Text Metric (Fuzzy String Match) | Measures the difference between two sequences of characters |
URL Decode | Decodes from the application/x-www-form-urlencoded MIME format |
URL Encode | Encodes to the application/x-www-form-urlencoded MIME format |
Trim | Functions for trimming white spaces around string data |
Encoders and Decoders | |
Decode | Decodes a column value as one of base32 , base64 , or hex |
Encode | Encodes a column value as one of base32 , base64 , or hex |
Unique ID | |
UUID Generation | Generates a universally unique identifier (UUID) .Recommended to use with Wrangler version 4.4.0 and above due to an important bug fix CDAP-17732 |
Date Transformations | |
Diff Date | Calculates the difference between two dates |
Format Date | Custom patterns for date-time formatting |
Format Unix Timestamp | Formats a UNIX timestamp as a date |
DateTime Transformations | |
Current DateTime | Generates the current datetime using the given zone or UTC by default |
Datetime To Timestamp | Converts a datetime value to timestamp with the given zone |
Format Datetime | Formats a datetime value to custom date time pattern strings |
Timestamp To Datetime | Converts a timestamp value to datetime |
Lookups | |
Catalog Lookup | Static catalog lookup of ICD-9, ICD-10-2016, ICD-10-2017 codes |
Table Lookup | Performs lookups into Table datasets |
Hashing & Masking | |
Message Digest or Hash | Generates a message digest |
Mask Number | Applies substitution masking on the column values |
Mask Shuffle | Applies shuffle masking on the column values |
Row Operations | |
Filter Row if Matched | Filters rows that match a pattern for a column |
Filter Row if True | Filters rows if the condition is true. |
Filter Row Empty of Null | Filters rows that are empty of null. |
Flatten | Separates the elements in a repeated field |
Fail on condition | Fails processing when the condition is evaluated to true. |
Send to Error | Filtering of records to an error collector |
Send to Error And Continue | Filtering of records to an error collector and continues processing |
Split to Rows | Splits based on a separator into multiple records |
Column Operations | |
Change Column Case | Changes column names to either lowercase or uppercase |
Changing Case | Change the case of column values |
Cleanse Column Names | Sanatizes column names, following specific rules |
Columns Replace | Alters column names in bulk |
Copy | Copies values from a source column into a destination column |
Drop Column | Drops a column in a record |
Fill Null or Empty Columns | Fills column value with a fixed value if null or empty |
Keep Columns | Keeps specified columns from the record |
Merge Columns | Merges two columns by inserting a third column |
Rename Column | Renames an existing column in the record |
Set Column Header | Sets the names of columns, in the order they are specified |
Split to Columns | Splits a column based on a separator into multiple columns |
Swap Columns | Swaps column names of two columns |
Set Column Data Type | Convert data type of a column |
NLP | |
Stemming Tokenized Words | Applies the Porter stemmer algorithm for English words |
Transient Aggregators & Setters | |
Increment Variable | Increments a transient variable with a record of processing. |
Set Variable | Sets a transient variable with a record of processing. |
Functions | |
Data Quality | Data quality check functions. Checks for date, time, etc. |
Date Manipulations | Functions that can manipulate date |
DDL | Functions that can manipulate definition of data |
JSON | Functions that can be useful in transforming your data |
Types | Functions for detecting the type of data |
Performance
Initial performance tests show that with a set of directives of high complexity for transforming data, DataPrep is able to process at about ~106K records per second. The rates below are specified as records/second.
Directive Complexity | Column Count | Records | Size | Mean Rate |
---|---|---|---|---|
High (167 Directives) | 426 | 127,946,398 | 82,677,845,324 | 106,367.27 |
High (167 Directives) | 426 | 511,785,592 | 330,711,381,296 | 105,768.93 |
Contact
Mailing Lists
CDAP User Group and Development Discussions:
The cdap-user mailing list is primarily for users using the product to develop applications or building plugins for appplications. You can expect questions from users, release announcements, and any other discussions that we think will be helpful to the users.
IRC Channel
CDAP IRC Channel: #cdap on irc.freenode.net
Slack Team
CDAP Users on Slack: cdap-users team
License and Trademarks
Copyright © 2016-2019 Cask Data, Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Cask is a trademark of Cask Data, Inc. All rights reserved.
Apache, Apache HBase, and HBase are trademarks of The Apache Software Foundation. Used with permission. No endorsement by The Apache Software Foundation is implied by the use of these marks.