Description:

Parses CSV-formatted data, returning each row in the CSV file as a separate record. This reader assumes that the first line in the content is the column names and all subsequent lines are the values. See Controller Service’s Usage for further documentation.

Tags:

CSV, parse, record, row, reader, delimited, comma, separated, values

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the Expression Language Guide.

Name

Default Value

Allowable Values

Description

Schema Access Strategy csv-header-derived * Use 'Schema Name' Property
* Use 'Schema Text' Property
* HWX Schema Reference Attributes
* HWX Content-Encoded Schema Reference
* Use Embedded Avro Schema
Specifies how to obtain the schema that is to be used for interpreting the data.
Schema Registry Controller Service API: </br> SchemaRegistry </br> Implementations: AvroSchemaRegistry </br> HortonworksSchemaRegistry Specifies the Controller Service to use for the Schema Registry
Schema Name ${schema.name} Specifies the name of the schema to lookup in the Schema Registry property
Supports Expression Language: true
Schema Text ${avro.schema} The text of an Avro-formatted Schema
Supports Expression Language: true
Date Format Specifies the format to use when reading/writing Date fields. If not specified, Date fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java Simple Date Format (for example, MM/dd/yyyy for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters, as in 01/01/2017).
Time Format Specifies the format to use when reading/writing Time fields. If not specified, Time fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java Simple Date Format (for example, HH:mm:ss for a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 18:04:15).
Timestamp Format Specifies the format to use when reading/writing Timestamp fields. If not specified, Timestamp fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java Simple Date Format (for example, MM/dd/yyyy HH:mm:ss for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters; and then followed by a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 01/01/2017 18:04:15).
CSV Format custom * Custom Format
* RFC 4180
* Microsoft Excel
* Tab-Delimited
* MySQL Format
* Informix Unload
* Informix Unload Escape Disabled
The text of an Avro-formatted Schema
Specifies which "format" the CSV data is in, or specifies if custom formatting should be used.
Value Separator The character that is used to separate values/fields in a CSV Record
Skip Header Line false * true
* false
Specifies whether or not the first line of CSV should be considered a Header and skipped. If the Schema Access Strategy indicates that the columns must be defined in the header, then this property will be ignored, since the header must always be present and won't be processed as a Record. Otherwise, this property should be 'true' if the first non-comment line of CSV contains header information that needs to be ignored.
Quote Character " The character that is used to quote values so that escape characters do not have to be used
Escape Character \ The character that is used to escape characters that would otherwise have a specific meaning to the CSV Parser.
Comment Marker The character that is used to denote the start of a comment. Any line that begins with this comment will be ignored.
Null String Specifies a String that, if present as a value in the CSV, should be considered a null field instead of using the literal value.
Trim Fields true * true
* false
Whether or not white space should be removed from the beginning and end of fields

State management:

This component does not store state.

Restricted:

This component is not restricted.

Summary:

The CSVReader Controller Service, expects input in such a way that the first line of a FlowFile specifies the name of each column in the data. Following the first line, the rest of the FlowFile is expected to be valid CSV data from which to form appropriate Records. The reader allows for customization of the CSV Format, such as which character should be used to separate CSV fields, which character should be used for quoting and when to quote fields, which character should denote a comment, etc.

Schemas and Type Coercion

When a record is parsed from incoming data, it is separated into fields. Each of these fields is then looked up against the configured schema (by field name) in order to determine what the type of the data should be. If the field is not present in the schema, that field is omitted from the Record. If the field is found in the schema, the data type of the received data is compared against the data type specified in the schema. If the types match, the value of that field is used as-is. If the schema indicates that the field should be of a different type, then the Controller Service will attempt to coerce the data into the type specified by the schema. If the field cannot be coerced into the specified type, an Exception will be thrown.

The following rules apply when attempting to coerce a field value from one data type to another:

  • Any data type can be coerced into a String type.
  • Any numeric data type (Byte, Short, Int, Long, Float, Double) can be coerced into any other numeric data type.
  • Any numeric value can be coerced into a Date, Time, or Timestamp type, by assuming that the Long value is the number of milliseconds since epoch Midnight GMT, January 1, 1970).
  • A String value can be coerced into a Date, Time, or Timestamp type, if its format matches the configured "Date Format," "Time Format," or "Timestamp Format."
  • A String value can be coerced into a numeric value if the value is of the appropriate type. For example, the String value 8 can be coerced into any numeric type. However, the String value 8.2 can be coerced into a Double or Float type but not an Integer.
  • A String value of "true" or "false" (regardless of case) can be coerced into a Boolean value.
  • A String value that is not empty can be coerced into a Char type. If the String contains more than 1 character, the first character is used and the rest of the characters are ignored.
  • Any "date/time" type (Date, Time, Timestamp) can be coerced into any other "date/time" type.
  • Any "date/time" type can be coerced into a Long type, representing the number of milliseconds since epoch (Midnight GMT, January 1, 1970).
  • Any "date/time" type can be coerced into a String. The format of the String is whatever DateFormat is configured for the corresponding property (Date Format, Time Format, Timestamp Format property).

If none of the above rules apply when attempting to coerce a value from one data type to another, the coercion will fail and an Exception will be thrown.

Examples

As an example, consider a FlowFile whose contents consists of the following:

id, name, balance, join_date, notes
1, John, 48.23, 04/03/2007 “Our very
first customer!”
2, Jane, 1245.89, 08/22/2009,
3, Frank Franklin, “48481.29”, 04/04/2016,

Additionally, let’s consider that this Controller Service is configured with the Schema Registry pointing to an AvroSchemaRegistry and the schema is configured as the following:

{
	  "namespace": "nifi",
	  "name": "balances",
	  "type": "record",
	  "fields": [
	  		{ "name": "id", "type": "int" },
	  		{ "name": "name": "type": "string" },
	  		{ "name": "balance": "type": "double" },
	  		{ "name": "join_date", "type": {
	  			"type": "int",
	  			"logicalType": "date"
	  		},
	  		{ "name": "notes": "type": "string" }
		]
	}

In the example above, we see that the ‘join_date’ column is a Date type. In order for the CSV Reader to be able to properly parse a value as a date, we need to provide the reader with the date format to use. In this example, we would configure the Date Format property to be MM/dd/yyyy to indicate that it is a two-digit month, followed by a two-digit day, followed by a four-digit year - each separated by a slash. In this case, the result will be that this FlowFile consists of 3 different records. The first record will contain the following values:

Field Name Field Value
id 1
name John
balance 48.23
join_date 04/03/2007
notes Our very
first customer!

The second record will contain the following values:

Field Name Field Value
id 2
name Jane
balance 1245.89
join_date 08/22/2009
notes

The third record will contain the following values:

Field Name Field Value
id 3
name Frank Franklin
balance 48481.29
join_date 04/04/2016
notes