Description:

Fetch files from Hadoop Distributed File System (HDFS) into FlowFiles. This Processor will delete the file from HDFS after fetching it.

Tags:

hadoop, HDFS, get, fetch, ingest, source, filesystem

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values.

Name Default Value Allowable Values Description
Hadoop Configuration Resources A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration.
Kerberos Principal Kerberos principal to authenticate as. Requires nifi.kerberos.krb5.file to be set in your nifi.properties
Kerberos Keytab Kerberos keytab associated with the principal. Requires nifi.kerberos.krb5.file to be set in your nifi.properties
Kerberos Relogin Period 4 hours Period of time which should pass before attempting a kerberos relogin
Directory The HDFS directory from which files should be read
Recurse Subdirectories true * true
* false
Indicates whether to pull files from subdirectories of the HDFS directory
Keep Source File false * true
* false
Determines whether to delete the file from HDFS after it has been successfully transferred. If true, the file will be fetched repeatedly. This is intended for testing only.
File Filter Regex A Java Regular Expression for filtering Filenames; if a filter is supplied then only files whose names match that Regular Expression will be fetched, otherwise all files will be fetched
Filter Match Name Only true * true
* false
If true then File Filter Regex will match on just the filename, otherwise subdirectory names will be included with filename in the regex comparison
Ignore Dotted Files true * true
* false
If true, files whose names begin with a dot (".") will be ignored
Minimum File Age 0 sec The minimum age that a file must be in order to be pulled; any file younger than this amount of time (based on last modification date) will be ignored
Maximum File Age The maximum age that a file must be in order to be pulled; any file older than this amount of time (based on last modification date) will be ignored
Polling Interval 0 sec Indicates how long to wait between performing directory listings
Batch Size 100 The maximum number of files to pull in each iteration, based on run schedule.
IO Buffer Size Amount of memory to use to buffer file contents during IO. This overrides the Hadoop Configuration
Compression codec NONE * NONE
* DEFAULT
* BZIP
* GZIP
* LZ4
* SNAPPY
* AUTOMATIC
No Description Provided.

Relationships:

Name Description
passthrough If this processor has an input queue for some reason, then FlowFiles arriving on that input are transferred to this relationship
success All files retrieved from HDFS are transferred to this relationship

Reads Attributes:

None specified.

Writes Attributes:

Name Description
filename The name of the file that was read from HDFS.
path The path is set to the relative path of the file's directory on HDFS. For example, if the Directory property is set to /tmp, then files picked up from /tmp will have the path attribute set to "./". If the Recurse Subdirectories property is set to true and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to "abc/1/2/3".

See Also:

PutHDFS,ListHDFS