Description:
Fetch sequence files from Hadoop Distributed File System (HDFS) into FlowFiles
Tags:
hadoop, HDFS, get, fetch, ingest, source, sequence file
Properties:
In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values.
Name | Default Value | Allowable Values | Description |
Hadoop Configuration Resources | A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. | ||
Kerberos Principal | Kerberos principal to authenticate as. Requires nifi.kerberos.krb5.file to be set in your nifi.properties | ||
Kerberos Keytab | Kerberos keytab associated with the principal. Requires nifi.kerberos.krb5.file to be set in your nifi.properties | ||
Kerberos Relogin Period | 4 hours | Period of time which should pass before attempting a kerberos relogin | |
Directory | The HDFS directory from which files should be read | ||
Recurse Subdirectories | true |
* true * false |
Indicates whether to pull files from subdirectories of the HDFS directory |
Keep Source File | false |
* true * false |
Determines whether to delete the file from HDFS after it has been successfully transferred. If true, the file will be fetched repeatedly. This is intended for testing only. |
File Filter Regex | A Java Regular Expression for filtering Filenames; if a filter is supplied then only files whose names match that Regular Expression will be fetched, otherwise all files will be fetched | ||
Filter Match Name Only | true |
* true * false |
If true then File Filter Regex will match on just the filename, otherwise subdirectory names will be included with filename in the regex comparison |
Ignore Dotted Files | true |
* true * false |
If true, files whose names begin with a dot (".") will be ignored |
Minimum File Age | 0 sec | The minimum age that a file must be in order to be pulled; any file younger than this amount of time (based on last modification date) will be ignored | |
Maximum File Age | The maximum age that a file must be in order to be pulled; any file older than this amount of time (based on last modification date) will be ignored | ||
Polling Interval | 0 sec | Indicates how long to wait between performing directory listings | |
Batch Size | 100 | The maximum number of files to pull in each iteration, based on run schedule. | |
IO Buffer Size | Amount of memory to use to buffer file contents during IO. This overrides the Hadoop Configuration | ||
Compression codec | NONE |
* NONE * DEFAULT * BZIP * GZIP * LZ4 * SNAPPY * AUTOMATIC |
No Description Provided. |
FlowFile Content | VALUE ONLY |
* VALUE ONLY * KEY VALUE PAIR |
Indicate if the content is to be both the key and value of the Sequence File, or just the value. |
Relationships:
Name | Description |
passthrough | If this processor has an input queue for some reason, then FlowFiles arriving on that input are transferred to this relationship |
success | All files retrieved from HDFS are transferred to this relationship |
Reads Attributes:
None specified.
Writes Attributes:
Name | Description |
filename | The name of the file that was read from HDFS. |
path | The path is set to the relative path of the file's directory on HDFS. For example, if the Directory property is set to /tmp, then files picked up from /tmp will have the path attribute set to "./". If the Recurse Subdirectories property is set to true and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to "abc/1/2/3". |
See Also:
PutHDFS