Description:
Retrieves a listing of files from HDFS. For each file that is listed in HDFS, creates a FlowFile that represents the HDFS file so that it can be fetched in conjunction with ListHDFS. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data. Unlike GetHDFS, this Processor does not delete any data from HDFS.
Tags:
hadoop, HDFS, get, list, ingest, source, filesystem
Properties:
In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values.
Name | Default Value | Allowable Values | Description |
Hadoop Configuration Resources | A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. | ||
Kerberos Principal | Kerberos principal to authenticate as. Requires nifi.kerberos.krb5.file to be set in your nifi.properties | ||
Kerberos Keytab | Kerberos keytab associated with the principal. Requires nifi.kerberos.krb5.file to be set in your nifi.properties | ||
Kerberos Relogin Period | 4 hours | Period of time which should pass before attempting a kerberos relogin | |
Distributed Cache Service |
Controller Service API: DistributedMapCacheClient Implementation: DistributedMapCacheClientService |
Specifies the Controller Service that should be used to maintain state about what has been pulled from HDFS so that if a new node begins pulling data, it won't duplicate all of the work that has been done. | |
Directory | The HDFS directory from which files should be read | ||
Recurse Subdirectories | true | * true</br> * false | Indicates whether to list files from subdirectories of the HDFS directory |
Relationships:
Name | Description |
success | All FlowFiles are transferred to this relationship |
Reads Attributes:
None specified.
Writes Attributes:
Name | Description |
filename | The name of the file that was read from HDFS. |
path | The path is set to the absolute path of the file's directory on HDFS. For example, if the Directory property is set to /tmp, then files picked up from /tmp will have the path attribute set to "./". If the Recurse Subdirectories property is set to true and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to "/tmp/abc/1/2/3". |
hdfs.owner | The user that owns the file in HDFS |
hdfs.group | The group that owns the file in HDFS |
hdfs.lastModified | The timestamp of when the file in HDFS was last modified, as milliseconds since midnight Jan 1, 1970 UTC |
hdfs.length | The number of bytes in the file in HDFS |
hdfs.replication | The number of HDFS replicas for hte file |
hdfs.permissions | The permissions for the file in HDFS. This is formatted as 3 characters for the owner, 3 for the group, and 3 for other users. For example rw-rw-r-- |
See Also:
GetHDFS,FetchHDFS,PutHDFS