XML Reader Batch Source
Plugin version: 2.11.0
The XML Reader plugin is a source plugin that allows users to read XML files stored on HDFS.
A user would like to read XML files that have been dropped into HDFS. These can range in size from small to very large XML files. The XMLReader will read and parse the files, and when used in conjunction with the XMLParser plugin, fields can be extracted. This reader emits one XML event, specified by the node path property, for each file read.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Reference Name | No | Required. This will be used to uniquely identify this source for lineage, annotating metadata, etc. |
Path | Yes | Required. Path to file(s) to be read. If a directory is specified, terminate the path name with a ‘/‘. This leverages glob syntax as described in the Java Documentation. |
Node Path | Yes | Required. Node path (XPath) to emit as an individual event from the XML schema. Example: '/book/price' to read only the price from under the book node. For more information about XPaths, see the Java Documentation. |
Action After Processing File | No | Required. Action to be taken after processing of the XML file. Possible actions are: (DELETE) delete from HDFS; (ARCHIVE) archive to the target location; and (MOVE) move to the target location. Default is None. |
Reprocessing Required | No | Required. Specifies whether the files should be reprocessed. If set to Default is Yes. |
Temporary Folder | Yes | Required. An existing folder path with read and write access for the current user. This is required for storing temporary files containing paths of the processed XML files. These temporary files will be read at the end of the job to update the file track table. Default is |
File Pattern | Yes | Optional. The regular expression pattern used to select specific files. This should be used in cases when the glob syntax in the |
Target Folder | Yes | Optional. Target folder path if the user select an action for after the process, either one of ARCHIVE or MOVE. Target folder must be an existing directory. |
Enable processing external entities | Yes | Optional. This enables processing external entities while reading xml file. Defaults to Default is Off. |
Enable XML parser to support DTDs | No | Optional. This sets supporting DTDs while processing xml file. This property needs to be set Default is Off. |
Output Schema | No | Required. The output schema for the data. |
Usage Notes
When specifying a regular expression for filtering files, you must use glob syntax in the folder path. This usually means ending the path with '/*'.
Here are some regular expression pattern examples:
Use '^' to select files with names starting with 'catalog', such as '^catalog'.
Use '$' to select files with names ending with 'catalog.xml', such as 'catalog.xml$'.
Use '.*' to select files with a name that contains 'catalogBook', such as 'catalogBook.*'.
Example
This example reads data from the folder hdfs:/cdap/source/xmls/
and emits XML records on the basis of the node path /catalog/book/title
. It will generate structured records with the fields offset
, fileName
, and record
. It will move the XML files to the target folder hdfs:/cdap/target/xmls/
and update the processed file information in the table named trackingTable
.
Property | Value |
---|---|
Reference Name |
|
Path |
|
Node Path |
|
Action After Processing File |
|
Reprocessing Required |
|
Temporary Folder |
|
File Pattern |
|
Target Folder |
|
For this XML as an input:
<catalog>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price><base>5.95</base><tax><surcharge>13.00</surcharge><excise>13.00</excise></tax></price>
<publish_date>2001-03-10</publish_date>
<description><name><name>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</name></name></description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price><base>5.95</base><tax><surcharge>14.00</surcharge><excise>14.00</excise></tax></price>
<publish_date>2001-09-10</publish_date>
<description><name>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</name></description>
</book>
</catalog>
The output records will be:
offset | filename | record |
---|---|---|
2 | hdfs:/cdap/source/xmls/catalog.xml | <title>Oberon’s Legacy</title> |
13 | hdfs:/cdap/source/xmls/catalog.xml | <title>The Sundered Grail</title> |
Created in 2020 by Google Inc.