Parsers Configurations¶
Parsers convert information from files and other data sources into structured data.
FactoryTX has a few built-in parsers available:
To learn more about creating a custom parser, please refer to the Creating a custom parser guide in Tutorials.
Attachment Parser¶
The attachment parser generates file attachment records from input
files.
Configuration:
Required and optional properties that can be configured for an 
attachment parser:
- parser_name: A customizable name used by streams to identify the parser. 
- parser_type: Type of parser to apply. Should be - attachment.
CSV Parser¶
The csv parser converts delimited-separated files (eg. .csv and .tsv)
into structured data.
- Example:
- Let’s say we are given some tab-separated data. - Timestamp Serial Pass / Fail 2018-02-06 13:48:10 001029281 PASS 2018-02-06 13:52:29 001029283 PASS 2018-02-06 13:52:55 001029284 FAIL - To read the data, our - csvparser configuration will look something like this:- { "parsers": [ { "parser_name": "Quality parser", "parser_type": "csv", "filter_stream": [ "*" ], "file_format": "csv", "separator": " " } ] } 
Configuration:
Required and optional properties that can be configured for a
csv parser:
- parser_name: A customizable name used by streams to identify the parser. 
- parser_type: Type of parser to apply. Should be - csv.
- file_encoding: Encoding (e.g. ‘utf-16’) of the content of the CSV file. Available encodings are derived from Python’s standard built-in codecs. If no encoding is specified, the system will make a best effort to guess the encoding, starting with ‘utf-8’. Encoding autodetection can be fragile, so it’s a good idea to configure the file encoding instead of letting the system guess. 
- separator: Delimiter for CSV columns. This defaults to - ,. Other common values include `` `` for tab-delimited (.tsv) files and- |for pipe-delimited files.
- header_row_number: Row number containing the column names. Defaults to the first line of the file. If all rows contain data, then the column names should be set via column_names instead. 
- column_names: Names of the columns of the data. If set, then header_row_number is ignored and all rows as treated as data. There must be one name per column of data in the file. 
- skip_rows: Number of rows to skip at the beginning of the file before reading column headers and data. Defaults to zero, so no rows are skipped. 
- column_data_types: Data type of the column. Although the parser will automatically guess the data type of every column, sometimes it guesses incorrectly. For example, a serial number may be treated as an integer and will lose leading zeros. By specifying the data type as - string, the leading zeros will be kept. This property is a list of column data types, each with the following properties:- column_name: Name of the column (e.g. - serial_number)
- data_type: Data type of the value. Available types are: - string,- integer, and- float.
 
- skip_empty_files: Suppress errors that occur when an empty file is parsed. If true empty files will be skipped otherwise an error will be raised. Defaults to - false.
- chunk_size: Maximum number of rows to load and process at a time. If the number of rows in an input file exceeds this value (which defaults to - 10 000), then FactoryTX will load and process data from the file in batches of chunk_size rows. Increasing this value beyond the default will not significantly improve performance, but may lead to high memory use.
- compression: Compression format that the parser should use to decompress files on-the-fly while they’re being processed. Supported compression formats are: - gzip,- bz2,- zip, and- xz.
- advanced: Additional parameters to pass directly to - pandas.read_csvor- pandas.read_excel. Advanced options override the corresponding named options, so setting- sepas an advanced setting will override the definition of- separator. Options injected by transforms will override advanced options, so beware!
Excel Parser¶
The excel parser converts Excel files (.xls and .xlsx) into
structured data.
Configuration:
Required and optional properties that can be configured for an 
excel parser:
- parser_name: A customizable name used by streams to identify the parser. 
- parser_type: Type of parser to apply. Should be - excel.
- header_row_number: Row number containing the column names. Defaults to the first line of the file. If all rows contain data, then the column names should be set via column_names instead. 
- column_names: Names of the columns of the data. If set, then header_row_number is ignored and all rows as treated as data. There must be one name per column of data in the file. 
- skip_rows: Number of rows to skip at the beginning of the file before reading column headers and data. Defaults to zero, so no rows are skipped. Useful in combination with column_names if the header row is formatted oddly or if the headers span multiple rows. 
- column_data_types: Data type of the column. Although the parser will automatically guess the data type of every column, sometimes it guesses incorrectly. For example, a serial number may be treated as an integer and will lose leading zeros. By specifying the data type as - string, the leading zeros will be kept. This property is a list of column data types, each with the following properties:- column_name: Name of the column (e.g. - serial_number)
- data_type: Data type of the value. Available types are: - string,- integer, and- float.
 
- sheet_names: List of sheet name(s) to read from each Excel file (e.g. [“Sheet1”, “Sheet5”]). If not specified, the first sheet will be used. 
- advanced: Additional parameters to pass directly to - pandas.read_csvor- pandas.read_excel. Advanced options override the corresponding named options, so setting- sepas an advanced setting will override the definition of- separator. Options injected by transforms will override advanced options, so beware!
Parquet Parser¶
The parquet parser converts parquet files into Pandas DataFrames.
Has the ability to parse from files or s3/azure data lake references.
Configuration:
Required and optional properties that can be configured for a parquet parser:
- parser_name: A customizable name used by streams to identify the parser. 
- parser_type: Type of parser to apply. Should be - parquet.
- column_names: Names of the columns of the data to read from file 
- chunk_size: Number of rows to process at one time (useful for large datasets) 
- skip_empty_files: Suppress errors that occur when an empty file is parsed. If true empty files will be skipped otherwise an error will be raised. Defaults to - false.
Pickle Parser¶
The pickle parser loads data from pickle files (.pkl). It can be used
to load pickle files generated from the TapTransform.
Configuration:
Required and optional properties that can be configured for a 
pickle parser:
- parser_name: A customizable name used by streams to identify the parser. 
- parser_type: Type of parser to apply. Should be - pickle.