Parsers Configurations¶
Parsers convert information from files and other data sources into structured data.
FactoryTX has a few built-in parsers available:
To learn more about creating a custom parser, please refer to the Creating a custom parser guide in Tutorials.
Attachment Parser¶
The attachment
parser generates file attachment records from input
files.
Configuration:
Required and optional properties that can be configured for an
attachment
parser:
parser_name: A customizable name used by streams to identify the parser.
parser_type: Type of parser to apply. Should be
attachment
.
CSV Parser¶
The csv
parser converts delimited-separated files (eg. .csv and .tsv)
into structured data.
- Example:
Let’s say we are given some tab-separated data.
Timestamp Serial Pass / Fail 2018-02-06 13:48:10 001029281 PASS 2018-02-06 13:52:29 001029283 PASS 2018-02-06 13:52:55 001029284 FAIL
To read the data, our
csv
parser configuration will look something like this:{ "parsers": [ { "parser_name": "Quality parser", "parser_type": "csv", "filter_stream": [ "*" ], "file_format": "csv", "separator": " " } ] }
Configuration:
Required and optional properties that can be configured for a
csv
parser:
parser_name: A customizable name used by streams to identify the parser.
parser_type: Type of parser to apply. Should be
csv
.file_encoding: Encoding (e.g. ‘utf-16’) of the content of the CSV file. Available encodings are derived from Python’s standard built-in codecs. If no encoding is specified, the system will make a best effort to guess the encoding, starting with ‘utf-8’. Encoding autodetection can be fragile, so it’s a good idea to configure the file encoding instead of letting the system guess.
separator: Delimiter for CSV columns. This defaults to
,
. Other common values include `` `` for tab-delimited (.tsv) files and|
for pipe-delimited files.header_row_number: Row number containing the column names. Defaults to the first line of the file. If all rows contain data, then the column names should be set via column_names instead.
column_names: Names of the columns of the data. If set, then header_row_number is ignored and all rows as treated as data. There must be one name per column of data in the file.
skip_rows: Number of rows to skip at the beginning of the file before reading column headers and data. Defaults to zero, so no rows are skipped.
column_data_types: Data type of the column. Although the parser will automatically guess the data type of every column, sometimes it guesses incorrectly. For example, a serial number may be treated as an integer and will lose leading zeros. By specifying the data type as
string
, the leading zeros will be kept. This property is a list of column data types, each with the following properties:column_name: Name of the column (e.g.
serial_number
)data_type: Data type of the value. Available types are:
string
,integer
, andfloat
.
skip_empty_files: Suppress errors that occur when an empty file is parsed. If true empty files will be skipped otherwise an error will be raised. Defaults to
false
.chunk_size: Maximum number of rows to load and process at a time. If the number of rows in an input file exceeds this value (which defaults to
10 000
), then FactoryTX will load and process data from the file in batches of chunk_size rows. Increasing this value beyond the default will not significantly improve performance, but may lead to high memory use.compression: Compression format that the parser should use to decompress files on-the-fly while they’re being processed. Supported compression formats are:
gzip
,bz2
,zip
, andxz
.advanced: Additional parameters to pass directly to
pandas.read_csv
orpandas.read_excel
. Advanced options override the corresponding named options, so settingsep
as an advanced setting will override the definition ofseparator
. Options injected by transforms will override advanced options, so beware!
Excel Parser¶
The excel
parser converts Excel files (.xls and .xlsx) into
structured data.
Configuration:
Required and optional properties that can be configured for an
excel
parser:
parser_name: A customizable name used by streams to identify the parser.
parser_type: Type of parser to apply. Should be
excel
.header_row_number: Row number containing the column names. Defaults to the first line of the file. If all rows contain data, then the column names should be set via column_names instead.
column_names: Names of the columns of the data. If set, then header_row_number is ignored and all rows as treated as data. There must be one name per column of data in the file.
skip_rows: Number of rows to skip at the beginning of the file before reading column headers and data. Defaults to zero, so no rows are skipped. Useful in combination with column_names if the header row is formatted oddly or if the headers span multiple rows.
column_data_types: Data type of the column. Although the parser will automatically guess the data type of every column, sometimes it guesses incorrectly. For example, a serial number may be treated as an integer and will lose leading zeros. By specifying the data type as
string
, the leading zeros will be kept. This property is a list of column data types, each with the following properties:column_name: Name of the column (e.g.
serial_number
)data_type: Data type of the value. Available types are:
string
,integer
, andfloat
.
sheet_names: List of sheet name(s) to read from each Excel file (e.g. [“Sheet1”, “Sheet5”]). If not specified, the first sheet will be used.
advanced: Additional parameters to pass directly to
pandas.read_csv
orpandas.read_excel
. Advanced options override the corresponding named options, so settingsep
as an advanced setting will override the definition ofseparator
. Options injected by transforms will override advanced options, so beware!
Parquet Parser¶
The parquet
parser converts parquet files into Pandas DataFrames.
Has the ability to parse from files or s3/azure data lake references.
Configuration:
Required and optional properties that can be configured for a parquet
parser:
- parser_name: A customizable name used by streams to identify the parser.
- parser_type: Type of parser to apply. Should be parquet
.
- column_names: Names of the columns of the data to read from file
- chunk_size: Number of rows to process at one time (useful for large datasets)
- skip_empty_files: Suppress errors that occur when an empty file is
parsed. If true empty files will be skipped otherwise an error will be raised. Defaults to
false
.
Pickle Parser¶
The pickle
parser loads data from pickle files (.pkl). It can be used
to load pickle files generated from the TapTransform
.
Configuration:
Required and optional properties that can be configured for a
pickle
parser:
parser_name: A customizable name used by streams to identify the parser.
parser_type: Type of parser to apply. Should be
pickle
.