Parsers Configurations

Parsers convert information from files and other data sources into structured data.

FactoryTX has a few built-in parsers available:

To learn more about creating a custom parser, please refer to the Creating a custom parser guide in Tutorials.

Attachment Parser

The attachment parser generates file attachment records from input files.

Configuration:

Required and optional properties that can be configured for an attachment parser:

  • parser_name: A customizable name used by streams to identify the parser.

  • parser_type: Type of parser to apply. Should be attachment.

CSV Parser

The csv parser converts delimited-separated files (eg. .csv and .tsv) into structured data.

Example:

Let’s say we are given some tab-separated data.

Timestamp   Serial  Pass / Fail
2018-02-06 13:48:10 001029281       PASS
2018-02-06 13:52:29 001029283       PASS
2018-02-06 13:52:55 001029284       FAIL

To read the data, our csv parser configuration will look something like this:

{
  "parsers": [
    {
      "parser_name": "Quality parser",
      "parser_type": "csv",
      "filter_stream": [
        "*"
      ],
      "file_format": "csv",
      "separator": "        "
    }
  ]
}

Configuration:

Required and optional properties that can be configured for a csv parser:

  • parser_name: A customizable name used by streams to identify the parser.

  • parser_type: Type of parser to apply. Should be csv.

  • file_encoding: Encoding (e.g. ‘utf-16’) of the content of the CSV file. Available encodings are derived from Python’s standard built-in codecs. If no encoding is specified, the system will make a best effort to guess the encoding, starting with ‘utf-8’. Encoding autodetection can be fragile, so it’s a good idea to configure the file encoding instead of letting the system guess.

  • separator: Delimiter for CSV columns. This defaults to ,. Other common values include `` `` for tab-delimited (.tsv) files and | for pipe-delimited files.

  • header_row_number: Row number containing the column names. Defaults to the first line of the file. If all rows contain data, then the column names should be set via column_names instead.

  • column_names: Names of the columns of the data. If set, then header_row_number is ignored and all rows as treated as data. There must be one name per column of data in the file.

  • skip_rows: Number of rows to skip at the beginning of the file before reading column headers and data. Defaults to zero, so no rows are skipped.

  • column_data_types: Data type of the column. Although the parser will automatically guess the data type of every column, sometimes it guesses incorrectly. For example, a serial number may be treated as an integer and will lose leading zeros. By specifying the data type as string, the leading zeros will be kept. This property is a list of column data types, each with the following properties:

    • column_name: Name of the column (e.g. serial_number)

    • data_type: Data type of the value. Available types are: string, integer, and float.

  • skip_empty_files: Suppress errors that occur when an empty file is parsed. If true empty files will be skipped otherwise an error will be raised. Defaults to false.

  • chunk_size: Maximum number of rows to load and process at a time. If the number of rows in an input file exceeds this value (which defaults to 10 000), then FactoryTX will load and process data from the file in batches of chunk_size rows. Increasing this value beyond the default will not significantly improve performance, but may lead to high memory use.

  • compression: Compression format that the parser should use to decompress files on-the-fly while they’re being processed. Supported compression formats are: gzip, bz2, zip, and xz.

  • advanced: Additional parameters to pass directly to pandas.read_csv or pandas.read_excel. Advanced options override the corresponding named options, so setting sep as an advanced setting will override the definition of separator. Options injected by transforms will override advanced options, so beware!

Excel Parser

The excel parser converts Excel files (.xls and .xlsx) into structured data.

Configuration:

Required and optional properties that can be configured for an excel parser:

  • parser_name: A customizable name used by streams to identify the parser.

  • parser_type: Type of parser to apply. Should be excel.

  • header_row_number: Row number containing the column names. Defaults to the first line of the file. If all rows contain data, then the column names should be set via column_names instead.

  • column_names: Names of the columns of the data. If set, then header_row_number is ignored and all rows as treated as data. There must be one name per column of data in the file.

  • skip_rows: Number of rows to skip at the beginning of the file before reading column headers and data. Defaults to zero, so no rows are skipped. Useful in combination with column_names if the header row is formatted oddly or if the headers span multiple rows.

  • column_data_types: Data type of the column. Although the parser will automatically guess the data type of every column, sometimes it guesses incorrectly. For example, a serial number may be treated as an integer and will lose leading zeros. By specifying the data type as string, the leading zeros will be kept. This property is a list of column data types, each with the following properties:

    • column_name: Name of the column (e.g. serial_number)

    • data_type: Data type of the value. Available types are: string, integer, and float.

  • sheet_names: List of sheet name(s) to read from each Excel file (e.g. [“Sheet1”, “Sheet5”]). If not specified, the first sheet will be used.

  • advanced: Additional parameters to pass directly to pandas.read_csv or pandas.read_excel. Advanced options override the corresponding named options, so setting sep as an advanced setting will override the definition of separator. Options injected by transforms will override advanced options, so beware!

Parquet Parser

The parquet parser converts parquet files into Pandas DataFrames. Has the ability to parse from files or s3/azure data lake references.

Configuration:

Required and optional properties that can be configured for a parquet parser: - parser_name: A customizable name used by streams to identify the parser. - parser_type: Type of parser to apply. Should be parquet. - column_names: Names of the columns of the data to read from file - chunk_size: Number of rows to process at one time (useful for large datasets) - skip_empty_files: Suppress errors that occur when an empty file is

parsed. If true empty files will be skipped otherwise an error will be raised. Defaults to false.

Pickle Parser

The pickle parser loads data from pickle files (.pkl). It can be used to load pickle files generated from the TapTransform.

Configuration:

Required and optional properties that can be configured for a pickle parser:

  • parser_name: A customizable name used by streams to identify the parser.

  • parser_type: Type of parser to apply. Should be pickle.

GreenButton XML Parser

The green_button_xml parser converts GreenButton-schema XML files into pandas DataFrame data.

Green Button Data Specification:

Example Configuration:

{
  "parsers": [
    {
      "parser_name": "GreenButton parser",
      "parser_type": "green_button_xml",
    }
  ]
}

Configuration:

Required and optional properties that can be configured for a green_button_xml parser:

  • parser_name: A customizable name used by streams to identify the parser.

  • parser_type: Type of parser to apply. Should be green_button_xml.

  • skip_empty_files: Suppress errors that occur when an empty file is parsed. If true empty files will be skipped otherwise an error will be raised. Defaults to false.