SMB Data Receiver

The smb data receiver connects to a SMB filesystem, eg. from a Windows file share or a network-attached storage device, and reads files stored on it.

Example:

If we want to retrieve data for WeldingMachine_1 from all CSV files with names that start with machine_1 on a SMB server at 1.2.3.4, our configuration will look something like this:

{
  "data_receiver": [
    {
      "data_receiver_name": "my_smb_receiver",
      "protocol": "smb",
      "delete_completed": true,
      "poll_interval": 4,
      "connections": [
        {
          "connection_name": "my_connection",
          "host": "1.2.3.4",
          "username": "admin",
          "password": "password",
          "shared_folder": "Folder Name"
        }
      ],
      "streams": [
        {
          "asset": "WeldingMachine_1",
          "stream_type": "cycle",
          "file_filter": [
            "machine_1*.csv"
          ],
          "parser": "My_CSV_Parser"
        }
      ],
      "parsers": [
        {
          "parser_name": "My_CSV_Parser",
          "parser_type": "csv"
        }
      ]
    }
  ]
}
Configuration:

The smb data receiver uses a file receiver that connects to a SMB filesystem.

All file receivers support these required and optional properties:

  • data_receiver_name: Unique name of the data receiver. This name will be used to track the progress state of the data stream.

  • protocol: Protocol to be used.

  • streams: List of input data streams to read from.

    • asset: Asset identifier

    • stream_type: Type of data stream

    • parser: Name of the parser used to convert the file. This has to match one of the parser_name values in the parsers section.

    • file_filter: List of files to filter on. Items can be regular expressions. You must select either file_filter or path_filter but not both.

    • path_filter: List of paths to filter on. Items can be regular expressions. You must select either file_filter or path_filter but not both.

  • connections: A list of connection settings used to connect to the data source. Please refer to the connection settings below.

  • process_files_alphabetically: true to process all of the files retrieved by the connection(s) in alphabetical order. By default, this option is false, so files are processed by last modified (or created) time with oldest files first.

  • process_ordered_connections: If there are multiple connections and you want to process (and transmit) the files received from one connection before another, set option to true to process files based on the order of the connection(s). By default, this option is set to false. This option is usually related to the Kafka transmit because events within a Kafka topic need to be in order. For example, if one connection receives historical files and another connection receives “realtime” files, you’ll want to enable this setting and order the connections with the historical connection first, so that events in the Kafka topic are in chronological order. Please note that the order of the files per connection is maintained: if process_files_alphabetically is enabled, files will be parsed in alphabetical order; if disabled, files will be parsed in chronological order based on the last modified time.

  • parsers: A list of parsers that can be used for the input streams. Each parser has the following default properties:

    • parser_name: A customizable name used by streams to identify the parser.

    • parser_type: Type of parser to apply (e.g. csv)

    • parser_version: Version of the parser type to apply.

    FactoryTX has a few built-in parsers available for you to use. Please refer to the Parsers Configurations section of the manual for more details about them.

  • skip_parser_errors: Whether the receiver will allow parser errors when processing files. If True, parsing errors will be skipped and logged in a quarantine log file while the receiver continues processing. Otherwise if we encounter a parsing error, the receiver will halt until the file gets repaired. (Only works for csv files)

  • archive_completed: true to keep a local copy of each file received from the remote server, or false if local copies should not be kept. If enabled, files will be archived until their total size increases above the amount specified in the max_archive_size_mb parameter.

  • max_archive_size_mb: If archive_completed is true, delete the archive completed files once the total size (in mb) is greater than this value. A negative value means never delete any files. Defaults to -1.

  • delete_completed: true if files should be deleted from the data directory after they have been received by FactoryTX, or false if files should never be deleted. Defaults to false to avoid accidentally losing data.

    • archive_completed and delete_completed are independent of each other. Archiving will create a copy of the file in a new directory, while deleting will remove the original file.

  • read_in_progress_files: Whether to read files as soon as they are created, or wait for the upstream service to stop writing to the file before reading it.

  • emit_file_metadata: Whether or not to inject additional columns into each record containing metadata about the file it came from. If this setting is enabled then every record will contain fields named file_name, file_path, and file_timestamp.

  • poll_interval: Number of seconds to wait before fetching new data from the server.

  • temporary_file_directory: Specify the directory to temporarily store files that have been downloaded from the connection(s). By default, the directory used is /tmp.

Connection Settings:

Required and optional properties that can be configured for a SMB connection:

  • connection_name: Unique name for the connection.

  • username: Username to use to connect to the SMB server.

  • password: Password to use to connect to the SMB server.

  • host: IP address or hostname of the SMB server.

  • shared_folder: Name of the shared folder to fetch files from.

  • connection_type: How to connect to the SMB server. Must be either direct_hosted or netbios. Servers running Windows 2000 or newer support direct hosting over TCP, which is easier to configure. Older systems may need to use NetBIOS instead.

  • min_smb_protocol: Min SMB version to try to use with the upstream server. See https://www.samba.org/samba/docs/current/man-html/smb.conf.5.html#CLIENTMAXPROTOCOL for the list of acceptable values. There is no known reason to try anything below ‘SMB2’; but some v3 servers will required ‘SMB3’ here.

  • max_smb_protocol: Max SMB version to try to use with the upstream server. See https://www.samba.org/samba/docs/current/man-html/smb.conf.5.html#CLIENTMAXPROTOCOL for the list of acceptable values.

  • netbios_name: NetBIOS machine name of the SMB server. Not required for direct hosted SMB.

  • netbios_domain: NetBIOS workgroup or domain name of the SMB server. Defaults to a blank string since it’s not used for direct hosted SMB and is optional for most connections over NetBIOS.

  • data_directory: Path to the subdirectory containing data files in the shared folder. If this option is not set then all files in all subdirectories will be scanned.

  • port: TCP port number for the SMB server. Defaults to port 445 for direct hosted SMB, or 139 for NetBIOS.

  • timeout: Maximum duration (in seconds) of each SMB request.

Please refer to Notes on ordered file processing for more details about file receiver options.