Notes on ordered file processing

File receivers have configuration settings that can dictate the order of how files are retrieved and processed. This section will walk through the various combinations for a better understanding on when and how to use the settings. There are three primary settings that can be toggled on and off:

  1. retrieve_ordered_files: A per-connection setting that can speed up the retrieval of files. Please refer to the setting description about conditions that the files should follow for better performance. Please note that this option may not be supported by all file receiver connections. Currently, only the AWS S3 receiver is officially supported.

  2. process_files_alphabetically: A per-receiver setting that determines the order of how files are copied into FactoryTX, parsed, and processed based on specific file attributes: file name and last modified time.

  3. process_ordered_connections: A per-receiver setting that determines the order of how files are copied into FactoryTX, parsed, and processed based on how connections are ordered in the FactoryTX configuration.

Flow of receiving files

When a file receiver looks for new files to ingest:

  1. The file receiver connection will retrieve a list of available files from the data source. The receiver will save this list and each file’s attributes like file name, size, and last modified time. This step is where the retrieve_ordered_files setting is applied. Some file receiver connections (e.g. AWS S3 receiver) are able to retrieve a sublist of available files from the data source based on the last completed file, so enabling the setting enables this feature. When requesting a sublist, the response time from the data source is faster than asking for all of the available files.

  2. After saving the list of available files, the file receiver will iterate through it, check if the file is new or recently modified, and add the file to the download queue. This step is where the process_files_alphabetically setting is applied. When the setting is enabled, the receiver will iterate through the list of available files in alphabetical order. In effect, files will be downloaded and processed in lexicographic order. When the setting is disabled, the receiver will iterate through the list of available files based on their last modified time, with older files first. In effect, files will be downloaded and processed in chronological order based on their last modified time.

  3. After determining which files are new or modified, the file receiver will start copying the files into FactoryTX and parsing them for transform and transmit components to process. This step is where the process_ordered_connections setting is applied. If enabled, the order of the file receiver connections in the FactoryTX configuration affects the order of downloading the files. Files associated with one connection will be downloaded and processed before moving onto files associated with the next connection in the list. If the setting is disabled, the order of connections is disregarded and files are downloaded based on the queue filled in step 2.

Examples

Scenario

retrieve_ordered_files

process_files_alphabetically

process_ordered_connections

1

True*

False

False

2

True/False*

True

False

3

True/False*

True

True

4

False

False

True

5

True*

False

True

* File receiver connection supports this feature.


Scenario 1

Files in the data source for a connection are ordered in lexicographically and chronologically by last modified time.

Files in folder_1:

  • File_1 (time_1)

  • File_2 (time_2)

  • File_3 (time_3)

Order of processed files:

  1. File_1 (time_1)

  2. File_2 (time_2)

  3. File_3 (time_3)

The closest unit test with checking if the retrieve_ordered_files flag works is: the test_fetch_last_completed_file test in tests/factorytx/receivers/file/test_state.py. Files should be downloaded and processed based on last modified time as demonstrated in the test_ordered_by_mtime test in tests/factorytx/receivers/file/test_receiver.py.


Scenario 2

Files in the data source for a connection are ordered lexicographically, but not ordered chronologically based on last modified time.

Files in folder_1:

  • File_2 (time_2)

  • File_3 (time_3)

  • File_4 (time_1)

Files in folder_2:

  • File_1 (time_2)

  • File_5 (time_3)

  • File_6 (time_1)

Order of processed files:

  1. File_1 (time_2)

  2. File_2 (time_2)

  3. File_3 (time_3)

  4. File_4 (time_1)

  5. File_5 (time_3)

  6. File_6 (time_1)

The unit test covering this scenario is test_ordered_by_filename in tests/factorytx/receivers/file/test_receiver.py.


Scenario 3

Very similar to scenario 2, files in the data source for a connection are ordered lexicographically, but not ordered chronologically based on last modified time. However, in this scenario, we want to process all the files from one connection (folder_2) before processing files from the next connection (folder_1).

Files in folder_1:

  • File_2 (time_2)

  • File_3 (time_3)

  • File_4 (time_1)

Files in folder_2:

  • File_1 (time_2)

  • File_5 (time_3)

  • File_6 (time_1)

Order of processed files:

  1. File_1 (time_2)

  2. File_5 (time_3)

  3. File_6 (time_1)

  4. File_2 (time_2)

  5. File_3 (time_3)

  6. File_4 (time_1)

The unit test covering this scenario is test_ordered_by_filename_and_connection in tests/factorytx/receivers/file/test_receiver.py.


Scenario 4

Files in the data source are ordered chronologically by last modified time, but not ordered lexicographically. In this scenario, we want to process all the files from one connection (folder_2) before processing files from the next connection (folder_1).

Files in folder_1:

  • File_2 (time_2)

  • File_3 (time_3)

  • File_4 (time_1)

Files in folder_2:

  • File_1 (time_2)

  • File_5 (time_3)

  • File_6 (time_1)

Order of processed files:

  1. File_6 (time_1)

  2. File_1 (time_2)

  3. File_5 (time_3)

  4. File_4 (time_1)

  5. File_2 (time_2)

  6. File_3 (time_3)

The unit test covering this scenario is test_ordered_by_mtime_and_connection in tests/factorytx/receivers/file/test_receiver.py.


Scenario 5

Very similar to scenario 1, files in the data source for a connection are ordered in lexicographically and chronologically by last modified time. In this scenario, we want to process all the files from one connection (folder_2) before processing files from the next connection (folder_1).

Files in folder_1:

  • File_4 (time_1)

  • File_5 (time_2)

  • File_6 (time_3)

Files in folder_2:

  • File_1 (time_1)

  • File_2 (time_2)

  • File_3 (time_3)

Order of processed files:

  1. File_1 (time_1)

  2. File_2 (time_2)

  3. File_3 (time_3)

  4. File_4 (time_4)

  5. File_5 (time_5)

  6. File_6 (time_6)

Because the retrieve_ordered_files setting is used for getting the list of available files, the unit test covering this scenario is the same as the one for scenario 4: test_ordered_by_mtime_and_connection in tests/factorytx/receivers/file/test_receiver.py.