Notes on ordered file processing¶
File receivers have configuration settings that can dictate the order of how files are retrieved and processed. This section will walk through the various combinations for a better understanding on when and how to use the settings. There are three primary settings that can be toggled on and off:
retrieve_ordered_files
: A per-connection setting that can speed up the retrieval of files. Please refer to the setting description about conditions that the files should follow for better performance. Please note that this option may not be supported by all file receiver connections. Currently, only the AWS S3 receiver is officially supported.process_files_alphabetically
: A per-receiver setting that determines the order of how files are copied into FactoryTX, parsed, and processed based on specific file attributes: file name and last modified time.process_ordered_connections
: A per-receiver setting that determines the order of how files are copied into FactoryTX, parsed, and processed based on how connections are ordered in the FactoryTX configuration.
Flow of receiving files¶
When a file receiver looks for new files to ingest:
The file receiver connection will retrieve a list of available files from the data source. The receiver will save this list and each file’s attributes like file name, size, and last modified time. This step is where the
retrieve_ordered_files
setting is applied. Some file receiver connections (e.g. AWS S3 receiver) are able to retrieve a sublist of available files from the data source based on the last completed file, so enabling the setting enables this feature. When requesting a sublist, the response time from the data source is faster than asking for all of the available files.After saving the list of available files, the file receiver will iterate through it, check if the file is new or recently modified, and add the file to the download queue. This step is where the
process_files_alphabetically
setting is applied. When the setting is enabled, the receiver will iterate through the list of available files in alphabetical order. In effect, files will be downloaded and processed in lexicographic order. When the setting is disabled, the receiver will iterate through the list of available files based on their last modified time, with older files first. In effect, files will be downloaded and processed in chronological order based on their last modified time.After determining which files are new or modified, the file receiver will start copying the files into FactoryTX and parsing them for transform and transmit components to process. This step is where the
process_ordered_connections
setting is applied. If enabled, the order of the file receiver connections in the FactoryTX configuration affects the order of downloading the files. Files associated with one connection will be downloaded and processed before moving onto files associated with the next connection in the list. If the setting is disabled, the order of connections is disregarded and files are downloaded based on the queue filled in step 2.
Examples¶
Scenario |
retrieve_ordered_files |
process_files_alphabetically |
process_ordered_connections |
---|---|---|---|
1 |
True* |
False |
False |
2 |
True/False* |
True |
False |
3 |
True/False* |
True |
True |
4 |
False |
False |
True |
5 |
True* |
False |
True |
*
File receiver connection supports this feature.
- Scenario 1
Files in the data source for a connection are ordered in lexicographically and chronologically by last modified time.
Files in
folder_1
:File_1 (time_1)
File_2 (time_2)
File_3 (time_3)
Order of processed files:
File_1 (time_1)
File_2 (time_2)
File_3 (time_3)
The closest unit test with checking if the
retrieve_ordered_files
flag works is: thetest_fetch_last_completed_file
test in tests/factorytx/receivers/file/test_state.py. Files should be downloaded and processed based on last modified time as demonstrated in thetest_ordered_by_mtime
test in tests/factorytx/receivers/file/test_receiver.py.
- Scenario 2
Files in the data source for a connection are ordered lexicographically, but not ordered chronologically based on last modified time.
Files in
folder_1
:File_2 (time_2)
File_3 (time_3)
File_4 (time_1)
Files in
folder_2
:File_1 (time_2)
File_5 (time_3)
File_6 (time_1)
Order of processed files:
File_1 (time_2)
File_2 (time_2)
File_3 (time_3)
File_4 (time_1)
File_5 (time_3)
File_6 (time_1)
The unit test covering this scenario is
test_ordered_by_filename
in tests/factorytx/receivers/file/test_receiver.py.
- Scenario 3
Very similar to scenario 2, files in the data source for a connection are ordered lexicographically, but not ordered chronologically based on last modified time. However, in this scenario, we want to process all the files from one connection (
folder_2
) before processing files from the next connection (folder_1
).Files in
folder_1
:File_2 (time_2)
File_3 (time_3)
File_4 (time_1)
Files in
folder_2
:File_1 (time_2)
File_5 (time_3)
File_6 (time_1)
Order of processed files:
File_1 (time_2)
File_5 (time_3)
File_6 (time_1)
File_2 (time_2)
File_3 (time_3)
File_4 (time_1)
The unit test covering this scenario is
test_ordered_by_filename_and_connection
in tests/factorytx/receivers/file/test_receiver.py.
- Scenario 4
Files in the data source are ordered chronologically by last modified time, but not ordered lexicographically. In this scenario, we want to process all the files from one connection (
folder_2
) before processing files from the next connection (folder_1
).Files in
folder_1
:File_2 (time_2)
File_3 (time_3)
File_4 (time_1)
Files in
folder_2
:File_1 (time_2)
File_5 (time_3)
File_6 (time_1)
Order of processed files:
File_6 (time_1)
File_1 (time_2)
File_5 (time_3)
File_4 (time_1)
File_2 (time_2)
File_3 (time_3)
The unit test covering this scenario is
test_ordered_by_mtime_and_connection
in tests/factorytx/receivers/file/test_receiver.py.
- Scenario 5
Very similar to scenario 1, files in the data source for a connection are ordered in lexicographically and chronologically by last modified time. In this scenario, we want to process all the files from one connection (
folder_2
) before processing files from the next connection (folder_1
).Files in
folder_1
:File_4 (time_1)
File_5 (time_2)
File_6 (time_3)
Files in
folder_2
:File_1 (time_1)
File_2 (time_2)
File_3 (time_3)
Order of processed files:
File_1 (time_1)
File_2 (time_2)
File_3 (time_3)
File_4 (time_4)
File_5 (time_5)
File_6 (time_6)
Because the
retrieve_ordered_files
setting is used for getting the list of available files, the unit test covering this scenario is the same as the one for scenario 4:test_ordered_by_mtime_and_connection
in tests/factorytx/receivers/file/test_receiver.py.