Improving testing data management for "self-test"

jcfr · December 8, 2018, 7:48am

We recently improved the time required to run python self-tests from 2x to 10x depending on the case, see here for more details.

I suggest to leverage and improve the SampleData logic to:

cache downloaded test data in a persistent directory that wouldn’t be cleared before each daily build starts.
ensure up-to-date data are always downloaded and be resilient to incomplete file download

Plan of action

The plan is the following:

Update SampleDataLogic and SampleDataSource to support Scene file type. Currently expectation seems to be that only files are downloaded.
Update all tests to use SampleDataSource, SampleDataLogic
Improve SampleData download logic to accept checksums. Re-download will be re-done if the checksum of the file into the cache does not match the expected one.
- This will help with testing, currently after aborting a test while it’s downloading (basically the first things happening when running a test), an invalid file may exist in the cache.
Add a new property named cacheDirectory, if this property is set and a checksum is associated with a file:
- the file will be downloaded into the cacheDirectory, renamed after the provided checksum and copied into the cache directory associated with the current scene
- Also, in case an environment variable named SLICER_SAMPLEDATA_CACHE_DIRECTORY is set, it will be used to initialize this property.
- By default, this new env. variable will be initialized with the same value already set to ExternalData_OBJECT_STORES env. variable. This variable is specified to Slicer and used to initialize where data downloaded using CMake ExternalData module are stored. That way, the cache used by regular CTest test and the one for the self-test will be downloaded in the same location saving time when re-running tests every day.

What is ExternalData ?

Read at https://blog.kitware.com/cmake-externaldata-using-large-files-with-distributed-version-control/ or look at the corresponding reference documentation https://cmake.org/cmake/help/latest/module/ExternalData.html

What is a self-test ?

In a nutshell, a self-test has the following features:
Important features include:

Tests are available as part of the binary distributions of slicer, so users can confirm correct behavior on their systems
The same tests are run as part of the nightly test process and submitted to the slicer dashboard.
Developers can efficiently develop the tests by reloading python scripts without needing to exit slicer.

You may read more details on the wiki

lassoan · December 8, 2018, 8:49pm

These would be great improvements.

It would be nice if we could improve the existing remote data management infrastructure (vtkCacheManager, vtkDataIOManager, vtkDataTransfer, vtkHTTPHandler, tkMRMLRemoteIOLogic) instead of building a new infrastructure. If we find that the current remote data management infrastructure is not useful or relevant anymore then probably we should remove it completely.

@pieper Do you remember what were the main remote data management use cases and if they are still relevant? It seems that Slicer could download data from remote servers using URLs stored in the scene, but I have never seen a scene like that.

pieper · December 9, 2018, 8:13pm

Yes, that was the goal and it did generally work but it didn’t turn out to be used (maybe we just never pushed it). The use cases were to allow slicer to natively reference image archives in MRML. I guess in the end people end up managing download tasks explicitly with server-specific interfaces (we have TCIA Browser, XNAT interface, DICOM Query/Retrieve) but we tend to use MRML only for local files.

+1 for the idea of unified caching infrastructure with checksums.

lassoan · December 10, 2018, 12:24am

Should that be high-level feature implemented in SampleData/Python or low-level feature implemented at MRML level in C++?

I tend to prefer the former option (remove networking from MRML and add it at higher level), as I don’t think network communication could be part of MRML (https, various authentication protocols, etc. would be too complicated to implement; simple anonymous public data download over http would be too limited). We could keep cache management (finding cached files, cleaning up old files, etc) in C++ and add checksum support.

pieper · December 10, 2018, 1:47pm

Good question - I definitely agree we should keep the networking details out of MRML. Personally I like Jc’s focus on caching SampleData for SelfTests since it’s a concrete use case to optimize. General Slicer use, like managing a locally cached subset from large case archives is a different problem.

lassoan · December 10, 2018, 5:38pm

Currently, SampleData uses the same cache folder as vtkCacheManager and there is already some interference (vtkCacheManager deletes data downloaded by SampleData when cache limit is exceeded).

To allow SampleData module to work reliably, we would need to update the cache manager’s strategy of how to delete things and probably also add support for checksum computation. The rest could be implemented at higher level, in SampleData.

If we touch the cache manager, it would be a good opportunity to remove remote data support from MRML, as it would simplify things. I add a note to Slicer5 roadmap.

jcfr · December 14, 2018, 8:19pm

“Fun” facts:

On the windows factory, there was 1.3GB (+120k files) of Slicer-*.log and Slicer-tmp*.log files
“Slicer4minutes” test is failing because the file slicer4minute.mrb already exist in the tmp folder but is the wrong one (it was updated few days ago). Moving to a checksum based approach will avoid this.
There is a “Cache” panel in the Slicer settings with the maximum size cache but user is not notified when the cache is growing beyond the set limit.
Application tests (e.g self-test in python) are run sequentially because (1) temporary test folder are not unique per Slicer session and would clash and (2) for historical reason test involving rendering couldn’t run in parallel when executed on VM

To move forward, the plan would be:

make sure the Cache folder persist between machine restart (by default the cache folder will be outside of the tmp folder. Currently it is /tmp/Slicer-<username>/RemoteIO)
before using file from the cache, they will be copied into a temporary location. This applies to SampleData download, datastore download, tests, …
each Slicer start would be associated with a unique temporary folder (e.g Slicer-NNN or Slicer-test-NNN) and only the last X one would be kept.

lassoan · December 14, 2018, 8:26pm

We should be able to load data directly from cache. We don’t even know what files we need to copy to a temporary location (we often need multiple files to be able to load data, e.g., for nhdr+raw files, MRML scenes, DICOM data).

Topic		Replies	Views
Loading of MRHead sample data set failed Support	15	2057	January 23, 2018
Do I need to re-download the sample data every time I reopen 3D Slicer and use the SlicerLiver extension? Support	4	44	February 26, 2025
File names and download of sample data Support	2	408	May 24, 2022
SlicerMorph Data Sample Downloading Support	2	422	January 27, 2022
How to download a file from datastore using python script? Support	4	826	December 6, 2018

Improving testing data management for "self-test"

Plan of action

What is ExternalData ?

What is a self-test ?

Related topics