Improving testing data management for "self-test"

We recently improved the time required to run python self-tests from 2x to 10x depending on the case, see here for more details.

I suggest to leverage and improve the SampleData logic to:

  • cache downloaded test data in a persistent directory that wouldn’t be cleared before each daily build starts.
  • ensure up-to-date data are always downloaded and be resilient to incomplete file download

Plan of action

The plan is the following:

  1. Update SampleDataLogic and SampleDataSource to support Scene file type. Currently expectation seems to be that only files are downloaded.
  2. Update all tests to use SampleDataSource, SampleDataLogic
  3. Improve SampleData download logic to accept checksums. Re-download will be re-done if the checksum of the file into the cache does not match the expected one.
    • This will help with testing, currently after aborting a test while it’s downloading (basically the first things happening when running a test), an invalid file may exist in the cache.
  4. Add a new property named cacheDirectory, if this property is set and a checksum is associated with a file:
    • the file will be downloaded into the cacheDirectory, renamed after the provided checksum and copied into the cache directory associated with the current scene
    • Also, in case an environment variable named SLICER_SAMPLEDATA_CACHE_DIRECTORY is set, it will be used to initialize this property.
    • By default, this new env. variable will be initialized with the same value already set to ExternalData_OBJECT_STORES env. variable. This variable is specified to Slicer and used to initialize where data downloaded using CMake ExternalData module are stored. That way, the cache used by regular CTest test and the one for the self-test will be downloaded in the same location saving time when re-running tests every day.

What is ExternalData ?

Read at https://blog.kitware.com/cmake-externaldata-using-large-files-with-distributed-version-control/ or look at the corresponding reference documentation https://cmake.org/cmake/help/latest/module/ExternalData.html

What is a self-test ?

In a nutshell, a self-test has the following features:
Important features include:

  • Tests are available as part of the binary distributions of slicer, so users can confirm correct behavior on their systems
  • The same tests are run as part of the nightly test process and submitted to the slicer dashboard.
  • Developers can efficiently develop the tests by reloading python scripts without needing to exit slicer.

You may read more details on the wiki

These would be great improvements.

It would be nice if we could improve the existing remote data management infrastructure (vtkCacheManager, vtkDataIOManager, vtkDataTransfer, vtkHTTPHandler, tkMRMLRemoteIOLogic) instead of building a new infrastructure. If we find that the current remote data management infrastructure is not useful or relevant anymore then probably we should remove it completely.

@pieper Do you remember what were the main remote data management use cases and if they are still relevant? It seems that Slicer could download data from remote servers using URLs stored in the scene, but I have never seen a scene like that.

1 Like

Yes, that was the goal and it did generally work but it didn’t turn out to be used (maybe we just never pushed it). The use cases were to allow slicer to natively reference image archives in MRML. I guess in the end people end up managing download tasks explicitly with server-specific interfaces (we have TCIA Browser, XNAT interface, DICOM Query/Retrieve) but we tend to use MRML only for local files.

+1 for the idea of unified caching infrastructure with checksums.

Should that be high-level feature implemented in SampleData/Python or low-level feature implemented at MRML level in C++?

I tend to prefer the former option (remove networking from MRML and add it at higher level), as I don’t think network communication could be part of MRML (https, various authentication protocols, etc. would be too complicated to implement; simple anonymous public data download over http would be too limited). We could keep cache management (finding cached files, cleaning up old files, etc) in C++ and add checksum support.

Good question - I definitely agree we should keep the networking details out of MRML. Personally I like Jc’s focus on caching SampleData for SelfTests since it’s a concrete use case to optimize. General Slicer use, like managing a locally cached subset from large case archives is a different problem.

Currently, SampleData uses the same cache folder as vtkCacheManager and there is already some interference (vtkCacheManager deletes data downloaded by SampleData when cache limit is exceeded).

To allow SampleData module to work reliably, we would need to update the cache manager’s strategy of how to delete things and probably also add support for checksum computation. The rest could be implemented at higher level, in SampleData.

If we touch the cache manager, it would be a good opportunity to remove remote data support from MRML, as it would simplify things. I add a note to Slicer5 roadmap.

“Fun” facts:

  • On the windows factory, there was 1.3GB (+120k files) of Slicer-*.log and Slicer-tmp*.log files
  • “Slicer4minutes” test is failing because the file slicer4minute.mrb already exist in the tmp folder but is the wrong one (it was updated few days ago). Moving to a checksum based approach will avoid this.
  • There is a “Cache” panel in the Slicer settings with the maximum size cache but user is not notified when the cache is growing beyond the set limit.
  • Application tests (e.g self-test in python) are run sequentially because (1) temporary test folder are not unique per Slicer session and would clash and (2) for historical reason test involving rendering couldn’t run in parallel when executed on VM

To move forward, the plan would be:

  • make sure the Cache folder persist between machine restart (by default the cache folder will be outside of the tmp folder. Currently it is /tmp/Slicer-<username>/RemoteIO)
  • before using file from the cache, they will be copied into a temporary location. This applies to SampleData download, datastore download, tests, …
  • each Slicer start would be associated with a unique temporary folder (e.g Slicer-NNN or Slicer-test-NNN) and only the last X one would be kept.

We should be able to load data directly from cache. We don’t even know what files we need to copy to a temporary location (we often need multiple files to be able to load data, e.g., for nhdr+raw files, MRML scenes, DICOM data).