Random number generator and reproducibility issues

We running into an issue where, when we run the same dataset couple times back to back, we get slightly different results. These are subtle differences, but when combined, they are enough to create extra variation and impact the results.

So how does one go about reproducibility in SLicer? Normally we would do things like specifying a specific seed for RNG, but if we do that does that impact operations in Slicer, or only within our module? Is there some examples we can take a look at ?

@chz31 @smrolfe

This is not application-level feature, but it is up to the algorithm to expose an interface for specifying a random seed. Algorithms should not use any global shared random number generator but an object that the algorithm owns. In VTK you would use an object of this class, in ITK you would use this class, etc.

Of course, most algorithms still won’t produce exactly the same results on different computers (different CPU, C runtime, etc. give different results for the same floating-point operations) and for each run (due to multi-threaded implementation). In theory, you could achieve 100% reproducibility of the results, but since it requires turning off most optimizations, run single-threaded, and building everything from source on all platforms, this requires a lot of extra effort. What is even worse is that the resulting algorithm implementation would not suitable for end users, as it would be just so much slower than the approximate implementation.

I think putting a lot of effort into 100% reproducibility of a tiny part of a workflow (running some processing algorithm on some data) is actually harmful, because it takes the time away from the real goal: reproducibility of the entire workflow. The entire workflow includes imaging, specifying additional user inputs, processing, visualization of results, etc. This requires open-source code, open data, automatic testing, documentation, training, tutorials, etc., which may not sound as exciting but essential for the overall advancement of a field.

2 Likes

Just clarify, if I do
np.random.seed(1)
will that affect all Slicer session on my specific module?

Ideally, you should not used the legacy interface to seed random numbers.

    def seed(self, seed=None):
        """
        seed(self, seed=None)
        Reseed a legacy MT19937 BitGenerator
        Notes
        -----
        This is a convenience, legacy function.
        The best practice is to **not** reseed a BitGenerator, rather to
        recreate a new one. This method is here for legacy reasons.
        This example demonstrates best practice.

        >>> from numpy.random import MT19937
        >>> from numpy.random import RandomState, SeedSequence
        >>> rs = RandomState(MT19937(SeedSequence(123456789)))
        # Later, you want to restart the stream
        >>> rs = RandomState(MT19937(SeedSequence(987654321)))
        """
[...]

References:

2 Likes

Also worth noting that seeding the generator using numpy.random.seed impacts other modules using the legacy interface.

So it should not be done in module logic and only reserved for testing.

For some more background, see python - Why using numpy.random.seed is not a good practice? - Stack Overflow

2 Likes