Should we start collecting software usage data?

re: VLC

Here is what appears while opening VLC for the first time:

image

Iā€™ve been working on the redaction of the pop-up where the user can agree to opt in. I think it needs to be clear and easy to understand. The following text is the draft:

Short version:

Opt-in for anonymous data collection in Slicer software. We collect non-personal data on usage frequency, hours, and UI button interactions to improve the user experience. Your privacy is protected; no personal data is collected or transmitted over the network. Select your preference below:

(ā€¢) Do not collect any data.
( ) Allow anonymous data collection.

[Accept]

Long Version:

Dear Slicer User,

At Slicer, we strive to enhance your experience and improve our software continually. To achieve this, we rely on anonymous software usage data to understand how you interact with our platform and where we can focus our efforts effectively.

The data we collect is strictly limited to non-personal information, ensuring your privacy and confidentiality are paramount. We do not gather any data that could identify you personally, nor do we track your individual installations.

What Data is Collected?

  1. Number of Runs: We collect data on the frequency of software usage to gauge overall engagement.
  2. Usage Hours: Understanding the duration of your sessions helps us optimize performance and usability.
  3. UI Button Interactions: Data regarding the usage of UI buttons provides insights into feature popularity and usability.

Rest assured, all collected data remains anonymized and is stored locally on your device. No information is transmitted over the network without your explicit consent.

To review the implementation details of our data collection process, please refer to the code available here.

Please select your preference below:

(ā€¢) Do not collect any data.
( ) Allow anonymous data collection.

[Accept]

Thank you for helping us improve Slicer!

2 Likes

I like the short wording. In addition to designing this dialog, it would be great if you could also make a mock up of the usage data display. How we present the data weā€™ve collected will be important for convincing people that itā€™s okay for them to share it.

1 Like

@BerDom.Ing missed the most important part: he will work on this topic for a few months (as part of fulfilling requirements for his degree).

I like the short version, too. It is somewhat repetitive (ā€œanonymous data collectionā€, ā€œcollect non-personal dataā€, ā€œno personal data is collectedā€, ā€œallow anonymous data collectionā€), so we could simplify things a bit.

We also need to agree on what data we would like to collect.

@BerDom.Ing could you clarify a bit what you mean?

  • Number of runs: How many times Slicer is started? How many times the user switched to a module? Or how many times some feature of a module was used? Or all these?
  • Usage hours: Time elapsed between starting and stopping Slicer? Time elapsed between entering and exiting a module? Should we extract idle time (when the user does not interact with Slicer)? The User Statistics module in Sandbox determines idle time and collects timing data - so it could be useful if we want to exclude idle time.
  • UI button interactions: Most things are UI button interactions. Could you clarify?

Question for everyone: What other information should we consider collecting? (any ideas are welcome and we can sort them out later)

Iā€™m thinking about potentially collecting information about the computer (operating system, available memory, CPU model, GPU, screen resolution) and data (image modality, size), as these information could be useful when we make design decisions.

For reference, telemetry in other software:

1 Like

After reading about the Audacity scandal, I realized itā€™s crucial to opt-in for data collection and ensure anonymity. Also, allowing time for the community to embrace the new system, rather than feeling itā€™s imposed upon them, is essential. We should consider conducting a survey among Slicer users to gauge their concerns about telemetry usage, explain procedures, and emphasize not rushing the technology. Empowering users, providing a dedicated folder for collected data, and a link to the telemetry code repository could help. Iā€™ve noticed that users express concerns about software safety when telemetry implementation is discussed.

Number of runs: Tracking how many times Slicer is started. I believe it would also be beneficial to count how often users switch modules, although implementing this might require some time and a deep understanding of Slicerā€™s source code.

Usage hours: Measuring the time elapsed between starting and stopping Slicer.

I aim to implement the User Statistics module mentioned earlier. Line 25 of the moduleā€™s description reads: ā€œThis module measures user statistics and stores them in a table. Some of the statistics measured include the active module, active segment editor effect, selected segment, duration, application status (active, wait, idle), etc. Tables from different scenes can be merged into a single table.ā€

UI button interactions: Each interaction should trigger an event handler. It might be feasible to increase a counter in a table for each invoked event of this type, providing insights into UI redesign or creating shortcuts for frequently used interactions.

I agree that collecting information about the userā€™s computer could prove valuable.

Iā€™ve been working on finishing the Slicer extension tutorial, but I keep encountering errors. I think implementing computer information collection into the module will be the first step. Iā€™ll try to accomplish this using only Qt and VTK to avoid installing extra libraries. Iā€™ve researched that Qt provides a class called ā€˜QOpenGLContext,ā€™ which can create an OpenGL context, and VTK provides a class called ā€˜vtkOpenGLRenderWindow,ā€™ which can create a window with an OpenGL context. This can be used to retrieve graphics card information.

@BerDom.Ing would it be possible for you to attend the weekly Slicer developer meetings?
At each meeting you could give an update in a few minutes and we could discuss any questions that may arise.

Of course, Iā€™ll be glad to join the meetings.

Today, in the weekly Slicer meeting, we discussed what to use for creating mock-ups, collecting error events, using JSON to store the data, and creating schemas for user-stored data. I believe it will be easy to migrate from JSON to an SQL schema.

Thanks for the summary. Could you please describe the full list of values that were considered for collection?

The values I considered are as follows:

Number of Slicer runs: numberOfSlicerRuns
Times a module is loaded: moduleName + number of load times
GPU Information
CPU Information
Operating System
Window Resolution: width X height
Idle elapsed time in hours: idleTime
Active elapsed time in hours: activeTime

If you are collecting the display information, maybe consider collecting ā€œdisplay scalingā€ parameters. I occasionally (and almost never consistently) find some strange case of text scaling issuesā€¦

Iā€™ll add the scaling information to the collected values now that I know it might cause some issues.

Number of Slicer runs

It is not clear what a ā€œrunā€ means. We need to define it. For example, Slicer starts and runs for longer than 1 minute.

Times a module is loaded

All modules are loaded when Slicer starts, so I guess you mean number of times a module is activated (each time the user switches to that module). This would be very low-level information and since module names may give away information (especially when the modules are not publicly released), we should carefully consider if we really want to do collect such information. Maybe modules could opt in to be logged.

GPU Information
CPU Information
Operating System

Need to specify enumerated values, because strings may identify some rare configurations. Make sure that available memory is included.

Window Resolution: width X height

Need to specify bins to make sure we donā€™t make systems identifiable that use rare resolutions. Also add scaling as Murat suggested.

We probably donā€™t need resolution of all screens, just the one where the Slicer main window is displayed on and the total number of screens.

Idle elapsed time in hours: idleTime

Iā€™m not sure if this is relevant. Why do we need to know if people leave their Slicer running on their computer?

Active elapsed time in hours: activeTime

What would be the sufficient resolution? 5 minutes? 10 minutes?
We would probably write the usage statistics to disk at this time interval (to make sure the information is captured even if Slicer crashes or terminated).


It could be also nice to count number of uncaught C++ exceptions, application crashes.

It would be important that extension developers should be able to specify custom events that they want to count. We should limit the allowed event names to make sure that no information is leaked through that (e.g., the Extensions Catalog Entry json file could contain the list of event names that the extension can count and we would not record or transmit anything else).

We also need to think about recording location of the user with some granularity. We could do this at server level from the IP address, but we would need to make sure that the location does not reveal too much information about a user. For example, it may not be desirable to be able to identify that a certain hospital uses Slicer for planning or guiding certain procedures. It would be nice to know for sure, but would violate the usersā€™ privacy.

Ideally, we should not store IP addresses so that we donā€™t need to deal with strict data handling regulations. However, then it is not clear how we can detect and mitigate trivial data manipulation attempts.

It is not clear what a ā€œrunā€ means. We need to define it. For example, Slicer starts and runs for longer than 1 minute.

I was going to define run as each time the application is opened, but now I consider that your approach is better. We should start counting when slicer runs for more than 1 minute so we know that it wasnā€™t opened on error, or simultaneous windows by clicking the icon so many times.

All modules are loaded when Slicer starts, so I guess you mean number of times a module is activated (each time the user switches to that module). This would be very low-level information and since module names may give away information (especially when the modules are not publicly released), we should carefully consider if we really want to do collect such information. Maybe modules could opt in to be logged.

Yes, I mean times a module is activated, thanks for the correction, i agree that we should be careful collecting information. I find this to be valuable and not too intrusive, but I want to hear more opinions about it.

Need to specify enumerated values, because strings may identify some rare configurations. Make sure that available memory is included.

I will specify the GPU,CPU and operating system information once I can retrieve it and test it in a couple of computers so I know what form the information has, and this could differ between different providers and operating systems, I will make sure that the available memory is included.

Need to specify bins to make sure we donā€™t make systems identifiable that use rare resolutions. Also add scaling as Murat suggested.

Iā€™m not sure what you mean with specific bins, a brief explanation could help me. Scaling will be added.

Iā€™m not sure if this is relevant. Why do we need to know if people leave their Slicer running on their computer?

Idle time is relevant to know how much time slicer was running in total, idle time in combination of the active time is equal to the total time slicer was running.

What would be the sufficient resolution? 5 minutes? 10 minutes?
We would probably write the usage statistics to disk at this time interval (to make sure the information is captured even if Slicer crashes or terminated).

I think that a good resolution to save the active time itā€™s 10 minutes.

It could be also nice to count number of uncaught C++ exceptions, application crashes.

I agree that it would be nice to count uncaught c++ exceptions and application crashes, I will surely need time, help, and guidance.

We also need to think about recording location of the user with some granularity. We could do this at server level from the IP address, but we would need to make sure that the location does not reveal too much information about a user.

I agree that we should not store IP addresses, but i think it would be okay to collect the location with a big granularity.

However, then it is not clear how we can detect and mitigate trivial data manipulation attempts.

To detect and mitigate trivial data manipulation attempts we could use Reedā€“Solomon error correction. This is another value to store, but ensures that the other parts of the message are being transmitted without an error.

Instead of storing any individual resolution (there are many standard resolutions), you would specify bins, such as:

  • < 960Ɨ540
  • 960Ɨ540 ā€¦ 1280Ɨ720
  • 1280Ɨ720 ā€¦ 1920x1080
  • 1920x1080 ā€¦ 2048Ɨ1080
  • ā€¦

Or maybe you would specify bins for horizontal resolution and aspect ratio.

You can use the same methods that are used for logging system information in the application log at startup.

Iā€™m not worried about data corruption, but that we should not make it extremely easy to manipulate statistics by submitting manipulated content to the server.

Yes do not collect information of modules being used that are not publicly included in the Slicer extensions index. Developers that create Slicer custom applications for commercial purposes (not open-source) will not want their module names leaked to the Slicer open-source. Developers may be creating new modules against their Slicer custom application, but may also be testing against a regular Slicer version, so application name wouldnā€™t be a valid way to exclude module name logging.

With the Reed-Solomon algorithm, you obtain a value that correlates with all the previous information in the message. Every time the usage statistics are saved on disk, running the algorithm ensures that all previous parts of the message (the usage statistics) remain unchanged. If a change were to occur, when decoding the Reed-Solomon value against the message, it would indicate that the message has changed. This value could be stored in the same storage as the user statistics and made available to the user if they wish to view it. If users want to manipulate the information on the usage statistics after altering one part of the message, they would need to run the Reed-Solomon algorithm, which would be publicly available. Then, they could save the result in the control value stored with the usage statistics. Itā€™s not bulletproof, but itā€™s not as trivial as simply changing the number in the available user statistics file.

Thank you for the clarification. Saving only the standard resolutions is indeed the best solution.

I will look into that. Thanks for the help.

Then we should enable modules to opt in to be logged.

Iā€™ve been working on the mock-ups for presenting the information to the users. I donā€™t know what the scaling values are like, so I left them out for now.

1 Like

In the weekly meeting, regarding the telemetry module, we discussed:

  • Asking developers more specific questions about the data that needs to be gathered.

  • Exploring methods to prevent the collection of fake data. We cannot use a hash because the code is open source. An anonymization approach could involve replacing IPs with geolocation data (opt-in is needed).

  • Recognizing the potential usefulness of telemetry for compiling a census of the compute hardware used for Slicer.

    My upcoming tasks include developing a Python data collector, setting up a server with a database, processing server logs, creating graphics, and implementing an endpoint to confirm successful receipt of data.
    I will need a function in python like slicer.util.getHardwareInformation() that does the same as qSlicerApplication::logApplicationInformation()