Should we start collecting software usage data?

mau_igna_06 · January 5, 2024, 10:32am

re: VLC

Here is what appears while opening VLC for the first time:

BerDom.Ing · February 26, 2024, 10:20pm

I’ve been working on the redaction of the pop-up where the user can agree to opt in. I think it needs to be clear and easy to understand. The following text is the draft:

Short version:

Opt-in for anonymous data collection in Slicer software. We collect non-personal data on usage frequency, hours, and UI button interactions to improve the user experience. Your privacy is protected; no personal data is collected or transmitted over the network. Select your preference below:

(•) Do not collect any data.
( ) Allow anonymous data collection.

[Accept]

Long Version:

Dear Slicer User,

At Slicer, we strive to enhance your experience and improve our software continually. To achieve this, we rely on anonymous software usage data to understand how you interact with our platform and where we can focus our efforts effectively.

The data we collect is strictly limited to non-personal information, ensuring your privacy and confidentiality are paramount. We do not gather any data that could identify you personally, nor do we track your individual installations.

What Data is Collected?

Number of Runs: We collect data on the frequency of software usage to gauge overall engagement.
Usage Hours: Understanding the duration of your sessions helps us optimize performance and usability.
UI Button Interactions: Data regarding the usage of UI buttons provides insights into feature popularity and usability.

Rest assured, all collected data remains anonymized and is stored locally on your device. No information is transmitted over the network without your explicit consent.

To review the implementation details of our data collection process, please refer to the code available here.

Please select your preference below:

(•) Do not collect any data.
( ) Allow anonymous data collection.

[Accept]

Thank you for helping us improve Slicer!

pieper · February 27, 2024, 8:08pm

I like the short wording. In addition to designing this dialog, it would be great if you could also make a mock up of the usage data display. How we present the data we’ve collected will be important for convincing people that it’s okay for them to share it.

lassoan · February 28, 2024, 4:33am

@BerDom.Ing missed the most important part: he will work on this topic for a few months (as part of fulfilling requirements for his degree).

I like the short version, too. It is somewhat repetitive (“anonymous data collection”, “collect non-personal data”, “no personal data is collected”, “allow anonymous data collection”), so we could simplify things a bit.

We also need to agree on what data we would like to collect.

@BerDom.Ing could you clarify a bit what you mean?

Number of runs: How many times Slicer is started? How many times the user switched to a module? Or how many times some feature of a module was used? Or all these?
Usage hours: Time elapsed between starting and stopping Slicer? Time elapsed between entering and exiting a module? Should we extract idle time (when the user does not interact with Slicer)? The User Statistics module in Sandbox determines idle time and collects timing data - so it could be useful if we want to exclude idle time.
UI button interactions: Most things are UI button interactions. Could you clarify?

Question for everyone: What other information should we consider collecting? (any ideas are welcome and we can sort them out later)

I’m thinking about potentially collecting information about the computer (operating system, available memory, CPU model, GPU, screen resolution) and data (image modality, size), as these information could be useful when we make design decisions.

For reference, telemetry in other software:

Firefox
VLC
KDE
GNOME - includes telemetry variables and explanation why they are collected
MuseScore
Fedora workstation
The Audacity scandal

BerDom.Ing · March 19, 2024, 1:44pm

After reading about the Audacity scandal, I realized it’s crucial to opt-in for data collection and ensure anonymity. Also, allowing time for the community to embrace the new system, rather than feeling it’s imposed upon them, is essential. We should consider conducting a survey among Slicer users to gauge their concerns about telemetry usage, explain procedures, and emphasize not rushing the technology. Empowering users, providing a dedicated folder for collected data, and a link to the telemetry code repository could help. I’ve noticed that users express concerns about software safety when telemetry implementation is discussed.

Number of runs: Tracking how many times Slicer is started. I believe it would also be beneficial to count how often users switch modules, although implementing this might require some time and a deep understanding of Slicer’s source code.

Usage hours: Measuring the time elapsed between starting and stopping Slicer.

I aim to implement the User Statistics module mentioned earlier. Line 25 of the module’s description reads: “This module measures user statistics and stores them in a table. Some of the statistics measured include the active module, active segment editor effect, selected segment, duration, application status (active, wait, idle), etc. Tables from different scenes can be merged into a single table.”

UI button interactions: Each interaction should trigger an event handler. It might be feasible to increase a counter in a table for each invoked event of this type, providing insights into UI redesign or creating shortcuts for frequently used interactions.

I agree that collecting information about the user’s computer could prove valuable.

BerDom.Ing · March 26, 2024, 3:27pm

I’ve been working on finishing the Slicer extension tutorial, but I keep encountering errors. I think implementing computer information collection into the module will be the first step. I’ll try to accomplish this using only Qt and VTK to avoid installing extra libraries. I’ve researched that Qt provides a class called ‘QOpenGLContext,’ which can create an OpenGL context, and VTK provides a class called ‘vtkOpenGLRenderWindow,’ which can create a window with an OpenGL context. This can be used to retrieve graphics card information.

lassoan · March 26, 2024, 4:14pm

@BerDom.Ing would it be possible for you to attend the weekly Slicer developer meetings?
At each meeting you could give an update in a few minutes and we could discuss any questions that may arise.

BerDom.Ing · March 26, 2024, 5:41pm

Of course, I’ll be glad to join the meetings.

BerDom.Ing · April 9, 2024, 4:14pm

Today, in the weekly Slicer meeting, we discussed what to use for creating mock-ups, collecting error events, using JSON to store the data, and creating schemas for user-stored data. I believe it will be easy to migrate from JSON to an SQL schema.

lassoan · April 9, 2024, 4:29pm

Thanks for the summary. Could you please describe the full list of values that were considered for collection?

BerDom.Ing · April 9, 2024, 4:44pm

The values I considered are as follows:

Number of Slicer runs: numberOfSlicerRuns
Times a module is loaded: moduleName + number of load times
GPU Information
CPU Information
Operating System
Window Resolution: width X height
Idle elapsed time in hours: idleTime
Active elapsed time in hours: activeTime

muratmaga · April 9, 2024, 4:53pm

If you are collecting the display information, maybe consider collecting “display scaling” parameters. I occasionally (and almost never consistently) find some strange case of text scaling issues…

BerDom.Ing · April 9, 2024, 4:58pm

I’ll add the scaling information to the collected values now that I know it might cause some issues.

lassoan · April 9, 2024, 6:28pm

Number of Slicer runs

It is not clear what a “run” means. We need to define it. For example, Slicer starts and runs for longer than 1 minute.

Times a module is loaded

All modules are loaded when Slicer starts, so I guess you mean number of times a module is activated (each time the user switches to that module). This would be very low-level information and since module names may give away information (especially when the modules are not publicly released), we should carefully consider if we really want to do collect such information. Maybe modules could opt in to be logged.

GPU Information
CPU Information
Operating System

Need to specify enumerated values, because strings may identify some rare configurations. Make sure that available memory is included.

Window Resolution: width X height

Need to specify bins to make sure we don’t make systems identifiable that use rare resolutions. Also add scaling as Murat suggested.

We probably don’t need resolution of all screens, just the one where the Slicer main window is displayed on and the total number of screens.

Idle elapsed time in hours: idleTime

I’m not sure if this is relevant. Why do we need to know if people leave their Slicer running on their computer?

Active elapsed time in hours: activeTime

What would be the sufficient resolution? 5 minutes? 10 minutes?
We would probably write the usage statistics to disk at this time interval (to make sure the information is captured even if Slicer crashes or terminated).

It could be also nice to count number of uncaught C++ exceptions, application crashes.

It would be important that extension developers should be able to specify custom events that they want to count. We should limit the allowed event names to make sure that no information is leaked through that (e.g., the Extensions Catalog Entry json file could contain the list of event names that the extension can count and we would not record or transmit anything else).

We also need to think about recording location of the user with some granularity. We could do this at server level from the IP address, but we would need to make sure that the location does not reveal too much information about a user. For example, it may not be desirable to be able to identify that a certain hospital uses Slicer for planning or guiding certain procedures. It would be nice to know for sure, but would violate the users’ privacy.

Ideally, we should not store IP addresses so that we don’t need to deal with strict data handling regulations. However, then it is not clear how we can detect and mitigate trivial data manipulation attempts.

BerDom.Ing · April 10, 2024, 7:14pm

It is not clear what a “run” means. We need to define it. For example, Slicer starts and runs for longer than 1 minute.

I was going to define run as each time the application is opened, but now I consider that your approach is better. We should start counting when slicer runs for more than 1 minute so we know that it wasn’t opened on error, or simultaneous windows by clicking the icon so many times.

All modules are loaded when Slicer starts, so I guess you mean number of times a module is activated (each time the user switches to that module). This would be very low-level information and since module names may give away information (especially when the modules are not publicly released), we should carefully consider if we really want to do collect such information. Maybe modules could opt in to be logged.

Yes, I mean times a module is activated, thanks for the correction, i agree that we should be careful collecting information. I find this to be valuable and not too intrusive, but I want to hear more opinions about it.

Need to specify enumerated values, because strings may identify some rare configurations. Make sure that available memory is included.

I will specify the GPU,CPU and operating system information once I can retrieve it and test it in a couple of computers so I know what form the information has, and this could differ between different providers and operating systems, I will make sure that the available memory is included.

Need to specify bins to make sure we don’t make systems identifiable that use rare resolutions. Also add scaling as Murat suggested.

I’m not sure what you mean with specific bins, a brief explanation could help me. Scaling will be added.

I’m not sure if this is relevant. Why do we need to know if people leave their Slicer running on their computer?

Idle time is relevant to know how much time slicer was running in total, idle time in combination of the active time is equal to the total time slicer was running.

What would be the sufficient resolution? 5 minutes? 10 minutes?
We would probably write the usage statistics to disk at this time interval (to make sure the information is captured even if Slicer crashes or terminated).

I think that a good resolution to save the active time it’s 10 minutes.

It could be also nice to count number of uncaught C++ exceptions, application crashes.

I agree that it would be nice to count uncaught c++ exceptions and application crashes, I will surely need time, help, and guidance.

We also need to think about recording location of the user with some granularity. We could do this at server level from the IP address, but we would need to make sure that the location does not reveal too much information about a user.

I agree that we should not store IP addresses, but i think it would be okay to collect the location with a big granularity.

However, then it is not clear how we can detect and mitigate trivial data manipulation attempts.

To detect and mitigate trivial data manipulation attempts we could use Reed–Solomon error correction. This is another value to store, but ensures that the other parts of the message are being transmitted without an error.

lassoan · April 10, 2024, 7:31pm

Instead of storing any individual resolution (there are many standard resolutions), you would specify bins, such as:

< 960×540
960×540 … 1280×720
1280×720 … 1920x1080
1920x1080 … 2048×1080
…

Or maybe you would specify bins for horizontal resolution and aspect ratio.

You can use the same methods that are used for logging system information in the application log at startup.

I’m not worried about data corruption, but that we should not make it extremely easy to manipulate statistics by submitting manipulated content to the server.

jamesobutler · April 10, 2024, 10:10pm

Yes do not collect information of modules being used that are not publicly included in the Slicer extensions index. Developers that create Slicer custom applications for commercial purposes (not open-source) will not want their module names leaked to the Slicer open-source. Developers may be creating new modules against their Slicer custom application, but may also be testing against a regular Slicer version, so application name wouldn’t be a valid way to exclude module name logging.

BerDom.Ing · April 11, 2024, 3:19pm

With the Reed-Solomon algorithm, you obtain a value that correlates with all the previous information in the message. Every time the usage statistics are saved on disk, running the algorithm ensures that all previous parts of the message (the usage statistics) remain unchanged. If a change were to occur, when decoding the Reed-Solomon value against the message, it would indicate that the message has changed. This value could be stored in the same storage as the user statistics and made available to the user if they wish to view it. If users want to manipulate the information on the usage statistics after altering one part of the message, they would need to run the Reed-Solomon algorithm, which would be publicly available. Then, they could save the result in the control value stored with the usage statistics. It’s not bulletproof, but it’s not as trivial as simply changing the number in the available user statistics file.

Thank you for the clarification. Saving only the standard resolutions is indeed the best solution.

I will look into that. Thanks for the help.

Then we should enable modules to opt in to be logged.

BerDom.Ing · April 14, 2024, 6:18pm

I’ve been working on the mock-ups for presenting the information to the users. I don’t know what the scaling values are like, so I left them out for now.

BerDom.Ing · April 16, 2024, 3:48pm

In the weekly meeting, regarding the telemetry module, we discussed:

Asking developers more specific questions about the data that needs to be gathered.
Exploring methods to prevent the collection of fake data. We cannot use a hash because the code is open source. An anonymization approach could involve replacing IPs with geolocation data (opt-in is needed).
Recognizing the potential usefulness of telemetry for compiling a census of the compute hardware used for Slicer.

My upcoming tasks include developing a Python data collector, setting up a server with a database, processing server logs, creating graphics, and implementing an endpoint to confirm successful receipt of data.
I will need a function in python like slicer.util.getHardwareInformation() that does the same as qSlicerApplication::logApplicationInformation()

Topic		Replies	Views
2023.08.01 Weekly Meeting Weekly meetings	2	308	July 31, 2023
Add number of extension installations to extension manager Feature requests	21	602	July 23, 2021
Updated Slicer binary automatic notification Feature requests	6	634	March 21, 2023
2023.10.31 Weekly Meeting Weekly meetings	1	239	October 30, 2023
ChatGPT knows a little about Slicer Development	8	874	February 3, 2023

Should we start collecting software usage data?

Related topics