re: VLC
Here is what appears while opening VLC for the first time:
re: VLC
Here is what appears while opening VLC for the first time:
Iāve been working on the redaction of the pop-up where the user can agree to opt in. I think it needs to be clear and easy to understand. The following text is the draft:
Short version:
Opt-in for anonymous data collection in Slicer software. We collect non-personal data on usage frequency, hours, and UI button interactions to improve the user experience. Your privacy is protected; no personal data is collected or transmitted over the network. Select your preference below:
(ā¢) Do not collect any data.
( ) Allow anonymous data collection.
[Accept]
Long Version:
Dear Slicer User,
At Slicer, we strive to enhance your experience and improve our software continually. To achieve this, we rely on anonymous software usage data to understand how you interact with our platform and where we can focus our efforts effectively.
The data we collect is strictly limited to non-personal information, ensuring your privacy and confidentiality are paramount. We do not gather any data that could identify you personally, nor do we track your individual installations.
What Data is Collected?
Rest assured, all collected data remains anonymized and is stored locally on your device. No information is transmitted over the network without your explicit consent.
To review the implementation details of our data collection process, please refer to the code available here.
Please select your preference below:
(ā¢) Do not collect any data.
( ) Allow anonymous data collection.
[Accept]
Thank you for helping us improve Slicer!
I like the short wording. In addition to designing this dialog, it would be great if you could also make a mock up of the usage data display. How we present the data weāve collected will be important for convincing people that itās okay for them to share it.
@BerDom.Ing missed the most important part: he will work on this topic for a few months (as part of fulfilling requirements for his degree).
I like the short version, too. It is somewhat repetitive (āanonymous data collectionā, ācollect non-personal dataā, āno personal data is collectedā, āallow anonymous data collectionā), so we could simplify things a bit.
We also need to agree on what data we would like to collect.
@BerDom.Ing could you clarify a bit what you mean?
Question for everyone: What other information should we consider collecting? (any ideas are welcome and we can sort them out later)
Iām thinking about potentially collecting information about the computer (operating system, available memory, CPU model, GPU, screen resolution) and data (image modality, size), as these information could be useful when we make design decisions.
For reference, telemetry in other software:
After reading about the Audacity scandal, I realized itās crucial to opt-in for data collection and ensure anonymity. Also, allowing time for the community to embrace the new system, rather than feeling itās imposed upon them, is essential. We should consider conducting a survey among Slicer users to gauge their concerns about telemetry usage, explain procedures, and emphasize not rushing the technology. Empowering users, providing a dedicated folder for collected data, and a link to the telemetry code repository could help. Iāve noticed that users express concerns about software safety when telemetry implementation is discussed.
Number of runs: Tracking how many times Slicer is started. I believe it would also be beneficial to count how often users switch modules, although implementing this might require some time and a deep understanding of Slicerās source code.
Usage hours: Measuring the time elapsed between starting and stopping Slicer.
I aim to implement the User Statistics module mentioned earlier. Line 25 of the moduleās description reads: āThis module measures user statistics and stores them in a table. Some of the statistics measured include the active module, active segment editor effect, selected segment, duration, application status (active, wait, idle), etc. Tables from different scenes can be merged into a single table.ā
UI button interactions: Each interaction should trigger an event handler. It might be feasible to increase a counter in a table for each invoked event of this type, providing insights into UI redesign or creating shortcuts for frequently used interactions.
I agree that collecting information about the userās computer could prove valuable.
Iāve been working on finishing the Slicer extension tutorial, but I keep encountering errors. I think implementing computer information collection into the module will be the first step. Iāll try to accomplish this using only Qt and VTK to avoid installing extra libraries. Iāve researched that Qt provides a class called āQOpenGLContext,ā which can create an OpenGL context, and VTK provides a class called āvtkOpenGLRenderWindow,ā which can create a window with an OpenGL context. This can be used to retrieve graphics card information.
@BerDom.Ing would it be possible for you to attend the weekly Slicer developer meetings?
At each meeting you could give an update in a few minutes and we could discuss any questions that may arise.
Of course, Iāll be glad to join the meetings.
Today, in the weekly Slicer meeting, we discussed what to use for creating mock-ups, collecting error events, using JSON to store the data, and creating schemas for user-stored data. I believe it will be easy to migrate from JSON to an SQL schema.
Thanks for the summary. Could you please describe the full list of values that were considered for collection?
The values I considered are as follows:
Number of Slicer runs: numberOfSlicerRuns
Times a module is loaded: moduleName + number of load times
GPU Information
CPU Information
Operating System
Window Resolution: width X height
Idle elapsed time in hours: idleTime
Active elapsed time in hours: activeTime
If you are collecting the display information, maybe consider collecting ādisplay scalingā parameters. I occasionally (and almost never consistently) find some strange case of text scaling issuesā¦
Iāll add the scaling information to the collected values now that I know it might cause some issues.
Number of Slicer runs
It is not clear what a ārunā means. We need to define it. For example, Slicer starts and runs for longer than 1 minute.
Times a module is loaded
All modules are loaded when Slicer starts, so I guess you mean number of times a module is activated (each time the user switches to that module). This would be very low-level information and since module names may give away information (especially when the modules are not publicly released), we should carefully consider if we really want to do collect such information. Maybe modules could opt in to be logged.
GPU Information
CPU Information
Operating System
Need to specify enumerated values, because strings may identify some rare configurations. Make sure that available memory is included.
Window Resolution: width X height
Need to specify bins to make sure we donāt make systems identifiable that use rare resolutions. Also add scaling as Murat suggested.
We probably donāt need resolution of all screens, just the one where the Slicer main window is displayed on and the total number of screens.
Idle elapsed time in hours: idleTime
Iām not sure if this is relevant. Why do we need to know if people leave their Slicer running on their computer?
Active elapsed time in hours: activeTime
What would be the sufficient resolution? 5 minutes? 10 minutes?
We would probably write the usage statistics to disk at this time interval (to make sure the information is captured even if Slicer crashes or terminated).
It could be also nice to count number of uncaught C++ exceptions, application crashes.
It would be important that extension developers should be able to specify custom events that they want to count. We should limit the allowed event names to make sure that no information is leaked through that (e.g., the Extensions Catalog Entry json file could contain the list of event names that the extension can count and we would not record or transmit anything else).
We also need to think about recording location of the user with some granularity. We could do this at server level from the IP address, but we would need to make sure that the location does not reveal too much information about a user. For example, it may not be desirable to be able to identify that a certain hospital uses Slicer for planning or guiding certain procedures. It would be nice to know for sure, but would violate the usersā privacy.
Ideally, we should not store IP addresses so that we donāt need to deal with strict data handling regulations. However, then it is not clear how we can detect and mitigate trivial data manipulation attempts.
It is not clear what a ārunā means. We need to define it. For example, Slicer starts and runs for longer than 1 minute.
I was going to define run as each time the application is opened, but now I consider that your approach is better. We should start counting when slicer runs for more than 1 minute so we know that it wasnāt opened on error, or simultaneous windows by clicking the icon so many times.
All modules are loaded when Slicer starts, so I guess you mean number of times a module is activated (each time the user switches to that module). This would be very low-level information and since module names may give away information (especially when the modules are not publicly released), we should carefully consider if we really want to do collect such information. Maybe modules could opt in to be logged.
Yes, I mean times a module is activated, thanks for the correction, i agree that we should be careful collecting information. I find this to be valuable and not too intrusive, but I want to hear more opinions about it.
Need to specify enumerated values, because strings may identify some rare configurations. Make sure that available memory is included.
I will specify the GPU,CPU and operating system information once I can retrieve it and test it in a couple of computers so I know what form the information has, and this could differ between different providers and operating systems, I will make sure that the available memory is included.
Need to specify bins to make sure we donāt make systems identifiable that use rare resolutions. Also add scaling as Murat suggested.
Iām not sure what you mean with specific bins, a brief explanation could help me. Scaling will be added.
Iām not sure if this is relevant. Why do we need to know if people leave their Slicer running on their computer?
Idle time is relevant to know how much time slicer was running in total, idle time in combination of the active time is equal to the total time slicer was running.
What would be the sufficient resolution? 5 minutes? 10 minutes?
We would probably write the usage statistics to disk at this time interval (to make sure the information is captured even if Slicer crashes or terminated).
I think that a good resolution to save the active time itās 10 minutes.
It could be also nice to count number of uncaught C++ exceptions, application crashes.
I agree that it would be nice to count uncaught c++ exceptions and application crashes, I will surely need time, help, and guidance.
We also need to think about recording location of the user with some granularity. We could do this at server level from the IP address, but we would need to make sure that the location does not reveal too much information about a user.
I agree that we should not store IP addresses, but i think it would be okay to collect the location with a big granularity.
However, then it is not clear how we can detect and mitigate trivial data manipulation attempts.
To detect and mitigate trivial data manipulation attempts we could use ReedāSolomon error correction. This is another value to store, but ensures that the other parts of the message are being transmitted without an error.
Instead of storing any individual resolution (there are many standard resolutions), you would specify bins, such as:
Or maybe you would specify bins for horizontal resolution and aspect ratio.
You can use the same methods that are used for logging system information in the application log at startup.
Iām not worried about data corruption, but that we should not make it extremely easy to manipulate statistics by submitting manipulated content to the server.
Yes do not collect information of modules being used that are not publicly included in the Slicer extensions index. Developers that create Slicer custom applications for commercial purposes (not open-source) will not want their module names leaked to the Slicer open-source. Developers may be creating new modules against their Slicer custom application, but may also be testing against a regular Slicer version, so application name wouldnāt be a valid way to exclude module name logging.
With the Reed-Solomon algorithm, you obtain a value that correlates with all the previous information in the message. Every time the usage statistics are saved on disk, running the algorithm ensures that all previous parts of the message (the usage statistics) remain unchanged. If a change were to occur, when decoding the Reed-Solomon value against the message, it would indicate that the message has changed. This value could be stored in the same storage as the user statistics and made available to the user if they wish to view it. If users want to manipulate the information on the usage statistics after altering one part of the message, they would need to run the Reed-Solomon algorithm, which would be publicly available. Then, they could save the result in the control value stored with the usage statistics. Itās not bulletproof, but itās not as trivial as simply changing the number in the available user statistics file.
Thank you for the clarification. Saving only the standard resolutions is indeed the best solution.
I will look into that. Thanks for the help.
Then we should enable modules to opt in to be logged.
Iāve been working on the mock-ups for presenting the information to the users. I donāt know what the scaling values are like, so I left them out for now.
In the weekly meeting, regarding the telemetry module, we discussed:
Asking developers more specific questions about the data that needs to be gathered.
Exploring methods to prevent the collection of fake data. We cannot use a hash because the code is open source. An anonymization approach could involve replacing IPs with geolocation data (opt-in is needed).
Recognizing the potential usefulness of telemetry for compiling a census of the compute hardware used for Slicer.
My upcoming tasks include developing a Python data collector, setting up a server with a database, processing server logs, creating graphics, and implementing an endpoint to confirm successful receipt of data.
I will need a function in python like slicer.util.getHardwareInformation() that does the same as qSlicerApplication::logApplicationInformation()