Should we use Git LFS to manage data?

You might consider git lfs (large file system) for the large files. I
use in the VTKExamples for some of the very large datasets.

Bill

2 Likes

Thanks for the suggestion Bill.

To manage the data, I think we will standardize on the approach already used by ITK. See https://itk.org/ITKExamples/Documentation/Contribute/UploadBinaryData.html

1 Like

Lfs maybe lower maintenance and less Kitware resources. Remember Midas?

I expect that GitHub has more resources and it is part of their core.

Also when you transition to GitHub, try to avoid any Kitware Slicer mods. Don’t fall into the Gerrit trap by adding local mods.

Do you know how does git-lfs work with private and public forks? Also, can anybody contribute new data sets or update them using pull requests?

Remember Midas?

So far, it has been working reasonably well. Associated with the Midas instance currently used to handle the Slicer data and packages, there is more than 350k items stored in the database. (this a side topic, but they have been working on the new iteration of this infrastructure now called Slicer package manager that is built on top of Girder- more on this in a different post)

Also thanks to the CMake ExternalData module, it is very easy to specify multiple data source, a global cache, and also granular download.

The multiple data source support allows to have redundancy of the data hosting and not rely on only one provider. For, example here are the data source configured for ITK, with the option of easily providing your own list.

I expect that GitHub has more resources and it is part of their core.

Also, for the ITK releases, it is for example possible to leverage GitHub to serve the testing data leveraging GitHub pages feature.

Here are some more info I gathered while investigating LFS last year:

  • it is all or nothing. No incremental download

  • it is does not support parallel download at clone time. That, it is possible to workaround this setting an env variable (or using the --skip-smudge option) to disable dowbload, and then run git lfs pull explicitly.

  • it requires an other tool to install. That said, it seems quite easy now by running

  • no central cache or way to share data between repo. (This has been an issue since 2015, was added to the git-lfs roadmap in 2016, but it is still pending.

it is all or nothing

Here is a comment from a colleague that could mitigate this:

For example, it is possible not to install git-lfs globally but rather do a local install after the repo is cloned if the data is needed at all.

This for example lets you have 1 repo with data present across several builds. The other builds point to this directory for their data rather than their own source dirs. I imagine you could use the trick of never installing git-lfs globally (i.e., nothing in ~/.gitconfig) to prevent a global smudge and instead ask git-lfs during the build to smudge only certain files.

In addition, there is support for adding a default match exclusion
(following .gitignore semantics) like this:

git config -f .lfsconfig lfs.fetchexclude '*'

and then use similar logic that is in the ExternalData module to do
manual git lfs fetch && git lfs checkout commands to fetch data
on-demand at build time.

question

It would be great to hear feedback of someone using it on a day-to-day basis.

@thewtex Since you have been though the process of implementing this for ITK, it would be great to hear your comments. If you would have to do it again, what would you choose ?

ITK manual references

When I considered using git-lfs for another project some time ago, one of the key concerns was that it was not possible to download the data directly from the GitHub web site (or any other web site). The only way to get to it was to install git-lfs client, which I considered to be a huge deterrent for a non tech-savvy user. See the issue and communication with GitHub support summarized here: https://github.com/isaacs/github/issues/712.

There were also other issues related to usability and possibly bugs in git-lfs implementation that I encountered at the time, but the lack of download without extra software was the deciding factor against git-lfs for me.

CMake ExternalData has two important advantages:

  1. Simplicity
  2. Lack of a dependence on a single point of failure

git lfs is similar to git submodule – it is an other set of commands that a developer must learn, and it requires learning and keeping another model and state associated with the source code repository tree in your head. If you already know git submodules or git lfs, then they are nice. However, they are both a big barrier to entry for new contributors, many of which whom are just struggling to understand Git.

With ExternalData, a developer can just upload the binary in a simple web page interface, and download a file that contains its hash. In the past, this was more difficult because the Midas web interface was clunky and slow. Now, Girder is faster and the UI is much improved. And, @zachmullen is working on making it even better. Or, it would be possible to create a simple web app to do this…

With ExternalData, multiple simple, redundant data stores are possible, including

  • A local cache
  • An archive that can be stored and distributed with a release
  • GitHub pages
  • A Midas server
  • A Girder server
  • An Apache server
  • An Azure blob store
  • Cloud provider X’s blob storage service

For ITK, we have used the first seven options throughout the project’s lifetime. If one of the resources is not available, the system tries a different resource. This means you are rarely, if ever, disabled when you are offline, have a poor connection, or the LFS server is offline.

Also, CMake ExternalData requires no other dependencies if you are already using CMake.

1 Like

I had no trouble install a client. I think any approach will require
some setup. That said, my comments were only meant to start a
discussion. I am no expert on this subject. For my small project
https://lorensen.github.io/VTKExamples/site/ it works. A more
sophisticated approach is probably required for Slicer…

Bill

I was also looking at git-lfs recently, and this seems to have changed:

https://github.com/SlicerDMRI/DMRITestData/blob/master/Tractography/fiber_ply_export_test.vtk?raw=true

from:

So, this would allow git-lfs to be used for data management, but not necessarily for runtime usage – clients would still download from a hash.

However, my personal opinion on git-lfs is a bit low right now because it broke all git pushes when I enabled it on a repo, due to this bug (can’t use github tokens with macOS credential helper). Uninstalling was easy and restored ability to push, but figuring out that I needed to do so took some time. Credential helper support should be a relatively simple fix, so somewhat concerning that it has not been fixed promptly.

I like the idea of reducing dependence on bespoke projects, but given that (1) on the client side, we are just pulling from raw URLs no matter what and (2) girder implements the S3 API (proprietary, but more-or-less a standard at this point), girder is only a soft dependency. We could move the data to any storage that implements S3-style buckets.

(to that end, it would be good to eventually put the data URLs and hashes in separate files rather that inline in cmake files – but not a priority)

1 Like

It’s nice to see that individual files can be downloaded, but the issue remains that when the whole repository is downloaded as ZIP, it only contains git-lfs pointers to the data.

I was looking into this again, and as of today at least, if I download the whole directory as ZIP, individual files are downloaded, and not just links.

1 Like

Someone alerted me to this issue again, and I realized was checking this using a repository that contains git-lfs managed content side by side with a large file directly stored in the repo. I downloaded the whole repo as zip, and concluded the behavior changed because I saw that large file.

It remains the case that git-lfs managed content is NOT included when the repository containing that content is downloaded as ZIP.

I apologize for my sloppiness.

1 Like

I’ve tested git-lfs and have mixed results.

Good:

  • easy to set up, just install git-lfs and specify what files should be stored using git-lfs (you can specify folders and/or file extensions)
  • works nicely and transparently when used on systems where git-lfs is installed
  • git-lfs files show up on Github’s web interface as regular files (e.g., you see the actual file content instead of pointer information)

Bad:

  • Github’s web interface always uploads files as regular files
  • If someone commits files who has not installed git-lfs, those files will be committed as regular files
  • Regular files can be converted to git-lfs files (there is an officially supported script for that), but it rewrites git history
  • Users reported various esoteric issues that were hard to understand and fix (see 1, 2, 3), even when users were careful and experienced. It is scary to think about how wrong things can go when we accept pull request from a larger community.
2 Likes

In addition to the “Bad” above, it should not be forgotten that git-lfs (if we use GitHub hosting, and if we don’t, the idea of consolidating things in one place goes away) requires the user to buy “data packs”. Included is only “1 GB of free storage and 1 GB a month of free bandwidth” (see https://help.github.com/articles/about-storage-and-bandwidth-usage/). Also, “Purchasing data packs for Git LFS is independent of any other paid plan on GitHub”, which to me means that the free entry level plan that any academic user can get does not give you any “data packs”. If we keep data on git-lfs, I imagine we would be checking it out quite frequently with nightly testing and users downloading sample data, so this fee alone might be a complete stopper.

I know this is an old thread but has there been any more learning gather in this realm?
Is there something out there at this point that is something in-between a full blown girder implementation and git-lfs approach? i.e. is there a “deploy a girder instance” equivalent to the “create a repo in github” out there in the wild yet?

it seems like git-lfs has the lowest bar to entry from a developer standpoint for smaller projects?