Cleaning up search results: robots.txt exclusion for old documentation


(Isaiah Norton) #1

Slicer’s search results are currently not as helpful as they could be. For example, the top results for the following searches are almost all outdated versioned wiki pages (often 3.6 or very early 4.x series):

slicer editor
slicer image guidance
slicer navigation
slicer diffusion tutorial
slicer openigtlink

Considering that we are now de facto using nightly as a rolling release, I’d like to propose excluding older wiki documentation versions from indexing by using the following entry in the slicer.org robots.txt. (if that’s too extreme, we could also allow /4.6)

Disallow: /wiki/Documentation/*
Allow: /wiki/Documentation/Nightly/*

Based on mediawiki and google documentation, this should result in only the nightly documentation being indexed.


It may also be useful to disallow searching the images folder, which will prevent old PDF versions from being indexed:

/w/images/*

The upshot would be better control over visibility and versioning by eliminating direct PDF links in search results, but I’m not sure about downside ramifications if any.


cc @freephile @mhalle


(Steve Pieper) #2

But aren’t the page ranks based on other pages that link to our pages? I’d guess that’s why some of the older links are still the highest hits.

Are we sure that blocking the old pages from scanning would make the now pages show up at the top of google, or maybe the links to older pages just stop being considered.

Currently we have the banner on old pages that suggests people look at the newer pages – but maybe we should actually redirect to the newest instead with a banner option to go back to the older page.


(Andras Lasso) #3

Instead of specific versions, probably the best would be to only have two versions of the documentation in the search results: “stable” and “latest”. But then it would be difficult to retrieve documentation for specific software versions.

Overall, I’ve lost faith in the wiki to be used for user guide (reference manual, detailed documentation of modules). I think documentation generated from the code, stored in the same repository as the code, made available through ReadTheDocs and as a downloadable pdf for each specific version would be a more sustainable solution. You can have a look at how Segment Editor’s documentation looks like in ReadTheDocs now: http://slicer.readthedocs.io/en/latest/user_guide/module_segmenteditor.html.


(Andrey Fedorov) #4

Me too, a while ago.

So now that we have two different mechanisms for documentation - what are the advantages of keeping using the wiki as the primary/recommended?

Are instructions on populating content on ReadTheDocs available somewhere? Or is this just an experiment?


(Andras Lasso) #5

It’s still experimental, but the mechanism works very nicely, just the content is not there yet.

If you just want to make edits to an existing page then click “Edit on GitHub” link at the top. Once you finished your edits GitHub will offer to create a pull request for you automatically, just accept that and you are good to go. Once the pull request is merged, the documentation is rebuilt automatically.

If you want to make significant edits, adding lots of pages, etc. then you can work on this branch locally, generate documentation using Sphinx, etc. and once you are done you can send a pull request with all your changes.


(Steve Pieper) #6

We still have the problem of SEO - searching for ‘slicer segmentation editor’ still takes you to the wiki which doesn’t (yet) link to readthedocs.


(Andras Lasso) #7

Yes, for now. I expect that when we’ll start using readthedocs for real then it’ll move up in the page ranks. Google finds readthedocs documentation pages of other projects.