This is a
failed proposal.
Consensus for its implementation was not established within a reasonable period of time. If you want to revive discussion, please use
the talk page or initiate a thread at
the village pump. |
Search engines such as Google and Bing deliver search results by using computer programs called web crawlers to 'surf' the internet looking for new pages to add to search indices, and for updates to previously 'crawled' pages. These potentially-intrusive programs are governed by a set of standards that allow website owners to control which pages the crawlers are allowed to visit, and which links they are allowed to follow to reach new pages. In the context of Wikipedia, this means that we have the ability to control which pages are accessible to web crawlers, and hence which pages are returned by search engines such as Google.
From Wikipedia's foundation, all of its content was made accessible to web crawlers and search engines. Robots.txt, the file that controls web crawler access, was used primarily to block individual web crawlers that were making excessively long or rapid crawls and hence were draining system resources. This meant that in addition to all our encyclopedic content, enormous amounts of discussion, dispute, and drama, were made available to external searches. This material is the focus of considerable numbers of complaints to the OTRS service, and can often contain unwanted personal information about users, undesirably heated debates about article subjects, and other content that does nothing to enhance Wikipedia's reputation as a professional encyclopedia. In 2006 the German Wikipedia held a 'Meinungsbilder' (roughly analogous to an RfC), and asked the developers to exclude all talk namespaces from web crawlers (see T6937), in an attempt to control some of this content.
Wikipedia's powerful presence as the internet's eighth most-popular website gives all our pages very heavy weighting in search engine rankings; a Wikipedia page that matches the search term entered is almost guaranteed a place in the top ten results, regardless of the actual page content. While this is an extremely positive status for our articles and content, it is not always beneficial:
In June 2006, MediaWiki was enhanced to provide the ability for developers to exclude individual namespaces from being indexed by web crawlers. This functionality was extended in February 2008 to allow developers to set indexing policy on individual pages. Finally, in July 2008, users were given the ability to manually set indexing policies for individual pages using two
magic words __INDEX__
and __NOINDEX__
; the developers can customise in which pages these magic words function.
Until late 2008, the poor quality of Wikipedia's own internal search engine meant that editors relied upon Google to find material for internal purposes, such as past discussions, useful help pages, and other information. In October 2008, the internal search function was significantly improved, enabling all the functionality already available through search engines such as Google, and also incorporating a number of features unique to Wikipedia, such as automatic identification of redirects and page sections, and more appropriate search rankings. This made the internal search a superior method for finding internal content than external searches like Google. In December 2008, new updates to the MediaWiki software enabled the insertion of inline search buttons to search through sets of subpages, such as the archives of talk pages or the Administrators' noticeboard.
This section needs expansion with: Include details of any controversies that have arisen due to google indexing of non-content pages.. You can help by
adding to it. |
The entirety of editorial pages have been spidered (pushed onto search engines such as Google) as a result. As a smaller website this was not a big deal. As a "top 5-10 website" it is. Dialog on users from Wikipedia, including their internal actions as editors, is routinely a "top hit" for individuals long after they edit, and pages other than mainspace and well patrolled parts of other spaces may contain large amounts of unchecked, unverified, user writings which any user may place within a variety of namespaces. Unless significantly problematic and actively noticed, they may go unchecked and spidered as Wikipedia content for years.
Our visitors and readers look for encyclopedic content, not inward-facing discussions, disputes by users. Our readers come first. There is considerable content we want the public to find and see. That is the end product of the project.
The rest - including popular project pages such as AFD, and all "talk" namespaces, dispute resolution pages, user pages, etc, are not of great benefit to the project if indexed on search engines. Many of them also raise considerable concerns about privacy and ease of finding harmful stuff (user disputes/allegations) on Google, far more than they help the project. We don't need those publicized. They are internal (editorial use) pages.
It is proposed that it's finally time to close the gap. Instead of NOINDEXing individual pages mostly ad-hoc, I can't see any strong current continuing rationale for any "internal" page to be spidered at all, and I can see problems reduced by killing it. Use internal search to find such material, and kill off spidering of anything that's not really of genuine public note as our "output/product".
A prior discussion has taken place at Wikipedia:Village pump (policy)#NOINDEX of all non-content namespaces (Dec 2008 - Jan 2009). This proposal is being set up to formally see if consensus exists to request these changes, and to identify the technical means to do so.
Namespace | Default state | Override allowed? |
---|---|---|
Mainspace | Indexed | No |
User: | Noindexed | Yes |
Wikipedia: | Noindexed | Yes |
File: | Indexed | Yes |
Mediawiki: | Noindexed | No |
Template: | Noindexed | Yes |
Help: | Indexed | No |
Category: | Indexed | Yes |
Portal: | Indexed | Yes |
All Talk namespaces ( Talk:, User talk:, File talk:, etc) |
Noindexed | No |
Changes from the current setting are highlighted |
The proposed changes fall into two areas: technical, and procedural, as described below.
The Wikipedia:, MediaWiki: and Template: subject namespaces, and all talk namespaces, are set to be not indexed by default; that is, no pages in these namespaces will be found by web crawlers and hence will not appear in search engine rankings, although all pages will continue to be visible in Wikipedia's own internal search results.
In addition, the magic words __INDEX__
and __NOINDEX__
are disabled in the MediaWiki: and Help: subject namespaces, and in all talk namespaces. This has the effect of 'locking in' the default setting so it cannot be changed on a per-page basis.
The new indexing settings are shown graphically in the table to the right.
With these changes, it becomes necessary to develop new guidelines to govern the use of the magic words __INDEX__
and __NOINDEX__
in those namespaces where they function.
Some content (non-encyclopedic material such as bug reports, internal project logos, etc) may be noindexed on a consensus basis. A discussion of NOINDEXing non-free media is likely to take place, separately to this proposal.
'Maintenance' categories will be manually NOINDEXed, all other categories (i.e. content categories) should not be overridden and shall remain Indexed.
Slightly longer answer |
---|
Project space contains a wide range of material. It can include, like userspace, almost any user writing, provided it appears superficially to be about the project or of project interest; discussions; disputes; negative material on users; essays on viewpoints of any editor; and considerable other unchecked material.
It also contains a significant amount of genuinely valuable material that is as much our "output/product" as any article - our policies, guidelines, explanations of processes, well recognized stable pages on Wikipedia/Wikimedia, reference data, and so on. Project space is a mix of all these. Some should be spidered (broadly, the latter valuable material and any other "consensus says"). A lot is unchecked, and new material may be added at any time. Since policies and guidelines can be collectively indexed simply via their respective templates, and the number of stable, valuable reference pages is fairly stable itself, and the number of other pages grows far faster and is unchecked, it's easier and more effective to default to NOINDEX, and then index as an exception, anything (or any group or category of pages) that consensus says is valuable. |
Full answer: | |
---|---|
A page can be set to Not Index in a number of ways. Web crawlers used by search engines check for a file called "
robots.txt" on the root of a webserver, and use that to set global parameters for which paths on the site can be accessed by the crawler. Wikipedia's robots.txt file is viewable at
http://en.wikipedia.org/robots.txt. Entries can be added to the file either by Wikimedia developers, or by en.wiki admins by editing
MediaWiki:Robots.txt. Entries added by the developers override those added by en.wiki admins. Secondly,
meta
HTML tags may be added to the header of individual pages to force web crawlers who visit the page to 'ignore' it. Several MediaWiki configuration settings allow these tags to be set on a
wiki-wide,
per-namespace, and
per-page basis. Finally, wiki users can add a
behavior switch to a page's wikimarkup to manually add an HTML meta element – the switch Meta HTML tags cannot override restrictions that are set in the robots.txt file, as a page that is excluded by robots.txt will never be fetched, so if it has a local override in the markup this will never be noticed. Finally, the namespaces in which the Using these options, we can ask the developers to implement any permutation of default state and overrides for any namespace (using the MediaWiki configuration settings), and also block both individual pages (using |
This is a
failed proposal.
Consensus for its implementation was not established within a reasonable period of time. If you want to revive discussion, please use
the talk page or initiate a thread at
the village pump. |
Search engines such as Google and Bing deliver search results by using computer programs called web crawlers to 'surf' the internet looking for new pages to add to search indices, and for updates to previously 'crawled' pages. These potentially-intrusive programs are governed by a set of standards that allow website owners to control which pages the crawlers are allowed to visit, and which links they are allowed to follow to reach new pages. In the context of Wikipedia, this means that we have the ability to control which pages are accessible to web crawlers, and hence which pages are returned by search engines such as Google.
From Wikipedia's foundation, all of its content was made accessible to web crawlers and search engines. Robots.txt, the file that controls web crawler access, was used primarily to block individual web crawlers that were making excessively long or rapid crawls and hence were draining system resources. This meant that in addition to all our encyclopedic content, enormous amounts of discussion, dispute, and drama, were made available to external searches. This material is the focus of considerable numbers of complaints to the OTRS service, and can often contain unwanted personal information about users, undesirably heated debates about article subjects, and other content that does nothing to enhance Wikipedia's reputation as a professional encyclopedia. In 2006 the German Wikipedia held a 'Meinungsbilder' (roughly analogous to an RfC), and asked the developers to exclude all talk namespaces from web crawlers (see T6937), in an attempt to control some of this content.
Wikipedia's powerful presence as the internet's eighth most-popular website gives all our pages very heavy weighting in search engine rankings; a Wikipedia page that matches the search term entered is almost guaranteed a place in the top ten results, regardless of the actual page content. While this is an extremely positive status for our articles and content, it is not always beneficial:
In June 2006, MediaWiki was enhanced to provide the ability for developers to exclude individual namespaces from being indexed by web crawlers. This functionality was extended in February 2008 to allow developers to set indexing policy on individual pages. Finally, in July 2008, users were given the ability to manually set indexing policies for individual pages using two
magic words __INDEX__
and __NOINDEX__
; the developers can customise in which pages these magic words function.
Until late 2008, the poor quality of Wikipedia's own internal search engine meant that editors relied upon Google to find material for internal purposes, such as past discussions, useful help pages, and other information. In October 2008, the internal search function was significantly improved, enabling all the functionality already available through search engines such as Google, and also incorporating a number of features unique to Wikipedia, such as automatic identification of redirects and page sections, and more appropriate search rankings. This made the internal search a superior method for finding internal content than external searches like Google. In December 2008, new updates to the MediaWiki software enabled the insertion of inline search buttons to search through sets of subpages, such as the archives of talk pages or the Administrators' noticeboard.
This section needs expansion with: Include details of any controversies that have arisen due to google indexing of non-content pages.. You can help by
adding to it. |
The entirety of editorial pages have been spidered (pushed onto search engines such as Google) as a result. As a smaller website this was not a big deal. As a "top 5-10 website" it is. Dialog on users from Wikipedia, including their internal actions as editors, is routinely a "top hit" for individuals long after they edit, and pages other than mainspace and well patrolled parts of other spaces may contain large amounts of unchecked, unverified, user writings which any user may place within a variety of namespaces. Unless significantly problematic and actively noticed, they may go unchecked and spidered as Wikipedia content for years.
Our visitors and readers look for encyclopedic content, not inward-facing discussions, disputes by users. Our readers come first. There is considerable content we want the public to find and see. That is the end product of the project.
The rest - including popular project pages such as AFD, and all "talk" namespaces, dispute resolution pages, user pages, etc, are not of great benefit to the project if indexed on search engines. Many of them also raise considerable concerns about privacy and ease of finding harmful stuff (user disputes/allegations) on Google, far more than they help the project. We don't need those publicized. They are internal (editorial use) pages.
It is proposed that it's finally time to close the gap. Instead of NOINDEXing individual pages mostly ad-hoc, I can't see any strong current continuing rationale for any "internal" page to be spidered at all, and I can see problems reduced by killing it. Use internal search to find such material, and kill off spidering of anything that's not really of genuine public note as our "output/product".
A prior discussion has taken place at Wikipedia:Village pump (policy)#NOINDEX of all non-content namespaces (Dec 2008 - Jan 2009). This proposal is being set up to formally see if consensus exists to request these changes, and to identify the technical means to do so.
Namespace | Default state | Override allowed? |
---|---|---|
Mainspace | Indexed | No |
User: | Noindexed | Yes |
Wikipedia: | Noindexed | Yes |
File: | Indexed | Yes |
Mediawiki: | Noindexed | No |
Template: | Noindexed | Yes |
Help: | Indexed | No |
Category: | Indexed | Yes |
Portal: | Indexed | Yes |
All Talk namespaces ( Talk:, User talk:, File talk:, etc) |
Noindexed | No |
Changes from the current setting are highlighted |
The proposed changes fall into two areas: technical, and procedural, as described below.
The Wikipedia:, MediaWiki: and Template: subject namespaces, and all talk namespaces, are set to be not indexed by default; that is, no pages in these namespaces will be found by web crawlers and hence will not appear in search engine rankings, although all pages will continue to be visible in Wikipedia's own internal search results.
In addition, the magic words __INDEX__
and __NOINDEX__
are disabled in the MediaWiki: and Help: subject namespaces, and in all talk namespaces. This has the effect of 'locking in' the default setting so it cannot be changed on a per-page basis.
The new indexing settings are shown graphically in the table to the right.
With these changes, it becomes necessary to develop new guidelines to govern the use of the magic words __INDEX__
and __NOINDEX__
in those namespaces where they function.
Some content (non-encyclopedic material such as bug reports, internal project logos, etc) may be noindexed on a consensus basis. A discussion of NOINDEXing non-free media is likely to take place, separately to this proposal.
'Maintenance' categories will be manually NOINDEXed, all other categories (i.e. content categories) should not be overridden and shall remain Indexed.
Slightly longer answer |
---|
Project space contains a wide range of material. It can include, like userspace, almost any user writing, provided it appears superficially to be about the project or of project interest; discussions; disputes; negative material on users; essays on viewpoints of any editor; and considerable other unchecked material.
It also contains a significant amount of genuinely valuable material that is as much our "output/product" as any article - our policies, guidelines, explanations of processes, well recognized stable pages on Wikipedia/Wikimedia, reference data, and so on. Project space is a mix of all these. Some should be spidered (broadly, the latter valuable material and any other "consensus says"). A lot is unchecked, and new material may be added at any time. Since policies and guidelines can be collectively indexed simply via their respective templates, and the number of stable, valuable reference pages is fairly stable itself, and the number of other pages grows far faster and is unchecked, it's easier and more effective to default to NOINDEX, and then index as an exception, anything (or any group or category of pages) that consensus says is valuable. |
Full answer: | |
---|---|
A page can be set to Not Index in a number of ways. Web crawlers used by search engines check for a file called "
robots.txt" on the root of a webserver, and use that to set global parameters for which paths on the site can be accessed by the crawler. Wikipedia's robots.txt file is viewable at
http://en.wikipedia.org/robots.txt. Entries can be added to the file either by Wikimedia developers, or by en.wiki admins by editing
MediaWiki:Robots.txt. Entries added by the developers override those added by en.wiki admins. Secondly,
meta
HTML tags may be added to the header of individual pages to force web crawlers who visit the page to 'ignore' it. Several MediaWiki configuration settings allow these tags to be set on a
wiki-wide,
per-namespace, and
per-page basis. Finally, wiki users can add a
behavior switch to a page's wikimarkup to manually add an HTML meta element – the switch Meta HTML tags cannot override restrictions that are set in the robots.txt file, as a page that is excluded by robots.txt will never be fetched, so if it has a local override in the markup this will never be noticed. Finally, the namespaces in which the Using these options, we can ask the developers to implement any permutation of default state and overrides for any namespace (using the MediaWiki configuration settings), and also block both individual pages (using |