Wikipedia talk:Controlling search engine indexing
![]() | Wikipedia Help | |||
|
noindex on new, unpatrolled pages
I was asked in the help channel about a page being noindexed, and it turns out (according to Phabricator) that new, unpatrolled pages are noindexed by default. It might be worth putting on this page, but at the very least, I wanted to put it on the talk page so people might be able to find it. --MarkTraceur (talk) 14:02, 25 October 2016 (UTC)
Possible updates needed
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
- (copied from User_talk:Xaosflux)
- NOINDEX in mainspace
I was reading through the latest NPP RfC and related threads and noticed something I'm apparently confused about. I figure you'd probably know :) Much has been made of the possible dangers of "bad" new articles getting indexed by search engines. Speedy tags apply {{NOINDEX}}, but is it actually the case that new articles are not indexed until patrolled? I don't believe that was true last year (last time I had a reason to notice), but I may be behind. WP:NOINDEX seems to be out of date. I tried to test it by creating an article with my sock, but was lucky enough that it was patrolled quickly. (Yet the article still isn't indexed on google, and others I created today with my main account are... unfortunately I don't have enough articles stored up for any more sampling :) Opabinia regalis (talk) 06:59, 23 November 2016 (UTC)
- @Opabinia regalis: (no)indexing isn't always an exact science. Since October 2016 (c.f. phab:T147544) new articles get an html attribute applied
<meta name="robots" content="noindex,nofollow"/>
until they are patrolled (or until 90 days go by, I haven't tested that). This attribute is also applied if the article contains a deletion tag such as {{speedy}}. Note, this is similar, but not identical to the behavior that __NOINDEX__ uses. There is a configuration parameter$wgExemptFromUserRobotsControl
that prevents the INDEX, NOINDEX magic words from overriding the namespace default (which for Namesapce:0 (Article) is set to INDEX). This is to prevent vandals from NOINDEX'ing random pages out of search. The noindex meta tab is merely a request to web crawlers - Google generally honors these - but some search engine may not. Finally, being available for indexed doesn't require or "push" a notice to all of the search providers of the world - it is up to them to fetch and index a page - sometimes this is fast, sometimes it takes a long time. Hope this helps. — xaosflux Talk 14:06, 23 November 2016 (UTC)- Yeah, I know it's not a push notification that pops up immediately; it's just very noticeable that I created three articles yesterday on very similar topics, and the two created by this account were indexed immediately, but the one that needed "patrolling" is still absent. Hardly statistically significant, but I hadn't given it much thought because most of my articles are on very obscure topics and they usually pop up in google searches for the title near-instantly. I suppose we'll have to at least update the boilerplate for autopatrolled - the conventional wisdom is that the user right doesn't benefit the holder, but does benefit others by saving them some work; that's clearly no longer true if we assume that people create new articles because they want others to find and read them.
- Anyway, thanks for the phab link, that's what I was looking for. Opabinia regalis (talk) 20:34, 23 November 2016 (UTC)
- Opabinia regalis I can't find the documentation - but I hear that google does follow our new pages feed - but that non-autopatrolled page (autopatrol is included in your sysop group) would have been skipped - so now it would have to wait to get spider-indexed. I'm assuming you are referring to New Jersey polyomavirus. I pulled the source on it, and it is not (now) flagged for noindex. I made a minor edit on it, that may help kickstart indexing on an external site. Please note, none of this behavior has changed due to removing the patrol behavior from autoconfirmed users - non-autopatrolled editors woudl still have needed someone else to mark their page as patrolled. This indexing behavior is likely different due to the October software update. — xaosflux Talk 20:49, 23 November 2016 (UTC)
- Opabinia regalis FYI - I used google webmaster tools to submit a request to crawl that page now - and now it is the #2 search result: google-result-here. — xaosflux Talk 20:53, 23 November 2016 (UTC)
- p.s. google has massive caches - I got it to show me that result, but when reloading its not up - their index will take a little time to replicate. — xaosflux Talk 20:57, 23 November 2016 (UTC)
- Ahhh, if google is directly following the new pages feed then that would make sense. Thanks, I see it now! (The other two articles are MW polyomavirus and STL polyomavirus, which were autopatrolled and indexed right away.) Opabinia regalis (talk) 21:26, 23 November 2016 (UTC)
- p.s. google has massive caches - I got it to show me that result, but when reloading its not up - their index will take a little time to replicate. — xaosflux Talk 20:57, 23 November 2016 (UTC)
- Opabinia regalis FYI - I used google webmaster tools to submit a request to crawl that page now - and now it is the #2 search result: google-result-here. — xaosflux Talk 20:53, 23 November 2016 (UTC)
- Opabinia regalis I can't find the documentation - but I hear that google does follow our new pages feed - but that non-autopatrolled page (autopatrol is included in your sysop group) would have been skipped - so now it would have to wait to get spider-indexed. I'm assuming you are referring to New Jersey polyomavirus. I pulled the source on it, and it is not (now) flagged for noindex. I made a minor edit on it, that may help kickstart indexing on an external site. Please note, none of this behavior has changed due to removing the patrol behavior from autoconfirmed users - non-autopatrolled editors woudl still have needed someone else to mark their page as patrolled. This indexing behavior is likely different due to the October software update. — xaosflux Talk 20:49, 23 November 2016 (UTC)
DuckDuckGo Zero-click box showing content of no-index pages
FYI: [1]. TigraanClick here to contact me 17:50, 4 January 2017 (UTC)
Indexing in user space
I have raised a question here about permitting particular pages in user space to be indexed: Noyster (talk), 20:50, 28 February 2017 (UTC)
On userpages
Suppose we were to put in https://en.wikipedia.org/robots.txt
Disallow: /wiki/User:
Would that serve as a deterrent to people's putting their resumes, etc. in userspace? St. claires fire (talk) 22:29, 12 April 2017 (UTC)