Wikipedia:Proposed tools/Cvcheck

Problem

AFAIK there are currently only two WP tools available to check articles for the presence of text copied from the Web. Both have limitations.

User:CorenSearchBot runs as a background task on newly created articles. A particular article can also be run thru it by adding its name to a queue, which article the bot will run, it states, when it has a free moment. Its major limitation is that because it's an automated task, it can't search Google or GBooks.

User:The Earwig's tool [1] is manually invoked. It searches Google, but not Gbooks. It would not have caught the material that caused the recent flap [2]. (Don't know whether CSbot would have caught it either). Its author is a student who has said they won't have time to improve its algorithm. It doesn't create permanent output (I realize that might pose a maintenance problem.) I'm not sure, but I think from looking at the code [3], that if it finds one match, it adds that url to an exclusion list. If true, this means that the person who'll try and clean the article will need to go on manually comparing the rest of the website to the article - it would be much more efficient to see every match.

Requirements

Check article sentences to see whether they were copied verbatim or close to verbatim from websites (excluding known WP mirrors and public domain) and books in Gprint. Create output: for each match: article section title, matching sentence or good-sized sentence fragment, and url. Optional but would be useful: a second pass option with checkboxes that would allow the user to exclude some of the match websites, because even if the usual WP mirrors are automatically excluded, one often sees random sites that have scraped WP.

While it's under development, or maybe after it goes live too, dump its search strings out somewhere; we could then contemplate why it didn't find a match where we would have expected it to and think of ways to further improve the algorithm. Novickas (talk) 15:23, 5 November 2010 (UTC)[reply]

Interface design

Console-based, like Earwig's tool.

List of interested developers

High-level architecture

(to be filled in by developers; what components will the tool have, and how will they interact?)

Implementation details

(to be filled in by developers; how will the tool be implemented? what technologies will be used and what implementation issues do you anticipate?)

Progress

(as the tool is developed, describe here how far along it is and what problems are being encountered)