This page is currently inactive and is retained for
historical reference. Either the page is no longer relevant or consensus on its purpose has become unclear. To revive discussion, seek broader input via a forum such as the
village pump.
Bug 275 (since fixed) has caused the accidental duplication of entire sections of some articles. This page is an attempt to locate all such instances of this problem and fix them.
A script was run on an offline copy of the database. First, it isolated all pages with duplicate headers. Then, it sliced each remaining page into three-word "chains" or "triplets" and looked to see how many of these chains appeared more than once. The percentage of repeated chains are reported for each article. A high percentage is a good indication that duplication has occurred.
This list was produced with the June 26, 2005 database dump, so many such instances have probably already been fixed. (You can check using the history feature whether duplication actually did occur.) But the following need to be checked. We're not sure what a good percentage cutoff is, so start at the top and work your way down. Please strikethrough fixed pages and underline false positives, so we can determine if the detection algorithm is working well, and when we should stop checking. We've also included a section sorted by absolute number of triplets repeated, in case there are long pages with small duplications. Thanks for your help!
Note that the listing for under 300 triplets is still available if it is needed, though if there are a high number of false positives, it maybe more efficient to wait for a new database dump. --
Beland 06:04, 1 August 2005 (UTC)reply
It might be a useful metric to figure out what the length of the longest duplicated section is, though this will clearly be more computationally complex. Also, small differences in duplicated sections may appear over time. Another red flag might be if a repeated section starts with a header, though again, this may be corrupted over time. --
Beland 06:09, 1 August 2005 (UTC)reply
User:Seth_Ilys/NCGA_Template 27% repeated - 278 out of 997 triplets (deliberate duplication of text for Senate and House)
Textgestaltung 27% repeated - 201 out of 727 triplets (Copy of the German Wikipedia's text formatting page, lots of duplicate content, but intentionally)
Boze_Pravde 25% repeated - 230 out of 910 triplets (two versions of one song, slightly different lyrics)
Bethel,_Connecticut 25% repeated - 234 out of 917 triplets (similar but not identical demographics section from a slightly different article, not sure why it's there)
This page is currently inactive and is retained for
historical reference. Either the page is no longer relevant or consensus on its purpose has become unclear. To revive discussion, seek broader input via a forum such as the
village pump.
Bug 275 (since fixed) has caused the accidental duplication of entire sections of some articles. This page is an attempt to locate all such instances of this problem and fix them.
A script was run on an offline copy of the database. First, it isolated all pages with duplicate headers. Then, it sliced each remaining page into three-word "chains" or "triplets" and looked to see how many of these chains appeared more than once. The percentage of repeated chains are reported for each article. A high percentage is a good indication that duplication has occurred.
This list was produced with the June 26, 2005 database dump, so many such instances have probably already been fixed. (You can check using the history feature whether duplication actually did occur.) But the following need to be checked. We're not sure what a good percentage cutoff is, so start at the top and work your way down. Please strikethrough fixed pages and underline false positives, so we can determine if the detection algorithm is working well, and when we should stop checking. We've also included a section sorted by absolute number of triplets repeated, in case there are long pages with small duplications. Thanks for your help!
Note that the listing for under 300 triplets is still available if it is needed, though if there are a high number of false positives, it maybe more efficient to wait for a new database dump. --
Beland 06:04, 1 August 2005 (UTC)reply
It might be a useful metric to figure out what the length of the longest duplicated section is, though this will clearly be more computationally complex. Also, small differences in duplicated sections may appear over time. Another red flag might be if a repeated section starts with a header, though again, this may be corrupted over time. --
Beland 06:09, 1 August 2005 (UTC)reply
User:Seth_Ilys/NCGA_Template 27% repeated - 278 out of 997 triplets (deliberate duplication of text for Senate and House)
Textgestaltung 27% repeated - 201 out of 727 triplets (Copy of the German Wikipedia's text formatting page, lots of duplicate content, but intentionally)
Boze_Pravde 25% repeated - 230 out of 910 triplets (two versions of one song, slightly different lyrics)
Bethel,_Connecticut 25% repeated - 234 out of 917 triplets (similar but not identical demographics section from a slightly different article, not sure why it's there)