As a history buff I've been visiting Wikipedia for years. Before long I developed an interest in the
Timeline-articles, especially the
year pages (like
1492) and the
Days of the year pages (short: DOY pages) like
December 24.
Days of the year pages all contain a
Births- and
Deaths-section, listing links to person's bio pages in chronological order. I noticed, however, that pre-medieval entries (before 1500 AD) were fastly underrepresented in DOY-pages. It seemed like no one was born or died on a specific date
before the 16th century:
I also noticed that, compared to the DOY-pages, the Year pages listed far more persons in their Births and Deaths sections, many of them stating the exact date of their birth/demise.
A quick scan confirmed this: a lot of persons with a
biography stated in a year page are not present in the corresponding Days of the year page.
The reason for this is probably that it is less obvious for biographers to add an entry to the corresponding DOY-page than to the Year-page
[1]. Once omitted by an author chances are small that someone else would add a link to a DOY manually since it is horrendously tedious.
Towards December 2016 I got an idea; shouldn't it be possible to (semi-)automate
cross-referencing these timeline types? In case a missing link was spotted an added advantage would be that the text to insert into the DOY page could be generated based on the one in the Year page (whose format only slightly differs from the DOY page).
A potential big issue, however, is the fact that the text within the Births- and Deaths sections is unstructured. Luckily I noticed that the level of standardization within the pages and sections is quite high; I ran an automated check regarding the Years 500 BC - 1550 AD and only had to change a few dozen pages, fixing missing sections or re-applying template standardization. I was quite astonished that I encountered so few text structure errors, given the fact that editing wiki-pages is open to anybody. Unfortunately this proved not be the case regarding the actual content (see
results and
statistics).
In the weeks that followed I created and improved a
VBA-powered
MS Excel-
application that implemented the envisioned functionality.
When clicking the button 'Check year' in the Excel-file, a specific Year page is analysed and used as a starting point to look for missing entries in matching DOY pages.
Per section (Births and Deaths) the general algorithm looks like this:
After a specific year is processed the results may look like this:
The results are shown in the sheet per section; Births on the left and Deaths on the right.
Per section all persons are listed of which an exact date was stated in the Year page and of which a bio exists.
Per person next information is displayed:
When the application has determined that the date in the Year page and in the bio page are identical then the "Text to add to section"-cell gets a green backcolor. Also, When a person is already listed in the matching day-page 'Name exists?' is TRUE and the text to add is '-'. Otherwise the text to copy into the matching DOY page is generated based on the one in the year page.
Quite a few notable links are missing; in case of Year page 1492, section Births, 21 persons with exact birth dates are listed. But only 8 of them are present in the matching DOY pages
[2].
Another thing that stands out are the erroneous entries; Not all "Text to add"-cells are green. If a discrepancy is identified between the date stated in the Year page and the one in the matching biography it is marked by a red back color instead.
As it turned out, quite a few types of errors existed that needed to be fixed before I could add missing entries:
If such an error is detected further investigation is required; what is the source of error?; the year page or the bio? Or do I need to tweak my VBA-code?
For instance: take a look at Births; person:
Adam Ries. The
Year page states March 27 as the date of birth, whereas the matching
bio states January 17. Further investigation will have to make clear which correction will have to be made to which page. As a consequence I had to correct numerous Year- and bio-pages (which I don't mind being a
Wiki Gnome).
After adressing the errors of a specific year I could finally do what was the initial goal of the project: insert links to bio's that are missing in
WP:DAYS. Per year the generated text of the entries is copied manually from Excel to the correct location within the section of the DOY page.
So far (today is 26 June 2017) I checked all the years between 500 BC and 1625 AD this way. I added a few thousand entries during the process, in some cases adding 10+ (pre-)medieval entries to a DOY.
However, not all persons should be added to a DOY because of
'sufficiently globally notable'. All my insertions are swiftly validated by the Wiki-community, especially by
Rms125a@hotmail.com. Again, thanks for all your hard work!
Going through the centuries I noted a signifant increase in data from the 14th century onwards. Until 1350 every missing link to a referenced bio was added. From 1350 going forward the missing entries detected by the Excel-application were subject to more scrutiny. Apparently there's a lot of crap being added. Apart from notability I decided to also look at the size of the bio involved as an indication for possible insertion into DOY. I am fully aware of all the ongoing discussions around entry notability. However, because of the sheer amount of missing entries I had to come up with some criterium to quickly sift through the found missing links. I found that article size is a good indication of notability. Next table shows the century and the minimum number of characters (based on the raw http request content) required for an article usually to be eligable for insertion:
Century | Mininum nr of chars |
---|---|
Before 14th (<1300) | 0 |
14th (1300-1399) | 2000 |
15th (1400-1499) | 5000 |
After 16th (>1500) | 8000 |
Keep in mind that its limits are quite flexible. For instance: some articles on medieval poets actually quote some of their entire poems, greatly diminishing the article's relevance and notability based on the article size in characters. Other bio's state all kinds of invisible information like an infobox with a lot of empty properties and/or numerous wiki categories with the same result.
On the other hand I learned that, based on article size, the minimum number of characters to compose a relevant wiki bio regardless of its occurence in history is around 8,000.
Of course there are some other criteria for a bio to be notable or not:[ongoing]
In the end bio size doesn't seem to matter in order to be admitted to a DOY-page. Just take a look at this table (more summaries can be found in the archive at the top).
Ageed, 'milestone' is a terrible term concocted by management ;). Anyway:
Below charts per section are displayed that clearly shows the progress that's been made since the start of this endeavour [8]. Also note that the many existing erroneous links were removed from the DOY pages during the period December 2016 – June 2017.
During this project I also compiled some other statistics. Following excerpt of the output explains itself. Pay special attention to the number/fraction of discrepancies per processed year. Luckily the fraction of erroneous entries dropped sharply after 1600. I never found out why.
As a history buff I've been visiting Wikipedia for years. Before long I developed an interest in the
Timeline-articles, especially the
year pages (like
1492) and the
Days of the year pages (short: DOY pages) like
December 24.
Days of the year pages all contain a
Births- and
Deaths-section, listing links to person's bio pages in chronological order. I noticed, however, that pre-medieval entries (before 1500 AD) were fastly underrepresented in DOY-pages. It seemed like no one was born or died on a specific date
before the 16th century:
I also noticed that, compared to the DOY-pages, the Year pages listed far more persons in their Births and Deaths sections, many of them stating the exact date of their birth/demise.
A quick scan confirmed this: a lot of persons with a
biography stated in a year page are not present in the corresponding Days of the year page.
The reason for this is probably that it is less obvious for biographers to add an entry to the corresponding DOY-page than to the Year-page
[1]. Once omitted by an author chances are small that someone else would add a link to a DOY manually since it is horrendously tedious.
Towards December 2016 I got an idea; shouldn't it be possible to (semi-)automate
cross-referencing these timeline types? In case a missing link was spotted an added advantage would be that the text to insert into the DOY page could be generated based on the one in the Year page (whose format only slightly differs from the DOY page).
A potential big issue, however, is the fact that the text within the Births- and Deaths sections is unstructured. Luckily I noticed that the level of standardization within the pages and sections is quite high; I ran an automated check regarding the Years 500 BC - 1550 AD and only had to change a few dozen pages, fixing missing sections or re-applying template standardization. I was quite astonished that I encountered so few text structure errors, given the fact that editing wiki-pages is open to anybody. Unfortunately this proved not be the case regarding the actual content (see
results and
statistics).
In the weeks that followed I created and improved a
VBA-powered
MS Excel-
application that implemented the envisioned functionality.
When clicking the button 'Check year' in the Excel-file, a specific Year page is analysed and used as a starting point to look for missing entries in matching DOY pages.
Per section (Births and Deaths) the general algorithm looks like this:
After a specific year is processed the results may look like this:
The results are shown in the sheet per section; Births on the left and Deaths on the right.
Per section all persons are listed of which an exact date was stated in the Year page and of which a bio exists.
Per person next information is displayed:
When the application has determined that the date in the Year page and in the bio page are identical then the "Text to add to section"-cell gets a green backcolor. Also, When a person is already listed in the matching day-page 'Name exists?' is TRUE and the text to add is '-'. Otherwise the text to copy into the matching DOY page is generated based on the one in the year page.
Quite a few notable links are missing; in case of Year page 1492, section Births, 21 persons with exact birth dates are listed. But only 8 of them are present in the matching DOY pages
[2].
Another thing that stands out are the erroneous entries; Not all "Text to add"-cells are green. If a discrepancy is identified between the date stated in the Year page and the one in the matching biography it is marked by a red back color instead.
As it turned out, quite a few types of errors existed that needed to be fixed before I could add missing entries:
If such an error is detected further investigation is required; what is the source of error?; the year page or the bio? Or do I need to tweak my VBA-code?
For instance: take a look at Births; person:
Adam Ries. The
Year page states March 27 as the date of birth, whereas the matching
bio states January 17. Further investigation will have to make clear which correction will have to be made to which page. As a consequence I had to correct numerous Year- and bio-pages (which I don't mind being a
Wiki Gnome).
After adressing the errors of a specific year I could finally do what was the initial goal of the project: insert links to bio's that are missing in
WP:DAYS. Per year the generated text of the entries is copied manually from Excel to the correct location within the section of the DOY page.
So far (today is 26 June 2017) I checked all the years between 500 BC and 1625 AD this way. I added a few thousand entries during the process, in some cases adding 10+ (pre-)medieval entries to a DOY.
However, not all persons should be added to a DOY because of
'sufficiently globally notable'. All my insertions are swiftly validated by the Wiki-community, especially by
Rms125a@hotmail.com. Again, thanks for all your hard work!
Going through the centuries I noted a signifant increase in data from the 14th century onwards. Until 1350 every missing link to a referenced bio was added. From 1350 going forward the missing entries detected by the Excel-application were subject to more scrutiny. Apparently there's a lot of crap being added. Apart from notability I decided to also look at the size of the bio involved as an indication for possible insertion into DOY. I am fully aware of all the ongoing discussions around entry notability. However, because of the sheer amount of missing entries I had to come up with some criterium to quickly sift through the found missing links. I found that article size is a good indication of notability. Next table shows the century and the minimum number of characters (based on the raw http request content) required for an article usually to be eligable for insertion:
Century | Mininum nr of chars |
---|---|
Before 14th (<1300) | 0 |
14th (1300-1399) | 2000 |
15th (1400-1499) | 5000 |
After 16th (>1500) | 8000 |
Keep in mind that its limits are quite flexible. For instance: some articles on medieval poets actually quote some of their entire poems, greatly diminishing the article's relevance and notability based on the article size in characters. Other bio's state all kinds of invisible information like an infobox with a lot of empty properties and/or numerous wiki categories with the same result.
On the other hand I learned that, based on article size, the minimum number of characters to compose a relevant wiki bio regardless of its occurence in history is around 8,000.
Of course there are some other criteria for a bio to be notable or not:[ongoing]
In the end bio size doesn't seem to matter in order to be admitted to a DOY-page. Just take a look at this table (more summaries can be found in the archive at the top).
Ageed, 'milestone' is a terrible term concocted by management ;). Anyway:
Below charts per section are displayed that clearly shows the progress that's been made since the start of this endeavour [8]. Also note that the many existing erroneous links were removed from the DOY pages during the period December 2016 – June 2017.
During this project I also compiled some other statistics. Following excerpt of the output explains itself. Pay special attention to the number/fraction of discrepancies per processed year. Luckily the fraction of erroneous entries dropped sharply after 1600. I never found out why.