“ | "I realised that deaths lists lend themselves for citation generation, perhaps uniquely so. | ” |
During my private project Chaining back the Years I noticed that a lot of articles that list deceased per month ('dpm's') lacked references. These references should state the date (and cause) of death of those listed (the 'entries').
Early 2020 I realised that deaths lists lend themselves for citation generation, perhaps uniquely so. I was already aware of the excellent archive API of The New York Times. So I experimented with some code that ultimately would become the WikipediaReferences application. This article describes its evolution and general algoritms.
The .NET Core 3.1 application consists of three projects:
To not scare you, the reader, away I will not dwell too much on the technical details of the solution and why the console in the end had way too much responsibilities, I will focus its functions instead.
The most used functionality of the app is the 'Print month of death' option. When selected it first resolves which dpm should be handled (see screenshot). Subsequently it generates a file containing
wikitext which contains added NYT citations to corresponding entries. In the file sometimes also existing references are replaced by NYT refs (more on that later). When the file contents is pasted in a dpm the article is
updated accordingly.
But how was this accomplished?
In the end two sources of data need to be connected. We have the NYTimes data (more specifically the obituaries) on one hand and we have the Wikipedia biographies ('bio's') on the other.
I could only generate a reference if I can find an NYT obituary that states the date of death of someone who has a bio on Wikipedia.
Different steps were necessary to accomplish this:
The initial step was to populate a database with NYT obituary data. [1] The Archive API returns an array of all NYT articles for a given month. [2] The response is in JSON and the response size can be quite large (~20mb). This is structured data. A typical article (or 'Doc') contains next properties and metadata:
First task was to retrieve the obituaries from this huge array of Docs. [3] Only entries are listed that have a corresponding biography page on the English Wikipedia. So before the obituary data could be added I had to check that the deceased was on Wikipedia. I would retrieve the name of the obituary's subject from the Doc's metadata and perform the check. [4]. Issue was that the NYT subject's name and the Wikipedia article name could differ. It would be a shame that I would miss a lot of citations because of that. So dependent on the name the application would check up to four name variations. For instance the name James D. Hardy Jr. would be checked using these variations:
Sometimes resolving a NYT subject would lead to a disambiguation page like this one. Luckily these types of pages are higly standardized in format which enabled the software to do this: it would look for the entry in the disambiguation page whose stated YEAR of death would match the month I was processing. So if I was adding January 1995 to the references database then George Price would be resolved as George Price (cartoonist).
During this initial phase I also discovered some bugs in the Archive API itself which I reported to the NYTimes staff and which have been fixed. On July 8 the software finally produced something I could work with.
The most tricky bit, however, was resolving the actual death date stated in the obituary. One would expect this fact to be present in the doc's metadata but alas. Thankfully the properties lead_paragraph and @abstract in most cases contained enough information to determine the persons date of demise. Easiest of course would be when the death date would be stated in the first paragraph:
"Ann Dunnigan, an actress and translator, died on Sept. 5 at her home in Manhattan. She was 87."
[5]
Other examples:
However, in most cases the date of passing had to be deducted from the obituary, using the publication date as a reference point. There were two flavors:
1. Day names mentioned in the lead:
After I implemented the first version of the deduction algorithm I still couldn't resolve quite a few dates; see 'bucket section' 31/12/9999 in September 1997. Analysis of the bucket results pinted to the remaining indicators:
I also had to figure out which words in combination with date (indicators) would yield the best results. After much experimenting I came up with this
regular expression:
Regex regex = new Regex(" (?:died|dead|killed) .{0,60}" + dayIndicator);
By 11 July 2020
I could resolve most dates. I drew the line trying to crack these:
With our obituary data securely stored in the database we can now use it to compile citations when generating the wikicode for a dpm. The general process can be divided into these tasks:
After the dpm had been parsed into entries as software objects the tool would validate its data:
Per day of the processed month the app checks if NYT obituary data exists for existing entries. If so the refererence wikitext is generated. It will be used as a citation in next two situations.
Note: if NYT obituary data is found for a bio which is not present in the dpm AND the app evaluates it sufficiently notable [8] than a warning is displayed urging the user (me) to consider adding the bio as entry to the dpm.
Code excerpt:
private void EvaluateDeathsPerMonthArticle(int year, int monthId, IEnumerable<Entry> entries) { UI.Console.WriteLine("Evaluating the entries..."); var references = GetReferencesPermonth(year, monthId); for (int day = 1; day <= DateTime.DaysInMonth(year, monthId); day++) { UI.Console.WriteLine($"\r\nChecking nyt ref. date {new DateTime(year, monthId, day).ToShortDateString()}"); IEnumerable<Reference> referencesPerDay = references.Where(r => r.DeathDate.Day == day); foreach (var reference in referencesPerDay) HandleReference(reference, entries); } }
In the early stages of the app the output generated by the app was quite crude. When the wikitext was pasted in my NYT references page it looks like this. The listed items still needed to be manually matched and the reference pasted by hand into the processing page. This situation persisted until October 4 2020.
By that time I got so fed up with this manual work that I spend time on automating wikicode generation. By November 14 work had progressed considerably. I could now copy the wikitext that had been outputted by the app to a text file and paste the text in the dpm. After saving the edits the generated citations would be added. This would also signal the start of Round 2.
During the course of this endeavour I had been adding numerous citations to entries manually. And although the Wikipedia:RefToolbar is a great help I got increasingly frustrated with tediously filling in the fields. Since sportspeople die too I had been using the sub sites of Sports Reference extensively as citation sources. I had already noticed that the layout and html between those sites was very similar. Another time-saving idea popped up in my head: why not extend the application so that it facilitates generating references other than the NYTimes?
I delved into website data extraction and settled for the Html Agility Pack as the weapon of choice. After some experimenting I was able to grab the citation data of next sites by only entering the person's id as input in the console app:
As you can see in the image the generated wikitext is displayed in green in the console. The text could now be copied and pasted as a citation. It worked so well that in time I time I added three other sources for references:
Since the data was already present I also added an option the generate specific NYTimes references.
Code excerpt regarding Olympedia.org:
public void GenerateOlympediaReference() { string url = GetReferenceUrl("http://www.olympedia.org/athletes/", "Olympedia Id: (f.i.: 73711)"); var rootNode = GetHtmlDocRootNode(url); var table = rootNode.Descendants(0).First(n => n.HasClass("biodata")) .Descendants("tr") .Select(tr => { var key = tr.Elements("th").Select(td => td.InnerText).First(); var value = tr.Elements("td").Select(td => td.InnerText).First(); return new KeyValuePair<string, string>(key, value); } ).ToList(); string usedName = table.First(kvp => kvp.Key == "Used name").Value; var reference = GenerateWebReference($"Olympedia – {usedName}", url, "olympedia.org", DateTime.Today, DateTime.MinValue, publisher: "[[OlyMADMen]]"); UI.Console.WriteLine(ConsoleColor.Green, reference); }
Although I still have to create refs manually, this final piece of functionality expedited citation generation to a level that was acceptable to me.
return articleDocs.Where(d => d.type_of_material.Contains("Obituary")).AsEnumerable().OrderBy(d => d.pub_date);
“ | "I realised that deaths lists lend themselves for citation generation, perhaps uniquely so. | ” |
During my private project Chaining back the Years I noticed that a lot of articles that list deceased per month ('dpm's') lacked references. These references should state the date (and cause) of death of those listed (the 'entries').
Early 2020 I realised that deaths lists lend themselves for citation generation, perhaps uniquely so. I was already aware of the excellent archive API of The New York Times. So I experimented with some code that ultimately would become the WikipediaReferences application. This article describes its evolution and general algoritms.
The .NET Core 3.1 application consists of three projects:
To not scare you, the reader, away I will not dwell too much on the technical details of the solution and why the console in the end had way too much responsibilities, I will focus its functions instead.
The most used functionality of the app is the 'Print month of death' option. When selected it first resolves which dpm should be handled (see screenshot). Subsequently it generates a file containing
wikitext which contains added NYT citations to corresponding entries. In the file sometimes also existing references are replaced by NYT refs (more on that later). When the file contents is pasted in a dpm the article is
updated accordingly.
But how was this accomplished?
In the end two sources of data need to be connected. We have the NYTimes data (more specifically the obituaries) on one hand and we have the Wikipedia biographies ('bio's') on the other.
I could only generate a reference if I can find an NYT obituary that states the date of death of someone who has a bio on Wikipedia.
Different steps were necessary to accomplish this:
The initial step was to populate a database with NYT obituary data. [1] The Archive API returns an array of all NYT articles for a given month. [2] The response is in JSON and the response size can be quite large (~20mb). This is structured data. A typical article (or 'Doc') contains next properties and metadata:
First task was to retrieve the obituaries from this huge array of Docs. [3] Only entries are listed that have a corresponding biography page on the English Wikipedia. So before the obituary data could be added I had to check that the deceased was on Wikipedia. I would retrieve the name of the obituary's subject from the Doc's metadata and perform the check. [4]. Issue was that the NYT subject's name and the Wikipedia article name could differ. It would be a shame that I would miss a lot of citations because of that. So dependent on the name the application would check up to four name variations. For instance the name James D. Hardy Jr. would be checked using these variations:
Sometimes resolving a NYT subject would lead to a disambiguation page like this one. Luckily these types of pages are higly standardized in format which enabled the software to do this: it would look for the entry in the disambiguation page whose stated YEAR of death would match the month I was processing. So if I was adding January 1995 to the references database then George Price would be resolved as George Price (cartoonist).
During this initial phase I also discovered some bugs in the Archive API itself which I reported to the NYTimes staff and which have been fixed. On July 8 the software finally produced something I could work with.
The most tricky bit, however, was resolving the actual death date stated in the obituary. One would expect this fact to be present in the doc's metadata but alas. Thankfully the properties lead_paragraph and @abstract in most cases contained enough information to determine the persons date of demise. Easiest of course would be when the death date would be stated in the first paragraph:
"Ann Dunnigan, an actress and translator, died on Sept. 5 at her home in Manhattan. She was 87."
[5]
Other examples:
However, in most cases the date of passing had to be deducted from the obituary, using the publication date as a reference point. There were two flavors:
1. Day names mentioned in the lead:
After I implemented the first version of the deduction algorithm I still couldn't resolve quite a few dates; see 'bucket section' 31/12/9999 in September 1997. Analysis of the bucket results pinted to the remaining indicators:
I also had to figure out which words in combination with date (indicators) would yield the best results. After much experimenting I came up with this
regular expression:
Regex regex = new Regex(" (?:died|dead|killed) .{0,60}" + dayIndicator);
By 11 July 2020
I could resolve most dates. I drew the line trying to crack these:
With our obituary data securely stored in the database we can now use it to compile citations when generating the wikicode for a dpm. The general process can be divided into these tasks:
After the dpm had been parsed into entries as software objects the tool would validate its data:
Per day of the processed month the app checks if NYT obituary data exists for existing entries. If so the refererence wikitext is generated. It will be used as a citation in next two situations.
Note: if NYT obituary data is found for a bio which is not present in the dpm AND the app evaluates it sufficiently notable [8] than a warning is displayed urging the user (me) to consider adding the bio as entry to the dpm.
Code excerpt:
private void EvaluateDeathsPerMonthArticle(int year, int monthId, IEnumerable<Entry> entries) { UI.Console.WriteLine("Evaluating the entries..."); var references = GetReferencesPermonth(year, monthId); for (int day = 1; day <= DateTime.DaysInMonth(year, monthId); day++) { UI.Console.WriteLine($"\r\nChecking nyt ref. date {new DateTime(year, monthId, day).ToShortDateString()}"); IEnumerable<Reference> referencesPerDay = references.Where(r => r.DeathDate.Day == day); foreach (var reference in referencesPerDay) HandleReference(reference, entries); } }
In the early stages of the app the output generated by the app was quite crude. When the wikitext was pasted in my NYT references page it looks like this. The listed items still needed to be manually matched and the reference pasted by hand into the processing page. This situation persisted until October 4 2020.
By that time I got so fed up with this manual work that I spend time on automating wikicode generation. By November 14 work had progressed considerably. I could now copy the wikitext that had been outputted by the app to a text file and paste the text in the dpm. After saving the edits the generated citations would be added. This would also signal the start of Round 2.
During the course of this endeavour I had been adding numerous citations to entries manually. And although the Wikipedia:RefToolbar is a great help I got increasingly frustrated with tediously filling in the fields. Since sportspeople die too I had been using the sub sites of Sports Reference extensively as citation sources. I had already noticed that the layout and html between those sites was very similar. Another time-saving idea popped up in my head: why not extend the application so that it facilitates generating references other than the NYTimes?
I delved into website data extraction and settled for the Html Agility Pack as the weapon of choice. After some experimenting I was able to grab the citation data of next sites by only entering the person's id as input in the console app:
As you can see in the image the generated wikitext is displayed in green in the console. The text could now be copied and pasted as a citation. It worked so well that in time I time I added three other sources for references:
Since the data was already present I also added an option the generate specific NYTimes references.
Code excerpt regarding Olympedia.org:
public void GenerateOlympediaReference() { string url = GetReferenceUrl("http://www.olympedia.org/athletes/", "Olympedia Id: (f.i.: 73711)"); var rootNode = GetHtmlDocRootNode(url); var table = rootNode.Descendants(0).First(n => n.HasClass("biodata")) .Descendants("tr") .Select(tr => { var key = tr.Elements("th").Select(td => td.InnerText).First(); var value = tr.Elements("td").Select(td => td.InnerText).First(); return new KeyValuePair<string, string>(key, value); } ).ToList(); string usedName = table.First(kvp => kvp.Key == "Used name").Value; var reference = GenerateWebReference($"Olympedia – {usedName}", url, "olympedia.org", DateTime.Today, DateTime.MinValue, publisher: "[[OlyMADMen]]"); UI.Console.WriteLine(ConsoleColor.Green, reference); }
Although I still have to create refs manually, this final piece of functionality expedited citation generation to a level that was acceptable to me.
return articleDocs.Where(d => d.type_of_material.Contains("Obituary")).AsEnumerable().OrderBy(d => d.pub_date);