Computing desk | ||
---|---|---|
< December 17 | << Nov | December | Jan >> | December 19 > |
Welcome to the Wikipedia Computing Reference Desk Archives |
---|
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages. |
Given an ancient scanned book, for example De re anatomica libri XV, physical leaf number 10 equates to printed page number 3. Thus https://archive.org/details/BIUSante_08734/page/n10 ("n10" is the physical leaf number) shows printed page #3. If we run OCR on the book and build a table mapping physical leaf numbers to printed pages, for the first 27 leafs:
Extended content
|
---|
Page[0].ppagei = 0 Page[1].ppagei = 0 Page[2].ppagei = 0 Page[3].ppagei = 0 Page[4].ppagei = 0 Page[5].ppagei = 0 Page[6].ppagei = 0 Page[7].ppagei = 0 Page[8].ppagei = 0 Page[9].ppagei = 0 Page[10].ppagei = 0 Page[11].ppagei = 4 Page[12].ppagei = 5 Page[13].ppagei = 0 Page[14].ppagei = 7 Page[15].ppagei = 8 Page[16].ppagei = 5 Page[17].ppagei = 0 Page[18].ppagei = 0 Page[19].ppagei = 0 Page[20].ppagei = 13 Page[21].ppagei = 14 Page[22].ppagei = 0 Page[23].ppagei = 16 Page[24].ppagei = 17 Page[25].ppagei = 0 Page[26].ppagei = 0 Page[27].ppagei = 20 |
Due to OCR errors, some of the printed page numbers can't be determined (" = 0") and some are wrong ("Page[16"] = 5"). Is there a suggested method or algorithm for discovering runs of sequential numbers, and from that fill in blank or incorrect pages? This is a general question for many scanned books not just this example. -- Green C 16:24, 18 December 2019 (UTC)
I expect they are sequential, and page16=5 is an ocr error where a "9" got interpreted as "5". 173.228.123.190 ( talk) 11:29, 21 December 2019 (UTC)
Computing desk | ||
---|---|---|
< December 17 | << Nov | December | Jan >> | December 19 > |
Welcome to the Wikipedia Computing Reference Desk Archives |
---|
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages. |
Given an ancient scanned book, for example De re anatomica libri XV, physical leaf number 10 equates to printed page number 3. Thus https://archive.org/details/BIUSante_08734/page/n10 ("n10" is the physical leaf number) shows printed page #3. If we run OCR on the book and build a table mapping physical leaf numbers to printed pages, for the first 27 leafs:
Extended content
|
---|
Page[0].ppagei = 0 Page[1].ppagei = 0 Page[2].ppagei = 0 Page[3].ppagei = 0 Page[4].ppagei = 0 Page[5].ppagei = 0 Page[6].ppagei = 0 Page[7].ppagei = 0 Page[8].ppagei = 0 Page[9].ppagei = 0 Page[10].ppagei = 0 Page[11].ppagei = 4 Page[12].ppagei = 5 Page[13].ppagei = 0 Page[14].ppagei = 7 Page[15].ppagei = 8 Page[16].ppagei = 5 Page[17].ppagei = 0 Page[18].ppagei = 0 Page[19].ppagei = 0 Page[20].ppagei = 13 Page[21].ppagei = 14 Page[22].ppagei = 0 Page[23].ppagei = 16 Page[24].ppagei = 17 Page[25].ppagei = 0 Page[26].ppagei = 0 Page[27].ppagei = 20 |
Due to OCR errors, some of the printed page numbers can't be determined (" = 0") and some are wrong ("Page[16"] = 5"). Is there a suggested method or algorithm for discovering runs of sequential numbers, and from that fill in blank or incorrect pages? This is a general question for many scanned books not just this example. -- Green C 16:24, 18 December 2019 (UTC)
I expect they are sequential, and page16=5 is an ocr error where a "9" got interpreted as "5". 173.228.123.190 ( talk) 11:29, 21 December 2019 (UTC)