This article is rated C-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||
|
To-do list for Search engine indexing:
|
The goal is to provide an authoritative resource on the architecture, behavior, major processes and challenges in search engine indexing. This should be described for the general audience of the web. Please refrain from adding commercial references to support a NPOV.
This comes from http://blog.searchenginewatch.com/blog/041111-084221, so I have not included it at the moment in the article, as I do not want to do anything illegal, and am not sure this is the best reference. The goal is to show the sizes, at least at some point in time, of the number of pages indexed, to help get a feel for the size. The understanding and reference to sizes in application is important to understand the technological challenge and the rationale behind the intense research in compression and forms of indexing and search engine architectures. Josh Froelich 16:44, 15 December 2006 (UTC)
Search Engine | Reported Size | Page Depth |
8.1 billion | 101K | |
MSN | 5.0 billion | 150K |
Yahoo | 4.2 billion (estimate) |
500K |
Ask Jeeves | 2.5 billion | 101K+ |
WikiPedia's Search option is powered by a search engine which involves an indexing process. The search engine technology is a program called Lucene ( wiki), which is built into MediaWiki, the supporting software that WikiPedia uses.
Some of these same concepts are employed by WikiPedia, and you can learn more about these by researching MediaWiki and the architecture of Lucene.
WikiPedia articles are stored in a database, which serves as the corpus. For indexing, primarily only the wikitext part of the article is considered, the other meta data is mostly ignored. The wikitext is periodically indexed by the Lucene indexer, and this index is used when you click the search button on WikiPedia. From wikiehelp - Article does not appear when searching. The database for searching is updated approximately every 4-6 weeks. Changes do not appear there right away. Josh Froelich 02:34, 19 December 2006 (UTC)
I am considering adding an example of WikiPedia itself of the innards of search engine indexing. For the wikipedia lucene example, looking at the SVN source code on mediawiki's webstie, we can see that:
Josh Froelich 20:21, 7 December 2006 (UTC)
Thinking about adding a future reading section or incorporating these sources into the article. Josh Froelich 22:38, 20 December 2006 (UTC)
Some search engines identify sentence and paragraph boundaries in natural language and denote the containing sentence for each token, and the position of the token within its containing sentence, referred to as the sentence offset. This is a natural language processing task. Some search engines attempt to automatically identify the part of speech of each token within its containing sentence. Stemming may be performed at this point, which essentially involves ignoring the affixes or suffices of the token. Many search engines store the position of the token within the overall document. One accepted definition of token position is the count of preceding characters prior to the occurrence of the word (citation needed). Position is sometimes referred to as the offset. As the first token in a document contains the first character in the document, so the token's position is 0. Where as the second word might occur 5 characters later, so its recorded position is 5. An alternative definition of token position is the count of preceding tokens (citation needed). The first token is at position 0, the second token at position 1, and so on. Suppose a document consists solely of the sentence "It's a dog eat dog world". The token array (a data structure similar to a list, or subtype of a list) might look like the following:
Token (Word) | Position |
It's | 1 |
a | 6 |
dog | 8 |
eat | 11 |
dog | 15 |
world | 20 |
Storing the token 'dog' repeatedly is a redundancy. Storing a list such as this for all the documents in a large corpus is not technically feasible. Instead, many search engine designs incorporate a token hash, which assigns a unique, identifying value for each word. This word hash is a separately managed data structure commonly referred to as a lexicon. The following table displays what the lexicon looks like:
Word Id | Word |
1 | a |
2 | dog |
3 | eat |
4 | it's |
5 | world |
After hashing the words, the algorithm stores a list of each word hash key (the key is the id or value identifying the word) and its position (and potentially other characteristics such as the part of speech). This listing consumes less computer memory (takes up less space) given the smaller key size.
Word Id | Position |
4 | 1 |
1 | 6 |
2 | 8 |
3 | 11 |
2 | 15 |
5 | 20 |
In the above figure, word id 2 appears twice. This is a redundancy, which means there is room for optimization. As such, the list is conflated or aggregated by the id. The position for each token occurrence, commonly referred to as a 'hit' (citation needed), is stored as a list for each key. The list might look something like the following:
Token Id | Positions |
1 | 1,20,500,etc |
2 | 10,45,3445,etc |
In addition to grouping together words based on the exact characters in each word (dog is the same as dog...), some search engines also group together 'similar words', or words which share something in common, such as grammatical form or meaning. Using linguistic analysis, stemming is performed which identifies the root lexeme of each word. If two words share the same lexeme, the words are grouped together during conflation.
A stop list is a list of words (or numbers or other symbols), typically for a specific language, that should not be indexed. Alternatively, the stop list is the list of words that the search engine user cannot use to search with (and as such, should not always be indexed). At this point in the process, or earlier during initial tokenization or hashing (or just by looking at a property of the word in the lexicon), words may be ignored and then removed to prevent the entries from being stored in one of the later data structures of the indexing process such as the forward index.
Stop lists are also referred to as black lists, or ignore lists. Entries in the stop list are typically referred to as stop words or stopwords.
Stop lists typically contain functional words, sometimes referred to as noise (see information theory, Claude Shannon). Common English stop words include:
These words carry little semantic value for the search effort. Removing the words due to limited value ratio (the disk space or memory consumed in the index data structures and the time spent processing them versus how frequently users search for them or need to search at all for them), is a common practice as it reduces the disk size of later data structures, which also in turn improves processing speed.
A start list is a list of words or symbols that should be indexed. Some indexing processes only index words from a start list. Others provide a weighting method, where start words are treated differently, so that search results containing these words appear higher in the search results.
Lexicography is commonly used to refer to the practice of making dictionaries. Search engine designers may compile dictionaries, or one or more thesauri and incorporate these into the indexing process, to introduce human bias. For example, words which are found in a thesaurus as synonyms may only be indexed as a single entry.
In application, document authors may use lexically distinct symbols to annotate a single concept. Given that search engines pursue quality search results, and that a search engine user enters words which he/she considers to represent the concept, a goal of incorporating the dictionary is to increase the accuracy of the search engine in matching various documents, despite that the symbols (the physical words) differ.
Certain search engines do not incorporate dictionaries, or involve them at a later point as part of query expansion.
The class of search engines and natural language processing software which employ Latent Semantic Analysis use the Term-document matrix data structure. This is similar in nature to the forward and inverted indices described above. The term document matrix contains a table where every row is a document, every column is a word, and each intersection (or data cell) contains a frequency count (or sometimes other information). The data structure looks like the following:
the | dog | |
Document 1 | 5 | 34243 |
Document 2 | 2 | 0 or null |
Queries are performed by locating the column containing the word and obtaining the set of non- null rows in the table.
Note that the term document matrix is a sparse matrix as not every document in the corpus contains every token, and not every token matches all documents. The forward and inverted indices are also forms of this same sparse matrix.
This article would flow better if the sections were placed in the general order in which they are executed to create the inverted index (the final step). So something like:
With the smaller sections in between some of these. This would also involved changing the lead to include a statement of the goal of indexing: "Indexing aims to produce an map from unique keywords or terms to a list of documents containing that keyword. This data structure is referred to as an inverted keyword index."
Thoughts from anyone?
Drmadskills ( talk) 18:16, 21 February 2008 (UTC)
This article seems too biased to Internet web search engines. I came here to find information on how full-text indexes work in general (hash tables?) and found lots of stuff about... identifying languages? meta tags? Web 3.0?? —Preceding unsigned comment added by 200.68.94.105 ( talk) 19:06, 22 August 2008 (UTC)
I could not agree more: there should be a separate article that describes a unversal concept of search index providing examples of search index types and their applications, such as DBMS, text, and image search engines. Not necessarily an Internet search engine! The article should contain a link to an article that describes (Internet) fulltext search engines. Itman ( talk) 19:53, 24 February 2009 (UTC)
One of the previous subject authors wrote: "Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science." The word "informatics" is linked to another page titled Information technology. Note: Another page titled Informatics (academic field) exists. The Informatics (academic field) page specifies "Not to be confused with Information technology". This discussion requires a SME to identify 1. whether Index design incorporates informatics or information technology and 2. link the appropriate page to the designated word. araffals 17:11, 22 January 2013 (UTC) — Preceding unsigned comment added by Araffals ( talk • contribs)
This article's lead section gives "search engine optimisation indexing" as a synonym for search engine indexing. Is this a mistake? Jarble ( talk) 03:10, 29 September 2020 (UTC)
This article is rated C-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||
|
To-do list for Search engine indexing:
|
The goal is to provide an authoritative resource on the architecture, behavior, major processes and challenges in search engine indexing. This should be described for the general audience of the web. Please refrain from adding commercial references to support a NPOV.
This comes from http://blog.searchenginewatch.com/blog/041111-084221, so I have not included it at the moment in the article, as I do not want to do anything illegal, and am not sure this is the best reference. The goal is to show the sizes, at least at some point in time, of the number of pages indexed, to help get a feel for the size. The understanding and reference to sizes in application is important to understand the technological challenge and the rationale behind the intense research in compression and forms of indexing and search engine architectures. Josh Froelich 16:44, 15 December 2006 (UTC)
Search Engine | Reported Size | Page Depth |
8.1 billion | 101K | |
MSN | 5.0 billion | 150K |
Yahoo | 4.2 billion (estimate) |
500K |
Ask Jeeves | 2.5 billion | 101K+ |
WikiPedia's Search option is powered by a search engine which involves an indexing process. The search engine technology is a program called Lucene ( wiki), which is built into MediaWiki, the supporting software that WikiPedia uses.
Some of these same concepts are employed by WikiPedia, and you can learn more about these by researching MediaWiki and the architecture of Lucene.
WikiPedia articles are stored in a database, which serves as the corpus. For indexing, primarily only the wikitext part of the article is considered, the other meta data is mostly ignored. The wikitext is periodically indexed by the Lucene indexer, and this index is used when you click the search button on WikiPedia. From wikiehelp - Article does not appear when searching. The database for searching is updated approximately every 4-6 weeks. Changes do not appear there right away. Josh Froelich 02:34, 19 December 2006 (UTC)
I am considering adding an example of WikiPedia itself of the innards of search engine indexing. For the wikipedia lucene example, looking at the SVN source code on mediawiki's webstie, we can see that:
Josh Froelich 20:21, 7 December 2006 (UTC)
Thinking about adding a future reading section or incorporating these sources into the article. Josh Froelich 22:38, 20 December 2006 (UTC)
Some search engines identify sentence and paragraph boundaries in natural language and denote the containing sentence for each token, and the position of the token within its containing sentence, referred to as the sentence offset. This is a natural language processing task. Some search engines attempt to automatically identify the part of speech of each token within its containing sentence. Stemming may be performed at this point, which essentially involves ignoring the affixes or suffices of the token. Many search engines store the position of the token within the overall document. One accepted definition of token position is the count of preceding characters prior to the occurrence of the word (citation needed). Position is sometimes referred to as the offset. As the first token in a document contains the first character in the document, so the token's position is 0. Where as the second word might occur 5 characters later, so its recorded position is 5. An alternative definition of token position is the count of preceding tokens (citation needed). The first token is at position 0, the second token at position 1, and so on. Suppose a document consists solely of the sentence "It's a dog eat dog world". The token array (a data structure similar to a list, or subtype of a list) might look like the following:
Token (Word) | Position |
It's | 1 |
a | 6 |
dog | 8 |
eat | 11 |
dog | 15 |
world | 20 |
Storing the token 'dog' repeatedly is a redundancy. Storing a list such as this for all the documents in a large corpus is not technically feasible. Instead, many search engine designs incorporate a token hash, which assigns a unique, identifying value for each word. This word hash is a separately managed data structure commonly referred to as a lexicon. The following table displays what the lexicon looks like:
Word Id | Word |
1 | a |
2 | dog |
3 | eat |
4 | it's |
5 | world |
After hashing the words, the algorithm stores a list of each word hash key (the key is the id or value identifying the word) and its position (and potentially other characteristics such as the part of speech). This listing consumes less computer memory (takes up less space) given the smaller key size.
Word Id | Position |
4 | 1 |
1 | 6 |
2 | 8 |
3 | 11 |
2 | 15 |
5 | 20 |
In the above figure, word id 2 appears twice. This is a redundancy, which means there is room for optimization. As such, the list is conflated or aggregated by the id. The position for each token occurrence, commonly referred to as a 'hit' (citation needed), is stored as a list for each key. The list might look something like the following:
Token Id | Positions |
1 | 1,20,500,etc |
2 | 10,45,3445,etc |
In addition to grouping together words based on the exact characters in each word (dog is the same as dog...), some search engines also group together 'similar words', or words which share something in common, such as grammatical form or meaning. Using linguistic analysis, stemming is performed which identifies the root lexeme of each word. If two words share the same lexeme, the words are grouped together during conflation.
A stop list is a list of words (or numbers or other symbols), typically for a specific language, that should not be indexed. Alternatively, the stop list is the list of words that the search engine user cannot use to search with (and as such, should not always be indexed). At this point in the process, or earlier during initial tokenization or hashing (or just by looking at a property of the word in the lexicon), words may be ignored and then removed to prevent the entries from being stored in one of the later data structures of the indexing process such as the forward index.
Stop lists are also referred to as black lists, or ignore lists. Entries in the stop list are typically referred to as stop words or stopwords.
Stop lists typically contain functional words, sometimes referred to as noise (see information theory, Claude Shannon). Common English stop words include:
These words carry little semantic value for the search effort. Removing the words due to limited value ratio (the disk space or memory consumed in the index data structures and the time spent processing them versus how frequently users search for them or need to search at all for them), is a common practice as it reduces the disk size of later data structures, which also in turn improves processing speed.
A start list is a list of words or symbols that should be indexed. Some indexing processes only index words from a start list. Others provide a weighting method, where start words are treated differently, so that search results containing these words appear higher in the search results.
Lexicography is commonly used to refer to the practice of making dictionaries. Search engine designers may compile dictionaries, or one or more thesauri and incorporate these into the indexing process, to introduce human bias. For example, words which are found in a thesaurus as synonyms may only be indexed as a single entry.
In application, document authors may use lexically distinct symbols to annotate a single concept. Given that search engines pursue quality search results, and that a search engine user enters words which he/she considers to represent the concept, a goal of incorporating the dictionary is to increase the accuracy of the search engine in matching various documents, despite that the symbols (the physical words) differ.
Certain search engines do not incorporate dictionaries, or involve them at a later point as part of query expansion.
The class of search engines and natural language processing software which employ Latent Semantic Analysis use the Term-document matrix data structure. This is similar in nature to the forward and inverted indices described above. The term document matrix contains a table where every row is a document, every column is a word, and each intersection (or data cell) contains a frequency count (or sometimes other information). The data structure looks like the following:
the | dog | |
Document 1 | 5 | 34243 |
Document 2 | 2 | 0 or null |
Queries are performed by locating the column containing the word and obtaining the set of non- null rows in the table.
Note that the term document matrix is a sparse matrix as not every document in the corpus contains every token, and not every token matches all documents. The forward and inverted indices are also forms of this same sparse matrix.
This article would flow better if the sections were placed in the general order in which they are executed to create the inverted index (the final step). So something like:
With the smaller sections in between some of these. This would also involved changing the lead to include a statement of the goal of indexing: "Indexing aims to produce an map from unique keywords or terms to a list of documents containing that keyword. This data structure is referred to as an inverted keyword index."
Thoughts from anyone?
Drmadskills ( talk) 18:16, 21 February 2008 (UTC)
This article seems too biased to Internet web search engines. I came here to find information on how full-text indexes work in general (hash tables?) and found lots of stuff about... identifying languages? meta tags? Web 3.0?? —Preceding unsigned comment added by 200.68.94.105 ( talk) 19:06, 22 August 2008 (UTC)
I could not agree more: there should be a separate article that describes a unversal concept of search index providing examples of search index types and their applications, such as DBMS, text, and image search engines. Not necessarily an Internet search engine! The article should contain a link to an article that describes (Internet) fulltext search engines. Itman ( talk) 19:53, 24 February 2009 (UTC)
One of the previous subject authors wrote: "Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science." The word "informatics" is linked to another page titled Information technology. Note: Another page titled Informatics (academic field) exists. The Informatics (academic field) page specifies "Not to be confused with Information technology". This discussion requires a SME to identify 1. whether Index design incorporates informatics or information technology and 2. link the appropriate page to the designated word. araffals 17:11, 22 January 2013 (UTC) — Preceding unsigned comment added by Araffals ( talk • contribs)
This article's lead section gives "search engine optimisation indexing" as a synonym for search engine indexing. Is this a mistake? Jarble ( talk) 03:10, 29 September 2020 (UTC)