A DIRECT APPROACH TO INFORMATION RETRIEVAL [1] Master's Thesis, [2] University of London, [3] 1975 [4] Submitted by Kyung-Youn Park [5] Supervised by B. C. Brookes [6] University College London [7]
"What we should do, I suggest, is to give up the idea of ultimate sources of knowledge, and admit that all knowledge is human; that it is mixed with our errors, our prejudices, our dreams, and our hopes; that all we can do is to grope for truth even though it be beyond our reach. We may admit that our groping is often inspired, but we must be on our guard against the belief, however deeply felt, that our inspiration carries any authority, divine or otherwise. If we thus admit that there is no authority beyond the reach of criticism to be found within the whole province of our knowledge, however far it may have penetrated into the unknown, then we can retain, without danger, the idea that truth is beyond human authority. And we must retain it. For without this idea there can be no objective standards of inquiry; no criticism of our conjectures; no groping for the unknown; no quest for knowledge."
"In science men have learned consciously to subordinate themselves to a common purpose without losing the individuality of their achievements. Each one knows that his work depends on that of his predecessors and colleagues, and that it can only reach its fruition through the work of his successors."
"The modern World Encyclopaedia should consist of relations, extracts, quotations, very carefully assembled with the approval of outstanding authorities in each subject, carefully collated and edited and critically presented. It would not be a miscellany, but a concentration, a clarification and a synthesis."
In this study I am concerned with file organization of scientific literature in view of discovering useful information efficiently; largely, the problem of information retrieval. It seems that information retrieval now implies something more than a mechanistic and technical problem, something that gradually resolves into complexity of human communication, understanding and knowledge. Similar views have recently been expressed by Mitroff, et al [1] and by Brookes [2] in a wider context. " As we may think" or look back, our initial hope for information retrieval has been faded in spite of tremendous development of computer techniques and others made for the past thirty years. This frustration was anticipated as early as 1948 by Wiener [3]. Still we are not sure if we could restore the hope in the near future, particularly along the same line of thought.
As to scientific information* in the wide sense, the following fundamental questions may be raised:
Obviously, information retrieval is most closely related to the last question. But I feel that the other two questions should also be taken into consideration when we intend to discuss information retrieval carefully. I selected the prefatory statements by Popper [4], by Bernal [5], and by Wells [6] as the most thought-provoking with respect to these three fundamental questions. And the statements represent my standpoint that I have taken in approaching the problems of information retrieval.
In the following chapters, I discuss first some fundamental considerations for information retrieval. I shall understand the narrowed retrieval problems mainly owing to Fairthorne's insightful contention [7]. Further I shall attempt to understand the problems in the light of communication and information which appear to be almost undefined. For this purpose I attend to Cherry's critical view on human communication [8] and to Ogden and Richards' classic theory of interpretation [9]. In short I am seeking for a solution to the problems of information retrieval, by questioning what influences those who communicate and obtain information.
Eventually, I propose a way of file organization as most essential for information retrieval. The proposal is only crude at this stage. In fact, the discussion of fundamental considerations is thus intended to make clearer and justify to some extent the idea which might require further elaboration and application. The main feature of the proposal is to use in retrieval those extracts in which the source document cites, describes, criticizes, and/or collates other documents. Such extracts seem to provide concise but significant clues for discriminating the cited documents. The most concise clues should be regarded as significant when they are coherent in their proper environments or contexts.
The overall view of main retrieval events may be represented schematically as shown in Figure 1. It may be said here that:
Figure 1. Schematic View of Information Retrieval Events.
Information retrieval is a complex type of communication between the system and the user. The schematic diagram in Figure 1 roughly shows the situation. Admittedly, the diagram is too simple and crude for explaining information retrieval meaningfully. It will be expanded in Chapter 5. Meanwhile, it may suffice to show how to approach retrieval problems.
What we want to know ultimately is the relationship between the system and the user, which is represented in Figure 1 by the solid arrows and characterized by prediction and discrimination of documents. Also, we can consider many other relationships in the diagram; for example, those represented by the dotted arrows and the broken arrows. Here we can reasonably assert that all knowledge of these relationships should concentrate on explicating the relationship of utmost importance between the system and the user.
On the other hand, information retrieval may be possible with little or no attention to knowledge of the relationship between the system and the user. That is to say, we can contain the system and the user in a black box*, perform information retrieval, and improve the performance successively by feedback control. Combination of the solid arrows and the dotted arrows makes a closed cycle for feedback control. The black box has two input terminals, Ein and Din, which are input to the system and the user, respectively. It also has two output terminals: one for the user to give Eout in search of, and then in response to, Din, and the other for the system to retrieve Dout in response to Ein. This principle is illustrated in Figure 2, where Po represents the given initial condition or a set of performance factors of the black box.
Figure 2. Feedback Control of Information Retrieval.
Whether or not it is possible and practicable, this principle almost certainly would not tell much about the relationship between the system and the user, meaningfully. In other words, it may not necessarily be suitable for explicating the relationship. Even if suitable, it can explain the relationship only indirectly, i.e., through inferences from a great deal of valid and consistent evidence.
The approach that has been overwhelmingly used in the field of information retrieval is very similar to this principle. The main difference is to change the initial condition Po in many ways in order to know which initial condition will give the optimum performance of the system. This approach is not quite intended to know the relationship between the system and the user. The other, direct approach will be attempted in this study.
The user needs information. Even if he is seeking for it in a document, he may be little conscious of the physical form of document. For he may understand the notion of information in much the same way as a housewife in the marketplace does.
The user can easily and quite properly speak of pertinent, relevant, useful, or valuable information, expressing somewhat different shades of meaning. All these similar qualifiers, however, seem to be more or less redundant, for the user would appreciate information as such only when it is relevant to his specific purpose. Furthermore, it seems certain that these qualifiers are used only ambiguously.
Also, it is hard to say that the user, having found extremely valuable information in the library, must have in mind any economic sense at all. Of course, anyone would be perfectly right to understand information essentially as something "bought, sold, stored, treated, exchanged and consumed in economic terms," if there exists that sort of information. Another way of understanding [8] is such that:
Obviously, this quotation represents our common understanding. However, we know the fact that, by understanding or defining something in a particular way, one specifies in effect one's readiness or intention to communicate with other people as to the thing defined. And those people are expected or invited to share the same understanding. Unfortunately, some people would not or could not agree with the suggested definition, however authoritative, because it is absurd, unnecessary, out of their concern, or for some other reasons. In this case, communication is likely to become a conflicting argument apart from what is to be communicated; actually the breakdown of communication.
Communication between the retrieval system and the user as illustrated in Figure 1, is of secondary importance to the user. It is only intermediary or necessary for another communication of primary importance, that is, communication of information with the author of a document. Therefore misconception of information may damage both communications.
From the psychological point of view, Stevens [10] attempted to generalize communication by defining it as "the discriminatory response of an organism to a stimulus." This definition was criticized by Cherry [8] on the grounds that communication is essentially the relationship established between stimuli and responses. It does not follow however that the definition is wrong. It simply focuses on the communication event at the receiving end; for example, the discriminatory response of the user to a given document.
Information seeking presupposes satisfaction of information needs. The user will be really satisfied only when he finds information relevant to his need. Naturally, he discriminates what he has received in the light of his need and criteria. Namely, relevance judgment. Nothing can stop him from being subjective and tough in the judgment. He may even totalize his relevance criteria. Now it is well known that relevance judgments are so complicated depending on not only subject matter but also many other things [11]. It is noteworthy that the real judge is the user. If a panel of judges were to take his place, it may serve some practical purposes, but for a shift. We shall reserve the pure notion of relevance for those systems that aim to provide relevant information. [8]
Many studies have shown that informal communication is very popular among scientists, especially among those who are eminent. This phenomenon is quite convincing. However, it is mistaken that informal communication is superior to formal communication. At least, informal communication does not pay much attention to the "social responsibility" or morality, if you like, which was emphasized by Bernal [5]. It should be said that each represents a different machinery of communication, not degrading the other. Presumably, all the past experience of an eminent scientist in formal communication (e.g., through books, journals, lectures, libraries, and so on), must have enabled him to shortcircuit to only the essence, say, a few words of suggestion. This shortcircuiting may be applied to formal communication.
Among many others, Jahoda, et al [12] observed that 66% of faculty members interviewed in one university maintained personal indexes, that 42% of them regarded preparation of indexes as too time-consuming, and that 32% complained of inconsistency in indexing. No doubt, a large portion of scientists spend much time in preparing their own systems, i.e., personal indexes or the like, and they are themselves suffering from so-called retrieval problems. Therefore, anything that can help them solve the problems and improve their own systems would be duly appreciated.
On the other hand, naive and primitive as they may be, personal indexes must be worth careful study in order to learn how scientists extend their retrieval facilities toward systems. An important suggestion to all kinds of retrieval systems may be found there.
From the user's point of view, any retrieval system may be regarded as an extension of his information-seeking facilities. The user can satisfy his information need to some extent for himself, instead of delegating retrieval to what may be called outside systems as opposed to personal means. In this respect, it is questionable whether outside systems, however elaborate, can promise better satisfaction than personal means which are familiar to the user.
When the user delegates retrieval to the system, there must be some agreement, although tacit, between both parties about the way in which the system is designed to act on behalf of the user. A set of constraints is characteristic of any system whatsoever. It is beyond these constraints that the retrieval system is not expected to answer and the user is not normally allowed to ask. However, these obvious constraints seem to be often disguised and overlooked. (Note that at the moment we are talking about the current systems regardless of their future developments.) This aspect has been convincingly discussed by Fairthorne in many of his writings.
Then, what is the agreement? The straightforward answer is that the user should agree that the system works on the basis of subject similarity rather than relevance. The distinction between these two apparently similar notions should be made clear.
If any two readers are compared as to what each of them recognizes from the same document, both will be different in general from each other. For everyone tends to interpret subjectively what is written or said about. These individual interpretations may be superimposed for many readers, in order to separate the densely overlapping thus explicit meaning from the relatively subjective and more or less implicit meanings.
We can fairly reasonably say that in interpreting a document, the indexer tends to behave in an unanimous way; in principle he can discern from a document most of the explicit meaning. The implicit meaning that he may discern in addition would be negligible considering the large number of potential users. As opposed to the indexer, the user tends to interpret in a subjective way; he may not find any information from the explicit meaning but elsewhere. Various users may be very different from each other in finding information according to their past experiences and present state of mind.
Not attempting to be precise, let us associate the explicit, unanimous interpretation with subject similarity. And let us associate the subjective, individual interpretation with relevance. Perhaps we cannot discuss similarity without unanimity or commonality of interpretation; nor relevance without subjectivity or individuality of interpretation. Fairthorne [7] distinguishes them as extensional aboutness and intensional aboutness. We shall return to this distinction later.
It may be said that subject similarity is a necessary condition for relevance and that relevance is a sufficient condition for subject similarity. In this respect, the system which operates even ideally on the basis of subject similarity, or in an unanimous way, is liable to two types of error, that is, to miss relevant documents and to retrieve non-relevant documents. It cannot turn aside relevance by any means. Thus it can only predict relevance, however ideal in recognizing subject similarity.
Strictly speaking, relevance is a priori from the system's point of view. It is true in the sense that relevance criteria, however precisely stated by the user, cannot wholly be accepted due to the system constraints, so the accepted part may not be sufficient in the end. It is all the more so because relevance criteria, however readily accepted by the system, cannot affect indexing retrospectively. To borrow Fairthorne's contention [7]:
We have still another reason to believe in the a priori characteristic of relevance. A great deal of experimental as well as operational experience in retrieval has been accumulated at least over the past twenty years. Retrieval languages and devices must have been greatly improved. Nevertheless, how much has been learned about how the user judges a relevant document* as such? Obviously not much.
If so understood, relevance must have been overemphasized in evaluating retrieval systems. Especially, comparison of different systems might have been unfair to some; for it is not certain whether or not subject similarity goes parallel with relevance. How retrieval systems are to be evaluated and compared should be made clear first; in terms of either subject similarity or relevance or both. The evaluation solely based on subject similarity would not tell much about how to give more satisfaction to the user; that solely based on relevance would not tell much about how to keep the agreement with the user better. Collation of both evaluation will be necessary to know about the relationship between the system and the user.
A group of documents can be said to be similar to each other, when they have in common a set of identical properties A; they are similar with respect to the shared properties A. In general, each document in a similarity group has some other (different) properties B in addtion to A. Therefore, the content C of a document may be represented:
This equation may apply somewhat analogously to the document surrogate, too.
Because of the repetitive nature of the shared properties A, a group of similar documents are characterized by semantic redundancy, even if not by textual redundancy. This characteristic will be transferred somewhat analogously to the corresponding document surrogates. That is to say, the identical properties A are repeated not only in similar documents but also in their surrogates. This repetition or redundancy in a group of similar surrogates appears to be inevitable, because there would be no grouping of similar documents or surrogates without that. But it is not quite so from the point of view of file organization. For one thing, the idea of inverted files may be worth remembering in this connection; however, this idea is likely to raise another kind of redundancy, that is, repetition of the name of the surrogate which belongs to many similarities, e.g., index terms.
This Thesis is supposed to be the first to mention semantic redundancy in contrast to the common textual and Colin Cherry's syntactical redundancy (p.182). It googled 410 hits. -- KYPark 10:09, 13 Apr 2005 (UTC)
An abstract file as a retrieval tool is no exception to such redundancy. The comparative efficiency of abstracts in retrieval is still controversial. The low efficiency of abstracts, if true, may stem from difficulties in formalization and in machine processing. However, formalization does not really matter so much in human processing. And we can reasonably assert that abstracts contain much greater "semantic information" than other kinds of surrogates such as titles, sets of index terms, and classification codes. Therefore, without considering the time consumed, the human searching of abstracts should perform better than that of other surrogates in judging similarity, at least in principle.
Supposing that abstracting processes are formalized to such an extent that the above equation holds well. Then, it will be possible to exclude the identical part A from all but one of the similar abstracts, allowing them reference to A in the retained abstract. Otherwise, we can list all the similar abstracts in one of them. By doing so, we need not search for them one by one but by a group, whenever the search requests fall upon A.
Once the existence of A is accepted, a model abstract of the identical part A may be desired for all the documents that have A in common. A collection of such models will look like a classification scheme. This can be applied to individual abstracts. Then, each abstract may consist of the prescriptive code for A and the descriptive text for B, the different part. (This way of doing may be parallelly adapted to combining a hierrarchical classification system with a descriptive indexing system.) In practice, the prescriptive code may or may not be substituted for the text corresponding to A in an abstract. What is implied in this idea is not merely to reduce the textual or semantic redundancy involved in a group of similar abstracts. [4]
In general, document surrogates includes errors of various kinds. Let us take for example just one kind of errors: inconsistency in surrogation. Many inconsistencies can hardly be said to be errors in the strict sense, for the surrogates are fairly correct individually. The cause of these inconsistencies may be attributable to difficulty or lack in formalization.
In this respect, abstracting systems, particularly based on author abstracts, seem to be hopeless to control. However, this is not the whole point. The default is to leave the failure caused by inconsistency to be repeated each time the abstracts are searched. Certainly, this failure can be prevented or reduced by careful examination and grouping of similar abstracts, prior to a series of searches.
This prior grouping process implies retrieval which ensures high recall even at the cost of low precision. One thing that matters here is the manageable number of abstracts to be examined as to their similarity. The greater the number, the more preventive work there is to be done. What makes matters worse is the possible multiplicity of similarity groups which an abstract belongs to at the same time. We may not even make certain which groups will be more significant or more likely to be requested by the user. This situation will eventually demand enormous efforts. Our ideal to rule out inconsistencies may require prohibitive efforts.
We all know something about abstracts and extracts, not being pretentious. However, this general kind of knowledge may not suffice for critical discussion of their characteristics, merits, snags, and so on. An abstract was defined as an abbreviated, accurate representation of a document; and an extract as consisting of one or more portions of a document selected to represent the whole. Were they defined with accuracy? Were the definitions intended for making clearer how to make abstracts and extracts? Are there any really working standards for making them?
Any document surrogate of however small and biased content may be justified, because it is not the document itself but a representation, description or prescription. Sometimes it is mistaken that the content of a surrogate is the same as the content of the corresponding document; or that the equation C = A * B holds equally in both cases. Distinguishing between intensional aboutness and extensional aboutness, Faithorne [7] says that:
Fairthorne's intensional aboutness Author's subjective/implicit meaning Dumais's latent semantic indexing [5] [6] Stark refers to extensional aboutness. [7] Cheti refers to intensional aboutness. Cheti refers to the above quote. [8] Hawthorne's intensional aboutness Hawthorne's fluidity of meaning [9] Author's flexibility and elasticity just below
Even with the great flexibility and elastisity of language, it seems almost impossible to make an abstract of about two hundred words exactly analogous to the content C of the corresponding document. In other words, selection and bias are more or less unavoidable in abstracting. If paraphrasing of selection is considered to be semantically superficial, then the difference between an abstract and an extract will be somewhat marginal. Both are biased selections or parts of the content C.
Roughly speaking, an abstract is more intended to balance selection uniformly over C, aiming at inductive information effects. Similarly, an extract is more intended to spot selection (perhaps conclusive part) eccentrically from C, aiming at immediate rather than inductive information effects. Yet, no formal procedures beyond conventions of a vague nature are available of what to select.
Considering the power of meta-language and its use in retrieval, Goffman, et al [13] notice that an abstract is given in meta-language whereas an extract in object-language. They further notice that many abstracts, being written in "trivial" meta-language, should more accurately be called extracts.
Selection or part of a document, whether balancing or spotting, should assume that it can do without the rest or context. In other words, it should be an independent unit of discourse. Truly, abstracts, extracts, titles, even index terms, all these tell us something on their own account. Fairthorne [7] paraphrases Bohnert's notion of data as:
This phrase seems to be worth careful scrutiny. Perhaps, we can raise several questions such as:
We shall discuss these and other questions in the next chapter. Meanwhile, Belzer [14] calculates "the entropies of the various surrogates of error-free information," by assigning one bit of information to a full document. For five different types of surrogates - citation, abstract, first paragraph, last paragraph, first and last paragraph - he observes the 2 x 2 contingency of:
By showing the calculation result as in Table 1, and by calling attention to the fact that production of abstracts only requires extensive professional effort, he in effect revives superiority of extracts to abstracts. Comparison of a document with its surrogates is also interesting.
We are free to think of anything. [9] [10] At one moment we think of the trees in the garden; at another, of the tomorrow's weather; at still another, of the past experience. These things that we think of, whether existent or non-existent, substantial or imaginary, true or false, may be said to be fairly stable in contrast to our free thoughts. The trees in the garden must be there dropping leaves, even while we stop thinking of them. [11] Thus we can freely organize or map our thoughts upon this stable background. [12]
We invent symbols* with which to express or map our thoughts on things. In isolation we are free to invent and use any symbols to express our thoughts. In a society where we should associate, cooperate, share, or communicate with other people [8], we have two options. That is, either to dictate our own invention or to obey the social rule. Small babies would begin with dictations, e.g., crying out instinctively to express their desire. However, they cannot merely dictate. That would be admittedly too limiting. Far more required of them and far more convenient in most cases is to conform to the social rule. They learn to practice it gradually but never completely. Even when grown up, dictations are necessary. As a matter of fact, the two options are normally intermingled, but eventually separable.
The social rule on how to use symbols has evolved from among people, somewhat loosely and still changing. Truly, people have been inventing, elaborating, and using more and more convenient symbols for more efficient communication. The speaker expects the listener to receive the symbol that is substantially imparted or communicated. However, the speaker's real expectation is that the listener would share the same thought and direct to the same thing that the speaker has in mind. Largely, this expectation is met when the speaker conforms to the well-established social rule, e.g., grammar and dictionary. Then the majority understand him unanimously; communication is clear-cut. In this case the meaning of such a symbol may be said to be denotative, explicit, or extensional. Refer to a dictionary for these words (Table 2).
On the other hand, communication does not appear as simple as that. Indeed, communication as a whole is a loose affair. Ambiguous and misleading expressions, misunderstandings and different interpretations, and so on. Supposing that someone intends to convey a definite thought or story with the following word string [8]:
which looks almost non-sensical as a whole. Then, what will happen to us listeners? We have a dictionary, but we cannot simply sum up the meanings of individual words. That "a whole is more than the sum of the parts" is too plain a saying. There seems to be no grammar to which the speaker might have conformed. He merely suggests rather than tells the story, which in other words is implied or implicit in the word string, i.e., symbol. From this awkward symbol we can guess the story with varying accuracies, if we are ready to take risks. In this case, the meaning of such a symbol may be said to be connotative, implicit, or intensional. [13] [14] [15] [16] [17] [18] Again, refer to a dictionary for these words (Table 2).
When we communicate our thoughts about things, we use signs. That is to say, three factors - a thought, a thing, and a sign - are essentially involved in any communication event, either speaking or listening. [19] [20] [21] Ogden and Richards [9] place the three factors at the corners of a triangle, where the relations between these factors are represented by the sides, as shown in Figure 3. They recognize that there are causal, direct relations between a sign and a thought, and between a thought and a thing. But, they insist, the relation between a sign and a thing is merely "imputed" as opposed to the causal relations; it holds only indirectly round the two sides of the triangle. they further insist that it is because of this imputed relation that most of the language problems arise. Signs are instruments subjected to thinking or interpretation; they can be related to things only through thinking, or more specifically through interpretation.
Things and experiences are also interpreted; they are treated as signs. Thus, through all our life, we interpret signs in the widest sense, with few exceptions. Then, what happens when we interpret signs? Ogden and Richards [9] generalize the process of sign interpretations as follows:
For example, a dog, on hearing the dinner bell, interprets the bell sounds as a sign and runs into the dining room. He can do so owing to the past experience in which clumps of events - Bells, savours, longing for food, etc. - have recurred "nearly uniformly." Such a clump of events may be called an external context. And the mental events, occurring in the dog which can link merely the present bell sound together with the past experience of bells-savours-longings, may be called a psychological context. To define more precisely:
Contexts occur more or less uniformly; that is to say, the constitutive characters of a context recur with uncertainty or with a probability. It follows that the context is said to be determinative with respect to one character if both characters are closely related. By taking very general constitutive characters and uniting relations, we have contexts of high probability; we can increase the probability of a context by adding suitable members. Thus we react the recurring part of the context in the same way as we did the whole context. Experience recurs in contexts which recur more or less uniformly, and interpretation is only possible in these recurring contexts.
Finally, Ogden and Richards [9] attempt to narrow down their implications by applying the context theory of interpretation [23] [24] [25] to the use of words at different levels; from simple recognition of sounds as words to critical interpretation of words.
Ogden and Richards [9] do not specify contexts in the triangle in Figure 3. Cherry [8] modifies the diagram as shown in Figure 4a. We shall further modify it as shown in Figure 4b, and say that the triangle is surrounded by the external context and contains the psychological context inside. Still, the diagram only represents either speaking or listening. Thus we shall develop the diagram further in the following.
A unit communication, including both speaking and listemning, may be represented by the diagram as shown in Figure 5. The arrows and the corresponding words may be convenient to represent the functional flow in a unit communication. Thus we shall say that:
If there arises no physical distortion between two signs, then the sign in speaking and the sign in listening will be the same, or get together. If the listener's thought directs to the same thing that initiated the speaker's thought, then we shall have an ideal unit communication as shown in Figure 6.
We may better develop the diagram further in order to represent communication situations which are more complex than a unit communication. And we shall normally approximate individual units of communication to ideal units as shown in Figure 6.
Let us take for example the password game. The questioner, thinking of WATCHWORD, gives a symbol 'watchword' to the intermediary, who in turn gives another symbol 'password' to the answerer. Before and after translation from 'watchword' into 'password,' the intermediary's thoughts I and I' should be different such that I corresponds to WATCHWORD and I' to PASSWORD. Therefore, the answerer's thought should direct first to PASSWORD, and then to WATCHWORD which is the correct answer. The answerer should make a guess that is the reverse of translation. This password game is illustrated in Figure 7. Communication between the questioner and the intermediary makes an ideal unit, and that between the intermediary and the answerer makes another ideal. These two ideal units are separated by a communication gap which should be overcome by the answerer's guesswork. In corollary, complexity of communications involved in information retrieval may be shown as the diagram in Figure 8.
The idea proposed in this chapter is to use in information retrieval those extracts in which the source document cites, describes, criticizes, and/or collates other documents (See Figure 9). It is only exploratory within the scope of this study. It can be justified on the grounds that the citing and the cited documents are coherent with each other, that extracts provide concise clues for discriminating these documents, and that even concise clues are interpreted meaningfully in the given contexts. Although widely practiced among information users, the idea has not yet been formally studied in view of efficient file organization as far as I know. Therefore, the implication of the idea might go farther in the future than can be expected to now, and require more exploration. In this respect, what is immediately required will be some rationale behind the idea. While all the preceding discussions are relevant to this rationale, the following are intended to support the idea focally.
Now it is almost certain that subject coverage or specialization can hardly be defined consistently and objectively. At best we can say that two documents are similar with respect to something, based on the evidence that we recognize from the documents. Still, the totality of evidence would not make sure similarity; it gives us no more than a degree of belief.
In most cases, two documents similar with respect to something are indexed or abstracted individually. In this sense they are related to each other only indirectly, or with some uncertainty. Indexing inconsistency, mainly caused by individual varieties even in case of fairly adequate assignment of index terms, is now well known. This will significantly degrade retrieval as a grouping process of similar documents. Therefore, to use the direct evidence of similarity established between the two or more documents will be desirable.
We can quite reasonably say that the citations, by which I mean both the citing and the cited articles inclusively, are similar at a certain level of abstraction, especially in highly specialized fields of science. Therefore we can trace back and forth between the citations in order to find similar articles. This is the principle of citation indexing applied by Garfield [16]. However, the serious objection to citation indexing is that it demands too much risk, relying heavily on the mere fact that X cites Y. Tracing back and forth tends to diverge tremendously. The solution required for this technique would be to exclude noise sources and provide all the citations with subject indicators more powerful than titles. Lipetz [17] attempted to improve selectivity of citations by providing "context indicators" rather than "subject indicators." His approach seems applausible, but demands much intellectual effort. After all, the usefulness of direct evidence has not yet been warranted significantly by citation indexing.
In this respect, the far more elaborate method, bibliographic coupling, developed by Kessler [18] shares the same fate as citation indexing. It is noticed [19] that "citation tracing is pervasive information-seeking mode." What should be further noticed is that backward tracing is much more pervasive and that any intellectual tracing is initiated by discerning some meaningful evidence rather than the "mere fact."
On the other hand, it is questionable whether indexes and abstracts are the only means of retrieval as an extention of information-seeking facilities. Books, reviews, monographs, and journal articles; all these are likely to lead our information needs to other sources of information. Almost all scientific articles cites, describes, analyses, and groups a number of other articles. Thus, the reader of the citing articles can, perhaps very easily, discriminate the cited articles as to their subjects, crucial points, logical relationships, and so on. By doing so, the reader is in effect retrieving relevant articles with the aid of expertise.
Vickery [20] emphasizes the importance of review articles and the like as an efficient, selective "means to discover what they must read amid the vast mass of available documents," pointing out that "the traditional means of discovery of the pertinent literature are inadequate." Nevertheless, the traditional means may better give access to more selective means. That is to say, the strategy of discovery may best be divided into two different means.
A similar strategy was considered by Goffman, et al [13] by introducing meta-linguistic terms to indexing. However, their approach appears passive in that it is simply intended to divide a file in order to economize searches. A more active approach is therefore desired for selective discovery in terms of quality rather than quantity.
On the other hand, Goffman, et al [13] regret that many abstracts written in "trivial" meta-language are much closer to object-linguistic "extract," and that many reviewers write abstracts instead of the state of the art. They seem to favor meta-linguistic abstracts more than object-linguistic "extracts." Ironically, one of the authors recently shows that extracts are better than abstracts in terms of calculated entropies as well as intellectual efforts [14]. The power of meta-language which they properly recognized suffers from inconclusiveness, waiting for further observation.
On the whole, most of the traditional means, such as subject indexes, abstracts, and extracts seem to go paralytic facing efficient file organization. Obsolescence of scientific literature [21] is now widely known. Brookes [22] was interested in obsolescence involved in a cumulative file. Unfortunately, his interest has not yet been worked out. Certainly accumulated in a large, cumulative file would be archival value, but at the cost of retrieval devaluation. Thus systematic file organization AND maintenance should be taken as most essential in view of information retrieval.
Recently, Blaxter and Blaxter [23] report an interesting observation on the needs and habits of scientific authors and readers in three research institutes. They show that the information needs of individual working scientists are met by a very small number of primary journals, and that the cited references appended to primary articles or review articles are used in most literature searches. More precisely:
If this were to be the general pattern of literature searches by working scientists, and if information retrieval is to meet ultimately the information needs of individual scientists, file organization should be considered in the light of the above observation.
Figure 9 shows the first paragraph extracted from an article* (hereafter called the sample article) in a recent issue of Physical Review. The extract has eight references (Refs. 1-8) not merely cited and described, but also criticized and collated. With respect to the cited references, the extract is meta-linguistic and of a review kind. Similar extracts can be made from other parts of the sample article wherever each cites one or more references (Figure 13). By extract is meant hereafter an extract of this kind, as opposed to a common, object-linguistic extract.
From the extract in Figure 9, a subject index to Refs. 1-8 may be derived as illustrated in Figure 10. The complete subject index to Refs. 1-37 of the sample article is shown in Figure 11. The actual indexing is done on the work sheet as illustrated in Figure 12, while the indexer scanning the source document selects index terms. There is therefore no need to make extracts substantially.
Figure 13 illustrates a provisional compilation for the sample article where:
Similar compilations for a number of source documents may be serially accumulated into a file. Being combined with the subject index and the author index, this file may be used as personal or other means for information retrieval. Convenience of the file will remain a technical problem.
The use of the file in retrieval is much the same as that of reviews and text books which can lead the reader to various sources of information. As mentioned previously, extracts under consideration are in fact of a review kind. External and psychological contexts are involved in reading reviews. In extending to other sources of information, the reader can benefit from expertise provided by reviews of source documents. Certainly he would not make instantaneous, mechanistic YES-NO decisions based on simple criteria. To the contrary, his decisions will be carefully thought out.
Selection of one source document by using the subject index is relatively less important, since it is mainly intended to lead to retrieval of as many cited references as possible. Therefore usefulness of the file will depend on coherence of citations, i.e., coherence of cited references with each other as well as with the source document. And extracts should be made short as far as they do not significantly degrade the maximum coherence that is obtainable from the full text. Here, coherence may be defined:
number of citations retrieved as relevant coherence = -----------------------------------------. number of citations examined
From the extract Ext a in Figure 13, the reader may notice that all the cited references (Refs 1-8) are about INDEPENDENT PARTICLE MODEL, which presumably represents the significant aspect in common. Much subject content behind this representation may be covered by the abstract of the source document. Thus, given the context by the abstract, the reader can to some extent do without the individual abstracts of the cited articles. Similarly, the reader can benefit from other contexts which are exchanged between the cited references. How much he can benefit from these external contexts will depend on his psychological context.
Extracts should be made primarily in one or more sentences. Description in sentences is one of the advantages of extracts over description in keywords. However, some extracts are non-sensical, mostly redundant, or require modification. It would be better in these cases either:
In short, the length and the coherence of extracts should be balanced. Extraction of keywords or phrases similar to subject indexing, may suffice in many cases.
Perhaps the simplest file organization would be to mark extracts directly on the source document and to derive the subject index from them. In a sophisticated environment, e.g., visual display and keyboard manipulation of constituent files, the following organization may be convenient.
Figure 14 illustrates an entry to the subject index, and Figure 15 illustrates ways of access from the subject index to the citation index and to the extract file.
I think, as many others may do, [26] that in his World Encyclopedia, [27] H. G. Wells [28] proposed in effect an ideal of file organization for information retrieval. [29] Refer again to the prefatory statement made by him. The crucial point here is to select and collate carefully, and to present critically. So far this study has attempted to move toward his ideal. [30]
Say, "World Encyclopedia." This somewhat tricky wording seems to bear some misunderstanding. Clearly, it is to put away miscellany and synthesize the essence only rather than to bring all together. In general, words being freed from its proper contexts, whether literary or external or psychological, are mischievous. and easily bring in misinterpretations. Incidentally, Wells himself experienced such a mischief done by a professional journalist. Hayakawa [24] says that: [31]
By saying "ignoring," however, he would not ignore the possibility of dispensing with part of the whole context. Given the environment, or given the wider context, part of the context is determinative in interpretation.
Bruza, P.D., Song, D., Wong, K.F. (2000) Aboutness from a Commonsense Perspective. Journal of the American Society for Information Science and Technology (JASIST), 51(12), 1090-1105. pdf another
Maron (1977) tackled aboutness by relating it to a probability of satisfaction. Three types of aboutness were characterized: S-about, O-about and R-about. S-about (i.e. subjective about) is a relationship between a document and the resulting inner experience of the user. O-about (i.e. objective about) is a relationship between a document and a set of index terms. More specifically, a document D is about a term set T if user X employs T to search for D. R-about purports to be a generalization of O-about to a specific user community (i.e., a class of users). Let I be an index term and D be a document, then D is R-about I is the ratio between the number of users satisfied with D when using I and the number of users satisfied by D. Using this as a point of departure, Maron further constructs a probabilistic model of R-aboutness. The advantage of this is that it leads to an operational definition of aboutness which can then be tested experimentally. However, once the step has been made into the probabilistic framework, it becomes difficult to study properties of aboutness, e.g. how does R-about behave under conjunction? The underlying problem relates to the fact that probabilistic independence lacks properties with respect to conjunction and disjunction. In other words, one's hands are largely tied when trying to express qualitative properties of aboutness within a probabilistic setting. setting. (For this reason Dubois et al. (1997) developed a qualitative framework for relevance using possibility theory).
Maron, M.E. (1977). On Indexing, Retrieval and the Meaning of About. Journal of the American Society for Information Science, 28 (1): 38-43.
Dubois, D., Farinas del Cerro, L., Herzig, A., & Prade, H. (1997). Qualitative Relevance and Independence: A Roadmap. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), pp. 62-67, 1997.
David Bohm, has frequently referred to meaning, particularly when talking about his recent experiments with dialogue groups in which "a free flow of meaning" is encouraged. This whole question of meaning, and what we mean by it is clearly of importance and, in particular, the question "What do you mean by language?"
C.K. Ogden and I. A. Richards's classic The Meaning of Meaning8 provides a useful introduction to such questions. Following Odgen and Richards the work of Ludgwig Wittgenstein had made a particularly significant contribution to the notion of meaning in linguistics.9 According to his dictum: Don't look for the meaning, look for the use. Essentially this can be interpreted as saying that meaning is a generalization that doesn't correspond to anything that is actually available in language behavior. What we actually rely upon are individual uses which are themselves interrelated according to a pattern of family resemblances. In this sense words could no more be said to "possess" an intrinsic meaning that is independent of their use than, in Bohr's view, could an electron be said to "possess" an intrinsic position or spin.
When Martin Gardner retired from writing his Mathematical Games column for Scientific American magazine, Hofstadter succeeded him with a column entitled Metamagical Themas (an anagram of "Mathematical Games").
Hofstadter invented the concept of Reviews of This Book, a book containing nothing but cross-referenced reviews of itself. He introduces the idea in Metamagical Themas:
Hofstadter's Law: "It always takes longer than you expect, even when you take into account Hofstadter's Law".
CiteSeer, in the past known as ResearchIndex, is a public specialty search engine and digital library that was created by researchers Dr. Steve Lawrence, Kurt Bollacker and Dr. Lee Giles while they were at the NEC Research Institute (now NEC Labs), Princeton, NJ, USA. CiteSeer crawls for and harvests academic scientific documents and uses autonomous citation indexing to permit querying by citation or by document. [...]
CiteSeer's goal is to improve the dissemination and access of academic and scientific literature. As a non-profit service that can be freely used by anyone, it is an example of the democratization of scientific knowledge and the Open Access movement that is revolutionizing academic and scientific publishing and scientific literature access. [...]
Google began as a research project in early 1996 by Larry Page and Sergey Brin, two Ph.D. students at Stanford who developed the hypothesis that a search engine based on analysis of the relationships between Web sites would produce better results than the basic techniques then in use. It was originally nicknamed BackRub because the system checked backlinks to estimate a site's importance. (A small search engine called RankDex was already exploring a similar strategy.)
Convinced that the pages with the most links to them from other highly relevant Web pages must be the most relevant ones, Page and Brin decided to test their thesis as part of their studies, and laid the foundation for their search engine. [...]
cf. Terry Winograd
The idea was worked out in more detailed form by Cerf's networking research group at Stanford in the 1973–74 period, resulting in the first TCP specification ( Request for Comments 675) (The early networking work at Xerox PARC, which produced the PARC Universal Packet protocol suite, much of which was contemporaneous, was also a significant technical influence; people moved between the two).
DARPA then contracted with BBN Technologies, Stanford University, and the University College London to develop operational versions of the protocol on different hardware platforms. Four versions were developed: TCP v1, TCP v2, a split into TCP v3 and IP v3 in the spring of 1978, and then stability with TCP/IP v4 — the standard protocol still in use on the Internet today.
In 1975, a two-network TCP/IP communications test was performed between Stanford and University College London (UCL). In November, 1977, a three-network TCP/IP test was conducted between the U.S., UK, and Norway. Between 1978 and 1983, several other TCP/IP prototypes were developed at multiple research centres. A full switchover to TCP/IP on the ARPANET took place January 1, 1983.
— " TCP/IP," Wikipedia.
Advanced Research Projects Agency was renamed to Defence Advanced Research Projects Agency ( DARPA) in 1972.
A fundamental pioneer in the call for a global network, J.C.R. Licklider, articulated the ideas in his January 1960 paper, Man-Computer Symbiosis.
- "A network of such [computers], connected to one another by wide-band communication lines" which provided "the functions of present-day libraries together with anticipated advances in information storage and retrieval and [other] symbiotic functions. "—J.C.R. Licklider.
— " Three terminals and an ARPA," in: " Internet history," Wikipedia
The World Wide Web has evolved into a universe of information at our finger tips. But this was not an idea born with the Internet. This lecture recounts earlier attempts to disseminate information that influenced the Web - such as the French Encyclopédists in the 18th century, H. G. Wells' World Brain in the 1930s, and Vannevar Bush's Memex in the 1940s.
— Editorial comment
There are quite a number of published histories of the Internet and the World Wide Web. Typically these histories portray the Internet as a revolutionary development of the late 20th century—perhaps with distant roots that date back to the early 1960s. In one sense this is an apt characterization. The Internet is absolutely a creation of the computer age.
But we should not think of the Internet just as a revolutionary development. It is also an evolutionary development in the dissemination of information. In that sense the Internet is simply the latest chapter of a history that can be traced back to the Library of Alexandria or the printing press of William Caxton.
In this lecture I will not be going back to the Library of Alexandria or the printing press of William Caxton. Instead I will focus on the contributions of three individuals who envisioned something very like the Internet and the World Wide Web, long before the Internet became a technical possibility.
These three individuals each set an agenda. They put forward a vision of what the dissemination of information might become, when the world had developed the technology and was willing to pay for it. Since the World Wide Web became established in 1991 thousands of inventers and entrepreneurs have changed the way in which many of us conduct our daily lives. Today, most of the colonists of the Web are unaware of their debt to the past. I think Sir Isaac Newton put it best: “If [they] have seen further, it is by standing on the shoulders of giants.” This lecture is about three of those giants: H.G. Wells, Vannevar Bush, and J.C.R. Licklider.
— Introduction
Around 1937, Wells perceived that the world was drifting into war. He believed this was because of the sheer ignorance of ordinary people, that allowed them to be duped into voting for fascist governments. He believed that the World Brain could be a force in conquering this ignorance and he set about trying to raise the half-a-million pounds a year that he estimated would be needed to run the project. He lectured and wrote articles which were later published as a book called the World Brain (1938). He made an American lecture tour, hoping it would raise interest in his grand project. One lecture, in New York, was broadcast and relayed across the nation. He dined with President Roosevelt, and if Wells raised the issue of the World Brain with him — which seems more than likely — it did not have the effect of loosening American purse-strings. Sadly, Wells never succeeded in establishing his program before World War II broke out, and then of course such a cultural project would have been unthinkable in the exigencies of war.
— H. G. Wells and the World Brain
The rapid growth of the Internet in the 1990s was primarily due to the World Wide Web. The Web Browser made using the Internet easy for ordinary people, and also worth doing and worth investing in. The World Wide Web was invented by Sir Tim Berners-Lee working in the CERN European particle physics laboratory in Geneva, in 1991. As Berners-Lee put it himself, the World Wide Web was “the marriage of hypertext and the Internet.” The ideas were in the air. He just put the pieces together. And in so doing, he set in train a chain of events that have changed the world.
— Conclusion
The WorldWideWeb (W3) is a wide-area hypermedia information retrieval initiative aiming to give universal access to a large universe of documents.
— The first web page, CERN 1991 [24]
The idea of hypertext is neither a great invention nor even an invention. To say it is great is to say information is great or computing is great. Neither is great. But its use may be either great or evil or better than nothing. It's up to you how about pornographic uses, for example. Put otherwise, an atom is barely great while an atomic bomb is surely great.
Speech is strictly linear, while text is less but still so indeed. But you need not read linearly from opening to ending. Reading is nonlinear from time to time, if not in general. So is it for reference works in particular. It is simply a programmer's job to help read nonlinearly as usually needed. But a programmer may be lucky enough to be the first who gets the job done. All congratulations on her chance of being the first programmer rather than inventor, surely in case she is not worth the inventor.
Relevance is the basic underlying notion of IR by choice. [...] IR, as formulated, is based on [...] relevance. Whose relevance? Users! In reality, wanting or not, users and IR are not 'separable.' This brings us to the necessity for a broader context for IR provided by information science. Computer science provides the infrastructure. Information science the context. [emphasis original]
— Saracevic (1997). "Users lost: Reflections on the past, future, and limits of information science." SIGIR Forum, 31 (2) 16-27. [1]
Some thinkers attack the problem of free will by distnguishing different notions of freedom or meaning of the word 'free'. In one sense we are free -- free enough for concepts of morality and responsibility to come into play. In another sense we are not free, and all that happens now is determined by what has happened earlier. According to this 'soft determinism', as William James called it, determinism is supposed to express a true doctrine in one sense of the words, and a false doctrine in another. Plenty of philosophers have argued that the problem about free will arises from what Hobbes called the 'inconstancy' of language. The same word, they say, is inconstant -- it can have several meanings. Even philosophers who argue for a simple determinism have to show that in their arguments the word 'free' is used with a constant sense, leading up to the conclusion that we are not free.
— Ian Hacking (1975) Why Does Language Matter to Philosophy? (p. 4-5)
Berlin did not assert that determinism was untrue, but rather that to accept it required a radical transformation of the language and concepts we use to think about human life -- especially a rejection of the idea of individual moral responsibility. To praise or blame individuals, to hold them responsible, is to assume that they have some control over their actions, and could have chosen differently. If individuals are wholly determined by unalterable forces, it makes no more sense to praise or blame them for their actions than it would to blame someone for being ill, or praise someone for obeying the laws of gravity. Indeed, Berlin suggested that acceptance of determinism -- that is, the complete abandonment of the concept of human free will -- would lead to the collapse of all meaningful rational activity as we know it.
— "Isaiah Berlin," on: "Free Will and Determinism," in: Stanford Encyclopedia of Philosophy
Local realism is the combination of the principle of locality with the "realistic" assumption that all objects must objectively have pre-existing values for any possible measurement before these measurements are made. Einstein liked to say that the moon is "out there" even when no one is observing it.
McGinn's aim is two-fold: to undermine both descriptive and causal theories of reference, and to argue for his preferred, 'contextual' theory of reference. McGinn is moved to this position by emphasizing indexicals—which he takes to be the primary referential devices—rather than proper names. Linguistic reference, for McGinn, is a conventional activity governed by rules that prescribe the spatio-temporal conditions of correct use; the semantic referent of a speaker's term is given by combining its linguistic meaning with the spatio-temporal context in which the speaker is located. McGinn concludes his defence of this theory by demonstrating the plausibility of its implications for such topics as abstract objects, self-reference, attribution, the language of thought hypothesis, truth, and the reducibility of reference.
— (Abstract) Colin McGinn (2002) "The Mechanism of Reference" (in) Knowledge and Reality, pp. 197-223.
Context is a term that has come into more and more frequent use in the last thirty or forty years in a number of disciplines--among them, anthropology, archaeology, art history, geography, intellectual history, law, linguistics, literary criticism, philosophy, politics, psychology, sociology, and theology. A trawl through the on-line catalogue of the Cambridge University Library in 1999 produced references to 1,453 books published since 1978 with the word context in the title (and 377 more with contexts in the plural). There have been good reasons for this development. The attempt to place ideas, utterances, texts, and other artifacts "in context" has led to many insights.
— Peter Burke (2002) "Context in Context." Common Knowledge, 8(1): 152-177.
Note: He was a visiting professor at UCL.Today the digital library community spends some effort on scanning, compression, and OCR; tomorrow it will have to focus almost exclusively on selection, searching, and quality assessment. Input will not matter as much as relevant choice. Missing information won't be on the tip of your tongue; it will be somewhere in your files. Or, perhaps, it will be in somebody else's files. With all of everyone's work online, we will have the opportunity first glimpsed by H. G. Wells (and a bit later and more concretely by Vannevar Bush) to let everyone use everyone else's intellectual effort. We could build a real `World Encyclopedia' with a true `planetary memory for all mankind' as Wells wrote in 1938. [Wells 1938]. He talked of ``knitting all the intellectual workers of the world through a common interest;`` we could do it. The challenge for librarians and computer scientists is to let us find the information we want in other people's work; and the challenge for the lawyers and economists is to arrange the payment structures so that we are encouraged to use the work of others rather than re-create it.
— Michael Lesk (1997) How Much Information Is There In the World? (Excerpt from Conclusion)
Note: He was also a visiting professor at UCL.World Brain or Global Brain proponents tend to extrapolate quite extravagantly the capabilities and implications of emerging technology. For Wells it was microfilm. Today it is the infinitely more sophisticated Internet and World Wide Web which have enmeshed our globe in a fantastically intricate and diffused communications infrastructure. By means of this technology as World or Global Brain proponents imagine it taking shape, the effective deployment of the entire universe of knowledge will become possible. But this begs unresolved questions about the relative value of the individual and the state, about the nature of individual and social benefits and how they are best to be allocated, about what constitutes freedom and how it might be appropriately constrained. It flies in the face of the intransigent reality that what constitutes the ever-expanding store of human knowledge is almost incalculably massive in scale, is largely viewpoint-dependent, is fragmented, complex, ceaselessly in dispute and always under revision.
— W. Boyd Rayward (1999) "H. G. Wells's Idea of a World Brain: A Critical Re-Assessment." JASIS, 50: 557-579.
A DIRECT APPROACH TO INFORMATION RETRIEVAL [1] Master's Thesis, [2] University of London, [3] 1975 [4] Submitted by Kyung-Youn Park [5] Supervised by B. C. Brookes [6] University College London [7]
"What we should do, I suggest, is to give up the idea of ultimate sources of knowledge, and admit that all knowledge is human; that it is mixed with our errors, our prejudices, our dreams, and our hopes; that all we can do is to grope for truth even though it be beyond our reach. We may admit that our groping is often inspired, but we must be on our guard against the belief, however deeply felt, that our inspiration carries any authority, divine or otherwise. If we thus admit that there is no authority beyond the reach of criticism to be found within the whole province of our knowledge, however far it may have penetrated into the unknown, then we can retain, without danger, the idea that truth is beyond human authority. And we must retain it. For without this idea there can be no objective standards of inquiry; no criticism of our conjectures; no groping for the unknown; no quest for knowledge."
"In science men have learned consciously to subordinate themselves to a common purpose without losing the individuality of their achievements. Each one knows that his work depends on that of his predecessors and colleagues, and that it can only reach its fruition through the work of his successors."
"The modern World Encyclopaedia should consist of relations, extracts, quotations, very carefully assembled with the approval of outstanding authorities in each subject, carefully collated and edited and critically presented. It would not be a miscellany, but a concentration, a clarification and a synthesis."
In this study I am concerned with file organization of scientific literature in view of discovering useful information efficiently; largely, the problem of information retrieval. It seems that information retrieval now implies something more than a mechanistic and technical problem, something that gradually resolves into complexity of human communication, understanding and knowledge. Similar views have recently been expressed by Mitroff, et al [1] and by Brookes [2] in a wider context. " As we may think" or look back, our initial hope for information retrieval has been faded in spite of tremendous development of computer techniques and others made for the past thirty years. This frustration was anticipated as early as 1948 by Wiener [3]. Still we are not sure if we could restore the hope in the near future, particularly along the same line of thought.
As to scientific information* in the wide sense, the following fundamental questions may be raised:
Obviously, information retrieval is most closely related to the last question. But I feel that the other two questions should also be taken into consideration when we intend to discuss information retrieval carefully. I selected the prefatory statements by Popper [4], by Bernal [5], and by Wells [6] as the most thought-provoking with respect to these three fundamental questions. And the statements represent my standpoint that I have taken in approaching the problems of information retrieval.
In the following chapters, I discuss first some fundamental considerations for information retrieval. I shall understand the narrowed retrieval problems mainly owing to Fairthorne's insightful contention [7]. Further I shall attempt to understand the problems in the light of communication and information which appear to be almost undefined. For this purpose I attend to Cherry's critical view on human communication [8] and to Ogden and Richards' classic theory of interpretation [9]. In short I am seeking for a solution to the problems of information retrieval, by questioning what influences those who communicate and obtain information.
Eventually, I propose a way of file organization as most essential for information retrieval. The proposal is only crude at this stage. In fact, the discussion of fundamental considerations is thus intended to make clearer and justify to some extent the idea which might require further elaboration and application. The main feature of the proposal is to use in retrieval those extracts in which the source document cites, describes, criticizes, and/or collates other documents. Such extracts seem to provide concise but significant clues for discriminating the cited documents. The most concise clues should be regarded as significant when they are coherent in their proper environments or contexts.
The overall view of main retrieval events may be represented schematically as shown in Figure 1. It may be said here that:
Figure 1. Schematic View of Information Retrieval Events.
Information retrieval is a complex type of communication between the system and the user. The schematic diagram in Figure 1 roughly shows the situation. Admittedly, the diagram is too simple and crude for explaining information retrieval meaningfully. It will be expanded in Chapter 5. Meanwhile, it may suffice to show how to approach retrieval problems.
What we want to know ultimately is the relationship between the system and the user, which is represented in Figure 1 by the solid arrows and characterized by prediction and discrimination of documents. Also, we can consider many other relationships in the diagram; for example, those represented by the dotted arrows and the broken arrows. Here we can reasonably assert that all knowledge of these relationships should concentrate on explicating the relationship of utmost importance between the system and the user.
On the other hand, information retrieval may be possible with little or no attention to knowledge of the relationship between the system and the user. That is to say, we can contain the system and the user in a black box*, perform information retrieval, and improve the performance successively by feedback control. Combination of the solid arrows and the dotted arrows makes a closed cycle for feedback control. The black box has two input terminals, Ein and Din, which are input to the system and the user, respectively. It also has two output terminals: one for the user to give Eout in search of, and then in response to, Din, and the other for the system to retrieve Dout in response to Ein. This principle is illustrated in Figure 2, where Po represents the given initial condition or a set of performance factors of the black box.
Figure 2. Feedback Control of Information Retrieval.
Whether or not it is possible and practicable, this principle almost certainly would not tell much about the relationship between the system and the user, meaningfully. In other words, it may not necessarily be suitable for explicating the relationship. Even if suitable, it can explain the relationship only indirectly, i.e., through inferences from a great deal of valid and consistent evidence.
The approach that has been overwhelmingly used in the field of information retrieval is very similar to this principle. The main difference is to change the initial condition Po in many ways in order to know which initial condition will give the optimum performance of the system. This approach is not quite intended to know the relationship between the system and the user. The other, direct approach will be attempted in this study.
The user needs information. Even if he is seeking for it in a document, he may be little conscious of the physical form of document. For he may understand the notion of information in much the same way as a housewife in the marketplace does.
The user can easily and quite properly speak of pertinent, relevant, useful, or valuable information, expressing somewhat different shades of meaning. All these similar qualifiers, however, seem to be more or less redundant, for the user would appreciate information as such only when it is relevant to his specific purpose. Furthermore, it seems certain that these qualifiers are used only ambiguously.
Also, it is hard to say that the user, having found extremely valuable information in the library, must have in mind any economic sense at all. Of course, anyone would be perfectly right to understand information essentially as something "bought, sold, stored, treated, exchanged and consumed in economic terms," if there exists that sort of information. Another way of understanding [8] is such that:
Obviously, this quotation represents our common understanding. However, we know the fact that, by understanding or defining something in a particular way, one specifies in effect one's readiness or intention to communicate with other people as to the thing defined. And those people are expected or invited to share the same understanding. Unfortunately, some people would not or could not agree with the suggested definition, however authoritative, because it is absurd, unnecessary, out of their concern, or for some other reasons. In this case, communication is likely to become a conflicting argument apart from what is to be communicated; actually the breakdown of communication.
Communication between the retrieval system and the user as illustrated in Figure 1, is of secondary importance to the user. It is only intermediary or necessary for another communication of primary importance, that is, communication of information with the author of a document. Therefore misconception of information may damage both communications.
From the psychological point of view, Stevens [10] attempted to generalize communication by defining it as "the discriminatory response of an organism to a stimulus." This definition was criticized by Cherry [8] on the grounds that communication is essentially the relationship established between stimuli and responses. It does not follow however that the definition is wrong. It simply focuses on the communication event at the receiving end; for example, the discriminatory response of the user to a given document.
Information seeking presupposes satisfaction of information needs. The user will be really satisfied only when he finds information relevant to his need. Naturally, he discriminates what he has received in the light of his need and criteria. Namely, relevance judgment. Nothing can stop him from being subjective and tough in the judgment. He may even totalize his relevance criteria. Now it is well known that relevance judgments are so complicated depending on not only subject matter but also many other things [11]. It is noteworthy that the real judge is the user. If a panel of judges were to take his place, it may serve some practical purposes, but for a shift. We shall reserve the pure notion of relevance for those systems that aim to provide relevant information. [8]
Many studies have shown that informal communication is very popular among scientists, especially among those who are eminent. This phenomenon is quite convincing. However, it is mistaken that informal communication is superior to formal communication. At least, informal communication does not pay much attention to the "social responsibility" or morality, if you like, which was emphasized by Bernal [5]. It should be said that each represents a different machinery of communication, not degrading the other. Presumably, all the past experience of an eminent scientist in formal communication (e.g., through books, journals, lectures, libraries, and so on), must have enabled him to shortcircuit to only the essence, say, a few words of suggestion. This shortcircuiting may be applied to formal communication.
Among many others, Jahoda, et al [12] observed that 66% of faculty members interviewed in one university maintained personal indexes, that 42% of them regarded preparation of indexes as too time-consuming, and that 32% complained of inconsistency in indexing. No doubt, a large portion of scientists spend much time in preparing their own systems, i.e., personal indexes or the like, and they are themselves suffering from so-called retrieval problems. Therefore, anything that can help them solve the problems and improve their own systems would be duly appreciated.
On the other hand, naive and primitive as they may be, personal indexes must be worth careful study in order to learn how scientists extend their retrieval facilities toward systems. An important suggestion to all kinds of retrieval systems may be found there.
From the user's point of view, any retrieval system may be regarded as an extension of his information-seeking facilities. The user can satisfy his information need to some extent for himself, instead of delegating retrieval to what may be called outside systems as opposed to personal means. In this respect, it is questionable whether outside systems, however elaborate, can promise better satisfaction than personal means which are familiar to the user.
When the user delegates retrieval to the system, there must be some agreement, although tacit, between both parties about the way in which the system is designed to act on behalf of the user. A set of constraints is characteristic of any system whatsoever. It is beyond these constraints that the retrieval system is not expected to answer and the user is not normally allowed to ask. However, these obvious constraints seem to be often disguised and overlooked. (Note that at the moment we are talking about the current systems regardless of their future developments.) This aspect has been convincingly discussed by Fairthorne in many of his writings.
Then, what is the agreement? The straightforward answer is that the user should agree that the system works on the basis of subject similarity rather than relevance. The distinction between these two apparently similar notions should be made clear.
If any two readers are compared as to what each of them recognizes from the same document, both will be different in general from each other. For everyone tends to interpret subjectively what is written or said about. These individual interpretations may be superimposed for many readers, in order to separate the densely overlapping thus explicit meaning from the relatively subjective and more or less implicit meanings.
We can fairly reasonably say that in interpreting a document, the indexer tends to behave in an unanimous way; in principle he can discern from a document most of the explicit meaning. The implicit meaning that he may discern in addition would be negligible considering the large number of potential users. As opposed to the indexer, the user tends to interpret in a subjective way; he may not find any information from the explicit meaning but elsewhere. Various users may be very different from each other in finding information according to their past experiences and present state of mind.
Not attempting to be precise, let us associate the explicit, unanimous interpretation with subject similarity. And let us associate the subjective, individual interpretation with relevance. Perhaps we cannot discuss similarity without unanimity or commonality of interpretation; nor relevance without subjectivity or individuality of interpretation. Fairthorne [7] distinguishes them as extensional aboutness and intensional aboutness. We shall return to this distinction later.
It may be said that subject similarity is a necessary condition for relevance and that relevance is a sufficient condition for subject similarity. In this respect, the system which operates even ideally on the basis of subject similarity, or in an unanimous way, is liable to two types of error, that is, to miss relevant documents and to retrieve non-relevant documents. It cannot turn aside relevance by any means. Thus it can only predict relevance, however ideal in recognizing subject similarity.
Strictly speaking, relevance is a priori from the system's point of view. It is true in the sense that relevance criteria, however precisely stated by the user, cannot wholly be accepted due to the system constraints, so the accepted part may not be sufficient in the end. It is all the more so because relevance criteria, however readily accepted by the system, cannot affect indexing retrospectively. To borrow Fairthorne's contention [7]:
We have still another reason to believe in the a priori characteristic of relevance. A great deal of experimental as well as operational experience in retrieval has been accumulated at least over the past twenty years. Retrieval languages and devices must have been greatly improved. Nevertheless, how much has been learned about how the user judges a relevant document* as such? Obviously not much.
If so understood, relevance must have been overemphasized in evaluating retrieval systems. Especially, comparison of different systems might have been unfair to some; for it is not certain whether or not subject similarity goes parallel with relevance. How retrieval systems are to be evaluated and compared should be made clear first; in terms of either subject similarity or relevance or both. The evaluation solely based on subject similarity would not tell much about how to give more satisfaction to the user; that solely based on relevance would not tell much about how to keep the agreement with the user better. Collation of both evaluation will be necessary to know about the relationship between the system and the user.
A group of documents can be said to be similar to each other, when they have in common a set of identical properties A; they are similar with respect to the shared properties A. In general, each document in a similarity group has some other (different) properties B in addtion to A. Therefore, the content C of a document may be represented:
This equation may apply somewhat analogously to the document surrogate, too.
Because of the repetitive nature of the shared properties A, a group of similar documents are characterized by semantic redundancy, even if not by textual redundancy. This characteristic will be transferred somewhat analogously to the corresponding document surrogates. That is to say, the identical properties A are repeated not only in similar documents but also in their surrogates. This repetition or redundancy in a group of similar surrogates appears to be inevitable, because there would be no grouping of similar documents or surrogates without that. But it is not quite so from the point of view of file organization. For one thing, the idea of inverted files may be worth remembering in this connection; however, this idea is likely to raise another kind of redundancy, that is, repetition of the name of the surrogate which belongs to many similarities, e.g., index terms.
This Thesis is supposed to be the first to mention semantic redundancy in contrast to the common textual and Colin Cherry's syntactical redundancy (p.182). It googled 410 hits. -- KYPark 10:09, 13 Apr 2005 (UTC)
An abstract file as a retrieval tool is no exception to such redundancy. The comparative efficiency of abstracts in retrieval is still controversial. The low efficiency of abstracts, if true, may stem from difficulties in formalization and in machine processing. However, formalization does not really matter so much in human processing. And we can reasonably assert that abstracts contain much greater "semantic information" than other kinds of surrogates such as titles, sets of index terms, and classification codes. Therefore, without considering the time consumed, the human searching of abstracts should perform better than that of other surrogates in judging similarity, at least in principle.
Supposing that abstracting processes are formalized to such an extent that the above equation holds well. Then, it will be possible to exclude the identical part A from all but one of the similar abstracts, allowing them reference to A in the retained abstract. Otherwise, we can list all the similar abstracts in one of them. By doing so, we need not search for them one by one but by a group, whenever the search requests fall upon A.
Once the existence of A is accepted, a model abstract of the identical part A may be desired for all the documents that have A in common. A collection of such models will look like a classification scheme. This can be applied to individual abstracts. Then, each abstract may consist of the prescriptive code for A and the descriptive text for B, the different part. (This way of doing may be parallelly adapted to combining a hierrarchical classification system with a descriptive indexing system.) In practice, the prescriptive code may or may not be substituted for the text corresponding to A in an abstract. What is implied in this idea is not merely to reduce the textual or semantic redundancy involved in a group of similar abstracts. [4]
In general, document surrogates includes errors of various kinds. Let us take for example just one kind of errors: inconsistency in surrogation. Many inconsistencies can hardly be said to be errors in the strict sense, for the surrogates are fairly correct individually. The cause of these inconsistencies may be attributable to difficulty or lack in formalization.
In this respect, abstracting systems, particularly based on author abstracts, seem to be hopeless to control. However, this is not the whole point. The default is to leave the failure caused by inconsistency to be repeated each time the abstracts are searched. Certainly, this failure can be prevented or reduced by careful examination and grouping of similar abstracts, prior to a series of searches.
This prior grouping process implies retrieval which ensures high recall even at the cost of low precision. One thing that matters here is the manageable number of abstracts to be examined as to their similarity. The greater the number, the more preventive work there is to be done. What makes matters worse is the possible multiplicity of similarity groups which an abstract belongs to at the same time. We may not even make certain which groups will be more significant or more likely to be requested by the user. This situation will eventually demand enormous efforts. Our ideal to rule out inconsistencies may require prohibitive efforts.
We all know something about abstracts and extracts, not being pretentious. However, this general kind of knowledge may not suffice for critical discussion of their characteristics, merits, snags, and so on. An abstract was defined as an abbreviated, accurate representation of a document; and an extract as consisting of one or more portions of a document selected to represent the whole. Were they defined with accuracy? Were the definitions intended for making clearer how to make abstracts and extracts? Are there any really working standards for making them?
Any document surrogate of however small and biased content may be justified, because it is not the document itself but a representation, description or prescription. Sometimes it is mistaken that the content of a surrogate is the same as the content of the corresponding document; or that the equation C = A * B holds equally in both cases. Distinguishing between intensional aboutness and extensional aboutness, Faithorne [7] says that:
Fairthorne's intensional aboutness Author's subjective/implicit meaning Dumais's latent semantic indexing [5] [6] Stark refers to extensional aboutness. [7] Cheti refers to intensional aboutness. Cheti refers to the above quote. [8] Hawthorne's intensional aboutness Hawthorne's fluidity of meaning [9] Author's flexibility and elasticity just below
Even with the great flexibility and elastisity of language, it seems almost impossible to make an abstract of about two hundred words exactly analogous to the content C of the corresponding document. In other words, selection and bias are more or less unavoidable in abstracting. If paraphrasing of selection is considered to be semantically superficial, then the difference between an abstract and an extract will be somewhat marginal. Both are biased selections or parts of the content C.
Roughly speaking, an abstract is more intended to balance selection uniformly over C, aiming at inductive information effects. Similarly, an extract is more intended to spot selection (perhaps conclusive part) eccentrically from C, aiming at immediate rather than inductive information effects. Yet, no formal procedures beyond conventions of a vague nature are available of what to select.
Considering the power of meta-language and its use in retrieval, Goffman, et al [13] notice that an abstract is given in meta-language whereas an extract in object-language. They further notice that many abstracts, being written in "trivial" meta-language, should more accurately be called extracts.
Selection or part of a document, whether balancing or spotting, should assume that it can do without the rest or context. In other words, it should be an independent unit of discourse. Truly, abstracts, extracts, titles, even index terms, all these tell us something on their own account. Fairthorne [7] paraphrases Bohnert's notion of data as:
This phrase seems to be worth careful scrutiny. Perhaps, we can raise several questions such as:
We shall discuss these and other questions in the next chapter. Meanwhile, Belzer [14] calculates "the entropies of the various surrogates of error-free information," by assigning one bit of information to a full document. For five different types of surrogates - citation, abstract, first paragraph, last paragraph, first and last paragraph - he observes the 2 x 2 contingency of:
By showing the calculation result as in Table 1, and by calling attention to the fact that production of abstracts only requires extensive professional effort, he in effect revives superiority of extracts to abstracts. Comparison of a document with its surrogates is also interesting.
We are free to think of anything. [9] [10] At one moment we think of the trees in the garden; at another, of the tomorrow's weather; at still another, of the past experience. These things that we think of, whether existent or non-existent, substantial or imaginary, true or false, may be said to be fairly stable in contrast to our free thoughts. The trees in the garden must be there dropping leaves, even while we stop thinking of them. [11] Thus we can freely organize or map our thoughts upon this stable background. [12]
We invent symbols* with which to express or map our thoughts on things. In isolation we are free to invent and use any symbols to express our thoughts. In a society where we should associate, cooperate, share, or communicate with other people [8], we have two options. That is, either to dictate our own invention or to obey the social rule. Small babies would begin with dictations, e.g., crying out instinctively to express their desire. However, they cannot merely dictate. That would be admittedly too limiting. Far more required of them and far more convenient in most cases is to conform to the social rule. They learn to practice it gradually but never completely. Even when grown up, dictations are necessary. As a matter of fact, the two options are normally intermingled, but eventually separable.
The social rule on how to use symbols has evolved from among people, somewhat loosely and still changing. Truly, people have been inventing, elaborating, and using more and more convenient symbols for more efficient communication. The speaker expects the listener to receive the symbol that is substantially imparted or communicated. However, the speaker's real expectation is that the listener would share the same thought and direct to the same thing that the speaker has in mind. Largely, this expectation is met when the speaker conforms to the well-established social rule, e.g., grammar and dictionary. Then the majority understand him unanimously; communication is clear-cut. In this case the meaning of such a symbol may be said to be denotative, explicit, or extensional. Refer to a dictionary for these words (Table 2).
On the other hand, communication does not appear as simple as that. Indeed, communication as a whole is a loose affair. Ambiguous and misleading expressions, misunderstandings and different interpretations, and so on. Supposing that someone intends to convey a definite thought or story with the following word string [8]:
which looks almost non-sensical as a whole. Then, what will happen to us listeners? We have a dictionary, but we cannot simply sum up the meanings of individual words. That "a whole is more than the sum of the parts" is too plain a saying. There seems to be no grammar to which the speaker might have conformed. He merely suggests rather than tells the story, which in other words is implied or implicit in the word string, i.e., symbol. From this awkward symbol we can guess the story with varying accuracies, if we are ready to take risks. In this case, the meaning of such a symbol may be said to be connotative, implicit, or intensional. [13] [14] [15] [16] [17] [18] Again, refer to a dictionary for these words (Table 2).
When we communicate our thoughts about things, we use signs. That is to say, three factors - a thought, a thing, and a sign - are essentially involved in any communication event, either speaking or listening. [19] [20] [21] Ogden and Richards [9] place the three factors at the corners of a triangle, where the relations between these factors are represented by the sides, as shown in Figure 3. They recognize that there are causal, direct relations between a sign and a thought, and between a thought and a thing. But, they insist, the relation between a sign and a thing is merely "imputed" as opposed to the causal relations; it holds only indirectly round the two sides of the triangle. they further insist that it is because of this imputed relation that most of the language problems arise. Signs are instruments subjected to thinking or interpretation; they can be related to things only through thinking, or more specifically through interpretation.
Things and experiences are also interpreted; they are treated as signs. Thus, through all our life, we interpret signs in the widest sense, with few exceptions. Then, what happens when we interpret signs? Ogden and Richards [9] generalize the process of sign interpretations as follows:
For example, a dog, on hearing the dinner bell, interprets the bell sounds as a sign and runs into the dining room. He can do so owing to the past experience in which clumps of events - Bells, savours, longing for food, etc. - have recurred "nearly uniformly." Such a clump of events may be called an external context. And the mental events, occurring in the dog which can link merely the present bell sound together with the past experience of bells-savours-longings, may be called a psychological context. To define more precisely:
Contexts occur more or less uniformly; that is to say, the constitutive characters of a context recur with uncertainty or with a probability. It follows that the context is said to be determinative with respect to one character if both characters are closely related. By taking very general constitutive characters and uniting relations, we have contexts of high probability; we can increase the probability of a context by adding suitable members. Thus we react the recurring part of the context in the same way as we did the whole context. Experience recurs in contexts which recur more or less uniformly, and interpretation is only possible in these recurring contexts.
Finally, Ogden and Richards [9] attempt to narrow down their implications by applying the context theory of interpretation [23] [24] [25] to the use of words at different levels; from simple recognition of sounds as words to critical interpretation of words.
Ogden and Richards [9] do not specify contexts in the triangle in Figure 3. Cherry [8] modifies the diagram as shown in Figure 4a. We shall further modify it as shown in Figure 4b, and say that the triangle is surrounded by the external context and contains the psychological context inside. Still, the diagram only represents either speaking or listening. Thus we shall develop the diagram further in the following.
A unit communication, including both speaking and listemning, may be represented by the diagram as shown in Figure 5. The arrows and the corresponding words may be convenient to represent the functional flow in a unit communication. Thus we shall say that:
If there arises no physical distortion between two signs, then the sign in speaking and the sign in listening will be the same, or get together. If the listener's thought directs to the same thing that initiated the speaker's thought, then we shall have an ideal unit communication as shown in Figure 6.
We may better develop the diagram further in order to represent communication situations which are more complex than a unit communication. And we shall normally approximate individual units of communication to ideal units as shown in Figure 6.
Let us take for example the password game. The questioner, thinking of WATCHWORD, gives a symbol 'watchword' to the intermediary, who in turn gives another symbol 'password' to the answerer. Before and after translation from 'watchword' into 'password,' the intermediary's thoughts I and I' should be different such that I corresponds to WATCHWORD and I' to PASSWORD. Therefore, the answerer's thought should direct first to PASSWORD, and then to WATCHWORD which is the correct answer. The answerer should make a guess that is the reverse of translation. This password game is illustrated in Figure 7. Communication between the questioner and the intermediary makes an ideal unit, and that between the intermediary and the answerer makes another ideal. These two ideal units are separated by a communication gap which should be overcome by the answerer's guesswork. In corollary, complexity of communications involved in information retrieval may be shown as the diagram in Figure 8.
The idea proposed in this chapter is to use in information retrieval those extracts in which the source document cites, describes, criticizes, and/or collates other documents (See Figure 9). It is only exploratory within the scope of this study. It can be justified on the grounds that the citing and the cited documents are coherent with each other, that extracts provide concise clues for discriminating these documents, and that even concise clues are interpreted meaningfully in the given contexts. Although widely practiced among information users, the idea has not yet been formally studied in view of efficient file organization as far as I know. Therefore, the implication of the idea might go farther in the future than can be expected to now, and require more exploration. In this respect, what is immediately required will be some rationale behind the idea. While all the preceding discussions are relevant to this rationale, the following are intended to support the idea focally.
Now it is almost certain that subject coverage or specialization can hardly be defined consistently and objectively. At best we can say that two documents are similar with respect to something, based on the evidence that we recognize from the documents. Still, the totality of evidence would not make sure similarity; it gives us no more than a degree of belief.
In most cases, two documents similar with respect to something are indexed or abstracted individually. In this sense they are related to each other only indirectly, or with some uncertainty. Indexing inconsistency, mainly caused by individual varieties even in case of fairly adequate assignment of index terms, is now well known. This will significantly degrade retrieval as a grouping process of similar documents. Therefore, to use the direct evidence of similarity established between the two or more documents will be desirable.
We can quite reasonably say that the citations, by which I mean both the citing and the cited articles inclusively, are similar at a certain level of abstraction, especially in highly specialized fields of science. Therefore we can trace back and forth between the citations in order to find similar articles. This is the principle of citation indexing applied by Garfield [16]. However, the serious objection to citation indexing is that it demands too much risk, relying heavily on the mere fact that X cites Y. Tracing back and forth tends to diverge tremendously. The solution required for this technique would be to exclude noise sources and provide all the citations with subject indicators more powerful than titles. Lipetz [17] attempted to improve selectivity of citations by providing "context indicators" rather than "subject indicators." His approach seems applausible, but demands much intellectual effort. After all, the usefulness of direct evidence has not yet been warranted significantly by citation indexing.
In this respect, the far more elaborate method, bibliographic coupling, developed by Kessler [18] shares the same fate as citation indexing. It is noticed [19] that "citation tracing is pervasive information-seeking mode." What should be further noticed is that backward tracing is much more pervasive and that any intellectual tracing is initiated by discerning some meaningful evidence rather than the "mere fact."
On the other hand, it is questionable whether indexes and abstracts are the only means of retrieval as an extention of information-seeking facilities. Books, reviews, monographs, and journal articles; all these are likely to lead our information needs to other sources of information. Almost all scientific articles cites, describes, analyses, and groups a number of other articles. Thus, the reader of the citing articles can, perhaps very easily, discriminate the cited articles as to their subjects, crucial points, logical relationships, and so on. By doing so, the reader is in effect retrieving relevant articles with the aid of expertise.
Vickery [20] emphasizes the importance of review articles and the like as an efficient, selective "means to discover what they must read amid the vast mass of available documents," pointing out that "the traditional means of discovery of the pertinent literature are inadequate." Nevertheless, the traditional means may better give access to more selective means. That is to say, the strategy of discovery may best be divided into two different means.
A similar strategy was considered by Goffman, et al [13] by introducing meta-linguistic terms to indexing. However, their approach appears passive in that it is simply intended to divide a file in order to economize searches. A more active approach is therefore desired for selective discovery in terms of quality rather than quantity.
On the other hand, Goffman, et al [13] regret that many abstracts written in "trivial" meta-language are much closer to object-linguistic "extract," and that many reviewers write abstracts instead of the state of the art. They seem to favor meta-linguistic abstracts more than object-linguistic "extracts." Ironically, one of the authors recently shows that extracts are better than abstracts in terms of calculated entropies as well as intellectual efforts [14]. The power of meta-language which they properly recognized suffers from inconclusiveness, waiting for further observation.
On the whole, most of the traditional means, such as subject indexes, abstracts, and extracts seem to go paralytic facing efficient file organization. Obsolescence of scientific literature [21] is now widely known. Brookes [22] was interested in obsolescence involved in a cumulative file. Unfortunately, his interest has not yet been worked out. Certainly accumulated in a large, cumulative file would be archival value, but at the cost of retrieval devaluation. Thus systematic file organization AND maintenance should be taken as most essential in view of information retrieval.
Recently, Blaxter and Blaxter [23] report an interesting observation on the needs and habits of scientific authors and readers in three research institutes. They show that the information needs of individual working scientists are met by a very small number of primary journals, and that the cited references appended to primary articles or review articles are used in most literature searches. More precisely:
If this were to be the general pattern of literature searches by working scientists, and if information retrieval is to meet ultimately the information needs of individual scientists, file organization should be considered in the light of the above observation.
Figure 9 shows the first paragraph extracted from an article* (hereafter called the sample article) in a recent issue of Physical Review. The extract has eight references (Refs. 1-8) not merely cited and described, but also criticized and collated. With respect to the cited references, the extract is meta-linguistic and of a review kind. Similar extracts can be made from other parts of the sample article wherever each cites one or more references (Figure 13). By extract is meant hereafter an extract of this kind, as opposed to a common, object-linguistic extract.
From the extract in Figure 9, a subject index to Refs. 1-8 may be derived as illustrated in Figure 10. The complete subject index to Refs. 1-37 of the sample article is shown in Figure 11. The actual indexing is done on the work sheet as illustrated in Figure 12, while the indexer scanning the source document selects index terms. There is therefore no need to make extracts substantially.
Figure 13 illustrates a provisional compilation for the sample article where:
Similar compilations for a number of source documents may be serially accumulated into a file. Being combined with the subject index and the author index, this file may be used as personal or other means for information retrieval. Convenience of the file will remain a technical problem.
The use of the file in retrieval is much the same as that of reviews and text books which can lead the reader to various sources of information. As mentioned previously, extracts under consideration are in fact of a review kind. External and psychological contexts are involved in reading reviews. In extending to other sources of information, the reader can benefit from expertise provided by reviews of source documents. Certainly he would not make instantaneous, mechanistic YES-NO decisions based on simple criteria. To the contrary, his decisions will be carefully thought out.
Selection of one source document by using the subject index is relatively less important, since it is mainly intended to lead to retrieval of as many cited references as possible. Therefore usefulness of the file will depend on coherence of citations, i.e., coherence of cited references with each other as well as with the source document. And extracts should be made short as far as they do not significantly degrade the maximum coherence that is obtainable from the full text. Here, coherence may be defined:
number of citations retrieved as relevant coherence = -----------------------------------------. number of citations examined
From the extract Ext a in Figure 13, the reader may notice that all the cited references (Refs 1-8) are about INDEPENDENT PARTICLE MODEL, which presumably represents the significant aspect in common. Much subject content behind this representation may be covered by the abstract of the source document. Thus, given the context by the abstract, the reader can to some extent do without the individual abstracts of the cited articles. Similarly, the reader can benefit from other contexts which are exchanged between the cited references. How much he can benefit from these external contexts will depend on his psychological context.
Extracts should be made primarily in one or more sentences. Description in sentences is one of the advantages of extracts over description in keywords. However, some extracts are non-sensical, mostly redundant, or require modification. It would be better in these cases either:
In short, the length and the coherence of extracts should be balanced. Extraction of keywords or phrases similar to subject indexing, may suffice in many cases.
Perhaps the simplest file organization would be to mark extracts directly on the source document and to derive the subject index from them. In a sophisticated environment, e.g., visual display and keyboard manipulation of constituent files, the following organization may be convenient.
Figure 14 illustrates an entry to the subject index, and Figure 15 illustrates ways of access from the subject index to the citation index and to the extract file.
I think, as many others may do, [26] that in his World Encyclopedia, [27] H. G. Wells [28] proposed in effect an ideal of file organization for information retrieval. [29] Refer again to the prefatory statement made by him. The crucial point here is to select and collate carefully, and to present critically. So far this study has attempted to move toward his ideal. [30]
Say, "World Encyclopedia." This somewhat tricky wording seems to bear some misunderstanding. Clearly, it is to put away miscellany and synthesize the essence only rather than to bring all together. In general, words being freed from its proper contexts, whether literary or external or psychological, are mischievous. and easily bring in misinterpretations. Incidentally, Wells himself experienced such a mischief done by a professional journalist. Hayakawa [24] says that: [31]
By saying "ignoring," however, he would not ignore the possibility of dispensing with part of the whole context. Given the environment, or given the wider context, part of the context is determinative in interpretation.
Bruza, P.D., Song, D., Wong, K.F. (2000) Aboutness from a Commonsense Perspective. Journal of the American Society for Information Science and Technology (JASIST), 51(12), 1090-1105. pdf another
Maron (1977) tackled aboutness by relating it to a probability of satisfaction. Three types of aboutness were characterized: S-about, O-about and R-about. S-about (i.e. subjective about) is a relationship between a document and the resulting inner experience of the user. O-about (i.e. objective about) is a relationship between a document and a set of index terms. More specifically, a document D is about a term set T if user X employs T to search for D. R-about purports to be a generalization of O-about to a specific user community (i.e., a class of users). Let I be an index term and D be a document, then D is R-about I is the ratio between the number of users satisfied with D when using I and the number of users satisfied by D. Using this as a point of departure, Maron further constructs a probabilistic model of R-aboutness. The advantage of this is that it leads to an operational definition of aboutness which can then be tested experimentally. However, once the step has been made into the probabilistic framework, it becomes difficult to study properties of aboutness, e.g. how does R-about behave under conjunction? The underlying problem relates to the fact that probabilistic independence lacks properties with respect to conjunction and disjunction. In other words, one's hands are largely tied when trying to express qualitative properties of aboutness within a probabilistic setting. setting. (For this reason Dubois et al. (1997) developed a qualitative framework for relevance using possibility theory).
Maron, M.E. (1977). On Indexing, Retrieval and the Meaning of About. Journal of the American Society for Information Science, 28 (1): 38-43.
Dubois, D., Farinas del Cerro, L., Herzig, A., & Prade, H. (1997). Qualitative Relevance and Independence: A Roadmap. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), pp. 62-67, 1997.
David Bohm, has frequently referred to meaning, particularly when talking about his recent experiments with dialogue groups in which "a free flow of meaning" is encouraged. This whole question of meaning, and what we mean by it is clearly of importance and, in particular, the question "What do you mean by language?"
C.K. Ogden and I. A. Richards's classic The Meaning of Meaning8 provides a useful introduction to such questions. Following Odgen and Richards the work of Ludgwig Wittgenstein had made a particularly significant contribution to the notion of meaning in linguistics.9 According to his dictum: Don't look for the meaning, look for the use. Essentially this can be interpreted as saying that meaning is a generalization that doesn't correspond to anything that is actually available in language behavior. What we actually rely upon are individual uses which are themselves interrelated according to a pattern of family resemblances. In this sense words could no more be said to "possess" an intrinsic meaning that is independent of their use than, in Bohr's view, could an electron be said to "possess" an intrinsic position or spin.
When Martin Gardner retired from writing his Mathematical Games column for Scientific American magazine, Hofstadter succeeded him with a column entitled Metamagical Themas (an anagram of "Mathematical Games").
Hofstadter invented the concept of Reviews of This Book, a book containing nothing but cross-referenced reviews of itself. He introduces the idea in Metamagical Themas:
Hofstadter's Law: "It always takes longer than you expect, even when you take into account Hofstadter's Law".
CiteSeer, in the past known as ResearchIndex, is a public specialty search engine and digital library that was created by researchers Dr. Steve Lawrence, Kurt Bollacker and Dr. Lee Giles while they were at the NEC Research Institute (now NEC Labs), Princeton, NJ, USA. CiteSeer crawls for and harvests academic scientific documents and uses autonomous citation indexing to permit querying by citation or by document. [...]
CiteSeer's goal is to improve the dissemination and access of academic and scientific literature. As a non-profit service that can be freely used by anyone, it is an example of the democratization of scientific knowledge and the Open Access movement that is revolutionizing academic and scientific publishing and scientific literature access. [...]
Google began as a research project in early 1996 by Larry Page and Sergey Brin, two Ph.D. students at Stanford who developed the hypothesis that a search engine based on analysis of the relationships between Web sites would produce better results than the basic techniques then in use. It was originally nicknamed BackRub because the system checked backlinks to estimate a site's importance. (A small search engine called RankDex was already exploring a similar strategy.)
Convinced that the pages with the most links to them from other highly relevant Web pages must be the most relevant ones, Page and Brin decided to test their thesis as part of their studies, and laid the foundation for their search engine. [...]
cf. Terry Winograd
The idea was worked out in more detailed form by Cerf's networking research group at Stanford in the 1973–74 period, resulting in the first TCP specification ( Request for Comments 675) (The early networking work at Xerox PARC, which produced the PARC Universal Packet protocol suite, much of which was contemporaneous, was also a significant technical influence; people moved between the two).
DARPA then contracted with BBN Technologies, Stanford University, and the University College London to develop operational versions of the protocol on different hardware platforms. Four versions were developed: TCP v1, TCP v2, a split into TCP v3 and IP v3 in the spring of 1978, and then stability with TCP/IP v4 — the standard protocol still in use on the Internet today.
In 1975, a two-network TCP/IP communications test was performed between Stanford and University College London (UCL). In November, 1977, a three-network TCP/IP test was conducted between the U.S., UK, and Norway. Between 1978 and 1983, several other TCP/IP prototypes were developed at multiple research centres. A full switchover to TCP/IP on the ARPANET took place January 1, 1983.
— " TCP/IP," Wikipedia.
Advanced Research Projects Agency was renamed to Defence Advanced Research Projects Agency ( DARPA) in 1972.
A fundamental pioneer in the call for a global network, J.C.R. Licklider, articulated the ideas in his January 1960 paper, Man-Computer Symbiosis.
- "A network of such [computers], connected to one another by wide-band communication lines" which provided "the functions of present-day libraries together with anticipated advances in information storage and retrieval and [other] symbiotic functions. "—J.C.R. Licklider.
— " Three terminals and an ARPA," in: " Internet history," Wikipedia
The World Wide Web has evolved into a universe of information at our finger tips. But this was not an idea born with the Internet. This lecture recounts earlier attempts to disseminate information that influenced the Web - such as the French Encyclopédists in the 18th century, H. G. Wells' World Brain in the 1930s, and Vannevar Bush's Memex in the 1940s.
— Editorial comment
There are quite a number of published histories of the Internet and the World Wide Web. Typically these histories portray the Internet as a revolutionary development of the late 20th century—perhaps with distant roots that date back to the early 1960s. In one sense this is an apt characterization. The Internet is absolutely a creation of the computer age.
But we should not think of the Internet just as a revolutionary development. It is also an evolutionary development in the dissemination of information. In that sense the Internet is simply the latest chapter of a history that can be traced back to the Library of Alexandria or the printing press of William Caxton.
In this lecture I will not be going back to the Library of Alexandria or the printing press of William Caxton. Instead I will focus on the contributions of three individuals who envisioned something very like the Internet and the World Wide Web, long before the Internet became a technical possibility.
These three individuals each set an agenda. They put forward a vision of what the dissemination of information might become, when the world had developed the technology and was willing to pay for it. Since the World Wide Web became established in 1991 thousands of inventers and entrepreneurs have changed the way in which many of us conduct our daily lives. Today, most of the colonists of the Web are unaware of their debt to the past. I think Sir Isaac Newton put it best: “If [they] have seen further, it is by standing on the shoulders of giants.” This lecture is about three of those giants: H.G. Wells, Vannevar Bush, and J.C.R. Licklider.
— Introduction
Around 1937, Wells perceived that the world was drifting into war. He believed this was because of the sheer ignorance of ordinary people, that allowed them to be duped into voting for fascist governments. He believed that the World Brain could be a force in conquering this ignorance and he set about trying to raise the half-a-million pounds a year that he estimated would be needed to run the project. He lectured and wrote articles which were later published as a book called the World Brain (1938). He made an American lecture tour, hoping it would raise interest in his grand project. One lecture, in New York, was broadcast and relayed across the nation. He dined with President Roosevelt, and if Wells raised the issue of the World Brain with him — which seems more than likely — it did not have the effect of loosening American purse-strings. Sadly, Wells never succeeded in establishing his program before World War II broke out, and then of course such a cultural project would have been unthinkable in the exigencies of war.
— H. G. Wells and the World Brain
The rapid growth of the Internet in the 1990s was primarily due to the World Wide Web. The Web Browser made using the Internet easy for ordinary people, and also worth doing and worth investing in. The World Wide Web was invented by Sir Tim Berners-Lee working in the CERN European particle physics laboratory in Geneva, in 1991. As Berners-Lee put it himself, the World Wide Web was “the marriage of hypertext and the Internet.” The ideas were in the air. He just put the pieces together. And in so doing, he set in train a chain of events that have changed the world.
— Conclusion
The WorldWideWeb (W3) is a wide-area hypermedia information retrieval initiative aiming to give universal access to a large universe of documents.
— The first web page, CERN 1991 [24]
The idea of hypertext is neither a great invention nor even an invention. To say it is great is to say information is great or computing is great. Neither is great. But its use may be either great or evil or better than nothing. It's up to you how about pornographic uses, for example. Put otherwise, an atom is barely great while an atomic bomb is surely great.
Speech is strictly linear, while text is less but still so indeed. But you need not read linearly from opening to ending. Reading is nonlinear from time to time, if not in general. So is it for reference works in particular. It is simply a programmer's job to help read nonlinearly as usually needed. But a programmer may be lucky enough to be the first who gets the job done. All congratulations on her chance of being the first programmer rather than inventor, surely in case she is not worth the inventor.
Relevance is the basic underlying notion of IR by choice. [...] IR, as formulated, is based on [...] relevance. Whose relevance? Users! In reality, wanting or not, users and IR are not 'separable.' This brings us to the necessity for a broader context for IR provided by information science. Computer science provides the infrastructure. Information science the context. [emphasis original]
— Saracevic (1997). "Users lost: Reflections on the past, future, and limits of information science." SIGIR Forum, 31 (2) 16-27. [1]
Some thinkers attack the problem of free will by distnguishing different notions of freedom or meaning of the word 'free'. In one sense we are free -- free enough for concepts of morality and responsibility to come into play. In another sense we are not free, and all that happens now is determined by what has happened earlier. According to this 'soft determinism', as William James called it, determinism is supposed to express a true doctrine in one sense of the words, and a false doctrine in another. Plenty of philosophers have argued that the problem about free will arises from what Hobbes called the 'inconstancy' of language. The same word, they say, is inconstant -- it can have several meanings. Even philosophers who argue for a simple determinism have to show that in their arguments the word 'free' is used with a constant sense, leading up to the conclusion that we are not free.
— Ian Hacking (1975) Why Does Language Matter to Philosophy? (p. 4-5)
Berlin did not assert that determinism was untrue, but rather that to accept it required a radical transformation of the language and concepts we use to think about human life -- especially a rejection of the idea of individual moral responsibility. To praise or blame individuals, to hold them responsible, is to assume that they have some control over their actions, and could have chosen differently. If individuals are wholly determined by unalterable forces, it makes no more sense to praise or blame them for their actions than it would to blame someone for being ill, or praise someone for obeying the laws of gravity. Indeed, Berlin suggested that acceptance of determinism -- that is, the complete abandonment of the concept of human free will -- would lead to the collapse of all meaningful rational activity as we know it.
— "Isaiah Berlin," on: "Free Will and Determinism," in: Stanford Encyclopedia of Philosophy
Local realism is the combination of the principle of locality with the "realistic" assumption that all objects must objectively have pre-existing values for any possible measurement before these measurements are made. Einstein liked to say that the moon is "out there" even when no one is observing it.
McGinn's aim is two-fold: to undermine both descriptive and causal theories of reference, and to argue for his preferred, 'contextual' theory of reference. McGinn is moved to this position by emphasizing indexicals—which he takes to be the primary referential devices—rather than proper names. Linguistic reference, for McGinn, is a conventional activity governed by rules that prescribe the spatio-temporal conditions of correct use; the semantic referent of a speaker's term is given by combining its linguistic meaning with the spatio-temporal context in which the speaker is located. McGinn concludes his defence of this theory by demonstrating the plausibility of its implications for such topics as abstract objects, self-reference, attribution, the language of thought hypothesis, truth, and the reducibility of reference.
— (Abstract) Colin McGinn (2002) "The Mechanism of Reference" (in) Knowledge and Reality, pp. 197-223.
Context is a term that has come into more and more frequent use in the last thirty or forty years in a number of disciplines--among them, anthropology, archaeology, art history, geography, intellectual history, law, linguistics, literary criticism, philosophy, politics, psychology, sociology, and theology. A trawl through the on-line catalogue of the Cambridge University Library in 1999 produced references to 1,453 books published since 1978 with the word context in the title (and 377 more with contexts in the plural). There have been good reasons for this development. The attempt to place ideas, utterances, texts, and other artifacts "in context" has led to many insights.
— Peter Burke (2002) "Context in Context." Common Knowledge, 8(1): 152-177.
Note: He was a visiting professor at UCL.Today the digital library community spends some effort on scanning, compression, and OCR; tomorrow it will have to focus almost exclusively on selection, searching, and quality assessment. Input will not matter as much as relevant choice. Missing information won't be on the tip of your tongue; it will be somewhere in your files. Or, perhaps, it will be in somebody else's files. With all of everyone's work online, we will have the opportunity first glimpsed by H. G. Wells (and a bit later and more concretely by Vannevar Bush) to let everyone use everyone else's intellectual effort. We could build a real `World Encyclopedia' with a true `planetary memory for all mankind' as Wells wrote in 1938. [Wells 1938]. He talked of ``knitting all the intellectual workers of the world through a common interest;`` we could do it. The challenge for librarians and computer scientists is to let us find the information we want in other people's work; and the challenge for the lawyers and economists is to arrange the payment structures so that we are encouraged to use the work of others rather than re-create it.
— Michael Lesk (1997) How Much Information Is There In the World? (Excerpt from Conclusion)
Note: He was also a visiting professor at UCL.World Brain or Global Brain proponents tend to extrapolate quite extravagantly the capabilities and implications of emerging technology. For Wells it was microfilm. Today it is the infinitely more sophisticated Internet and World Wide Web which have enmeshed our globe in a fantastically intricate and diffused communications infrastructure. By means of this technology as World or Global Brain proponents imagine it taking shape, the effective deployment of the entire universe of knowledge will become possible. But this begs unresolved questions about the relative value of the individual and the state, about the nature of individual and social benefits and how they are best to be allocated, about what constitutes freedom and how it might be appropriately constrained. It flies in the face of the intransigent reality that what constitutes the ever-expanding store of human knowledge is almost incalculably massive in scale, is largely viewpoint-dependent, is fragmented, complex, ceaselessly in dispute and always under revision.
— W. Boyd Rayward (1999) "H. G. Wells's Idea of a World Brain: A Critical Re-Assessment." JASIS, 50: 557-579.