From Wikipedia, the free encyclopedia

Extraction from natural language sources

The biggest portion of information contained in business documents, even about 80% [1], is encoded in natural language and therefore unstructured. Because unstructured data are rather badly suited to extract knowledge from it, it is necessary to apply more complex methods, which nevertheless generally supply worse results, than it would be possible for structured data. The massive acquisition of extracted knowledge should compensate the increased complexity and decreased quality of extraction. In the following, natural language sources are understood as sources of information, where the data are given in an unstructured fashion as plain text. But the text can be additionally embedded in a markup document (e. g. HTML document), because the most of the systems remove the markup elements automatically.

Traditional Information Extraction (IE)

The Traditional Information Extraction [2] is a technology of natural language processing, which extracts information from typically natural language texts and structures these in a suitable manner. The kinds of information to be identified must be specified in a model before beginning the process, which is why the whole process of Traditional Information Extraction is domain dependent. The IE is split in the following five subtasks.

The task of Named Entity Recognition is to recognize and to categorize all named entities contained in a text (assignment of a named entity to a predefined category). This works by application of grammar based methods or statistical models.

The Coreference Resolution identifies equivalent, by NER recognized, entities within a text. There are two relevant kinds of equivalence relationships. The first one relates to the relationship between two different represented entities (e. g. IBM Europe and IBM) and the second one to the relationship between an entity and their anaphoric references (e. g. it and IBM). Both kinds should be recognized by the Coreference Resolution.

At the Template Element Construction the IE system identifies descriptive properties of entities, recognized by NER and CO. These properties correspond to ordinary qualities like red or big.

The Template Relation Construction identifies relations, which exist between the template elements. These relations can be of several kinds, such as works-for or located-in, with the restriction, that both domain and range correspond to entities.

In the Template Scenario Production events, which are described in the text, will be identified and structured with respect to the entities, recognized by NER and CO and relations, identified by TR.

Ontology-Based Information Extraction (OBIE)

The Ontology-Based Information Extraction [1] is a subfield of Information Extraction, with which at least one ontology is used to guide the process of information extraction from natural language text. Though, the OBIE system uses methods of Traditional Information Extraction to identify concepts, instances and relations of the used ontologies in the text, which will be structured to an ontology after the process. Thus, the input ontologies constitute the model of information to be extracted.

Ontology Learning (OL)

With Ontology Learning [3] whole ontologies from natural language text are semi-automatic extracted. According to this it can be applied supportingly at ontology engineering. It is split in the following seven subtasks, which must not be supported by all OL systems.

  • Domain Terminology Extraction
  • Concept Discovery
  • Concept Hierarchy Derivation
  • Learning of non-taxonomic relations
  • Rule Discovery
  • Ontology Population
  • Concept Hierarchy Extension

At the Domain Terminology Extraction domain-specific terms are extracted, which are used in the following Concept Discovery to derive concepts. Relevant terms can be determined e. g. by calculation of the TF/IDF values or by application of the C-value / NC-value method. The resulted list of terms has to be filtered by a domain expert. Subsequent, similarly to Coreference Resolution in IE, the OL system determines synonyms, because they share the same meaning and therefore correspond to the same concept. The most common methods therefor are clustering and the application of statistical similarity measures.

In the Concept Discovery terms are grouped to meaning bearing units, which correspond to an abstraction of the world and therefore to concepts. The grouped terms are these domain-specific terms and their synonyms, which were identified in the Domain Terminology Extraction.

In the Concept Hierarchy Derivation the OL system tries to arrange the extracted concepts in a taxonomic structure. This is mostly achieved by unsupervised hierarchical clustering methods. Because the result of such methods is often noisy, a supervision, e. g. by evaluation by the user, is integrated. A further method for the derivation of a concept hierarchy exists in the usage of several patterns, which should indicate a sub- or supersumption relationship. Pattern like “X, what is a Y” or “X is a Y” indicate, that X is a subclass of Y. Such pattern can be analyzed efficiently, but they occur too infrequent, to extract enough sub- or supersumption relationships. Instead bootstrapping methods are developed, which learn these patterns automatically and therefore ensure a higher coverage.

At the Learning of non-taxonomic relations relationships are extracted, which don´t express any sub- or supersumption. Such relationships are e. g. works-for or located-in. There are two common approaches to solve this subtask. The first one bases upon the extraction of anonymous associations, which are named appropriate in a second step. The second approach extracts verbs, which indicate a relationship between the entities, represented by the surrounding words. But the result of both approaches has to be evaluated by an ontologist.

In the Rule Discovery [4] axioms (formal description of concepts) are generated for the extracted concepts. This can be achieved, e. g., by analyzing the syntactic structure of a natural language definition and the application of transformation rules on the resulting dependency tree. The result of this process is a list of axioms, which is afterward comprehended to a concept description. This one has to be evaluated by an ontologist.

At the Ontology Population the ontology is augmented with instances of concepts and properties. For the augmentation with instances of concepts methods, which are based on the matching of lexico-syntactic patterns, are used. Instances of properties are added by application of bootstrapping methods, which collect relationtuples.

In the Concept Hierarchy Extension the OL system tries to extend the taxonomic structure of an existing ontology with further concepts. This can be realized supervised by an trained classifier or unsupervised by the application of similarity measures.

Semantic Annotation (SA)

At the Semantic Annotation [5] of natural language text this one is augmented with metadata (often represented in RDFa), which should make the semantics of contained terms machine-understandable. At this process, which is generally semi-automatic, knowledge is extracted in the sense, that a link between lexical terms and e. g. concepts from ontologies is established. Thus, the knowledge is also won, which meaning of a term in the processed context was intended. The semi-automatic Semantic Annotation can be split in the following two subtasks.

  • Terminology Extraction
  • Entity Linking

At the Terminology Extraction lexical terms from the text are extracted. For this purpose a tokenizer determines at first the word boundaries and solves abbreviations. Afterward terms from the text, which correspond to a concept, are extracted with the help of a domain-specific lexicon to link these at Entity Linking.

At Entity Linking [6] a link between the extracted lexical terms from the source text and the concepts from an ontology is established. For this, candidate-concepts are detected appropriate to the several meanings of a term with the help of a lexicon. Closing, the context of the terms is analyzed, to determine the most appropriate disambiguation, to assign the term to the correct concept.

Tools

The following criteria can be used to categorize tools, which extract knowledge from natural language text.

Source Which input formats can be processed by the tool (e. g. plain text, HTML or PDF)?
Access Paradigm Can the tool query the data source or uses it a whole dump for the extraction process?
Data Synchronization Is the result of the extraction process synchronized with the source?
Uses Output Ontology Does the tool link the result with an ontology?
Mapping Automation How automated is the extraction process (manual, semi-automtic or automatic)?
Requires Ontology Does the tool need an ontology for the extraction?
Uses GUI Does the tool offer a graphical user interface?
Approach Which approach (IE, OBIE, OL or SA) is used by the tool?
Extracted Entities Which types of entities (e. g. named entities, concepts or relationships) can be extracted by the tool?
Applied Techniques Which techniques are applied (e. g. NLP, statistical methods, clustering or machine learning)?
Output Model Which model is used to represent the result of the tool (e. g. RDF or OWL)?
Supported Domains Which domains are supported (e. g. economy or biology)?
Supported Languages Which languages can be processed (e. g. english or german)?

The following table characterizes some tools for Knowledge Extraction from natural language sources.

Name Source Access Paradigm Data Synchronization Uses Output Ontology Mapping Automation Requires Ontology Uses GUI Approach Extracted Entities Applied Techniques Output Model Supported Domains Supported Languages
AeroText [7] yes IE named entities, relationships, events multilingual
AlchemyAPI [8] plain text, HTML automatic yes SA multilingual
ANNIE [9] plain text dump yes yes IE finite state algorithms multilingual
ASIUM [10] plain text dump semi-automatic yes OL concepts, concept hierarchy NLP, clustering
Attensity Exhaustive Extraction [11] automatic IE named entities, relationships, events NLP
DBpedia Spotlight [12] plain text SPARQL automatic yes yes SA annotation to each word, annotation to non-stopwords domain-independent english
iDocument [13] HTML, PDF, DOC SPARQL yes yes OBIE instances, property values NLP personal, business
NetOwl Extractor [14] yes IE named entities, relationships, events NLP multiple domains multilingual
OntoGen [15] semi-automatic yes OL concepts, concept hierarchy, non-taxonomic relations, instances NLP, machine learning, clustering
OntoLearn [16] plain text, HTML dump no yes automatic yes no OL concepts, concept hierarchy, instances NLP, statistical methods proprietary domain-independent english
OntoLearn Reloaded plain text, HTML dump no yes automatic yes no OL concepts, concept hierarchy, instances NLP, statistical methods proprietary domain-independent english
OntoSyphon [17] HTML, PDF, DOC dump, search engine queries no yes automatic yes no OBIE concepts, relations, instances NLP, statistical methods RDF domain-independent english
ontoX [18] dump yes OBIE instances, datatype property values
PoolParty Extractor [19] plain text, HTML, DOC, ODT dump no yes automatic yes yes OBIE named entities, concepts, relations, concepts that categorize the text, enrichments NLP, machine learning, statistical methods RDF, OWL domain-independent english, german, spanish, french
SCOOBIE plain text, HTML dump no yes automatic no no OBIE instances, property values, RDFS types NLP, machine learning RDF, RDFa domain-independent english, german
SemTag [20] [21] automatic yes SA machine learning database record
smart FIX plain text, HTML, PDF, DOC, e-Mail dump yes no automatic no yes OBIE named entities NLP, machine learning proprietary domain-independent english, german, french, dutch, polish
Text2Onto [22] plain text, HTML, PDF dump semi-automatic yes yes OL concepts, concept hierarchy, non-taxonomic relations, instances, axioms NLP, machine learning POM multilingual
Text-To-Onto [23] plain text, HTML, PDF, PostScript dump semi-automatic yes yes OL concepts, concept hierarchy, non-taxonomic relations, lexical entities referring to concepts, lexical entities referring to relations NLP, machine learning, clustering, statistical methods german
The Wiki Machine [24] plain text, HTML, PDF, DOC dump no yes automatic yes yes SA annotation to proper and common nouns machine learning RDFa domain-independent english, german, spanish, french, portuguese, italian, russian
ThingFinder [25] IE named entities, relationships, events multilingual

References

  1. ^ a b Wimalasuriya, Daya C.; Dou, Dejing (2010). "Ontology-based information extraction: An introduction and a survey of current approaches", Journal of Information Science, 36(3), p. 306 - 323, http://ix.cs.uoregon.edu/~dou/research/papers/jis09.pdf (retrieved: 18.06.2012).
  2. ^ Cunningham, Hamish (2005). "Information Extraction, Automatic", Encyclopedia of Language and Linguistics, 2, p. 665 - 677, http://gate.ac.uk/sale/ell2/ie/main.pdf (retrieved: 18.06.2012).
  3. ^ Cimiano, Philipp; Völker, Johanna; Studer, Rudi (2006). "Ontologies on Demand? - A Description of the State-of-the-Art, Applications, Challenges and Trends for Ontology Learning from Text", Information, Wissenschaft und Praxis, 57, p. 315 - 320, http://people.aifb.kit.edu/pci/Publications/iwp06.pdf (retrieved: 18.06.2012).
  4. ^ Völker, Johanna; Hitzler, Pascal; Cimiano, Philipp (2007). "Acquisition of OWL DL Axioms from Lexical Resources", Proceedings of the 4th European conference on The Semantic Web, p. 670 - 685, http://smartweb.dfki.de/Vortraege/lexo_2007.pdf (retrieved: 18.06.2012).
  5. ^ Erdmann, M.; Maedche, Alexander; Schnurr, H.-P.; Staab, Steffen (2000). "From Manual to Semi-automatic Semantic Annotation: About Ontology-based Text Annotation Tools", Proceedings of the COLING, http://www.ida.liu.se/ext/epa/cis/2001/002/paper.pdf (retrieved: 18.06.2012).
  6. ^ Rao, Delip; McNamee, Paul; Dredze, Mark (2011). "Entity Linking: Finding Extracted Entities in a Knowledge Base", Multi-source, Multi-lingual Information Extraction and Summarization, http://www.cs.jhu.edu/~delip/entity-linking.pdf (retrieved: 18.06.2012).
  7. ^ Rocket Software, Inc. (2012). "technology for extracting intelligence from text", http://www.rocketsoftware.com/products/aerotext (retrieved: 18.06.2012).
  8. ^ Orchestr8 (2012): "AlchemyAPI Overview", http://www.alchemyapi.com/api (retrieved: 18.06.2012).
  9. ^ The University of Sheffield (2011). "ANNIE: a Nearly-New Information Extraction System", http://gate.ac.uk/sale/tao/splitch6.html#chap:annie (retrieved: 18.06.2012).
  10. ^ ILP Network of Excellence. "ASIUM (LRI)", http://www-ai.ijs.si/~ilpnet2/systems/asium.html (retrieved: 18.06.2012).
  11. ^ Attensity (2012). "Exhaustive Extraction", http://www.attensity.com/products/technology/semantic-server/exhaustive-extraction/ (retrieved: 18.06.2012).
  12. ^ Mendes, Pablo N.; Jakob, Max; Garcia-Sílva, Andrés; Bizer; Christian (2011). "DBpedia Spotlight: Shedding Light on the Web of Documents", Proceedings of the 7th International Conference on Semantic Systems, p. 1 - 8, http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/Mendes-Jakob-GarciaSilva-Bizer-DBpediaSpotlight-ISEM2011.pdf (retrieved: 18.06.2012).
  13. ^ Adrian, Benjamin; Maus, Heiko; Dengel, Andreas (2009). "iDocument: Using Ontologies for Extracting Information from Text", http://www.dfki.uni-kl.de/~maus/dok/AdrianMausDengel09.pdf (retrieved: 18.06.2012).
  14. ^ SRA International, Inc. (2012). "NetOwl Extractor", http://www.sra.com/netowl/entity-extraction/ (retrieved: 18.06.2012).
  15. ^ Fortuna, Blaz; Grobelnik, Marko; Mladenic, Dunja (2007). "OntoGen: Semi-automatic Ontology Editor", Proceedings of the 2007 conference on Human interface, Part 2, p. 309 - 318, http://analytics.ijs.si/~blazf/papers/OntoGen2_HCII2007.pdf (retrieved: 18.06.2012).
  16. ^ Missikoff, Michele; Navigli, Roberto; Velardi, Paola (2002). "Integrated Approach to Web Ontology Learning and Engineering", Computer, 35(11), p. 60 - 63, http://wwwusers.di.uniroma1.it/~velardi/IEEE_C.pdf (retrieved: 18.06.2012).
  17. ^ McDowell, Luke K.; Cafarella, Michael (2006). "Ontology-driven Information Extraction with OntoSyphon", Proceedings of the 5th international conference on The Semantic Web, p. 428 - 444, http://turing.cs.washington.edu/papers/iswc2006McDowell-final.pdf (retrieved: 18.06.2012).
  18. ^ Yildiz, Burcu; Miksch, Silvia (2007). "ontoX - A Method for Ontology-Driven Information Extraction", Proceedings of the 2007 international conference on Computational science and its applications, 3, p. 660 - 673, http://publik.tuwien.ac.at/files/pub-inf_4769.pdf (retrieved: 18.06.2012).
  19. ^ semanticweb.org (2011). "PoolParty Extractor", http://semanticweb.org/wiki/PoolParty_Extractor (retrieved: 18.06.2012).
  20. ^ Dill, Stephen; Eiron, Nadav; Gibson, David; Gruhl, Daniel; Guha, R.; Jhingran, Anant; Kanungo, Tapas; Rajagopalan, Sridhar; Tomkins, Andrew; Tomlin, John A.; Zien, Jason Y. (2003). "SemTag and Seeker: Bootstraping the Semantic Web via Automated Semantic Annotation", Proceedings of the 12th international conference on World Wide Web, p. 178 - 186, http://www2003.org/cdrom/papers/refereed/p831/p831-dill.html (retrieved: 18.06.2012).
  21. ^ Uren, Victoria; Cimiano, Philipp; Iria, José; Handschuh, Siegfried; Vargas-Vera, Maria; Motta, Enrico; Ciravegna, Fabio (2006). "Semantic annotation for knowledge management: Requirements and a survey of the state of the art", Web Semantics: Science, Services and Agents on the World Wide Web, 4(1), p. 14 - 28, http://staffwww.dcs.shef.ac.uk/people/J.Iria/iria_jws06.pdf, (retrieved: 18.06.2012).
  22. ^ Cimiano, Philipp; Völker, Johanna (2005). "Text2Onto - A Framework for Ontology Learning and Data-Driven Change Discovery", Proceedings of the 10th International Conference of Applications of Natural Language to Information Systems, 3513, p. 227 - 238, http://www.cimiano.de/Publications/2005/nldb05/nldb05.pdf (retrieved: 18.06.2012).
  23. ^ Maedche, Alexander; Volz, Raphael (2001). "The Ontology Extraction & Maintenance Framework Text-To-Onto", Proceedings of the IEEE International Conference on Data Mining, http://users.csc.calpoly.edu/~fkurfess/Events/DM-KM-01/Volz.pdf (retrieved: 18.06.2012).
  24. ^ Machine Linking. "We connect to the Linked Open Data cloud", http://thewikimachine.fbk.eu/html/index.html (retrieved: 18.06.2012).
  25. ^ Inxight Federal Systems (2008). "Inxight ThingFinder and ThingFinder Professional", http://inxightfedsys.com/products/sdks/tf/ (retrieved: 18.06.2012).
From Wikipedia, the free encyclopedia

Extraction from natural language sources

The biggest portion of information contained in business documents, even about 80% [1], is encoded in natural language and therefore unstructured. Because unstructured data are rather badly suited to extract knowledge from it, it is necessary to apply more complex methods, which nevertheless generally supply worse results, than it would be possible for structured data. The massive acquisition of extracted knowledge should compensate the increased complexity and decreased quality of extraction. In the following, natural language sources are understood as sources of information, where the data are given in an unstructured fashion as plain text. But the text can be additionally embedded in a markup document (e. g. HTML document), because the most of the systems remove the markup elements automatically.

Traditional Information Extraction (IE)

The Traditional Information Extraction [2] is a technology of natural language processing, which extracts information from typically natural language texts and structures these in a suitable manner. The kinds of information to be identified must be specified in a model before beginning the process, which is why the whole process of Traditional Information Extraction is domain dependent. The IE is split in the following five subtasks.

The task of Named Entity Recognition is to recognize and to categorize all named entities contained in a text (assignment of a named entity to a predefined category). This works by application of grammar based methods or statistical models.

The Coreference Resolution identifies equivalent, by NER recognized, entities within a text. There are two relevant kinds of equivalence relationships. The first one relates to the relationship between two different represented entities (e. g. IBM Europe and IBM) and the second one to the relationship between an entity and their anaphoric references (e. g. it and IBM). Both kinds should be recognized by the Coreference Resolution.

At the Template Element Construction the IE system identifies descriptive properties of entities, recognized by NER and CO. These properties correspond to ordinary qualities like red or big.

The Template Relation Construction identifies relations, which exist between the template elements. These relations can be of several kinds, such as works-for or located-in, with the restriction, that both domain and range correspond to entities.

In the Template Scenario Production events, which are described in the text, will be identified and structured with respect to the entities, recognized by NER and CO and relations, identified by TR.

Ontology-Based Information Extraction (OBIE)

The Ontology-Based Information Extraction [1] is a subfield of Information Extraction, with which at least one ontology is used to guide the process of information extraction from natural language text. Though, the OBIE system uses methods of Traditional Information Extraction to identify concepts, instances and relations of the used ontologies in the text, which will be structured to an ontology after the process. Thus, the input ontologies constitute the model of information to be extracted.

Ontology Learning (OL)

With Ontology Learning [3] whole ontologies from natural language text are semi-automatic extracted. According to this it can be applied supportingly at ontology engineering. It is split in the following seven subtasks, which must not be supported by all OL systems.

  • Domain Terminology Extraction
  • Concept Discovery
  • Concept Hierarchy Derivation
  • Learning of non-taxonomic relations
  • Rule Discovery
  • Ontology Population
  • Concept Hierarchy Extension

At the Domain Terminology Extraction domain-specific terms are extracted, which are used in the following Concept Discovery to derive concepts. Relevant terms can be determined e. g. by calculation of the TF/IDF values or by application of the C-value / NC-value method. The resulted list of terms has to be filtered by a domain expert. Subsequent, similarly to Coreference Resolution in IE, the OL system determines synonyms, because they share the same meaning and therefore correspond to the same concept. The most common methods therefor are clustering and the application of statistical similarity measures.

In the Concept Discovery terms are grouped to meaning bearing units, which correspond to an abstraction of the world and therefore to concepts. The grouped terms are these domain-specific terms and their synonyms, which were identified in the Domain Terminology Extraction.

In the Concept Hierarchy Derivation the OL system tries to arrange the extracted concepts in a taxonomic structure. This is mostly achieved by unsupervised hierarchical clustering methods. Because the result of such methods is often noisy, a supervision, e. g. by evaluation by the user, is integrated. A further method for the derivation of a concept hierarchy exists in the usage of several patterns, which should indicate a sub- or supersumption relationship. Pattern like “X, what is a Y” or “X is a Y” indicate, that X is a subclass of Y. Such pattern can be analyzed efficiently, but they occur too infrequent, to extract enough sub- or supersumption relationships. Instead bootstrapping methods are developed, which learn these patterns automatically and therefore ensure a higher coverage.

At the Learning of non-taxonomic relations relationships are extracted, which don´t express any sub- or supersumption. Such relationships are e. g. works-for or located-in. There are two common approaches to solve this subtask. The first one bases upon the extraction of anonymous associations, which are named appropriate in a second step. The second approach extracts verbs, which indicate a relationship between the entities, represented by the surrounding words. But the result of both approaches has to be evaluated by an ontologist.

In the Rule Discovery [4] axioms (formal description of concepts) are generated for the extracted concepts. This can be achieved, e. g., by analyzing the syntactic structure of a natural language definition and the application of transformation rules on the resulting dependency tree. The result of this process is a list of axioms, which is afterward comprehended to a concept description. This one has to be evaluated by an ontologist.

At the Ontology Population the ontology is augmented with instances of concepts and properties. For the augmentation with instances of concepts methods, which are based on the matching of lexico-syntactic patterns, are used. Instances of properties are added by application of bootstrapping methods, which collect relationtuples.

In the Concept Hierarchy Extension the OL system tries to extend the taxonomic structure of an existing ontology with further concepts. This can be realized supervised by an trained classifier or unsupervised by the application of similarity measures.

Semantic Annotation (SA)

At the Semantic Annotation [5] of natural language text this one is augmented with metadata (often represented in RDFa), which should make the semantics of contained terms machine-understandable. At this process, which is generally semi-automatic, knowledge is extracted in the sense, that a link between lexical terms and e. g. concepts from ontologies is established. Thus, the knowledge is also won, which meaning of a term in the processed context was intended. The semi-automatic Semantic Annotation can be split in the following two subtasks.

  • Terminology Extraction
  • Entity Linking

At the Terminology Extraction lexical terms from the text are extracted. For this purpose a tokenizer determines at first the word boundaries and solves abbreviations. Afterward terms from the text, which correspond to a concept, are extracted with the help of a domain-specific lexicon to link these at Entity Linking.

At Entity Linking [6] a link between the extracted lexical terms from the source text and the concepts from an ontology is established. For this, candidate-concepts are detected appropriate to the several meanings of a term with the help of a lexicon. Closing, the context of the terms is analyzed, to determine the most appropriate disambiguation, to assign the term to the correct concept.

Tools

The following criteria can be used to categorize tools, which extract knowledge from natural language text.

Source Which input formats can be processed by the tool (e. g. plain text, HTML or PDF)?
Access Paradigm Can the tool query the data source or uses it a whole dump for the extraction process?
Data Synchronization Is the result of the extraction process synchronized with the source?
Uses Output Ontology Does the tool link the result with an ontology?
Mapping Automation How automated is the extraction process (manual, semi-automtic or automatic)?
Requires Ontology Does the tool need an ontology for the extraction?
Uses GUI Does the tool offer a graphical user interface?
Approach Which approach (IE, OBIE, OL or SA) is used by the tool?
Extracted Entities Which types of entities (e. g. named entities, concepts or relationships) can be extracted by the tool?
Applied Techniques Which techniques are applied (e. g. NLP, statistical methods, clustering or machine learning)?
Output Model Which model is used to represent the result of the tool (e. g. RDF or OWL)?
Supported Domains Which domains are supported (e. g. economy or biology)?
Supported Languages Which languages can be processed (e. g. english or german)?

The following table characterizes some tools for Knowledge Extraction from natural language sources.

Name Source Access Paradigm Data Synchronization Uses Output Ontology Mapping Automation Requires Ontology Uses GUI Approach Extracted Entities Applied Techniques Output Model Supported Domains Supported Languages
AeroText [7] yes IE named entities, relationships, events multilingual
AlchemyAPI [8] plain text, HTML automatic yes SA multilingual
ANNIE [9] plain text dump yes yes IE finite state algorithms multilingual
ASIUM [10] plain text dump semi-automatic yes OL concepts, concept hierarchy NLP, clustering
Attensity Exhaustive Extraction [11] automatic IE named entities, relationships, events NLP
DBpedia Spotlight [12] plain text SPARQL automatic yes yes SA annotation to each word, annotation to non-stopwords domain-independent english
iDocument [13] HTML, PDF, DOC SPARQL yes yes OBIE instances, property values NLP personal, business
NetOwl Extractor [14] yes IE named entities, relationships, events NLP multiple domains multilingual
OntoGen [15] semi-automatic yes OL concepts, concept hierarchy, non-taxonomic relations, instances NLP, machine learning, clustering
OntoLearn [16] plain text, HTML dump no yes automatic yes no OL concepts, concept hierarchy, instances NLP, statistical methods proprietary domain-independent english
OntoLearn Reloaded plain text, HTML dump no yes automatic yes no OL concepts, concept hierarchy, instances NLP, statistical methods proprietary domain-independent english
OntoSyphon [17] HTML, PDF, DOC dump, search engine queries no yes automatic yes no OBIE concepts, relations, instances NLP, statistical methods RDF domain-independent english
ontoX [18] dump yes OBIE instances, datatype property values
PoolParty Extractor [19] plain text, HTML, DOC, ODT dump no yes automatic yes yes OBIE named entities, concepts, relations, concepts that categorize the text, enrichments NLP, machine learning, statistical methods RDF, OWL domain-independent english, german, spanish, french
SCOOBIE plain text, HTML dump no yes automatic no no OBIE instances, property values, RDFS types NLP, machine learning RDF, RDFa domain-independent english, german
SemTag [20] [21] automatic yes SA machine learning database record
smart FIX plain text, HTML, PDF, DOC, e-Mail dump yes no automatic no yes OBIE named entities NLP, machine learning proprietary domain-independent english, german, french, dutch, polish
Text2Onto [22] plain text, HTML, PDF dump semi-automatic yes yes OL concepts, concept hierarchy, non-taxonomic relations, instances, axioms NLP, machine learning POM multilingual
Text-To-Onto [23] plain text, HTML, PDF, PostScript dump semi-automatic yes yes OL concepts, concept hierarchy, non-taxonomic relations, lexical entities referring to concepts, lexical entities referring to relations NLP, machine learning, clustering, statistical methods german
The Wiki Machine [24] plain text, HTML, PDF, DOC dump no yes automatic yes yes SA annotation to proper and common nouns machine learning RDFa domain-independent english, german, spanish, french, portuguese, italian, russian
ThingFinder [25] IE named entities, relationships, events multilingual

References

  1. ^ a b Wimalasuriya, Daya C.; Dou, Dejing (2010). "Ontology-based information extraction: An introduction and a survey of current approaches", Journal of Information Science, 36(3), p. 306 - 323, http://ix.cs.uoregon.edu/~dou/research/papers/jis09.pdf (retrieved: 18.06.2012).
  2. ^ Cunningham, Hamish (2005). "Information Extraction, Automatic", Encyclopedia of Language and Linguistics, 2, p. 665 - 677, http://gate.ac.uk/sale/ell2/ie/main.pdf (retrieved: 18.06.2012).
  3. ^ Cimiano, Philipp; Völker, Johanna; Studer, Rudi (2006). "Ontologies on Demand? - A Description of the State-of-the-Art, Applications, Challenges and Trends for Ontology Learning from Text", Information, Wissenschaft und Praxis, 57, p. 315 - 320, http://people.aifb.kit.edu/pci/Publications/iwp06.pdf (retrieved: 18.06.2012).
  4. ^ Völker, Johanna; Hitzler, Pascal; Cimiano, Philipp (2007). "Acquisition of OWL DL Axioms from Lexical Resources", Proceedings of the 4th European conference on The Semantic Web, p. 670 - 685, http://smartweb.dfki.de/Vortraege/lexo_2007.pdf (retrieved: 18.06.2012).
  5. ^ Erdmann, M.; Maedche, Alexander; Schnurr, H.-P.; Staab, Steffen (2000). "From Manual to Semi-automatic Semantic Annotation: About Ontology-based Text Annotation Tools", Proceedings of the COLING, http://www.ida.liu.se/ext/epa/cis/2001/002/paper.pdf (retrieved: 18.06.2012).
  6. ^ Rao, Delip; McNamee, Paul; Dredze, Mark (2011). "Entity Linking: Finding Extracted Entities in a Knowledge Base", Multi-source, Multi-lingual Information Extraction and Summarization, http://www.cs.jhu.edu/~delip/entity-linking.pdf (retrieved: 18.06.2012).
  7. ^ Rocket Software, Inc. (2012). "technology for extracting intelligence from text", http://www.rocketsoftware.com/products/aerotext (retrieved: 18.06.2012).
  8. ^ Orchestr8 (2012): "AlchemyAPI Overview", http://www.alchemyapi.com/api (retrieved: 18.06.2012).
  9. ^ The University of Sheffield (2011). "ANNIE: a Nearly-New Information Extraction System", http://gate.ac.uk/sale/tao/splitch6.html#chap:annie (retrieved: 18.06.2012).
  10. ^ ILP Network of Excellence. "ASIUM (LRI)", http://www-ai.ijs.si/~ilpnet2/systems/asium.html (retrieved: 18.06.2012).
  11. ^ Attensity (2012). "Exhaustive Extraction", http://www.attensity.com/products/technology/semantic-server/exhaustive-extraction/ (retrieved: 18.06.2012).
  12. ^ Mendes, Pablo N.; Jakob, Max; Garcia-Sílva, Andrés; Bizer; Christian (2011). "DBpedia Spotlight: Shedding Light on the Web of Documents", Proceedings of the 7th International Conference on Semantic Systems, p. 1 - 8, http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/Mendes-Jakob-GarciaSilva-Bizer-DBpediaSpotlight-ISEM2011.pdf (retrieved: 18.06.2012).
  13. ^ Adrian, Benjamin; Maus, Heiko; Dengel, Andreas (2009). "iDocument: Using Ontologies for Extracting Information from Text", http://www.dfki.uni-kl.de/~maus/dok/AdrianMausDengel09.pdf (retrieved: 18.06.2012).
  14. ^ SRA International, Inc. (2012). "NetOwl Extractor", http://www.sra.com/netowl/entity-extraction/ (retrieved: 18.06.2012).
  15. ^ Fortuna, Blaz; Grobelnik, Marko; Mladenic, Dunja (2007). "OntoGen: Semi-automatic Ontology Editor", Proceedings of the 2007 conference on Human interface, Part 2, p. 309 - 318, http://analytics.ijs.si/~blazf/papers/OntoGen2_HCII2007.pdf (retrieved: 18.06.2012).
  16. ^ Missikoff, Michele; Navigli, Roberto; Velardi, Paola (2002). "Integrated Approach to Web Ontology Learning and Engineering", Computer, 35(11), p. 60 - 63, http://wwwusers.di.uniroma1.it/~velardi/IEEE_C.pdf (retrieved: 18.06.2012).
  17. ^ McDowell, Luke K.; Cafarella, Michael (2006). "Ontology-driven Information Extraction with OntoSyphon", Proceedings of the 5th international conference on The Semantic Web, p. 428 - 444, http://turing.cs.washington.edu/papers/iswc2006McDowell-final.pdf (retrieved: 18.06.2012).
  18. ^ Yildiz, Burcu; Miksch, Silvia (2007). "ontoX - A Method for Ontology-Driven Information Extraction", Proceedings of the 2007 international conference on Computational science and its applications, 3, p. 660 - 673, http://publik.tuwien.ac.at/files/pub-inf_4769.pdf (retrieved: 18.06.2012).
  19. ^ semanticweb.org (2011). "PoolParty Extractor", http://semanticweb.org/wiki/PoolParty_Extractor (retrieved: 18.06.2012).
  20. ^ Dill, Stephen; Eiron, Nadav; Gibson, David; Gruhl, Daniel; Guha, R.; Jhingran, Anant; Kanungo, Tapas; Rajagopalan, Sridhar; Tomkins, Andrew; Tomlin, John A.; Zien, Jason Y. (2003). "SemTag and Seeker: Bootstraping the Semantic Web via Automated Semantic Annotation", Proceedings of the 12th international conference on World Wide Web, p. 178 - 186, http://www2003.org/cdrom/papers/refereed/p831/p831-dill.html (retrieved: 18.06.2012).
  21. ^ Uren, Victoria; Cimiano, Philipp; Iria, José; Handschuh, Siegfried; Vargas-Vera, Maria; Motta, Enrico; Ciravegna, Fabio (2006). "Semantic annotation for knowledge management: Requirements and a survey of the state of the art", Web Semantics: Science, Services and Agents on the World Wide Web, 4(1), p. 14 - 28, http://staffwww.dcs.shef.ac.uk/people/J.Iria/iria_jws06.pdf, (retrieved: 18.06.2012).
  22. ^ Cimiano, Philipp; Völker, Johanna (2005). "Text2Onto - A Framework for Ontology Learning and Data-Driven Change Discovery", Proceedings of the 10th International Conference of Applications of Natural Language to Information Systems, 3513, p. 227 - 238, http://www.cimiano.de/Publications/2005/nldb05/nldb05.pdf (retrieved: 18.06.2012).
  23. ^ Maedche, Alexander; Volz, Raphael (2001). "The Ontology Extraction & Maintenance Framework Text-To-Onto", Proceedings of the IEEE International Conference on Data Mining, http://users.csc.calpoly.edu/~fkurfess/Events/DM-KM-01/Volz.pdf (retrieved: 18.06.2012).
  24. ^ Machine Linking. "We connect to the Linked Open Data cloud", http://thewikimachine.fbk.eu/html/index.html (retrieved: 18.06.2012).
  25. ^ Inxight Federal Systems (2008). "Inxight ThingFinder and ThingFinder Professional", http://inxightfedsys.com/products/sdks/tf/ (retrieved: 18.06.2012).

Videos

Youtube | Vimeo | Bing

Websites

Google | Yahoo | Bing

Encyclopedia

Google | Yahoo | Bing

Facebook