In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.
Linguistic Linked Open Data describes the publication of data for linguistics and natural language processing using the following principles: [1]
The primary benefits of LLOD have been identified as: [2]
The home of the LLOD cloud diagram is under linguistic-lod.org [3]
Aside from gathering metadata and generating the LLOD cloud diagram, the LLOD community is driving the development of community standards with respect to vocabularies, metadata and best practice recommendations.
According to the state-of-the-art overview by Cimiano et al. (2020), [4] these include:
As of mid-2020, most of these community standards are actively worked on. Particularly problematic is the existence of multiple incompatible standards for linguistic annotations, and in early 2020, the W3C Community Group Linked Data for Language Technology has begun to work towards a consolidation of these (and other) vocabularies for linguistic annotations on the web. [15]
The LLOD cloud diagram has been developed and is maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation (since 2014 Open Knowledge), an open and interdisciplinary of experts in language resources.
The OWLG organizes community events and coordinates LLOD developments and facilitates interdisciplinary communication between and among LLOD contributors and users.
Several W3C Business and Community Groups focus on specialized aspects of LLOD:
LLOD development is driven forward by and documented in a series of international workshops, datathons, and associated publications. Among others, these include
Linguistic Linked Open Data is applied to address a number of scientific research problems:
Linguistic Linked Open Data is closely related with the development of
Uses and development of LLOD have been subject to several large-scale research projects, including
As of October 2018, the 10 most frequently linked resources in the LLOD diagram are (in order of the number of linked datasets):
There are a number of recurring discussions regarding the different aspects of the term, its applicability and for a particular type of resources. [32]
Aside from resources used in and created for linguistic research, the LLOD cloud diagram also includes ontologies, terminologies and general knowledge bases whose development was not originally driven by interest in language sciences or language technology, e.g., the DBpedia. As a criterion for inclusion into the LLOD diagram, the OWLG requires "linguistic relevance": "[A] dataset is linguistically relevant if it provides or describes language data that can be used for the purpose of linguistic research or natural language processing." [33] This does include linguistic resources in a strict sense ("condition 1": an annotated or otherwise structured resource created for application in language sciences or language technology, as demonstrated, for example, by a scientific publication at a linguistics-related journal or conference), but also resources "that can be used for annotating, enriching, retrieving or classifying language resources ... [if their relevance] can be verified by the existence of links between a resource (whose linguistic relevance is to be confirmed) and resources fulfilling condition (1)" ("condition 2"). [34]
A related issue is the classification of linguistically relevant datasets (or language resources in general). The OWLG developed the following classification for the LLOD cloud diagram: [35]
Note that in this classification, term bases might be slightly different in that they do not provide grammatical information, however, since they formalize semantic knowledge, they are of immanent relevance for natural language processing tasks, such as named entity recognition or anaphora resolution.
LLOD is defined in relation to Linked Open Data, and LLOD resources (data) should thus conform to licenses in accordance with the Open Definition. [36] For generating the LLOD cloud diagram (and the LOD diagram), this does, however, not seem to be enforced yet, so that the technical criterion is availability over the web and a metadata entry. In the OWLG, it has been repeatedly discussed whether non-commercial (academic) resources could be included with a general consensus of admitting them for the moment (2015) but subsequently enforcing stricter requirements along with the growth of the LLOD cloud. As of January 2018, it was not agreed upon yet when this move was about to happen. [37] As of January 2020, machine-readable license metadata was available for 86 LLOD resources, of these, 82 adopted open licenses, 4 adopted non-commercial licenses. [38]
In a broader sense, the term LLOD technology (infrastructures, tools, vocabularies) can also used to refer to the technology independently from whether actually open resources are involved, e.g., in the name of the EU project Pret-a-LLOD that features several commercial business cases. [39] This is justified for applications that consume (rather than provide) open data, but moreover, also when linked data technology and the adoptation of other LLOD conventions (esp., the use of RDF vocabularies developed in the context of LLOD) are applied in order to facilitates the seamless integration of LLOD resources (open resources).
The abbreviation "LLOD" can be used to refer to either LLOD technology (use of Linked Data and LLOD vocabularies, independent from the legal status of the data being processed) and LLOD resources (open data). For disambiguation, the terms "LLOD resources" and "LLOD technology" can be used. For emphasizing application or applicability to non-open resources, also "LLD" (Linguistic Linked Data) has been used. [40] A possible compromise is the acronym "LL(O)D" for the technology. A "Licensed Linguistic Linked Data" cloud that contains non-open resources does currently (June 2020) not exist. [38]
The definition of Linked Data requires the application of RDF or related standards. This includes the W3C recommendations SPARQL, Turtle, JSON-LD, RDF-XML, RDFa, etc. In language technology and the language sciences, however, other formalisms are currently more popular, and the inclusion of such data into the LLOD cloud diagram has been occasionally requested. [32] For several such languages, W3C-standardized wrapping mechanisms exist (e.g., for XML, CSV or relational databases, see Knowledge extraction#Extraction from structured sources to RDF), and such data can be integrated under the condition that the corresponding mapping is provided along with the source data.
A 2022 review paper is:
An exhaustive description on the state of the art on LLOD is provided by
The concept of a Linguistic Linked Open Data cloud has been originally introduced by
The first book on the topic is
According to Cimiano et al. (2020), [41] other seminal publications since then include
Developments from 2015 to 2019 are summarized in the collected volume by
{{
cite book}}
: |journal=
ignored (
help)
{{
cite web}}
: CS1 maint: numeric names: authors list (
link)
In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.
Linguistic Linked Open Data describes the publication of data for linguistics and natural language processing using the following principles: [1]
The primary benefits of LLOD have been identified as: [2]
The home of the LLOD cloud diagram is under linguistic-lod.org [3]
Aside from gathering metadata and generating the LLOD cloud diagram, the LLOD community is driving the development of community standards with respect to vocabularies, metadata and best practice recommendations.
According to the state-of-the-art overview by Cimiano et al. (2020), [4] these include:
As of mid-2020, most of these community standards are actively worked on. Particularly problematic is the existence of multiple incompatible standards for linguistic annotations, and in early 2020, the W3C Community Group Linked Data for Language Technology has begun to work towards a consolidation of these (and other) vocabularies for linguistic annotations on the web. [15]
The LLOD cloud diagram has been developed and is maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation (since 2014 Open Knowledge), an open and interdisciplinary of experts in language resources.
The OWLG organizes community events and coordinates LLOD developments and facilitates interdisciplinary communication between and among LLOD contributors and users.
Several W3C Business and Community Groups focus on specialized aspects of LLOD:
LLOD development is driven forward by and documented in a series of international workshops, datathons, and associated publications. Among others, these include
Linguistic Linked Open Data is applied to address a number of scientific research problems:
Linguistic Linked Open Data is closely related with the development of
Uses and development of LLOD have been subject to several large-scale research projects, including
As of October 2018, the 10 most frequently linked resources in the LLOD diagram are (in order of the number of linked datasets):
There are a number of recurring discussions regarding the different aspects of the term, its applicability and for a particular type of resources. [32]
Aside from resources used in and created for linguistic research, the LLOD cloud diagram also includes ontologies, terminologies and general knowledge bases whose development was not originally driven by interest in language sciences or language technology, e.g., the DBpedia. As a criterion for inclusion into the LLOD diagram, the OWLG requires "linguistic relevance": "[A] dataset is linguistically relevant if it provides or describes language data that can be used for the purpose of linguistic research or natural language processing." [33] This does include linguistic resources in a strict sense ("condition 1": an annotated or otherwise structured resource created for application in language sciences or language technology, as demonstrated, for example, by a scientific publication at a linguistics-related journal or conference), but also resources "that can be used for annotating, enriching, retrieving or classifying language resources ... [if their relevance] can be verified by the existence of links between a resource (whose linguistic relevance is to be confirmed) and resources fulfilling condition (1)" ("condition 2"). [34]
A related issue is the classification of linguistically relevant datasets (or language resources in general). The OWLG developed the following classification for the LLOD cloud diagram: [35]
Note that in this classification, term bases might be slightly different in that they do not provide grammatical information, however, since they formalize semantic knowledge, they are of immanent relevance for natural language processing tasks, such as named entity recognition or anaphora resolution.
LLOD is defined in relation to Linked Open Data, and LLOD resources (data) should thus conform to licenses in accordance with the Open Definition. [36] For generating the LLOD cloud diagram (and the LOD diagram), this does, however, not seem to be enforced yet, so that the technical criterion is availability over the web and a metadata entry. In the OWLG, it has been repeatedly discussed whether non-commercial (academic) resources could be included with a general consensus of admitting them for the moment (2015) but subsequently enforcing stricter requirements along with the growth of the LLOD cloud. As of January 2018, it was not agreed upon yet when this move was about to happen. [37] As of January 2020, machine-readable license metadata was available for 86 LLOD resources, of these, 82 adopted open licenses, 4 adopted non-commercial licenses. [38]
In a broader sense, the term LLOD technology (infrastructures, tools, vocabularies) can also used to refer to the technology independently from whether actually open resources are involved, e.g., in the name of the EU project Pret-a-LLOD that features several commercial business cases. [39] This is justified for applications that consume (rather than provide) open data, but moreover, also when linked data technology and the adoptation of other LLOD conventions (esp., the use of RDF vocabularies developed in the context of LLOD) are applied in order to facilitates the seamless integration of LLOD resources (open resources).
The abbreviation "LLOD" can be used to refer to either LLOD technology (use of Linked Data and LLOD vocabularies, independent from the legal status of the data being processed) and LLOD resources (open data). For disambiguation, the terms "LLOD resources" and "LLOD technology" can be used. For emphasizing application or applicability to non-open resources, also "LLD" (Linguistic Linked Data) has been used. [40] A possible compromise is the acronym "LL(O)D" for the technology. A "Licensed Linguistic Linked Data" cloud that contains non-open resources does currently (June 2020) not exist. [38]
The definition of Linked Data requires the application of RDF or related standards. This includes the W3C recommendations SPARQL, Turtle, JSON-LD, RDF-XML, RDFa, etc. In language technology and the language sciences, however, other formalisms are currently more popular, and the inclusion of such data into the LLOD cloud diagram has been occasionally requested. [32] For several such languages, W3C-standardized wrapping mechanisms exist (e.g., for XML, CSV or relational databases, see Knowledge extraction#Extraction from structured sources to RDF), and such data can be integrated under the condition that the corresponding mapping is provided along with the source data.
A 2022 review paper is:
An exhaustive description on the state of the art on LLOD is provided by
The concept of a Linguistic Linked Open Data cloud has been originally introduced by
The first book on the topic is
According to Cimiano et al. (2020), [41] other seminal publications since then include
Developments from 2015 to 2019 are summarized in the collected volume by
{{
cite book}}
: |journal=
ignored (
help)
{{
cite web}}
: CS1 maint: numeric names: authors list (
link)