WIKIPEDIA TALK:REQUESTS FOR COMMENT INFOBOX TEMPLATE COHERENCE

RFC Discussion

Please give feedback here and add votes (support, oppose etc.) here. SebastianHellmann ( talk) 14:59, 15 November 2009 (UTC) reply

Umm, this is the purpose of microformats. I don't see any need for further discussion here. Chris Cunningham (not at work) - talk 02:14, 16 November 2009 (UTC) reply

How do you create a global scheme with microformats? What properties do you use for examples to describe medical drugs or languages families? As far as I know, Microformats have a limited set of standardized datatypes. SebastianHellmann ( talk) 06:46, 16 November 2009 (UTC) reply

I just checked the microformats page. You are right microformats have a similar purpose, but the proposed approach goes further. There are about 10 specified microformats and ten more drafts, which do not cover a lot of domains, but only the most common and most simple ones(which is the intended goal of microformats). The basic idea to add metadata is the same. Furthermore they are also complementary. With the help of the annotations for coherence more microformats can be created and embedded once they are stable. Lets say there were microformats for everything, then this might (not clear) be achieved with microformats. But who is going to create them? Who is going to embed them? Who is to decide what the correct microformat is for each content type (as I mentioned e.g. medical drugs, etc.) SebastianHellmann ( talk) 10:02, 16 November 2009 (UTC) ) reply

The lack of existence for a given microformat is not reason to cook up our own standard which does the same thing. If there are common use cases (such as your "medical drugs" example) which could be covered by a microformat-type markup then we should just go with that. It's lightweight, requires very little effort to implement and can easily be pushed out to the wider microformats community if it's successful. Chris Cunningham (not at work) - talk 09:05, 16 November 2009 (UTC) reply

The standard we are refering to is Semantic Web including RDF and OWL. The use case is the data contained in Wikipedia, not one domain, but every domain (possibly also other language versions). It is preparing the deployment of Semantic MediaWiki and will be used by DBpedia right away. Is there a generic microformat that potentially covers everything, from almamater to contra-indication to oscar nominations ? SebastianHellmann ( talk) 09:19, 16 November 2009 (UTC) reply

This might be a stupid question, as I don't know much about either Wikipedia templates or the Semantic Web, but could we use RDFa? -- Chris Johnson ( talk) 13:12, 16 November 2009 (UTC) reply

There are two problems we need to tackle. The first one is the technical implementation and yes RDFa might be absolutely fine for this, also. It is the same standard. The second problem is the consensus about the meaning and interpretation of properties. As you see above each instance of Musical Artist has an occupation property. But if we go to the Infobox Prime Minister, then there is a property office. Now, you could say that in this case office is like occupation and use the same property occupation for this, but then other people might disagree. So there can really be a lot of disagreement about it until it gets stable. (As a rough estimate: the 49k attributes currently used in templates may be mapped to approximately 2k properties). So RDFa would be a sound way of including the information after the Ontology engineering has happened. Otherwise it will possibly result in a lot of template changes (which should also only be done by experts) SebastianHellmann ( talk) 15:47, 16 November 2009 (UTC) reply

I included some more examples above. Consider the trivial problem, that both infoboxes for musical artists and marching bands have a member property, but in the first it is a list of members and in the second case a count of members SebastianHellmann ( talk) 09:47, 16 November 2009 (UTC) reply

As Sebastian mentioned, RDF and OWL are not our own standards (we were not involved in their creation) and not the same as microformats. They both offer their own benefits and the existence of one of them does not remove the need for the other. Furthermore, RDF and OWL are widely used and established. You suggest creating new microformats for each kind of information. This is clearly not feasible, since a meaningful microformat needs community and tool support. It is also very unlikely that the microformat community wants to achieve this. Quoting [1]: "Microformats are not [...] Infinitely extensible and open-ended [...] A panacea for all taxonomies, ontologies, and other such abstractions [...] Defining the whole world, or even just boiling the ocean". Please note that our approach is also lightweight and could be running within days (depending on what will finally be decided), but we want consensus on the usage of the above described templates first. Jens Lehmann ( talk) 09:53, 16 November 2009 (UTC) reply

This seems like a fairly easy fix to be honest. You're right in that infoboxes are often inconsistent, but this doesn't require revamping the entire system... – Juliancolton | ^Talk 17:50, 16 November 2009 (UTC) reply

The goal and reasoning behind the suggested approach is not to revamp the entire system. One of the primary motivations is to be minimally invasive in the sense of not having a negative impact on the way Wikipedia works. The approach only requires to add the templates to the infobox documentation, i.e. the doc subpages of infobox templates (many of which we prepared already). It is a fairly small change with significant benefits. Jens Lehmann ( talk) 22:32, 16 November 2009 (UTC) reply

Infoboxes (infoboxen?) are definitely inconsistent, and some have way too many redundant parameters (take {{ Infobox school}} for example...). Microformats would help in some areas, but others are just a matter of deleting redundant infoboxes, and getting rid of redundant parameters in current infoboxes.-- Unionhawk ^Talk ^E-mail ^Review 18:58, 16 November 2009 (UTC) reply

I don't think requiring microformats would work, however I think having types associated with fields would be a very good idea and help with checking the encyclopaedia. For instance a date type would look different in different language versions, having a date format check would mean all dates in infoboxes could be checked for reasonableness and translated into a standard form easily for database use. All sorts of things like peoples names, place names, numbers of storeys in buildings, whether being still alive is reasonable given date of birth, professions, colours etc could all be checked but practically all would need language specific modules associated with them. Dmcq ( talk) 20:53, 16 November 2009 (UTC) reply

Yes, that is what the proposal is about. We had an iterative strategy in mind. First the templates are created here in Wikipedia. DBpedia uses these templates to extract cleaner data (maybe also other extraction approaches like Freebase). They do not take up so much space and do not interfer with anything (like template definitions). Wikipedians would have the direct possibility to influence the data in DBpedia and other extraction approaches. Since DBpedia pretty much contains the data included in Wikipedia, just in a structured format, we could spot inconsistencies in Wikipedia with some tool support and users could fix them. So in the long run DBpedia and Wikipedia will improve, likewise. So in a nutshell: the templates take up a small line in the doc pages, interested Wikipedians, who use the extracted data, gain influence and tools can be created that help editors. SebastianHellmann ( talk) 12:12, 17 November 2009 (UTC) reply

Any chance we can get the CENT wording changed? It currently reads like a plank on the five year plan. Protonk ( talk) 22:32, 17 November 2009 (UTC) reply

Can someone help me please

I find it very hard to believe that I have understood what the perceived problem and the proposed solution are:

Problem: Infoboxes are not standardised. Basically the same information is present in different templates, but using different parameters. Sometimes the same parameter is used for substantially different information in different infoboxes. And sometimes the same information is encoded in different ways.
Disadvantage for DBpedia: It's harder to extract information.
Disadvantage for Wikipedia: The more active editors get confused and always have to look up the documentation when the introduce an infobox into an article.
Proposed solution: Build a layer of abstraction over the mess, Wikipedia-side, which requires individual coding for the infobox templates.

I really hope that I misunderstood something. Hans Adler 22:58, 17 November 2009 (UTC) reply

In a way I have to echo parts of that. Yes, I can see some degree of standardization being needed. This is something that is seen with "general to specific" infoboxes where there is a generic, bare bones base 'box that can, and should, be replaced with a more specialized one by changing the template title and adding parameters but not changing parameters. But...

With DBpedia I'm not 100% sure that we should be adding a layer of rigidity/bureaucracy to satisfy a 3rd party (Or have I missed something stating that DBpedia is part of Wikipedia or that they will be a contributor and not just extracting/using the data editors provide?).

And with the non-English Wiki template in the examples... at a minimum there are two or three problems there: 1) Different material being acceptable for inclusion in different languages - do we use their criteria, force ours on them, or go for a "lowest common denominator"?; 2) Some of the other Wikis are using embedded images in the infobox header, something there has been limited discussion on here; and 3) language of use for accessibility. Bluntly I'd expect a template on the Portuguese Wiki to either be in Portuguese or be moving towards it. So a term in English is not necessarily going to be identical in Portuguese.

- J Greb ( talk) 23:37, 17 November 2009 (UTC) reply

The point about "satisfaction of a 3rd party" was already made in the previous discussion. The conclusion was to create this RFC with non-DBpedia-specific templates. Looking at the examples, you can see that they do not mention DBpedia and only map to RDF/OWL (W3C standards). It is likely that other extractors will use them as well, i.e. there is no commitment to DBpedia. (Of course DBpedia examples are used as demonstrators otherwise some people may not see the point in the RFC.) Apart from that, the changes should not be done for the satisfaction of any specific party, but to increase the value of the information contributed by Wikipedians.

Can you give a specific example for one of the three problems you mention? I do not really see why e.g. a template in Portugese in a Portugese wiki is a problem in this approach or how it relates to material being acceptable or not. -- Jens Lehmann ( talk) 08:26, 19 November 2009 (UTC) reply

I take the lack of response to my description of the problem and proposed solution as confirmation that the description was correct. Therefore I strongly reject the proposed solution as (1) addressing the problem in a place where it is inefficient to do so, and (2) addressing only the third-party concerns, unlike the natural solution, that would directly address the Wikipedia concerns as well. Wiki text is not ossified source code that shouldn't be touched so as not to break things. The way we address lack of uniformity here is by building a consensus for uniformisation and then executing it (by refactoring the wiki text). This only breaks down in exceptional cases such as the one that led to the WP:ENGVAR compromise. I am confident that this is not such an exceptional case. Hans Adler 09:25, 19 November 2009 (UTC) reply

My initial reaction is suspicion, while I can see theoretical value in removing anomalies it does seem like we are building a steam hammer to crack an egg. Yes I use infoboxes to assist in the task of writing articles, but before I can comment sensibly it seems I have to leave my articles and research something called microformats which quite frankly have little relevance in my 18th century Oldham. Yes I have added some wonderful parameters to an infobox template, that will be used in four or five months time. They were named to be consistent to parameters in a related list building template- so that when enough data has been gained in the listitem it can be launched into an article with an Infobox. Ie Zero keystrokes to copy data across. I am suspicious that this compliance thing; would rename parameters and thus break the link. Quite simple: if that happens- the Infobox goes and Iĺl code up something that looks like an infobox. One definition of a computer geek is someone who sits down and spends a fortnight coding an automatic procedure for task that would have taken 10 minutes manually.. Isn't there a warning template that says This template employed intricate coding don't try to edit it unless you know what you are doing- I need to be convinced that the proposed bot will have the intelligence to recognise when it is out of its depth- but the more I think of it the more worried I become. -- ClemRutter ( talk) 09:47, 19 November 2009 (UTC) reply

As we designed this solution, we had two other alternatives. We already created a mapping like the above proposed template annotations would produce. It is held in a database, which means we - for our purposes - can very well access it with SQL. Now, we thought about opening this database to more users where we considered the following alternatives: Create a GUI for the database, not so much work for us, we would have been able to check consistency, etc. Another alternative now is to install our own MediaWiki load the template annotations, we already have, into it, lock the template definitions and only allow registered users to edit it. Both approaches allow us to control exactly what is going on, we can block users, control parse hints, etc. Nevertheless, we think that Wikipedians should have full control, which we give away in this proposal. We are not admins here, our vote does not count much, if somebody wants to change the templates or the parsehints, we would have to adjust our code to Wikipedia, exactly as any other extraction approach.

We really think that metadata should be kept in and provided by Wikipedia directly. It would be nice, if MediaWiki would have that feature or if Semantic MediaWiki would be deployed, but both is not the case and it is unclear, whether it will happen in the near future. Scalability is important here. Our basic design principle was to create a lightweight process of Ontology Engineering. The question here is how much technical knowhow do you need to contribute to the Ontology and the property scheme. The domain knowledge, i.e. English literature, Linguistics, Medicine lies with people, whose skills normally end at using infoboxes in an article. Only technical experts can edit template definitions.

Templates are normally used for representational purposes and this will probably not change and also shouldn't change as it should stay easy to create them. Now the annotations just add another dimension to them, i.e. structured information. A much better way as Hans Adler proposed would of course be to curate the structured information and then only adjust the representation, but this would be a massive change and would require immense resources. It is much easier on the other hand to just put some extra templates somewhere. Contribution to these is easy and meta data can accumulate and it does not interfere with anything. I guess, there should not be a guideline for Wikipedians to adjust the templates to the mapping annotations, but the information accumulated in the mapping annotations can be useful in several ways, i.e. Data Extraction (once again, we made this proposal to generalize the approach, otherwise we would keep it locked), maybe create RDFa out of them and embed it in the page. If there is a functionality of MediaWiki that allows to include the metadata in the software, there would already be a semantic mapping and an ontology and it can be reused, converted and the mapping annotations deleted. SebastianHellmann ( talk) 14:53, 19 November 2009 (UTC) reply

I like the proposal to standardise the names of common parameters, but I don't like the proposal outlined above regarding template mapping which seems incredibly involved and convoluted and seems to be taking what should be simple tasks away from the comfort zone of all editors. Yes, there is some sort of issue here, but I think the proposed solution loses the baby along with the bath water. Hiding T 11:01, 19 November 2009 (UTC) reply

Is there something specific what you don't like? We were actually investigating the matter for quite a while (more than a year) and this RFC represents the easiest, simplest and most flexible solution we could come up with. It was also discussed with quite some people e.g. from the MediaWiki developer community. The infobox annotation templates are pretty concise - you just need few lines to establish a mapping. Please also bear in mind, that not ever author has to interact with the template annotations and in many cases (e.g. when attributes are unique) they are not even required at all. Also, the proposed approach can be later supported by a MediaWiki extension, to make the mapping generation even easier.-- Soeren1611 ( talk) 14:44, 19 November 2009 (UTC) reply

Looks like I have misunderstood something. What I'm then not clear on is whether you propose to first standardise the common parameters and then do this mapping. That is, if I understand you and this mapping won't affect the way I use an infobox template in an article or build an infobox template? Hiding T 15:02, 19 November 2009 (UTC) reply

I have practically the opposite reaction. I like the idea of templates being able to state the expected format of parameters or values expected in them, and give a mapping for template and parameter names so they can be correlated between wikipedias and for database. However I wouldn't want a forced standardization of names or parameters. I guess there would be standardization eventually because of better communication and I';m quite happy with such unforced standardization. My main problem is the hows of the template name and parameter name mappings if only portuguese is used in the portuguese wikipedia and only english in the english one. Dmcq ( talk) 12:13, 19 November 2009 (UTC) reply

I'm not sure I understand the languages point, but I'd like to see us standardise something like birth_name, birthname and birth name. Standardising those seems intuitive to me. Hiding T 12:48, 19 November 2009 (UTC) reply

My problem is with the portuguese 'relatesToClass = Musician' and it referring to a template in english with names like 'birthdate' in it. I can't see a decent way round that problem and it would be referring outside of the portuguese wikipedia to somewhere else. I suppose that information wouldn't actually be used in the portuguese wikipedia, only when extracting data using whereever that class is stored, but where it is stored is a bit of a problem. Dmcq ( talk) 13:19, 19 November 2009 (UTC) reply

In fact if we simply put classes into wikipedia commons as text that would be a way to make the job independent of dbpedia and acceptable to the different language versions. The relatesToClass then would simply refer to a file in commons which would look a bit similar to a template but could deal with some variation, for instance one wikipedia might like a list for one parameter and another might just want a number or leave it out. Dmcq ( talk) 13:40, 19 November 2009 (UTC) reply

Ah, I may have misunderstood a little. I hadn't realised the proposal was looking at standardising across all language Wikipedias. I don't think I could support that, each Wikipedia should be free to do as they wish, I should think. And wouldn;t this need far wider input is that were the case? Hiding T 14:15, 19 November 2009 (UTC) reply

Of course each Wikipedia should do as they wish. This discussion here is only about the English Wikipedia. However, if Wikipedians from other language chapters want to include the mechanism as well, then this is a very easy thing to do. We are not asking permission to create such templates on Wikipedian language editions other than the English one. I will edit the RFC to make this clearer. (In my opinion, it is an advantage that multi-language support is not an afterthought.) -- Jens Lehmann ( talk) 14:25, 19 November 2009 (UTC) reply

Hello Dmcq. What you are describing is what the RFC says: relatesToClass refers to a page in MetaWiki (not Commons - we discussed about the appropriate place some time ago and we already have permission from MetaWiki; we were also allowed to use a bot there to create the initial version). This way, the approach is language independent and the templates on each Wikipedia language edition are kept simple (they only contain the pointer to MetaWiki and a parse hint). -- Jens Lehmann ( talk) 14:35, 19 November 2009 (UTC) reply

Hello Hans and sorry for not replying earlier. I understand your point and why you are against the proposal. Usually, if you build e.g. a piece of software and find out that it is broken in some way, you just fix it directly. Usually, you should not create an additional layer of abstraction to make it look better from the outside. (This is the essence of your thoughts as I understand them. If I'm wrong, let me know.) It is a bit different here: First, the approach is only a simple mapping, which does not interfere with any existing functionality. Secondly, modifying all template definitions is not easy: Most template definitions are edit protected and require consensus on changing them, so you would have a hard time justifying all the changes. There is no process to align infobox definitions, no hierarchies of infobox templates etc. While our approach works within days, fixing everything will take forever. (My personal opinion: Without any software or process like this proposal supporting it, it will never happen.) If you take your thoughts further, you will probably realise that even fixing the infoboxes is not the appropriate thing to do, but you would have to place proper knowledge representation, most likely RDF, at the core of MediaWiki infoboxes. There have been tremendous efforts in this direction for years, which will hopefully succeed eventually. What we present here, is a simple intermediate step, which helps to get it right. It allows to show people some benefits of semantics without the need for dramatic changes in Wikipedia. If I didn't understand you correctly, please ask further questions. -- Jens Lehmann ( talk) 14:19, 19 November 2009 (UTC) reply

I don't actually think it would be that hard to get Wikipedia wide consensus to standardise infobox parameters like birth_name, birthname and birth name, and I reject the idea that it would take forever to do. I doubt it would take more than a week for a bot to run through. We've handled harder stuff than this before. Hiding T 14:35, 19 November 2009 (UTC) reply

I'm more for doing changes to parameter names on an ongoing basis as desired, after all the same sort of thing can be done in the database, its either effort in one place or in another. The main thing I'd like to see is the templates with their semantic information becoming a standard feature so it is easier in the future to extract and check data and compare it with other sources. Once you have a template doing something useful for people on wikipedia like automatically doing simple checks on the data the rest can follow easily. It does have to be set whilst considering databases and other wikipedias though or it will be much less useful than it might be. Wouldn't it be nice to be told you'd entered 12th Nob 2009 instead of Nov? Dmcq ( talk) 14:55, 19 November 2009 (UTC) reply

Totally. But wouldn't it also be nice to be able to enter birthdate and not have to go look at the docs when it doesn;t work in the particular infobox you are using? Hiding T 15:04, 19 November 2009 (UTC) reply

Writing such a bot might be indeed possible (although I think its much harder than you assume if you think e.g. about how cumbersome parsing of template rendering can be). However, for the bot to work it has to know how to map attributes - this is exactly the purpose of this RFC to establish a mechanism for mapping attributes. In an later stage this can be used to materialize this mapping by physically renaming template attributes as you suggest. -- Soeren1611 ( talk) 14:57, 19 November 2009 (UTC) reply

Um, I could set up an AWB run now to convert all instances of birth_name to birthname (chosen at random) in articles which transclude "Infobox Foo". It's not hard at all. Hiding T 15:06, 19 November 2009 (UTC) reply

I totally agree that this is easy. This wiki has a huge number of very active editors who would help. We don't do perfect solutions that do everything at once with no human help. (That might be hard, I admit.) We would do it separately for relatively homogeneous classes of infoboxes, and an occasional mistake by the conversion script would be no more dramatic than an average vandalism edit. Hans Adler 15:26, 19 November 2009 (UTC) reply

Hm, here let us contribute our mapping so the replacement is easier. I dumped it from our database: Preview (first 1000) total as tar.gz (15198). The first row is the property we mapped to (still in camelCase, all lowercase would be better), the second is the property as it occurs in the template, the third row is the template name (cave: it is all lower case, we have a script that corrects that, I will let it run over it, soon). We have a lot of data, which we can give to support cleaning Wikipedia, so if you think about cleaning everything, let us help, tell us what you would like to have. SebastianHellmann ( talk) 16:59, 19 November 2009 (UTC) reply

Soeren, that's not at all how Wikipedia works. We don't plan, we act. You can plan as much as you want for a year if you wish, advertising the fact in the most populated locations until you are getting on everybody's nerves. But you can only fully carry out the result if there is a consensus to do so, and you will not find out before you implement it. That's because most stakeholders won't notice before they see your edits on their watchlist because you touch an article they are interested in. Of course due diligence requires that you make an effort to gauge consensus before large scale changes. But it can still happen that you only learn that there isn't in fact a consensus after you have started the implementation. That's why we only do incremental development here. The waterfall method is even more ineffective here than in software development. Hans Adler 15:39, 19 November 2009 (UTC) reply

I refered to the time required to align all infoboxes (not a particular property). We estimate that 49k attributes need to be mapped to 2k properties, e.g. there are more then 10 used spellings of birth place and it might be hard to detect those. (Even if those are fixed, you still do not know that e.g. the canadian senator infobox template represents persons or what unit for an attribute is used in a particular template.) Do you think this can be done easily and how would you do it? -- Jens Lehmann ( talk) 15:48, 19 November 2009 (UTC) reply

I don't quite follow you. It'll be relatively easy to do just by looking at each infobox and working forwards from there. There's a project underway that's standardising banner templates, they're nearly finished now and I think they've been rolling through for about a year, so if you've spent a year on this already... What you find on Wikipedia is that once you harness the power of the community things happen incredibly swiftly. If we had a consensus on what to change, it could be coded into AWB and it would happen relative;ly quickly as part of every AWB users editing pattenrs, or a bot could do it. I'm not going to challenge your data about 49k attributes and 2k properties, but that really doesn't sound right to me. It might be you have a specific goal that Wikipedians don't, and you're mapping attributes to properties that we on WIkipedia wouldn;t. I'd probably just engage my brain to know whether/how? the canadian senator infobox template represents persons or what unit for an attribute is used in a particular template. I find that easiest. Hiding T 18:11, 19 November 2009 (UTC) reply

What we mean is: currently there are about 49111 different infobox parameters. Now, lets say you rename birth_place, birthPlace and placeOfBirth to birthplace, you will only have 49108 different infobox attributes. We estimate, that if you put everything together that belongs together (which has the same meaning), then there will only be 2000 different infobox attributes left. If you want to find all Musicians from Canada, then you can check all articles for an infobox attribute birthplace for Canada, where the infobox template is annotated as class Musician. SebastianHellmann ( talk) 19:34, 19 November 2009 (UTC) reply

What I am disputing is the idea that everyone would agree that the 49000 can be condensed to 20000. I'd also dispute your method for finding all musicians from Canada, since not all articles have infoboxes. I'd likely mine the categories, but that wouldn't be perfect either. I really think the way things don't work at the minute is that there are no standard terms for infobox parameters that could quite easily have them, and which people in this debate concede is a problem. I think the place to start is talking about standardising the most popular infobox parameters. From tiny acorns... Hiding T 20:52, 19 November 2009 (UTC) reply

This is actually a proposal where and how to discuss standardization. With birthplace it's probably obvious, but with other attributes it's not so clear. We just propose to put the above table at the doc page to have a place to talk about standardization and gather information required for cleaning the infoboxes someday. BTW there are 620 musicians from Canada, infoboxes do not cover everything, but it really helps to assign classes. SebastianHellmann ( talk) 07:58, 20 November 2009 (UTC) reply

Um...there's way more than 620, there's 896 stubs alone. I get what you are trying to do, what i am trying to explain is the best way to engage the Wikipedia community on the issue. I won't lose any sleep if you fail to do it, because the cost to me is low. But the way to do it is to propose standardising the easy fields. The rest will follow. Hiding T 18:36, 20 November 2009 (UTC) reply

Hm, it seems we still misunderstand each other. So you start standardizing the 20 most common attributes, maybe with a couple ten thousand edits in templates and articles, what then? Do more? How would you reach convergence? At a certain point somebody might think an attribute should have a different name and starts changing it back, because he thinks that populationCensus is probably different from populationTotal or populationEstimate. If you project that into the future, when would you reach the more special domains like Soccer or Medicine? How many edits will it totally require until most of the infoboxes are cleaned up? Categorys are great, if you have fixed dimensions and only one set of articles you look at. Look at this here: All soccer players, who played as goalkeeper for a club that has a stadium with more than 40.000 seats and who are born in a country with more than 10 million inhabitants. I agree that the easy fields are easy. What about the rest? SebastianHellmann ( talk) 14:58, 21 November 2009 (UTC) reply

It's almost deinite that someone will change things back. Welcome to Wikipedia. Look, as a Wikipedian with a lot of experience as to how these things go, I'd ask you just to accept that it will happen. It works in a similar way to the way an encyclopedia was built after Jimmy Wales and Larry Sanger decided to use wiki software to facilitate the writing of an encyclopedia anyone can edit. Everyone said it couldn't happen, yet here you are asking for a way to mine the data that shouldn't exist. Don't worry about the amount of edits that need to be made, don't worry about the things that can go wrong. If you engage the communtiy, you will quickly discover how trivial those concerns are. And the way to engage the community is to keep it simple and fix a problem everyone can identify with. The community will engage, and some of them will really engage and drive the standardisation on to your desired end. It really is that simple, if you can adopt the right approach. Hiding T 17:32, 23 November 2009 (UTC) reply
One should add that "all" refers to those things available in the knowledge base. -- Jens Lehmann ( talk) 18:40, 21 November 2009 (UTC) reply

What do you mean by standardising attributes/fields? Do you mean aligning the spelling of an attribute (rename birth_place, born, placeOfBirth to birthPlace etc.) or the integration of the mapping in the template definition? Would you go for point 1 or 2 mentioned in the Alternatives section of the RFC above? -- Jens Lehmann ( talk) 18:40, 21 November 2009 (UTC) reply

(Response to Jens Lehmann) Well, I totally disagree about the part starting "If you take your thoughts further". "Proper knowledge representation" such as RDF isn't any more appropriate in Wikipedia (mind you, I am talking about the wiki including most of the templates, not about the underlying software) than CORBA, DOM or similar systems would be for Linux or BSD kernels. These things don't fit into the overall culture. "Wiki" is Hawaiian for "quick", and that's very close to the KISS principle. You can't introduce frameworks with a relatively steep learning curve directly into such an environment. You must hide them. Hans Adler 15:22, 19 November 2009 (UTC) reply

I wrote (and meant) " MediaWiki", which is the software underlying Wikipedia, whereas you wrote "I am talking about the wiki including most of the templates, not about the underlying software". That might be the reason for disagreement. (See also: Semantic_MediaWiki) -- Jens Lehmann ( talk) 16:02, 19 November 2009 (UTC) reply

Ah, sorry, it "MediaWiki infoboxes" didn't make much sense to me since we implement infoboxes as MediaWiki templates, so I just ignored it. I withdraw that part of my comment. Hans Adler 17:01, 19 November 2009 (UTC) reply

You're right. "infoboxes" didn't make too much sense in that phrase. I can also understand your other points (after all constructive criticism may help the discussion), but my impression is that the above proposal is not an "unintelligent" way to do it. There is duplicate code in the sense that the mapping templates contain a fraction of the infobox attributes (those which should be mapped to the vocabulary on MetaWiki), but I'm not sure whether this is as bad as think. Often, the doc pages contain syntax + example of a template, which is also duplicate information. So any change to a template usually implies modifying its documentation as well. So, I do not see a very high risk of annotation rod (it could also be checked automatically). Another point is that it is a pretty hard requirement on editors to modify template definitions to enable this kind of mapping, in particular since this might require complex syntactical constructs. Maybe, we should invite a template expert into the discussion. -- Jens Lehmann ( talk) 19:26, 19 November 2009 (UTC) reply

(continued) I fail to see how this kind of thing serves any purpose that benefits Wikipedia in any way. Wikipedia could benefit if such information were integrated intelligently in the templates themselves and formed an additional incentive for uniformisation of parameter names. But what you did there leads to duplicate code and invites annotation rot. It looks as if based on a philosophy of minimising the cooperation with Wikipedia editors. Instead, to do these things properly you need the support of one or more of our template experts. You can find them in the editing histories of the more difficult templates. I am sure they will find a way to create self-documenting templates that "know" the "types" of their parameters and can have an optional outer layer that can present the parameter name as playername to the user. Test it with a few minor templates, get a separate local consensus for applying it to each of the most frequently used templates, and then get a global consensus for applying it basically everywhere.

As a practical matter, since yours seems to be mainly a German project, you might want to consider developing it at de.wikipedia first, taking into account what you learned here. I guess they are more open to such things, and once it works over there you can roll it out here. Hans Adler 15:22, 19 November 2009 (UTC) reply

We have a lot of data, which can help improve Wikipedia (english only), such as these template property mappings ( Preview (first 1000) total as tar.gz (15198). We also have a synchronized version of Wikipedia in RDF. We can provide lists and data that could exactly tell you, where inconsistencies are in Wikipedia infoboxes (like languages that are not links, but plain text in infobox_language, see here). How should we make that data useful then. What process should there be to decide, which datatype to put in which field for example? Please tell us your opinion. — Preceding unsigned comment added by SebastianHellmann ( talk • contribs)

Sorry, but I have no idea how we could use this information. It doesn't strike me as particularly helpful, although I might simply be lacking in imagination or relevant experience. Hans Adler 17:40, 19 November 2009 (UTC) reply

If editors annotate templates with this extra information then they make the information more accessible and easier to check. There would be no extra effort on the part of people using the templates. The facilities wouldn't affect the body of templates, only the parameters would be annotated. For wikipedia it would enable the parameters to be checked and errors found automatically - a person could write a template expecting a date and uses could be checked by a bot or directly if it simple enough without anyone else having to do anything special. Robots could also be used to check corresponding infoboxes in linked pages on other sites though that would probably be better left to dbpedia. Overall it would advance the aim of making information freely available with little effort on the part of most editors - and it would automate some of the bot efforts. I can see very little wrong with it and a great deal of gain. Dmcq ( talk) 18:46, 19 November 2009 (UTC) reply

Flexability question

I've noticed an example in the discussion that implies that the proposed schema would flag/warn an editor if the content entered at a parameter does not fit with an "acceptable" set (the example was that it would trigger if an editor entered "Nob 19, 2009" instead of Nov in a date field).

Am I reading this right?

Also, does this mean that the schema will force output?

The reason I'm asking...

Limited input options - Using birhdate/deathdate in biography 'boxes as an example. There are at least two ways to generate the date that shows up in the "Born" field. The most common are to either type out the entire date or to use a specific template that converts numbers into formatted text along with an age. If the schema is looking for only the "day month, year" format (and it would have to be 9 or so variations of that) it would hang on the use of the templates. Aside from dates there s also a question of what this would do to links in the text entered at a parameter.

Limited output - Right now there are 'boxes that take information entered into them as plain text and converts it to a specific link. If the schema limits what is shown in the 'box, this could be seen as damaging to those boxes.

- J Greb ( talk) 23:34, 19 November 2009 (UTC) reply

This isn't part of the current proposal, but what could be done with the information, after the proposal would be implemented. And even then it wouldn't have to be implemented for all parameters (so the output of some special parameters wouldn't be limited). It would limit input options for some parameters, but that's the point of this checking and if some template is used commonly, I'm sure it would be among the allowed variants. Svick ( talk) 01:17, 20 November 2009 (UTC) reply

The problem of template use in parameters could be deal with by expanding parameters before checking them so only the visible text was checked. I would have though any checking would best be done for starters by a bot patrolling things rather than when editing, that would have the least impact till everything was working nice and cleanly. Dmcq ( talk) 01:38, 20 November 2009 (UTC) reply

Things are never easy. It struck me that expanding the templates has a little problem if they are ever used to give different data according to user preferences. If I wanted as many dates as possible to have the day of the week output as well as the date then one really couldn't expect a parameter expecting a date to cope with getting a day of the week as well. Nt sure how one would go round a problem like this - having an internal format annotation added to the expanded template? Dmcq ( talk) 13:15, 20 November 2009 (UTC) reply

~~=== More suggested reading for DBpedia project members ===~~

Wikipedia templates severe limitations - solve this problem first

One of the reasons that there is duplication of templates with variances in setup, etc., is because of the quite limited functions allowed in templates. Unlike almost any programming macro language anywhere else, there is no capability to do something as simple as store a value in a variable, nor is there any looping capability. There is no way to test for a global variable - something that could be very useful to Wikipedia (example, a global variable to say whether a page is part of an archive or it is live, which could be tested by a template on the page). What happens is in order for a template to handle any level of complexity, you end up with either many transclusions of other templates within a template, or repetition of code.. in essence, what should be a simple process in most programming languages becomes a huge, complex template on Wikipedia. To create "duplicates" of templates with minor variations is a necessity in an environment where there are such constraints in programming capabilities that coding a template to cover every situation creates a template that is too difficult to use. Creating similar templates with minor variations allows the templates to remain comprehensible to the editors that try to use them.

There have been efforts to change this, but so far nothing has come out of it. I don't think adding another layer on top of this mess is a good move for Wikipedia.

Here is some reading on these limitations/problems of templates on Wikipedia [2] [3] [4] stmrlbs| talk 04:24, 22 November 2009 (UTC) reply

Thanks for your post. As mentioned above, I think it is good to have a neutral template expert joining the discussion. Just to clarify: Are you against alternative 2 in the alternatives section above or against the RFC itself? Your line of argumentation is that templates would become a mess if we include more in them (given that they allow only very simple structures). I agree with you on this, which is why I dislike alternative 2. In case you are opposing the RFC itself, one should consider that the templates are quite simple and that they do not completely duplicate infobox attributes (only those which should be mapped to the vocabulary on MetaWiki - see the drawbacks section above). -- Jens Lehmann ( talk) 07:59, 22 November 2009 (UTC) reply
As a minor sidenote, I think that discussion titles like "Suggested Reading for DBpedia project members" imply that we know less than most readers here about the subject. However, some DBpedia members know templates very well and aware of their limitations, while other non-DBpedians participating in the discussion might be less knowledgeable in the area. I would appreciate it a lot if the discussion titles would refer to their content (e.g. "Limitations of Templates" in this case). -- Jens Lehmann ( talk) 07:59, 22 November 2009 (UTC) reply

I think the first title of this kind was justified since it addressed the social side of Wikipedia, and the communication problems that have led to deletion of your project's first efforts seem to indicate a general unfamiliarity with that. In my opinion you need at least one person who has been fairly active in many areas of Wikipedia (preferably the English one) for about half a year, and who is similarly familiar with your project. Hans Adler 08:44, 22 November 2009 (UTC) reply

Our edits were legit respective WP:BOLD and WP:IAR. We thought it was a good step and an iterative, easy improvement (after your doctrine We don't plan, we act). We were met with hostility, because of our logo, which really seemed like spam (which we did remove now). Basically, now it is just one line in the doc subpage to collect information, which is useful for any extraction approach. Wikipedia is for everyone and it should be easy to contribute without first reading hundreds of pages and becoming a standing member for years WP:NOTBUREAUCRACY. If we consider the minimum goal of this proposal, i.e. collect information about content in template infoboxes, what is so bad about that (except taking up a line of space)? We just want to create a resource that can be exploited either by Semantic Web people like us, other extraction approaches or maybe by a script (not a built in template function) written to help Wikipedia editors. This is a strict Wiki process: Start something, other people see it, if they like it or have benefits, they will join and contribute. Now here, the discussion is drifting towards a very large scale such as improving underlying software, extending template functionalities and cleaning all infoboxes, where it should only be about a table in doc subpage. This drift in the topic is of course our fault, too, as we wanted to show the large scale benefits. I agree with you that it might not be the perfect method for this global kind of achievement. But what about starting out small, i.e. collect annotation information with a better-than-nothing mentality? A completely independent layer has the great advantage, that it does not break anything. SebastianHellmann ( talk) 13:08, 22 November 2009 (UTC) reply

You can't learn about the workings of Wikipedia merely by reading policies and guidelines. These are merely incomplete and often faulty attempts to write down actual practice, and that's why they are being revised all the time. The main reason for WP:BOLD is that we want to do small changes efficiently. This doesn't apply in your case, and the guideline warns about the problem in at least one other, marginally related, case. [5] Also, applying WP:IAR correctly is a very difficult art that even some very experienced editors don't master. See WP:EXCEPTIONS for some context that may help you understand why it can't be used for large-scale changes. All the time people are being dragged to WP:ANI and get into difficulties for bold moves such as creating English stubs for all articles on German politicians that exist on the German Wikipedia. Before you have seen these things you don't know enough to do large scale changes on your own. As I said, you really need someone who feels like a member of both projects to translate and make sure you avoid misunderstandings. Hans Adler 13:26, 22 November 2009 (UTC) reply

There were discussions (and support) with Daniel Kinzler (active since 2004 in the German Wikipedia, MetaWiki, Commons; software developed paid by Wikimedia Germany) and Brion Vibber (former CTO of the Wikimedia foundation and lead developer of MediaWiki). But isn't it also one aim of an RFC to point out issues? Can't we learn ourselves about the potential problems while applying the approach? (It's not like we are not willing to understand how Wikipedia works, although in my opinion Wikipedia is very diverse, so everything you learn can only serve as a guide, but not a rule. Common sense should prevail, at least in the long term.) Realistically speaking, what are the alternatives apart from abandoning the RFC? Shall we hire someone for covering the social aspect of adding templates to doc subpages of infobox templates? Or wait for someone to jump in and fill the gap? -- Jens Lehmann ( talk) 10:35, 23 November 2009 (UTC) reply

Perhaps I am too pessimistic, but I don't think the RFC is going to have any useful result. That's because most of what you are saying sounds very abstract to me. I simply don't understand where you are coming from. And I guess it's the same for the other participants. We can't come to any informed decisions here because we don't have a common language. We need an interpreter. Perhaps you can get Daniel Kinzler to restart the RFC. The English and German Wikipedias are quite different beasts (the German Wikipedia having an unwritten principle: "Don't mention the English Wikipedia") and judging from Daniel's account it seems editing is a low priority for him, so there is no guarantee it would be successful. But it's one thing that might be worth trying. Or Brion VIBBER, although like Daniel he has a WP edit count that's as low as mine, and I started only 2 years ago. The reason I once suggested you try it on the German Wikipedia first is that there you are more likely to find an experienced volunteer who you can meet in real life and show what your software does etc. Such a user could then translate what they have seen in a language that the community understands. (I wasn't trying to tell you to get lost and bother a sister project, although I now realise it probably sounded as if I did. Sorry for that.) Hans Adler 12:20, 23 November 2009 (UTC) reply

Hans Adler has much more experience with Wikipedia policies than I do and I have no problem deferring to his knowledge of Wikipedia's way of doing things. I did not intend to hijack his title/response, and have changed the title of my section after reading the responses. I did want to note that templating language was written with the intent that it was for text presentation only, and therefore should not be a programming language. I think that this "guideline" has created a lot of problems down the road, with Wikipedia growing so rapidly and needing more programming capability in templates to be able to create more consistency in so many variations of presentation in so many different types of articles. Hans said earlier that it sounded to him as if the proposed solution was to "Build a layer of abstraction over the mess, Wikipedia-side, which requires individual coding for the infobox templates." I agree with this assessment. To be able to database certain types of Wikipedia information is a nice concept, but to alter Wikipedia in order to accomplish this is not feasible, will probably not be enforceable long term. As Sebastian said, this is a place where people freely edit, and are you going to guard these templates when editors change these new parameters? Make another new rule for templates? There is no guarantee these new parameters will remain as you put them in. When you have to add another artificial layer to input to derive information from that input, it just never works longterm - unless you have a small controlled input base. Wikipedia is not a small controlled input base. What I would suggest is for this group of computer scientists to help the Wikipedia volunteers to find a solution for the problem of finding another programming macro language that will fit the constraints of Wikipedia (security problems, etc. [6] ) Then there will be a natural movement to convert existing heavily used templates to better address Wikipedia needs, and with that conversion, standards for uniformity of parameters will be a more natural progression at that point. Help fix the basic underlying problem, don't just add another artificial layer on top. stmrlbs| talk 20:36, 22 November 2009 (UTC) reply

What do you mean by guard these templates? One small risk is, that editors do syntax errors, which can be detected and corrected. Did you mean the template definition in the Template: namespace? Well, software has to be slightly adapted, changes might not occur so often and there could be new versions, of course, we would welcome that. We just want to contribute our initial mapping, then it is intended that it should be freely changed to include new mappings and improve the old ones. So we do not intend to lock anything in this respect. The aim is to map template attributes to OWL properties (basically reach convergence, so that attributes with the same meaning and value are mapped to the same string). It is a process to create a Semantic Web vocabulary. Normally, the core of such a vocabulary tends to become stable very fast. It is also easy to reach a good coverage. Tool assistance can be easily created with the help of a User Script (see Greasemonkey). Increasing expressivity of the template language is not a solution imho as only experts can join in modelling the vocabulary and it is also a security issue as you pointed out. Two easy extensions would be to create RDFa based on the template mappings (which is an easy script, maybe half a week in developing time) or to anchor the mapping table in the software (also half a week). Semantic Web data is spreading very far: Madonna@BBC . Note the suggested edit links of the BBC back to Musicbrainz and Wikipedia (in small font size). This is already good practice and gets adopted by many sites. So there is a need to act and provide guidelines and tool assistance. These guidelines should not be strictly enforced, but rather tools should make suggestions, based on the annotations. This will help. SebastianHellmann ( talk) 08:16, 23 November 2009 (UTC) reply

"Guard" is perhaps not the right term, but you would have to make changes in your inner templates if a property was added or changed - right? To keep the mapping accurate and up-to-date. So, you would have to watch all the infoboxes for any changes that might affect your mapping. My question is why can't you do this on your end? Keep information on each infobox and what the different parameters mean and what they map to in your system, and build your database on that? Interpret the properties on your end, instead of adding information that probably most people will not understand or use in Wikipedia? stmrlbs| talk 04:17, 25 November 2009 (UTC) reply

We currently do have our own internal database with many mappings, otherwise the demos would not work (for instance the Wikipedia facet based browser). The goal of the RFC is to move away from that. An internal database has some disadvantages: Long term maintenance of mappings is probably difficult, when they are kept up-to-date in a closed separate system. Wikipedians have no control over how the data is extracted, so they cannot influence how external tools use it. The approach is a relatively light weight way to have semantic support in Wikipedia. Also, as Dmcq mentioned, it should also be the case that the mappings do have some immediate use for Wikipedians. (I also disagree that they are very difficult to understand for technical users, which are able to edit infobox template definitions.) I added a paragraph to the Alternatives section of the RFC with this information. -- Jens Lehmann ( talk) 06:41, 25 November 2009 (UTC) reply

Jens, when I look at the examples, the original templates are missing. When I look at the history, I see they were deleted. If you wish to demo how this works, I would suggest setting up the Dbpedia template(s?) in your User space, and calling them from there in your examples. That way, thy won't be deleted until you are through using them. This is what I do when I've made changes to templates and want to test the changes without affecting anyone. Also, imo, you need to explain this more clearly - "Wikipedians have no control over how the data is extracted, so they cannot influence how external tools use it. " - How are your changes going to help Wikipedia control what data is extracted? stmrlbs| talk 01:35, 28 November 2009 (UTC) reply

We have been creating templates in our user spaces for many months before we started to add them to Wikipedia. We also own installations of MediaWiki. Our initial effort was certainly not the best way to do it - we had a DBpedia logo in there and the rendered template took a lot of space. If you need a template definition for the template we propose, we can add it. (I don't know how this helps you though, since the RFC page already contains examples and a mockup showing what it looks like when rendered.)

"How are your changes going to help Wikipedia control what data is extracted?" It doesn't help Wikipedia to control "what" data is extracted, but "how" it is extracted. For instance, the template provides information that birth_place and born both stand for "birthplace" (in more general terms it provides semantics to infoboxes), so it is quite obvious for me that this controls how data is extracted by DBpedia and others who will likely follow. If you then look at tools building on this, e.g. here, information from different Wikipedia infoboxes is mixed (all mapped to the class Person or a sub class of it) with different attributes (born, birth_place, ...). So the template allow you to control how the data is extracted and presented in such tools. The RFC seems to be badly written and extremely hard to understand. Maybe, you can point me/us to particular sentences/parts, which are confusing/unclear. We could extend the RFC to be much longer and detailed, but then most people won't read it. (To be honest, my impression is that some people seem to stop reading before the "Alternatives" section.) -- Jens Lehmann ( talk) 07:57, 29 November 2009 (UTC) reply

I vote for the alternative of having the interface definition on the same page as the template definition. It seems silly to me to separate them, it is just asking for trouble with changes being out of step. The idea of guarding is very bad if anyone was seriously proposing that. The wikipedia editors are the ones that will have to be shown the benefits and convinced to write interfaces definitions for the templates. Dmcq ( talk) 15:04, 23 November 2009 (UTC) reply

longer than "long term" goal and very short range steps

Writing as a librarian with a moderate amount of Information science background, I certainly would be a strong supporter of using structured data as much as possible, and even generating complete article content from them. Though I'm not an expert in what is fashionably terms ontology, it's basically structure and standardization, and making relationships explicit. Ultimately, I think we should go very far in this direct--so far that people could even choose at display-time the format and structure and size of articles: that we would have a true data base of information for articles and ways of assembling it. This may not be possible as a gradual change in the present Wikipedia--it may instead be the basis for the sort of encyclopedia that will supersede Wikipedia. At present, dbpedia and other projects are talking only about the sort of simple numerical or allied data that can be easily handled that way, but I see it as a basis for the organization, including both the writing and the presentation, of what we think of as the free text of articles and abstracts. I note that in the field familiar to me, the writing of biomedical research articles, there's a good deal of this being incorporated: indeed, pubmedcentral's versions of articles are constructed on the fly. (I would approach this by first working with the article types that can be totally structured in this way. Eventually, we will get to ways to deal with such amorphous subjects as historical and political events. Some of this might be developed more rapidly than one might at first think, as for types of biographies. It would at the least require the rewriting of all existing articles, which would in my mind be a very good thing indeed.

I am not really convinced that any of the current formulation of this--whether SMW or RDF--are really optimal, or what will be the ultimate generally used formulation. Much of the discussion of ontologies is so general as to encompass a wide range of purposes besides that of producing something resembling an encyclopedia, and we may eventually develop our own. But I am quite aware of how my own profession's attempt at this ended in the total mess of MARC formats--which, like Wikipedia , will probably need to be worked out from scratch all over again. Whether we will want to incorporate subschemes is another unclear question.

The immediate practical goal, though, would be first' standardization of terminologies for data elements in infoboxes --and I am not convinced that a formal ontology-based model will be the most productive at the start, and second the use of microformats for the data within them. This is the first step towards any rational continuation and expansion.

I see a discussion above about other language Wikipedias. I think this in fact might be the immediate focus, not just an additional product. Using the precedent of the multilanguage descriptors in Commons, If the ralationships are properly worked out, they should be appropriate for all languages, and we might do well to think of working this way from the very start. A date of birth in one language Wikipedia is immediately usable in another, and the microformat code can deal with the way it is expressed. The database behind the individual Wikipedias can be a single one.

I hope i am not expressing things in too naïve and informal a way for the specialists. DGG ( talk ) 04:30, 22 November 2009 (UTC) reply

I'm basically in favour of making the data more easily available and easier to check and working together with other languages. However any goal in Wikipedia has to be achieved by an evolutionary strategy and in line with the skills and desires of its editors. That has a few consequences as far as I can see. One has to have a long term goal and an idea how to get there otherwise the evolutionary strategy will lead to weird constructions and slow development. You get things like the eye which works but has a blind spot which a lot of other strange mechanisms deal with. The evolutionary nature means that each step in the path much show appreciable benefits such that the costs will be endured. And the costs must be fairly small. It can't be implemented without using Wikipedia editors as there's so many templates and new ones are being continually being made up - they are the ones that have to be shown a benefit. — Preceding unsigned comment added by Dmcq ( talk • contribs) 14:19, 22 November 2009 (UTC) reply

Can we just simplify all this

This is a RFC- but is ending up as a conversation. I still don´t understand what is being proposed. Is the proposal to break something, or just plonk a extra bit of meaningless text on the doc page of each template- which can be safely ignored? I have tried to follow the argument- I have followed the explanatory links to a page that is hat tagged as having multiple issues- and then looking to a page to explain a standard- found the line that this can refer to two nearly compatible standards! I sure it is all very clever- and the intention is noble, but what it is- I couldn't tell you. Could we simplify the sentences, bringing them back to a SUBJECT verb OBJECT format, removing all the subsidary clauses and modifiers, so other editors can join the debate. -- ClemRutter ( talk) 13:25, 23 November 2009 (UTC) reply

It doesn't break anything and the additions to the doc subpage of a template definition can safely be ignored. In order to clean up the mess you mentioned, I moved the discussion part of the RFC to the discussion page and removed unnecessary parts of the article. The "Semantic Web" article you are referring to is tagged as having several issues, because of a recent heated discussion (which I do not want to go into). I am not sure about the "two nearly compatible standards". Can you help me out by providing a link? Feel free to ask about other confusing/complex sentences such that I can fix them (or edit the RFC directly if you like). -- Jens Lehmann ( talk) 16:20, 23 November 2009 (UTC) reply

I think you are going about this in the wrong manner. There is absolutely no need to change infobox attribute names as a template interface can handle any mapping. There's no need to go around saying everything in Wikipedia needs to be standardized, that will just go up peoples noses. Having a template interface wil tend to get people to use standardized names over time and that's quite good enough.

The idea of having a separate subpage of a template definition is I feel also wrong headed. The template interface should be up front with the template definition. Otherwise it won't be seen and updated as necessary.

Standardizing the names in infoboxes is not something I see as conferring great value to editors in wikipedia. Skulking around putting in subpages or being bold and changing all parameters in sight are two opposite ends of the spectrum from cooperative editing. I acknowledge all this work furthers the aim of wikipedia but there needs to be something editors can see in wikipedia that is of advantage to them without going off to some database. Automated type checking of uses of templates is one clear advantage that people writing templates would be able to see fairly easily, it doesn't have to be wonderful just find any instances of date which don't parse well so they can check them. Then later on database aspects can be brought in to check that place names really are places for instance. Dmcq ( talk) 12:54, 24 November 2009 (UTC) reply

I agree with what your write (was it intended as reply to my post?) except that I think the doc subpages could also be a good place. They often already exist for popular templates, are freely editable (infobox template definitions are sometimes blocked or semi-blocked), and the doc subpages are usually transcluded in the template definition page, so they are visible when looking at a template. The mappings are, in some sense, documenting the template, because they add meaning to its attributes. That said, I'm also happy with adding them directly to the template definition, although I fear a bit that there could be a lot of of additional discussion with admins to temporarily unblock the templates. So an initial version, which could be done in a day, could take quite long instead. -- Jens Lehmann ( talk) 06:51, 25 November 2009 (UTC) reply

A summary: The big questions

I was reading this debate, and I needed to order my thoughts. I think others might like this list as well, so I'm just adding it to the pile.

Do we want to be able to do more with the infobox data ?
- I think it is pretty clear that everyone would really like to do more with this data
Implementation methods ?
1. Microformats
  - Standard has a too limited domain, and is stated to never encompass all infobox options
2. Semantic MediaWiki
  - Problematic due to load issues
  - Will probably require significant engineering work before it can be deployed in Wikipedia. Likely requiring significant monetary investment.
3. RDFa
  - Requires changes to all infoboxes templates.
  - RDFa is also not valid HTML
  - Not allowed by MediaWiki atm. Though I heard that someone was working on making them valid output for the software.
  - Potentially disruptive to deploy
4. dbPedia parameter mapping
  - Either on meta or on all template doc pages
  - Easier to change, simpler to deploy
  - Not a "clean" approach
  - Easier to get out-of-sync
  - Potentially disruptive to deploy
5. Full Semantic support in the core software
  - Highly desirable
  - Cleanest solution
  - Most flexible
  - Most work/expensive
What is our timetable ?
- I think our timetable is "within a year". This already excludes most of the implementations and basically only leaves one of the dbPedia solutions
Do we think that any progress is better than standing still and waiting for magic ?
- Definitely.
So what is the best process to do the dbPedia mapping
1. On dbPedia
  - Less easy to share with other parties. We might risk showing a bias to 1 party
  - Less contributors to keep the infobox options in sync
2. On Meta
  - More potential contributors than on dbPedia.
  - Still requires people to navigate away from en.wp when they change an infobox
3. On wiki's themselves
  - Biggest pool of editors
  - Simplest to keep in sync
  - Probably more prone to vandalism

So after all this, I personally am left with 3 ideas.

Keep everything as is, but add links to infobox documentation towards the maintenance spot for the specific mapping for that template in dbPedia
Put all mappings on a separate wikipedia website, with a dedicated interface for maintaining the mappings. Should be Interwiki aware. Mark infoboxes with a template+link leading to their registry of their entry on this mapping website. Basically, the current dbPedia approach, but with the wikipedia userdatabase. We basically separate dbPedia into an RDF mapping database under the auspicien of WMF and the webinterface dbPedia. (or fully usurp of course)
The current proposal of putting the mapping on enwp, but either use dedicated and protected subpages, or make a new noinclude section in the template page itself.

In my opinion, those are our best options, and I'd prefer number 2. — TheDJ ( talk • contribs) 15:58, 23 November 2009 (UTC) reply

I don't follow all of the above but I think the consensus is that each wiki should be free to adopt their own approach. I also don't see why the timetable has to be a year nor why it isn't achievable to do it through standardising templates and then letting other people mine the data from there. Perhaps someone can clue me in "in layman's terms". Hiding T 17:36, 23 November 2009 (UTC) reply

I have exactly the same problem. We have been told (I believe) that DBpedia gets the data from reading our wiki pages. That it needs to know for each parameter of each infobox of what type it is (e.g. date, and more precisely date of birth; or colour as text; or colour in RGB form). That this knowledge is currently hosted by DBpedia. That it is hard to maintain because Wikipedia's infoboxes are so inconsistent with each other.

Now the inconsistency of the infoboxes is a bug, not a feature. Let's standardise the parameters with an eye to DBpedia's needs. (I.e. in a few cases we will need two parameters that mean almost the same thing, only one of which may be used, and that display in exactly the same way. Because the information is fundamentally different and so DBpedia handles it differently.) Let's write a central documentation page with the most common standard parameters. That's going to simplify editing for those of us who regularly deal with many different infoboxes. Let's even make it as compatible as possible with the citation templates.

Why is that not the cleanest solution for DBpedia? Hans Adler 18:48, 23 November 2009 (UTC) reply

Yes, we extract the information from Wikipedia articles. Since recently we obtain the information from a live update stream, i.e. when a Wikipedia article changes, the DBpedia framework is notified, parses the wiki markup and updates the RDF output. DBpedia does not have hard requirements when it comes to dates, length measurements etc. We rather apply heuristics to parse values correctly. In many cases, the values can be parsed correctly, but sometimes they are a bit cluttered (for instance a date is sometimes a link to a wiki page and sometimes not; sometimes references are added to the date or markup is added which is only used to modify the layout but not the actual content). Standardising value formats for certain infobox attributes (you call them parameters) would be very welcome, but - at least from a DBpedia perspective - we do not need to force a specific value format for an infobox attribute, but it would usually be sufficient to say that the value can have one of several accepted formats. The core of the proposal is, however, to align different spellings of an attribute, which actually mean the same, and (in rarer cases) to disambiguate between attributes, which are spelled in the same way, but mean different things. Your idea of having a central documentation page for attributes is not far from what we propose. The difference is that we suggest to have one documentation page per property on MetaWiki and that we add more formal information as well. Here is a draft page for birthPlace, which contains information about the property (infobox attributes are mapped to RDF properties). It is a draft, quite ugly, and we may remove the "DBpedia" in the URL and elsewhere (we stopped those efforts until consensus emerges here), so hopefully you can ignore all that for the moment. The intention is to document the property there. One can discuss it there, add information about all attributes, which are mapped to this property (can be done automatically), add information about how values should like, pointers to discussion about this etc. This offers more benefits (for extractors like DBpedia and for achieving consensus) compared to "only" fixing the spelling of infobox attributes, although this would be a good thing to do anyway (and the mapping templates make it easy to see in which infobox templates corrections should/may be made). Are those explanations clear/helpful? Maybe the RFC can be updated to provide clearer explanations (and I may be too professionally blinkered to do it, although I'm happy to help). -- Jens Lehmann ( talk) 09:02, 24 November 2009 (UTC) reply

So there would be a central database of template or infobox types in wikimedia along with standardized parameters names and their types which the individual wikipedias can refer to. Databases can then use that together with any template interface in an individual wikipedia to try and interpret the parameters of a particular template. I guess the database editors will patrol wikimedia fixing up problems and sometimes looking at template interfaces in individual wikipedias if they don't seem to be delivering what they said they would. I think I'm happy with that if that's the idea. Looking at that birthplace I must admit "rdfs:subPropertyOf place or location" sounds all wrong so the terminology isn't very obvious, subClassOf would be more what I'd have thought so it looks like they will need specialist skills there which would be out of place in any of the wikipedias. Dmcq ( talk) 13:21, 24 November 2009 (UTC) reply

The "rdfs:subPropertyOf place or location" was wrong and I removed it. It could still be that mostly people with some Semantic Web knowledge will edit such definitions. On the other hand, one can explain things like "label", "domain", "range", "sub property", "equivalent property" in one or two sentences, so it is not hard to understand what they mean. (The template on MetaWiki needs to be improved of course.) -- Jens Lehmann ( talk) 15:27, 24 November 2009 (UTC) reply

Proposal to stadardise common infobox terms

I'd like to float the idea that we standardise common infobox terms and look at guiding editors on those standard terms. So we standardise birth_place and birthplace etc. to one term. Hiding T 17:38, 23 November 2009 (UTC) reply

What exactly is the problem here that needs to be solved? I'm not sure I understand this issue. Is this 1) a problem for new users who are trying to use an infobox and get confused when they put "birthplace" and it doesn't work because the template wants "birth_place", or, 2) is this an issue for an outside user, dbPedia, who wants to mine Wiki data and wants Wiki to change so they can do it easier? I don't really care about number 2; they're the ones who decided to mine Wikipedia, so it's up to them to make their site comply with Wikipedia. If this is a number 1 issue, I wonder how true it is. Where is this new user getting that template? I have seen new users copy and paste templates from other articles like it and then use that template, which means that the template parameters are fine. Or new users will go to the template page and copy/paste it, which also means the template parameters are fine. The only problem I see is if the user is changing a template to a more specific one (like changing template:infobox person to template:infobox politician). And even then, I really don't see what the problem is, because any serious template change requires the introduction of enough new parameters to make one parameter name change essentially irrelevant.

So what exactly is the problem that needs to be addressed? Why do things need to be absolutely standardized in templates when the parameters can be mapped together "behind the scenes" so that birthplace = birth_place = placeofbirth? I don't see why all the templates on wikipedia have to change, which would require editing every article, say, with a biographical template in the case of birthplace vs. birth_place. Do it behind the scenes. Unless this is some real big emergency, I don't honestly see the point. Kolindigo ( talk) 22:07, 26 November 2009 (UTC) reply

This is an issue for experienced Wikipedians who can;t remember which infobox uses which field, and if we standardised, it would simplify things. Can't see why that would be a problem. Perhaps you could explain why? Hiding T 19:18, 29 November 2009 (UTC) reply

The RFC is about making it easier to reuse information in Wikipedia infoboxes (both internally and externally; see demonstration section on what can be achieved) and it suggests to do this by mapping infoboxes and their attributes to descriptions on MetaWiki. This means that not every article needs to be edited. A standardisation of common infobox attributes can still take place after the mappings are created.

I should add that the RFC is not specific to any particular extractor (like DBpedia) and relies on standards, so it can be used by other extractors and internal Wikipedia tools as well. I'm not particular in favour of your attitude of not caring about anything happening "outside" (we do not really view ourselves as being outside). If Wikipedia can be made even more useful by enabling queries over it, improve browsing and searching etc., it is very well worth to allow this simple addition of a few lines to each template definition (it doesn't break anything). -- Jens Lehmann ( talk) 07:10, 27 November 2009 (UTC) reply

As a technically-ignorant editor who's never created an infobox and isn't clear about what's being proposed, let me say what I see as a possible problem with standardizing some parameters or characteristics. Sometimes parallel or analogous lines in information boxes refer to things that are slightly but significantly different, so editors should be (as I hope they would be) free to create the lines that fit the topic of the articles for which the information box is designed. To give a couple of examples off the top of my head: the American League and National League in American professional baseball have different structures, purposes, and relations with their minor leagues, from those of other leagues in other nations and other sports. The structures of local government in the United States and United Kingdom may look logical from the outside, but are full of all kinds of oddities and anomalies that would fit poorly into infoboxes that are good for other nations: e.g. the relation between the City of London and Greater London, the non-executive Five Boroughs in New York City which overlap counties of New York State, and the overlapping counties and districts of Northern Ireland. Political parties, coalitions and election systems also vary greatly not only from country to country, but within countries, so that standard U.S. election information boxes fit poorly for elections in New York, infoboxes have to make adjustments for the CDU- CSU alliance in Germany, and even Communist Parties have an enormous range of purposes and structures, as of course do trade unions and churches. The lines in these information boxes just aren't going to match each other very well all the time, but they have to be edited to fit the underlying subject matter. Sometimes this means phrasing the name of a line very carefully and idiosyncratically, so that the editors who fill those boxes are entering comparable items. The problems this variation would cause for databases (especially query-oriented ones) are a reflection of differences, anomalies and incongruities in the real world as much as those in editors' styles and habits. —— Shakescene ( talk) 06:53, 27 November 2009 (UTC) reply

You're right. In a number of cases, the complex real world is hard to fit into infoboxes. Our experience in the DBpedia project has been that the Wikipedia infoboxes are still a very useful source of information for querying purposes. The proposed mappings would essentially cover the cases, where it is quite clear that two attributes in different infoboxes mean the same thing. (This alone already provides much added value for querying and browsing.) -- Jens Lehmann ( talk) 07:37, 27 November 2009 (UTC) reply

Microformats/Coins

I'm abit confused by this discussion and whether the micro-formats solution has been explored. However there is good support for some micro-format especially COinS in the citation templates. This is a relatively easy thing to implement for most templates, exposes the data for those interested and has little impact for everybody else. Those interested just need to keep the templates on their watchlists. -- Salix ( talk): 20:03, 27 November 2009 (UTC) reply

See the top of this discussion and the alternatives section of the RFC page. -- Jens Lehmann ( talk) 07:58, 29 November 2009 (UTC) reply

A Really Well-Thought-Out Standardized Userbox/Infobox Template

Please see this page: [7] for the template. Developed by Super Sam, this template has 11 parameters that are optionally inclusive and default when excluded. Parameters are: Border color, size, and pattern; Background colors; Directionality; Text for each section, text color, and text size. Please see this page: [8] for information on how to implement this template, as well as many examples and 3 variants. I wish you the best of luck deciding in this matter. -- Homfrog^Talk 02:55, 5 December 2009 (UTC) reply

Conclusion

The discussion has been going on for a while now and we would like to come to a conclusion. Please give your "support" or "oppose" vote here with a short explanation (which should not turn into a discussion). We adapted the RFC during the above discussion (provision of a running example, description of suggested alternatives, demos updated), so please read it before casting your vote. Please note that the vote should not depend on the page where the mapping is stored (doc subpage, along with the template, extra mapping subpage). You can leave a comment about what you consider the best place to store them in your vote. The vote should also not be about whether or not infobox attributes should be standardised (e.g. different spellings of "birth place" unified) afterwards as suggested by Hiding. This is made easier through the mappings, but it is not strictly necessary to do it and can still be discussed after the mappings are in place (if accepted). Note that a later migration from the proposed RFC to other solutions (e.g. Semantic MediaWiki, solutions proposed by TheDJ) is definitely possible. Jens Lehmann, SebastianHellmann ( talk) 09:12, 1 December 2009 (UTC) reply

Just a note to say that we don't use voting to decide things on Wikipedia. I've retitled the section. Whether or not a motion is carried forward is decided by the consensus as determined by weight of argument, rather than head count. Chris Cunningham (not at work) - talk 12:26, 2 December 2009 (UTC) reply

Support

Support: I believe the 'Include the mapping directly in the infobox template definition' is best. I think having parse hints for parameters and access and checks via databases would help the aims of wikipedia while causing very little in the way of problems. Dmcq ( talk) 00:00, 2 December 2009 (UTC) reply

Oppose

Strong Oppose... despite it not being a vote. Overly complex, ugly, and almost hilariously over-engineered solution to a problem that would require significant effort for marginal benefit - something it shares in common with almost everything else associated with the "semantic web". For me, there's a right way and a wrong way of going about solving this issue, and this is definitely the wrong way. The right way is the simple "obvious" solution - make templates consistent (both in terms of names and values) and to put them in categories. This fixes the underlying problem of inconsistent templates instead of building a dirty hack to work around it, improves both machine and human readability, and is more likely to catch-on and actually be sustainable in the long-term, instead of the idea of creating meta-templates as a paper-thin abstraction layer solely for bots (or, more honestly, the idea of creating a meta-template solely to slightly ease implementation of dbpedia). Yes, effort and co-operation would be required in the community to standardise infoboxes, possibly including a bot to fix broken infoboxes, but it'd be a singular one-off effort for a significant real benefit to the project. - Halo ( talk) 09:56, 6 December 2009 (UTC) reply

While I'm at it, I'd also like to criticise the proposal for being almost uniquely unreadable, and this is from someone who actually understands the issues involved - I bet it's completely impenetrable to others who aren't quite so geeky. - Halo ( talk) 10:11, 6 December 2009 (UTC) reply

It does sound like it would be better to rewrite the proposal to make it more understandable. What's said here seems to totally miss the point of the proposal and misunderstand how it would be implemented, who would do any work, or what the benefits would be, in fact the whole kaboodle. Having consistent names in templates is just a 'would be nice' item but not very important. The 'dirty hack' as you call it is the whole point and is what gives the benefits, besides being much less work than standardizing the names. The current templates are a great big dirty hack and one advantage of this is it would give a way of putting a bit of structure on the parameter side of them. As to your feelings about being able to extract structured data from Wikipedia you're entitled to your opinion but lots of other people think differently. The point is would you be affected by the work to support the people who want semantic data out? As far as I can see there is nothing but benefits even for people who'll never go anywhere near a database. Dmcq ( talk) 13:08, 6 December 2009 (UTC) reply

RFC Discussion

Suggested reading for DBpedia project members

Flexability question

Wikipedia templates severe limitations - solve this problem first

longer than "long term" goal and very short range steps

Can we just simplify all this

A summary: The big questions

Proposal to stadardise common infobox terms

Microformats/Coins

A Really Well-Thought-Out Standardized Userbox/Infobox Template

Conclusion

Support

Oppose

RFC Discussion

Suggested reading for DBpedia project members

Flexability question

Wikipedia templates severe limitations - solve this problem first

longer than "long term" goal and very short range steps

Can we just simplify all this

A summary: The big questions

Proposal to stadardise common infobox terms

Microformats/Coins

A Really Well-Thought-Out Standardized Userbox/Infobox Template

Conclusion

Support

Oppose

Videos

Websites

Encyclopedia

Facebook