WIKIPEDIA TALK:MANUAL OF STYLE MATHEMATICS ARCHIVE 4

Precomposed vs. ASCII Roman numerals

Initial discussion

It's been my understanding for about 3 years now that English Wikipedia prefers ASCII Roman numerals to precomposed characters, for ease of typing, more consistent search results, less confusing copy-and-paste, and broader compatibility with fonts typically used by English speakers. In the 2020-11-01 database dump, I see for example 1,412,537 instances of "III" but only 288 instances of "Ⅲ". Since we don't use vertical text, this preference seems to align with what I found in the Unicode standard (quoting from Unicode 7.0.0, Chapter 22, p. 754):

Roman Numerals. For most purposes, it is preferable to compose the Roman numerals from sequences of the appropriate Latin letters. However, the uppercase and lowercase variants of the Roman numerals through 12, plus L, C, D, and M, have been encoded in the Number Forms block (U+2150..U+218F) for compatibility with East Asian standards. Unlike sequences of Latin letters, these symbols remain upright in vertical layout.

I've removed perhaps a few hundred of these, and the first objection I've gotten came recently from Struthious Bandersnatch, who is unpersuaded by this reasoning, and says that the language currently in Wikipedia:Manual of Style/Mathematics#Special symbols means that precomposed Roman numeral characters are preferred, which he intends on adding in articles he edits. We've brought the discussion here to see if there is in fact a consensus on this issue and to document it. I would proposing adding something like this to that MOS section:

For Roman numerals, ASCII letters should be used instead of precomposed Unicode characters. For example, VI, not Ⅵ.

What do you think? -- Beland ( talk) 03:16, 22 November 2020 (UTC) reply

Oh, I forgot to mention that MOS:ORDINAL writes "II" in ASCII, so if we develop a consensus against the ASCII representation, that would also need to be changed. -- Beland ( talk) 03:23, 22 November 2020 (UTC) reply

(To be clear, because I'm not sure it's entirely clear from what Beland says above: my position is not that it should be mandated to use Unicode Roman numerals, but that doing so should be a valid style variation for purposes of MOS:STYLERET.)

For my part, I'll start by saying that if the community should arrive at the conclusion that no styling variation can be allowed on this issue, I am entirely willing to abide by that. I'd actually say this is one of my stronger styling preferences, and §Special symbols seems straightforward and compatible here, but at the same time styling itself is not very high on my overall list of priorities. (Though I think having an extensive MOS is a good thing—I'm just usually content to simply follow it, in article space, rather than debate it.)

We probably should link to Numerals in Unicode § Roman numerals
Using the term "pre-composed" instead of "numeric" or something of that sort seems like a somewhat biased way to frame this; the Unicode Consortium document introduces its section on numerals with,
Many characters in the Unicode Standard are used to represent numbers or numeric expressions. Some characters are used exclusively in a numeric context; other characters can be used both as letters and numerically, depending on context. The notational systems for numbers are equally varied. They range from the familiar decimal notation to non-decimal systems, such as Roman numerals.
The section Beland cites above does not actually say why it is that for most purposes, it is preferable to use letters to represent Roman numerals; it seems to me that the reason could simply be that it's easier to type. Apart from the rotation behavior in vertical writing systems, it also doesn't say what other minority purposes there are. (I'm noticing now that our article claims Unicode Roman numerals are "for compatibility only", but cites this to the preceding version of this document, which as far as I can see does not actually say this either.)
Using numeric characters, Roman numerals are machine readable as numbers; this information is lost upon converting them to what the MOS calls similar-looking ASCII or punctuation symbols. Beland did not seem to think this was particularly notable in our previous discussion, but I'd be curious to see numbers on how consistently it can be done since humans can't do it reliably, particularly given the famous (moderately famous? okay, maybe just famous to me) case of the Indian news anchor who was fired after reading the name of the president of China, Xi Jinping, as "Eleven Jinping" on the day he had arrived on a diplomatic visit.
I've looked through all of the skins and to me, at least, although the glyphs look slightly different than the equivalent letters, usually with slightly different spacing, they look better to me in all skins. Which, along with machine readability, is why I've got this preference.

-- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 05:16, 22 November 2020 (UTC) reply

I'm actually one of the downstream consumers of English Wikipedia content, and I wouldn't say that the precomposed version of Roman numerals is better for machine readability. I've already got a spell check system up and running, and I have a partially constructed syntax checker that tries to parse English sentences to see if there are any grammar mistakes. Unless we decide that the ASCII representation is disallowed and change the million-plus instances of it, any artificial intelligence system is going to have to cope with the ASCII representation. In any mixed regime, there will always be strings like the pronoun "I" and abbreviations like "VI" for the Virgin Islands that as an isolated word are ambiguous as to whether they are Roman numerals. (The Chinese name "Xi" at least has a lowercase "i" to distinguish it from 9, so that's not actually a good example of where a text-eating program should be confused.)

English Wikipedia generally doesn't use Roman numerals for representing numbers, say in math equations, or even in running prose talking about Arrondissements of Paris. The vast majority of instances of Roman numerals are as part of a regnal name, and they are easy to parse as part of a proper noun based on capitalization alone. At that level, it's actually easier that these look like capitalized words rather than numbers. They are also often seen in chemistry notation, where something like "iron(I)" is completely unambiguous even in ASCII. Things like look like numbers based on character-level interpretation also often aren't. For example, with "how's your 9 to 5" and "watch your six", it's inappropriate to apply numerical processing and other solutions are needed to recognize idioms and whatnot.

I've also built text-to-speech systems, and here interpretation matters more. For example, Lee (Korean surname) is sometimes Romanized as "I", so if I see "David I", should I pronounce that like "David Yee" or "David the first"? If there is a mix of ASCII and precomposed styles in my input, I can't trust that seeing an ASCII "I" means I should say "Yee". In fact, given the relative frequency of occurrence, it's much more likely that "David the first" is correct because this is a reference to a famous monarch with this name. Having a lookup table (like the titles of all Wikipedia articles, something I actually do use) would be a better way to solve this problem than looking at character markup, or even better would be to look at the semantic context and do named entity recognition. Which is something any TTS system has to do anyway to get reliable pronunciation (for example for "Stein" as "st-ee-n" vs. "st-eye-n").

In practice, precomposed Roman numeral characters are just not going to be present in English content from non-Wikipedia sources. Industry standard machine learning systems which I might use as a starting point for my NLP code, are going to be trained on real-world data, where humans type in the ASCII representation. As a programmer I think I'm better off normalizing the precomposed version to the ASCII form in a pre-processing step. That would simplify all the downstream code that has to handle Roman numerals, if there's one canonical form. One of the reasons I started working on eliminating the precomposed forms is that they clog up spell check error reports. I could do the normalization to ASCII in my pre-processing code, but I figure I'm making life easier for the next spell checker (not to mention future editors and readers) by eliminating them in the wikitext.

In short, a mix of styles is less convenient for parsing than consistently using all one or the the other, and given the monumental effort it would take to switch styles, I think the easiest thing to do from a machine readability perspective is to more consistently use the convention advocated by the Unicode standard and de facto style used pretty much everywhere.

BTW, I don't see why "precomposed" is "biased"; "Precomposed Roman numerals" is what Wikipedia itself calls these in Unicode compatibility characters. As far as I can tell, "composition", "decomposition", and "precomposed" are standard technical terms used when discussing a single character that can also be represented as a multi-character sequence. -- Beland ( talk) 07:05, 22 November 2020 (UTC) reply

Support deprecating precomposed numerals and converting to ASCII on the basis of simplified support for accessibility tools described by Beland above. Accessibility is important. I would have guessed the precomposed ones were better for that, but that would have been only a guess and it's contradicted by the experiences described above. For what it's worth the two variations look identical on my screen; I would guess (another guess) that this is because the browser converts the precomposed ones to ASCII internally, so there is no actual benefit to precomposition for people who are just reading Wikipedia in browsers. — David Eppstein ( talk) 08:22, 22 November 2020 (UTC) reply

So I am also a professional software developer, although Beland would appear to have much more experience with NLP than do I. (Though that does not appear any more relevant or authoritative than my own professional experience here.)

What you seem to be saying, Beland, is that any software code for handling Roman numerals in text is already going to be a mess of complicated rules that have to deal with a range of scenarios, contextual clues, and edge cases. So it is puzzling to me why one additional rule that converts Unicode Roman numerals to Plane 0, Row 00 Latin Unicode characters (ASCII-compatible when UTF-8 encoded, of course, but not actually ASCII-encoded for a while now) would result in a salient difference in how easy it is to do anything, or would be less convenient in any substantive way; and quite frankly it just seems lazy to me on all fronts, search engines and spell-checking and otherwise, to simply normalize the data ahead of time rather than adding one rule and preserving information in the source.

It's like going and color-quantizing a bunch of source images because you don't want to bother to write code handling multiple bit depths. You realize that you're implicitly arguing that it would be better for NLP and TTS systems to not contain a simple rule like that, and hence be unable to handle Unicode Roman numerals?

On the matter of terminology—you, specifically, are the one claiming that Unicode Roman numerals merely represent pre-composed versions of Latin letter combinations. Note that even in the Wikipedia article you've linked to about compatibility characters, there's a “^{citation needed}” tag on this claim—and in fact the article says,

...in certain academic circles the use of Roman numerals as distinct from Latin letters that share the same glyphs would be no different from the use of Cuneiform numerals or ancient Greek numerals. Collapsing the Roman numeral characters to Latin letter characters eliminates a semantic distinction.

If you seriously do not understand why it's biased to make that claim in a policy thread I haven't commented in yet, and also after I've explicitly said that the Unicode Consortium documents you're linking to do not say that—without any counterargument or presenting better sourcing for your claim, and to use this wording which favors your desired conclusion in framing the question itself when introducing a policy discussion on a talk page, I kind of wonder whether you're a good candidate for making such proposals. Perhaps you should ask a neutral third party to make the proposal in this sort of situation.

David Eppstein, it's great that it looks indistinguishable to you, but as I said, to me, with my system's combination of browsers and built-in fonts,^[†] it does look different, and better. Hence the benefit is better styling, from my point of view at least, and what I'm saying is that these arguments simply aren't overwhelming enough to justify eradicating my styling preferences from Wikipedia.

Now if there was an accessibility benefit, I'd find that a persuasive argument. But web accessibility is something I know a fair bit about—in fact, I've worked on Section 503 compliance issues in web content management systems in the U.S. since the last century—and no one has actually presented any evidence here that converting Unicode numbers to a bunch of undifferentiated Latin letters provides improved accessibility.

^
† Fonts with open source, Debian-compatible Linux licensing, so there's no excuse for it not to similarly look better in any OS—that's on OS vendors. And it furthermore actually means that Wikipedia could probably use embedded web fonts to improve the display in all browsers, on all platforms, but I'm not advocating for that. Yet.

-- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 04:34, 27 November 2020 (UTC) reply

Yes, not all search engines put in the effort to handle all the many thousands of Unicode characters properly. You may think this is "lazy", but that doesn't mean they'll invest developer time in it. Yes, NLP systems are going to need to make a complicated set of rules; my point in saying that is to dispel the idea that encoding Roman numerals as precomposed Unicode characters some of the time will allow a naive system to handle them properly. If that was done all of the time, that distinction would be useful. "I kind of wonder whether you're a good candidate for making such proposals" felt like a personal attack; I'd appreciate it if we kept the discussion to the merits of the encodings. -- Beland ( talk) 05:07, 27 November 2020 (UTC) reply

WP:NPA concerns statements about personal behavior that lack evidence; it's you who are saying you don't understand the bias inherent in your own words, then linking to mainspace articles where the claim you're making is marked “^{citation needed}”. No need to discuss it anymore if you will stop portraying your position in this policy discussion—the position not simply that I have my preferences, and you have your preferences which are better, but that your preferences are so superlatively correct as to exclude my preferences as even being an acceptable option or variation—as a mere unbiased reflection of what Unicode Consortium documents say, or what orthodox programming practice would dictate.

If the developers of a(n unnamed) search engine don't invest developer time in something as simple as an equivalence like this, or don't even invest time in thinking about how their product should handle this situation, then they've done a poor job of making a search engine. I mean, they've basically made a search engine that doesn't handle Unicode. I'd be inclined to audit for Y2K bugs and UNIX epoch problems too. It's not a problem for Wikipedia to solve by constraining our styling decisions.

And I'm sorry but you simply have not demonstrated that a naïve system is going to be unable to handle Unicode Roman numerals properly. At all. The reason why a naïve system could handle them, and do so virtually effortlessly, is because as I've said (and the article you linked to said, which I quoted above) these Unicode numbers and their Unicode Latin letter equivalents are not simply fungible—the number code points contain additional information, which is being removed when they're converted to letters, which is one of the things I'm objecting to. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 11:18, 27 November 2020 (UTC) reply

My proposal that the decomposed style is the only one Wikipedia should allow is not based on an arbitrary personal preference, but what I see used in common practice, and most importantly on usability for human editors. Your interpretation of what the Unicode standard says seems very strained to me; I found the section I quoted to be a pretty clear endorsement of the de facto convention of not using the precomposed variants in English text. The generally accepted robustness principle indicates to me that "fixing" this problem should happen at both ends of the producer-consumer relationship; that Wikipedia should clean up its wikicode to follow the standard convention, and search engines should properly handle web pages that don't follow the standard convention. Following this principle means that both "poorly written" (which in this case would apparently include Google) and "well-written" search engines work properly. Yes, I'm well aware the precomposed version of these characters have more of a one-to-one mapping to a certain semantic meaning than the decomposed versions. Unfortunately, this otherwise nice semantic hint isn't helpful because the precomposed versions occur less than 1% of the time in English Wikipedia wikitext (and in English text on the web in general). A naive system that only uses the character encoding to semantically interpret Roman numerals is going to fail over 99% of the time on that particular task. Normally I'd also object to the loss of information due to an encoding change, but this isn't an object in a computer program, it's human-readable prose. More important than the debate over hypothetical NLP systems is the negative effects the precomposed characters have for humans who are manipulating the text, the vast majority of whom don't even know that the precomposed variants exist. I doubt that any style guide anywhere explicitly recommends using the precomposed Roman numerals in English prose, but I'm open to being proven wrong. -- Beland ( talk) 18:42, 27 November 2020 (UTC) reply

To add the requested link, how about:

For Roman numerals, ASCII letters should be used instead of precomposed Unicode characters. For example, VI, not Ⅵ. (See Numerals in Unicode § Roman numerals.)

? -- Beland ( talk) 23:11, 27 November 2020 (UTC) reply

No; I was not requesting that the policy be changed and include that link—I presented it as context for editors to evaluate this styling discussion. My position is that the policy is just fine as-is, which is why I showed up on your UTP and quoted it to you in the first place.

A naive system that only uses the character encoding to semantically interpret Roman numerals is going to fail over 99% of the time on that particular task.

Oh, come on. This is not even a good faith argument, after you've proposed that the easiest thing to do from a machine readability perspective as a “naïve” system is removing numeric information from source data and then using a possibly-infinite number of ad hoc rules for translating letters into Roman numerals. A naïve NLP system built in a 1930s teletype machine (video on YouTube), mechanically, would also mostly fail btw.

But “easiest” and “most effective” are different criteria; I specifically asked why one rule, or even a vanishingly-small handful of rules for that matter, for processing characters already encoded as numbers would make for a material difference in ease of implementation (...implementation of all of these products which aren't part of Wikipedia, for which you have yet to explain why we would need to conform our styling rules to their vendors' needs, if indeed your claim about ease of implementation is even valid, though it seems trivially untrue to me)—why this would result in a salient difference in how easy it is to do anything. This is a blatant case of moving the goalposts, and a rather rhetorically clumsy one at that.

...both "poorly written" (which in this case would apparently include Google) and "well-written" search engines...

You have simply repeated a previous, uncited claim you made on your user talk page, which I challenged then, without any actual citation here either. As with so many other assertions, you haven't demonstrated that any behavior of the Google Search Engine is the result of not investing developer time (on the part of Google, of all companies—not exactly resource-poor when it comes to developer time for their flagship product) or failing to even invest time in thinking about how their product should handle this situation.

The definition of the robustness principle you link to reads,

Be conservative in what you do, be liberal in what you accept from others

...so just how exactly does that even remotely describe your approach here? (Or paraphrase into a heuristic that fixing a problem should happen at both ends?—that's pretty much the diametric opposite of the concept.) How is destroying information in your source data to permit an unexplained, supposedly-“easiest” implementation of these various automated systems interacting with Wikipedia content which don't care about style at all anyways, “conservative”?

The introduction to the Unicode Consortium document's section on numerals says, again,

The notational systems for numbers are equally varied. They range from the familiar decimal notation to non-decimal systems, such as Roman numerals.

...which is what's clear: Unicode Roman numerals are a notation system for numbers, and no matter how many times you call them pre-composed versions of letters, merely for compatibility, or if you insert the term into the talk page section header here (or link to a Wikipedia article about encoding compatibility which marks the same claim with “^{citation needed}”...) that doesn't change anything. What's “strained” is claiming that a single sentence which is verging on a footnote (which still explicitly says that these are Forms of Numbers, anyways) overrides and excludes the definition of these code points in the document's own introductory section about numerals and overrides any autonomy Wikipedia would have for determining its own styling choices.

And it's also strained to act like you aren't obligated by Wikipedia practices to present a policy change proposal from an NPOV, when you're putting yourself forward as a superlative authority on common practice of styling decisions for Roman numerals; and tbh acting that way gainsays the authority you've arrogated to yourself. (Obviously, your judgment about these kinds of styling decisions wasn't authoritative for the type foundries that designed the fonts installed on my computer, who intended for these glyphs to be used and put developer time into it.) -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 07:06, 28 November 2020 (UTC) reply

And another thing:

I doubt that any style guide anywhere explicitly recommends using the precomposed Roman numerals in English prose, but I'm open to being proven wrong.

I explicitly said at your talk page,

I am, of course, not advocating for using these characters as letters, like in “triⅵal” or something, but exclusively numerically.

(Edit:) So another obvious, clumsy rhetorical gambit, this time a straw man. I'm at the point where I think I can say that not only did you not make any effort to present this policy change proposal neutrally, you are intentionally attempting to misrepresent my position here. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 07:19, 28 November 2020 (UTC) reply

In the interest of AGF, I'll assume that above you were simply presenting a one-sided, tendentious version of a statement like, "Non-Wikipedia guides don't usually specify technical details like choice of Unicode code points for ligatures or representation of visually similar glyphs" ...which, despite still appearing to be intended as a straw man, a weakened version of my position easily disproved, is simply a trivially untrue statement in a discussion of a style guide which literally says what the first quote above does; and hence this statement does not misrepresent my position in that interpretation, unlike some of the other above statements, such as the one I characterized as not being in good faith. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 07:49, 28 November 2020 (UTC) reply

I think it's useful to consider also the closely-analogous case of precomposed superscript digits, which again are just notation for numbers, but which this MOS forbids from mathematics articles for the reason that when they are rendered the same as proper superscripts they are useless and when their rendering subtly disagrees with other superscripts (which you have to use anyway because not all exponents can be made from the precomposed ones) that rendering difference is unwanted. Is there any reason we shouldn't treat this form of Unicode cruft that inappropriately mixes semantics (what a sequence of letters means) from syntax (what sequence of letters is to be rendered) differently than that other form? — David Eppstein ( talk) 07:28, 28 November 2020 (UTC) reply

But, superscript digits really are just notation for numbers—the semantic meaning is the same between the Wikicode/HTML, for example, “⁴” and “<sup>4</sup>”. That is not the case with Roman numerals, however—the semantic meaning of the number code point is different from the semantic meaning of a sequence of letter code points (or a single code point... note that for many of the Roman numerals it actually doesn't even make sense to call them “precomposed” because they aren't visually equivalent to multiple letters, but just one.) -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 07:49, 28 November 2020 (UTC) reply

Superscript digits can mean exponents. They can mean footnote markers. They can mean an index for a tensor. They can mean lots of different things. In the same way sequences of letters formatted as roman numerals can mean lots of things: the number of an event, a page number, the position of a hand on a clock, etc. It is wrongheaded to think that we can have a single different character for every possible number that could appear as a roman numeral. That's not how numerals work in any non-primitive numeral system. — David Eppstein ( talk) 08:00, 28 November 2020 (UTC) reply

David Eppstein: Again, the straw-est of straw men: It is wrongheaded to think that we can have a single different character for every possible number that could appear as a roman numeral. You could go tell that to whoever is arguing that, wherever they are, because they aren't participating in this discussion. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 21:27, 30 November 2020 (UTC) reply

@ Struthious Bandersnatch: You may well be actively avoiding the question of whether all numbers can be represented as precomposed Roman numeral unicodes, as you say, but it is an important question to face rather than to avoid. The point is that unless they are, there will always be instances where numbers that can be characters mix with numbers that cannot. Either this will create visible differences between these two classes of numbers (demonstrating that we should avoid the precomposed ones to prevent these inconsistencies) or the browsers will create the identical appearance in all cases (demonstrating that using the precomposed characters are pointless pointless to use because we can create the same effect with much easier editability using ASCII). So which is it, pointless or actively to be avoided? — David Eppstein ( talk) 21:58, 30 November 2020 (UTC) reply

David Eppstein: I'm not even going to bother naming which fallacy you're trying to use this time. (*Munches on a dicot.*) Speaking of avoiding questions. "unicodes"? (You're a computer science professor?)

As with so many other things... how is this terrible no-good non-easy editability issue not a problem with §Special symbols in general, rather than just Roman numerals? -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 22:45, 30 November 2020 (UTC) reply

Let me guess, you're a strict constructivist and you think this is the fallacy of the excluded middle? Also, if you're going to persuade other editors to your point of view by attempting to insult them over the informality of their vocabulary choices, "unicodes" is a bizarre choice for that; Google shows over 600 scholarly publications using that word. — David Eppstein ( talk) 23:34, 30 November 2020 (UTC) reply

False dichotomy. Pointing out that you are a prolific user of logical fallacies here, and that you and Beland are in no way disciplined about your conformation to accurate use of terminology in the course of demanding a new MOS rule enforcing mandatory compliance with dictated usage of single individual characters (who's the strict constructivist, again?) is simply the substantial truth and in no way a violation of NPA. You can smash a mirror if what you see makes you angry but you can't force me to refrain from describing your rhetorical behavior accurately on a Wikipedia talk page. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 19:02, 1 December 2020 (UTC) reply

@ Struthious Bandersnatch: Thanks for clarification on the link.

Yes, we all agree that the precomposed characters are a dedicated way to represent numbers in Roman notation. I think what we disagree about is whether we should use them because of their semantic properties. I interpret the Unicode standard as advising against it (I know you don't) but sure, in practice people might do so anyway. We can look around and gather evidence as to whether this is common practice.

When determining whether a style should be used in Wikipedia and endorsed by the MOS, we often look to external style guides, and to the actual practices of well-respected publications in the field and to the sources we cite. While you will find style guides have opinions about whether to use curly or straight quotation marks and whether to use "ae" or "æ", I don't know of any that recommend the use of precomposed Roman numerals. If it was important to do so for SEO purposes or machine readability or any other reason, I would expect at least some newspapers or academic journals to document this preference for their authors. Or if you think this level of detail is beyond the scope of most style guides, at the very least we'd be able to find such characters in the body text of notable publications, or even just mathematical publications, if we are going to limit the scope of this rule to mathematical articles. Since I believe that common practice is the opposite (to use ASCII characters for all Roman numerals), finding precomposed characters in a style guide or relevant publication would be evidence that my assumption based on my personal experience and Wikipedia's database is wrong. I was cordially inviting you to present such evidence, which would prompt me to rethink my assumptions. If every source we use only has ASCII Roman numerals, that would validate the claim that this is a de facto standard, even if it's not self-evident why it is so.

When I say "in English prose", I'm thinking about situations like "Elizabeth II is the current Queen of England", and "Ferric chloride is also known as iron(III) chloride", not the "vi" in "trivial", which is obviously not a number even to a naive system.

The fact that common fonts are capable of displaying precomposed Roman numerals doesn't seem pursuasive that they should be used for any particular purpose, any more than the fact that these fonts display curly quotes means that Wikipedia should use curly quote style.

I think you've confused two different systems I was hypothesizing. What I mean by "naive system" is one that relies on the character encoding to differentiate between Roman numerals and pronouns like "I". Any system that can actually do this differentiation can't be naive, but what I propose is that because it must be intelligent enough to know the difference in cases of pure ASCII encoding, it doesn't need the hint provided by the precomposed Roman numerals. And in fact I expect most such non-naive systems would perform better if they decomposed such characters early in processing. Yes, this counterintuitively improves performance by destroying information, but that happens with machine learning and search engines sometimes.

By analogy, in English NLP systems, it's common to destroy the information encoded in capitalization by lowercasing all inputs. For pretty much the same reason - lack of consistency. Capitalization also happens at the beginning of sentences, so it's not a reliable signal. Notice, for example, that Google gives an identical page of results on Shakespeare's play whether you search for "hamlet" as "Hamlet", even though when parsed as properly encoded English, the former might mean Hamlet (place) but can't mean The Tragedy of Hamlet.

Outputting only the ASCII encoding is "conservative" because it is the most common encoding and thus the one that every consumer must handle. If we consider using the precomposed encoding in English prose to be an error, then consumers would be justified in not handling it properly. If we consider it to merely be a secondary legitimate encoding, then producers who use both encodings are certainly being more "liberal" in that they produce more complicated output, even if we say that both are technically allowed. Given that the secondary encoding is far less common, in practice some neglectful consumers simply fail to handle it. In other words, precomposed Roman numerals are exactly the sort of thing that fall in to the "legal but obscure protocol features" that robustness principle advises that producers avoid.

Yes, though it's very slightly less work not to bother, if a non-naive system added a tiny bit of preprocessing code, it would be able to tolerate both encodings. But if the precomposed characters don't help the naive system, and they don't help the non-naive systems, and they cause problems for some search engines and spell checkers, and they confuse most people, then what's the benefit we are getting from the precomposed encoding? If you think there is a benefit, perhaps a specific example would help explain.

You mentioned on my user talk page that you used the precomposed "Ⅷ" on Dictionary of American Naval Fighting Ships. Going to this page, I immediately encounter problems in Firefox looking for this part of the page. This article actually uses both encodings, so if I search for "VIII" (which I do first since these characters are on my keyboard) I only get the one instance of this number that uses the ASCII encoding. Firefox doesn't know that "Ⅷ" is the same sequence of glyphs, so I can't cycle through the instances that use the precomposed encoding. The vast majority of readers have no idea that precomposed Roman numerals exist, and of those that figure out that's what's happening, most will still have no idea how to input one. Even people who know about all these things will have difficulty searching with a page where there is a mix of encodings. To me, this is an unacceptable user experience and the best argument yet for why ASCII encoding should be mandatory.

-- Beland ( talk) 09:53, 28 November 2020 (UTC) reply

I have the same experience in Chrome: the precomposed numerals in the table block me from finding them in a text search. — David Eppstein ( talk) 17:46, 28 November 2020 (UTC) reply

...yeah but the reason why it might be the best argument yet is that the arguments attempting to support your position so far have involved things like linking to a Wikipedia article where one of your central claims is marked "^{citation needed}" and paraphrasing the robustness principle into its exact opposite.

Again, web browsers are products from vendors other than Wikipedia or the WMF. It's not our job to improve them or compensate for the failings of those vendors.

I notice that if I go to the Wikipedia article Letterlike Symbols in Firefox, searching for "h" does not find the Planck constant symbol and searching for "K" does not find the Kelvin symbol. Similarly, at Mathematical Alphanumeric Symbols none of the glyphs that are definitely representing Latin letters are found by searching for their Plane 0, Row 00 look-alikes.

So if this was your best argument, after all the above writing, color me unimpressed. You would need to demonstrate that it has anything to do with Roman numerals in particular rather than Roman numerals plus everything else §Special symbols also covers.

And in fact I expect most such non-naive systems would perform better if they decomposed such characters early in processing.

I'd invite you to provide a citation... I don't think I have to know much about NLP to say that unless a system is able to correctly distinguish Roman numeral collections of Latin letters from non-Roman-numeral ones 100% of the time, Unicode Roman numerals in source material can have only positive utility. But again, it doesn't matter, because this sort of system is not a Wikipedia or WMF product.

in English NLP systems, it's common to destroy the information encoded in capitalization by lowercasing all inputs [...] Notice, for example, that Google gives an identical page of results on Shakespeare's play whether you search for "hamlet" as "Hamlet"

The main reason for Google to be case-insensitive is that it halves the size of the index necessary (and combinatorically, the reduction is much greater than half.) Punctuation is ignored for similar reasons.

...then what's the benefit we are getting from the precomposed encoding? If you think there is a benefit, perhaps a specific example would help explain.

Uh, better aesthetic style? We're in a discussion in a talk page for the Wikipedia Manual of Style, remember? Not that you didn't know that was one of the benefits I'd already proposed, or that you've demostrated other benefits—not to mention the existing damn rules in the style guide—aren't valid, this is all such empty rhetorical posturing.

I'll also point out, again, that it is completely absurd to refer to "Ⅴ" as a "pre-composed" version of "V"—you are hammering a square peg into a round hole here. But by all means, continue to demonstrate the inherent ridiculousness and prejudiced nature of your exhibition. I believe the word "wrongheaded" arose above, and it didn't apply to anything else in the conversation...

And as far as style guides other than the one you're proposing changing right now—proposing changing to reflect something you yourself are saying other style guides do not say—the AP Stylebook gives guidance to, for example, not use brackets because they supposedly can't be transmitted over news wires. Which, I'll bet anything, is based on some technical problem present in twentieth-century technology that tied back to nineteenth-century telegraph encoding practices. So if by some remote chance you actually go looking for evidence to cite, and in the even more unlikely event under a joint probability distribution that you actually find a style guide which says something about Unicode-specific character encoding practices rather than 19th-century telegraph stuff, also please bring evidence that the authors even remotely know what the hell they're talking about when it comes to this realm of technical topics.

The typography term for the difference between "ae" and "æ" is that the latter is called a ligature. And you're still using the term "ASCII encoding" too...

If you're going to switch from claiming you're trying ...to dispel the idea that encoding Roman numerals as precomposed Unicode characters some of the time will allow a naive system to handle them properly to saying that a naïve system is one that relies on the character encoding to differentiate between Roman numerals and pronouns like "I"—which would mean, under this new definition, that handling Unicode Roman numerals properly is the one thing a "naïve" system can actually do—and yet suggest that I'm the one who is confused? I'm just going to say it: in addition to neutral talk page proposals, Beland, you also seem to be out of your depth when it comes to the intersection of character encodings, typography, web styling, and UX and accessibility. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 21:27, 30 November 2020 (UTC) reply

If by "aesthetic style", you mean visual appearance (as in one of your two original reasons for opposing this change), I don't find anything wrong with the appearance of ASCII Roman numerals, and the other downsides of using non-ASCII characters seem way more important. Not to mention that whether or not the non-ASCII ones look different or better or worse or exactly the same or don't render at all depends on the specific font being loaded by the web browser.

As for terminology, there are downsides to any particular choice. Yes, the single-character Roman numerals aren't pre-composed, but "numeric characters" is also somewhat confusing because characters in both ranges semantically refer to numerals, as far as humans are concerned. I don't bother to say "characters inherited by Unicode from ASCII which have the same representation in UTF-8 but not in other Unicode encodings" because I expect you to know that's what I mean when I say "ASCII encoding" as shorthand. It's not for lack of knowledge, and for you to say things like that over and over again is irrelevant to the merits of the proposal, and quite frankly rude and uncivil. If you have a specific phrasing for the proposed MOS rule that you would prefer, I'm open to suggestions. We could refer to the character ranges by number, if you like.

Your second original reason for opposing this change is that the non-ASCII characters are more "machine readable". If you think the impact on an external NLP system "doesn't matter, because this sort of system is not a Wikipedia or WMF product", that implies that there's no point in making wikitext machine readable for external systems. What internal system, if any, would benefit from a "machine readable" encoding?

I don't think I or most editors would ever agree that the user experience on Wikipedia's own web site doesn't matter or doesn't justify Wikipedia making an effort to improve it. Accommodating the behaviors of the web browsers currently in use by the majority of site users is a major concern of every web site I've worked on professionally.

If we can't identify any style guides that recommend non-ASCII Roman numerals, and we can't find any reliable sources that actually use them, then I don't think there's any point in debating which "know what the hell they're talking about" if the verdict is unanimous. I'm curious how you got started using these characters yourself. Did you seem the used in a publication that you respect, or did you find out about them in a computing context and decide to start using them in articles on your own?

Using the Unicode Kelvin character on Letterlike Symbols makes sense, as the character itself is under discussion. We should definitely continue to use the non-ASCII Roman numeral characters on Numerals in Unicode. But you'll notice that in most of the Kelvin article we use the ASCII "K", which is searchable. The Kelvin character is specifically mentioned in Kelvin#Unicode character, which cites Unicode 8.0's recommendation to use the letter K instead. It also says to use the letter Å instead of the Angstrom character and Ω (Omega) instead of the Ohm character. If there are other characters for which searchability would be improved by substitution, then I see that as a reason to proceed with substituting those as well rather than a reason to continue having a poor user experience. It would certainly make no sense to me to allow non-ASCII characters in non-math articles, so that a search for, say, "Queen Elizabeth II" on a history page might or might not work depending on the "style preferences" of that page's original author. -- Beland ( talk) 01:31, 1 December 2020 (UTC) reply

I was going to say Wikipedia:Manual of Style/Mathematics#Special symbols may need to be modified to follow the advice on these three characters. But it refers to List of mathematical symbols, Wikipedia:Mathematical symbols, List of mathematical symbols by subject, and Mathematical operators and symbols in Unicode. These three characters only appear on the fourth list, but according to the text there, they are specifically excluded from the "math" part of the Letterlike Symbols block. The Planck constant does not appear on any of these lists. This makes sense to me, as I would consider these scientific rather than mathematical symbols, and so I interpret that the "Special symbols" section already does not apply to them. Very interestingly, Roman numerals do not appear on any of those lists. Though they are clearly related to numbers, this may be an indication that the "Special symbols" section was never intended to apply to them, or that this hadn't been considered one way or the other. Precomposed fractions are also not included in any of these lists, and we know for sure from MOS:FRAC that these are not allowed in English Wikipedia. Roman numerals are also like fractions in that the vast majority appear in articles that are not about mathematics - for example history, anime, chess, and military articles.

-- Beland ( talk) 01:31, 1 December 2020 (UTC) reply

You're seriously straight-up claiming that I'm "confused", then trying to strike a pose that I'm rude and uncivil for pointing out that you're repeatedly mis-naming character encodings, and a variety of other things, in a discussion of character encoding standards to avoid highlighting the fact that all of the characters we're talking about are Unicode? Sorry, it doesn't make any sense whatsoever to say "ASCII" when you're talking about Unicode Basic Latin Block letters but then have no problem saying "Unicode" when describing Unicode numbers—funny that the opposite doesn't happen, describing the Basic Latin characters as Unicode and the Roman numeral code points as from the standard you kept claiming they're solely included for compatibility with.

And you haven't simply been using the term "ASCII" casually—you have used it in the text of your proposed mandatory-rule addition to the Manual of Style. Asking me to do your work for you and write the specific phrasing of your proposed addition accurately is not some sort of transactional favor where you then get to call me rude and uncivil for pointing out your sloppy use of terminology at the same time you're trying to install an MOS rule demanding that Wikipedia editors use individual characters to your exacting preferences and specifications.

And furthermore, on the subject of your concomitant inattention to detail, the phrase lack of knowledge, which I've supposedly been saying over and over again and even just the word "knowledge", appears only in your own comment above in this thread. But it is an accurate self-description; if indeed you are knowledgeable in the technical areas I list above you are failing to convey that by even using terminology correctly. All while insisting on your superlative personal insight into what this Wikipedia guideline should say, while demonstrating at best superficial familiarity with Wikipedia policies and guidelines in general.

What internal system, if any, would benefit from a "machine readable" encoding?

What internal system wouldn't, if you yourself have admitted that it takes only a few lines of code in any programming language to create a "naïve NLP system" (rather overkill terminology-wise IMO, but it works) that can handle Roman numerals properly encoded in Unicode, but requires a system with a potentially-infinite number of ad hoc rules to do the job with Basic Latin letters?

I mean... this is the basic definition of the term "machine readable". I also don't get why you keep putting that phrase in quotes past the use–mention distinction... you're almost treating it like it's an unfamiliar term, or doesn't mean what our article machine-readable data says: Machine readable is not synonymous with digitally accessible. A digitally accessible document may be online, making it easier for humans to access via computers, but its content is much harder to extract, transform, and process via computer programming logic if it is not machine-readable.

I don't think I or most editors would ever agree that the user experience on Wikipedia's own web site doesn't matter or doesn't justify Wikipedia making an effort to improve it.

Then get the entire Wikipedia:Manual of Style/Mathematics § Special symbols section done away with. You can't cherry-pick Unicode Roman numerals to apply this usability quibble (which, I'll bet, no actual user has ever complained about anyways, not even to browser vendors, at least not with mathematical content) to: doing so is, as with so many other arguments made here, fallacious.

And I'd note that it's not just §Special symbols you need to work on changing if your concern about Ctrl+f browser page-specific searching is in any way whatsoever real instead of just more chaff thrown up in the process of trying to get your way: in Firefox if I search for "sinxdx" it doesn't find that sequence in §Using LaTeX markup. (Which of course makes sense, since Wikipedia currently uses a plugin which renders LaTeX to images.)

If we can't identify any style guides that recommend non-ASCII Roman numerals, and we can't find any reliable sources that actually use them

Tricksy Hobbitses. You're trying to translate the absence of style guides which recommend against the use of Unicode Roman numerals into a positive reason to add a MOS rule prohibiting them, which is a clumsy argument from silence (or more realistically an argument from ignorance because I doubt you've actually gone and looked at anywhere near all style guides to ascertain such an absence.) Are you guys playing logical fallacy bingo? Or going through the list of fallacies article and checking them off, or something?

And how would you know whether reliable sources, particularly printed reliable sources, use Roman numeral Unicode characters? Even if this were a valid argument in the first place (if our citation formatting has never had to conform to the intricate specifications of the many organizations making bux off of selling such things, why would our Unicode character encoding of Roman numerals need to conform to anyone else's not-even-explicitly-specified practice?) you have not exactly demonstrated yourself willing to put much effort into doing research.

I'm curious how you got started using these characters yourself. Did you seem the used in a publication that you respect, or did you find out about them in a computing context and decide to start using them in articles on your own?

Well did you do a survey of publications before encoding Roman numerals the way you do it? Surely, if you can put the question to me, you're willing to answer it yourself.

As far as pages which don't currently follow §Special symbols for the Kelvin and Planck constant symbols, again, propose changing the whole thing if you think it's fundamentally invalid.

It would certainly make no sense to me to allow non-ASCII characters in non-math articles

For you to allow editors to use "non-ASCII" characters? As I said on your talk page, you do not have any such power to overturn Wikipedia policy or guidelines by personal fiat. And notice that, if you're using the standard editor, right below the main editing field are a dropdown and a bunch of buttons which allow anyone to insert "non-ASCII" characters, including for example "™" which could easily be reproduced with the HTML <sup>TM</sup>.

I don't find anything wrong with the appearance of ASCII Roman numerals

...right, and that's a valid styling point of view. No one is saying it isn't. What you have not even begun to do here is demonstrate that your styling viewpoint is so virtuous and superior that it must exclude all other styling points of view, to the degree that for Roman numerals, Wikipedia:Manual of Style/Mathematics § Special symbols should be changed to say the complete opposite of what it says now—to go from saying that the rule of thumb is, characters and character sequences with mathematical significance should be represented by Unicode code points which encode that mathematical significance specifically rather than visually similar glyphs, to saying that Roman numerals must mandatorily be only represented with Basic Latin Unicode code points.

Between acting like you don't know what the term "aesthetic style" would mean in a Manual of Style discussion where I've repeatedly brought up fonts and even type foundries, and all of the other see no evil, hear no evil, speak no evil behavior on display here, this is all taking on the aspect of King Canute shouting at the tides. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 19:02, 1 December 2020 (UTC) reply

As far as I know, Wikipedia doesn't have any internal NLP systems that are attempting to parse the numerical values of Roman numerals, and thus there are none that would benefit from making them "machine readable". I am using quotation marks there because I am quoting your words, which I neither endorse nor condemn. If you have no objection to the phrasing "ASCII letters should be used instead of precomposed Unicode characters", then that's what the RFC will propose. If you don't express a preference for a different phrasing now, please don't argue later that the phrasing is defective and thus the proposal should be abandoned entirely. -- Beland ( talk) 22:52, 1 December 2020 (UTC) reply

As far as I know, Wikipedia doesn't have any internal NLP systems that are attempting to parse the numerical values of Roman numerals, and thus there are none that would benefit from making them "machine readable".

...then, of course, you are actually rendering all of your own quibbles about benefits from the supposed “easiest” things for NLP systems to do invalid as well.

If you have no objection to the phrasing "ASCII letters should be used instead of precomposed Unicode characters", then that's what the RFC will propose. If you don't express a preference for a different phrasing now, please don't argue later that the phrasing is defective and thus the proposal should be abandoned entirely.

I of course wrote at great length above about why using the term “ASCII” for Unicode Basic Latin characters is inaccurate and inappropriate.

If you want to call the entire community down to look at an example of you trying to rewrite a Wikipedia guideline using terminology from before even 1969's RFC 20, that's your business. I'm sure I will have a delightful discussion, among other things, reminiscing about old character encoding times with my fellow neckbeards.

My preference for phrasing is the current guideline, as it stands, unchanged, as I have stated repeatedly. You, of all people, are in no position to try to place any prior restraints on what sorts of arguments I can make, when you simply ignore my requests to follow basic Wikipedia procedural guidelines if you don't feel like it. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 00:30, 8 December 2020 (UTC) reply

Yes, if we don't care about external NLP tools, all the musings about whether NLP tools would actually benefit from this are moot. So it seems we can either choose to care about external NLP systems (which I do, because I operate one, but you have argued we shouldn't), or choose not to care about external NLP systems and safely disregard your original argument that we should allow upper-range characters because "Using numeric characters, Roman numerals are machine readable as numbers".

Given an opportunity to improve the wording of a proposal, even one you disagree with, a later objection as to the quality of the wording would be an argument made in bad faith. I will assume you will not make an argument in bad faith. I'm going to amend the wording slightly to reflect the valid point that not all the characters in the upper range are precomposed, and add a bit more detail. The MOS already uses "ASCII" in the same way that I am, so I'm going to disregard the argument that I am using this term incorrectly even though I understand the argument. I expect more people are familiar with "ASCII" than the names of Unicode blocks, so I'm going to retain that terminology for ease of understanding and consistency with the rest of the MOS. (I'll post new wording in appropriate subsection below.) -- Beland ( talk) 00:50, 9 December 2020 (UTC) reply

...a later objection as to the quality of the wording would be an argument made in bad faith.—Right, so what you've got are current and past objections to the quality of your wording. Your concern that bad faith arguments not be made—evidently by repeating objections you're already aware of, which definitely has nothing whatsoever to do with the concept of "bad faith"—is touching. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 21:53, 9 December 2020 (UTC) reply

I've just encountered a second serious accessibility issue. I just ran the proposed MOS change though a text-to-speech system. One ASCII character sequence is read aloud as "vee eye", but the non-ASCII equivalent is read aloud as "letter two one seven five". That means if we don't want Roman numerals to be essentially jibberish to some people with visual impairments, we should stick with the ASCII characters. -- Beland ( talk) 04:13, 9 December 2020 (UTC) reply

Again, none of this has anything specific to do with Roman numerals.

I guess you've never tried before, but attempt the same thing on any character in Mathematical Alphanumeric Symbols; or if the long names of those characters have been abbreviated in the specific tool you're using, go ahead and file a bug report to get Roman numeral code points fixed too if you genuinely care about usability.

To quote myself,

You can't cherry-pick Unicode Roman numerals to apply this usability quibble (which, I'll bet, no actual user has ever complained about anyways, not even to browser vendors, at least not with mathematical content) to: doing so is, as with so many other arguments made here, fallacious.

-- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 21:53, 9 December 2020 (UTC) reply

Just because other things are arguably broken doesn't mean that Roman numerals should also be arguably broken. There may also be other considerations for different characters that provide better reasons for using them. I think we should consider these characters in small groups before starting a discussion about making such a wholesale change. -- Beland ( talk) 19:44, 11 December 2020 (UTC) reply

Well, I don't. If you can't make your case in general, in a style guide which actually currently mandates representation of most complex mathematical formulae as embedded images that fundamentally have even worse usability issues you seem curiously uninterested in solving despite my pointing to active efforts to do so, then your arguments don't apply to Unicode Roman numerals alone either.

And as I've pointed out—I guess this is all new stuff for you, but as I said I've been doing this since the last century—a distinct difference from the issue of embedded images for more complicated formulae is that in the instance of Unicode, what's happening is that the tools themselves are choosing to not usably support Unicode by instead vocally reading out the six-word-plus formal name of every Mathematical Alphanumeric Symbol that's equivalent to a one-syllable Basic Latin character in English, or by Google not supporting the same Ctrl+f searches in Chrome for easily-typed visual equivalents of Roman numeral and Mathematical Alphanumeric Symbols that its search engine supports—the real issue here is these other products not usably supporting Unicode, not that Wikipedia needs to have styling policies mandating destruction of information to compensate for their shortcomings.

Usability isn't just a buzzword you can deploy without addressing the actual issues, it's a well-developed field at this point in the twenty-first century. (And, though progress in terms of implementation on many axes was somewhat behind where we are now—which is still not too great—even the last century's analysis of usability problems was not so bad.) -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 15:00, 12 December 2020 (UTC) reply

External practices

We've been discussing above whether or not any reputable style guides or reliable sources encode Roman numerals in non-ASCII characters. If anyone knows of any, please share! Personally, I don't recall ever having seen that in professional English publications, though it's not always obvious if you're not explicitly looking. Maybe half the very small number of instances of non-ASCII Roman numerals in English Wikipedia are actually in Japanese text. One way to approach this is just to come up with a list of reliable sources and do a site search to find a page with a Roman numeral. I've checked a few, though obviously the more that are checked the more reliable the sampling is. Feel free to round out the below with more sources you find reliable or if you can find a style guide that even mentions this issue that would be illuminating. -- Beland ( talk) 04:21, 2 December 2020 (UTC) reply

The New York Times - ASCII - [1]
The Washington Post - ASCII - [2]
Fox News - ASCII - [3]
Nature - ASCII - [4]
MIT Technology Review - ASCII - [5]
Ars Technica - ASCII - [6]
Al Jazeera English - ASCII - [7]
The Australian - ASCII - [8]
The Atlantic - ASCII - [9]
Bloomberg News - ASCII - [10]
CNET - ASCII - [11]
The Christian Science Monitor' - ASCII - [12]
The Economist - ASCII - [13]
engadget - ASCII - [14]
Financial Times - ASCII - [15]
Forbes - ASCII - [16]
TheGuardian.com - ASCII - [17]
The Hindu - ASCII - [18]
The Intercept - ASCII - [19]
Mother Jones - ASCII - [20]
National Geographic - ASCII - [21]
NBC News - ASCII - [22]
New Scientist - ASCII - [23]
ProPublica - ASCII - [24]
Quartz - ASCII - [25]
Reason - ASCII - [26]
Scientific American - ASCII - [27]
Science-Based Medicine - ASCII - [28]
Slate - ASCII - [29]
Southern Poverty Law Center - ASCII - [30]
Snopes - ASCII - [31]
USA Today - [32]
The Verge - ASCII - [33]
Vox - ASCII - [34]
Wired - ASCII - [35]
ZDNet - ASCII - [36]

To quote myself from above,

...if our citation formatting has never had to conform to the intricate specifications of the many organizations making bux off of selling [style guides], why would our Unicode character encoding of Roman numerals need to conform to anyone else's not-even-explicitly-specified practice?

Slapping together a list of web sites that supposedly don't use Unicode Roman numeral notation does not make your point, either. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 00:30, 8 December 2020 (UTC) reply

You needn't say "supposedly"; either they do or they don't, and you can verify that for yourself by following the links above and poking around on other pages on those sites if you like. I am responding to your claim that our lack of having identified any sources that use non-ASCII Roman numerals may simply be from lack of effort in researching that question. What other sites do has relevance because that is what both readers and authors of NLP systems will be expecting and familiar with. Wikipedia style does not have to follow the same practices as any other organization. In practice, though, we do tend to pick and choose from general-audience style guides and the practices of cited sources, and many editors who participate in MOS discussions find such evidence persuasive. If you personally don't, that's fine. -- Beland ( talk) 00:07, 9 December 2020 (UTC) reply

The very fact that you equate "poking around" with conclusively determining that a web site does not use Unicode Roman numerals in any instance, yet try to push back at any skepticism on my part, is about the right speed for the level of attention to detail you've shown during this discussion. So yeah, "supposedly".

You're doing the not-taking-responsibility-for-your-own-words thing again. You said If we can't identify any style guides that recommend non-ASCII Roman numerals, and we can't find any reliable sources that actually use them... What I said was,

And how would you know whether reliable sources, particularly printed reliable sources, use Roman numeral Unicode characters? Even if this were a valid argument in the first place (if our citation formatting has never had to conform to the intricate specifications of the many organizations making bux off of selling such things, why would our Unicode character encoding of Roman numerals need to conform to anyone else's not-even-explicitly-specified practice?) you have not exactly demonstrated yourself willing to put much effort into doing research.

You certainly have not proved that the printed New York Times does not use Unicode Roman numerals at any stage of its typesetting process, nor the web site either. And even if you could, it still wouldn't matter, because "bunch of undocumented practices which may have nothing to do with styling" does not equal "Beland gets what they want in disagreements over Wikipedia styling guidelines, which specify things to a much lower technical level than anyone else does anyways".

If you're worried that the appearance of properly-numeric-notation-encoded Roman numerals will be some sort of unexpected shock to the reader, I've responded to that genre of quibble already, but it looks like you have an argument with our friend David in that case:

...the two variations look identical on my screen; I would guess (another guess) that this is because the browser converts the precomposed ones to ASCII internally, so there is no actual benefit to precomposition for people who are just reading Wikipedia in browsers...

...we do tend to pick and choose from general-audience style guides and the practices of cited sources, and many editors who participate in MOS discussions find such evidence persuasive. If you personally don't, that's fine.—Of course, not only is this not much of an actual tendency of ours—again, citation formatting—you have not presented any evidence whatsoever from style guides, nor that your handwavy pointing to some web pages demonstrates anything to do with styling practices. Much less any persuasive evidence. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 21:53, 9 December 2020 (UTC) reply

I don't see how print sources are relevant to this question. Is there a particular visual appearance in the printed New York Times that you would like to emulate on Wikipedia? If you are unpersuaded the practices of other organizations actually matters to this question, then I'm not going to spend more time increasing your level of confidence in these empirical findings. -- Beland ( talk) 19:35, 11 December 2020 (UTC) reply

I don't see how print sources are relevant to this question.—because print sources are implementing aesthetic styling of Roman numerals, as would be embedded images or hoary old Flash .swfs in "explainers" on the ancient static HTML NYT pages from the last century that are still kicking around if you follow the right links.

What I would like for Wikipedia is, of course, the Manual of Style to be followed—it is correct, as I have said again and again, that I don't think we should be following the supposed interpolated unwritten styling rules of other organizations—I think we should be following the Wikipedia Manual of Style. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 15:00, 12 December 2020 (UTC) reply

Your response is about visual appearance, but you complaint is about character encoding, that I "have not proved that the printed New York Times does not use Unicode Roman numerals at any stage of its typesetting process". There's no way for print readers to be able to tell what character encoding was used; visual appearance can be altered by choice of font and other typographic effects having nothing to do with character encoding. Surely it's irrelevant to readers if the encoding was changed in some intermediate version of the typesetting process that isn't even the finished product. I have not claimed that the use or non-use of an encoding in print intermediates or finished products should be persuasive. The searchability and copy-and-paste concerns I think are important for Wikipedia only apply to web text, and likewise other organizations face the same issues on their web sites but not in print media. I'm not advocating for any particular visual appearance, so evidence of how Roman numerals appear in print is not relevant to any argument I'm making. If we were to do a survey of print sources, neglecting the argument to ignore all external practices, then checking to see which visual appearance they have is a question we can more usefully answer, in support of your argument that a particular appearance is desirable. Though I don't see anywhere you actually specified what you're looking for in a finished web page or Wikipedia print version? Serif font? Narrow spacing? Only takes up one em width? -- Beland ( talk) 02:36, 18 December 2020 (UTC) reply

Your response is about visual appearance, but you complaint is about character encoding I don't have a "complaint"; I support the Manual of Style as it currently reads and have asked you to comply with it. You are complaining that all editors are not mandatorily required to follow your styling preferences, and wish the MOS to be changed to require that.

Anyone reading the above conversation can easily see that from my very first sentence I've characterized the use of Roman numeral Unicode code points as a valid style variation and directly answered your repeated WP:HUH? questions about what benefits I could possibly see by specifying aesthetic benefits, among others.

Surely it's irrelevant to readers if the encoding was changed in some intermediate version of the typesetting process that isn't even the finished product. But, what, readers do care about which Unicode code points are used to represent Roman numerals in these Unicode web pages?

I'm not advocating for any particular visual appearance, so evidence of how Roman numerals appear in print is not relevant to any argument I'm making. Then how exactly are Roman numerals encoded from collections of Basic Latin Unicode code points what [readers] will be expecting and familiar with? You are making so many simultaneously-contradictory arguments, even within the same sub-threads here; even you seem to be having difficulty keeping track of them.

Though I don't see anywhere you actually specified what you're looking for in a finished web page or Wikipedia print version?—what the MOS says. For the bazillionth time. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 14:46, 21 December 2020 (UTC) reply

I was referring to your "complaint" about the argument I was making. When I said you haven't specified what you're looking for, I was referring to visual appearance, which is an issue that the MOS currently does not address for Roman numerals. What readers are expecting and are familiar with is ASCII characters that they can type and search, and multi-letter Roman numerals rather than single-character numerals they can't type. Those attributes are what are relevant to usability. I don't think readers have any particular expectation of visual appearance for Roman numerals (especially if we're only talking about kerning differences) though it may be jarring if they do not match the surrounding font. -- Beland ( talk) 19:07, 7 January 2021 (UTC) reply

Discussion next steps

So just to be perfectly clear here, in the same way I object that principles like NPOV and, say, just about everything in Wikipedia:How to contribute to Wikipedia guidance § General recommendations weren't followed in proposing this change and aren't being followed in this discussion, if this talk page thread is closed while skipping steps in Wikipedia:Closing discussions I am not just going to go along with it. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 21:27, 30 November 2020 (UTC) reply
- Well, WP:NPOV applies to article text, not style questions, and certainly not to individual editors. Everyone, including you, are allowed to have personal opinions on questions of style. Sometimes agreement on style questions involves evaluating the reasons for those opinions, and sometimes it just comes down to whether A or B looks prettier to more people. Either way, expressing style opinions is a necessary and healthy part of the process, as long as the conversation is civil and approached in good faith. WP:GUIDANCE is an essay that doesn't even necessarily have community consensus. As for closing this discussion, after a week we have two editors in favor of the proposal and one opposed, and many many words written in support of both sides. I'm pretty sure this is an unimportant issue to most editors, and I doubt the support:oppose ratio will change materially if more are consulted, so I would be inclined to simply ask everyone for informal acceptance and move on.
  If by the above you are asking for a formal third-party closure, I think it would help the closer a lot if both sides pulled together a brief summary of the most salient arguments and counter-arguments. The discussion has certainly been helpful for me in discovering new arguments and refining them, but the giant walls of text are just too much for most volunteers to read (though of course anyone who wants to still can).
  We could also do a month-long formal RFC, in which case we'd need a neutral summary (which I'd be happy to let you draft). A summary of arguments for and against would again be very helpful, so we could stop writing missives back and forth and let other editors comment. -- Beland ( talk) 02:11, 1 December 2020 (UTC) reply
  - So btw, if you really care about accessibility, per MOS:INDENTMIX (part of WP:ACCESSIBILITY) you are not supposed to mix bulleted and non-bulleted list styles together. I've changed the style for your above comment to match the MOS, out of the assumption this was unintentional, but of course feel free to revert it if you find my change objectionable.
    I take it the above remarks, and titling a new heading here with the term "next steps", is an open declaration you simply have no intention to go back and follow any previous steps in Wikipedia:How to contribute to Wikipedia guidance § General recommendations such as
    Leave room for flexibility (or: Avoid instruction creep). [...]
    Don't be prescriptive. Devolve responsibility. [...]
    or
    Consult widely – make a special effort to engage potential critics of the new guideline, engage them and get them to help find the middle ground early.
    "Good faith"... you see that as compatible with not presenting the discussion of a policy change from a neutral point of view? Or compatible with your talk of whether you're going to allow editors to apply this consensus Wikipedia guideline to use non-ASCII characters in non-math articles, a guideline which was initially proposed in 2005?
    Amazing how quickly you're able to flip from, "My wild guesses about a supposedly universal utilitarian un-thought-out internet-wide practice equate to an iron-clad ineluctable undocumented Wikipedia styling rule from which there must be no variation!" to "It's just a thirteen year old essay, made from even older pages that used to be in the Help: namespace, which could totally merely be a coincidence instead of representative of standard Wikipedia practices and procedures, so it doesn't count!" But sure, regale me with how your creative interpretations apply to the more concise procedural policy WP:PGCHANGE.
    As far as, we have two editors in favor of the proposal and one opposed, surely an editor with such overweening faith in your own insight into WP:P&G knows what I'm going to say in response, right? Wikipedia:Consensus, Consensus on Wikipedia does not mean unanimity (which is ideal but not always achievable), nor is it the result of a vote (my emphasis), its explanatory supplement Wikipedia:Polling is not a substitute for discussion § Policy and guidelines, and Wikipedia:What Wikipedia is not § Wikipedia is not a democracy.
    Also, another important bit of policy from Wikipedia:Consensus § In talk pages:
    The quality of an argument is more important than whether it represents a minority or a majority view. The arguments "I just don't like it" and "I just like it" usually carry no weight whatsoever.
    Let's not forget that your self-identified "best argument" above for your de novo addition of a mandatory styling rule to the MOS turned out to be something not even specific to Unicode Roman numerals at all.
    If you still seriously want to proceed further here, and don't want to revise your previous ah, approach to achieving consensus, at all, sure, I'll write a summary. What length should we aim for? (standard third-party word count tool linked on noticeboards, for convenience) As far as an RfC, you're welcome to go ahead with that if you want (while notifying me and following procedures, of course), but you're the one who wants to change the existing guideline.
    And as a final note, in case you're actually sincere about any of the things you're saying regarding wanting Wikipedia to work better: I'm observing that if I go to the Mozilla MathML demo page "Proving the Pythagorean theorem" in either Firefox or Chrome, I can do a Ctrl+f search for "a 2 + 2 a b + b 2" and see it matched within the rendered formula, unlike with the present LaTeX image-based rendering on Wikipedia.
    So you could put your money where your mouth is, as it were, and work on some issues surrounding implementation of MathML—here's the first archived VP discussion of implementation status, from 2018, that came up.
    Or you could continue your pursuit of installing a new mandatory guideline measure requiring the destruction of information in Wikipedia articles, and continue taking up my time trying to defend the established and rather more reasonable guideline. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 19:02, 1 December 2020 (UTC) reply
    - When I was talking about "allow non-ASCII characters in non-math articles" I was specifically referring to Roman numerals; sorry if that wasn't clear. I'm actually an advocate of using non-ASCII Unicode characters in most situations where there's not an ASCII alternative. It sounds like your preference is to run an RFC. In that case, to answer your question, the example summaries on WP:RFCBRIEF all seem to be a single sentence. -- Beland ( talk) 23:11, 1 December 2020 (UTC) reply
      - It sounds like your preference is to run an RFC. You are firmly in WP:IDIDNTHEARTHAT territory at this point. My repeatedly stated preference, at your talk page and here, is for this established and rather more reasonable guideline to remain as it is. It's unprofessional, undignified, and further clumsy rhetoric to persist in pretending that your desire to change this guideline and arrogation that you won't allow editors to use Unicode Roman numerals amongst Unicode Basic Latin characters is somehow my wish. Be your own person and take responsibility for your own actions.
        And on the same theme, WP:RFCBRIEF / WP:RFCNEUTRAL—shortcuts pointing to the same section Wikipedia:Requests for comment § Statement should be neutral and brief—apply to you as the editor making the request for a comment from the community, not me. And I certainly see no reason to be any more neutral in any responding comment I might make than you have been above. Also, lest you try to act as if it's unfair, I will point out the specific arguments you've made here, on your user talk page, and your general rhetorical behavior here as well if I choose.
        And if you do not follow policies, guidelines, and procedures both to the letter and in spirit, or even just don't follow orthodox practice, or again try to make up extra rules and claim you're merely following them so as to put your thumb on the scale, or any of the other rhetorical crap you've been pulling in this talk page section, I will point those things out as stridently as I choose. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 00:30, 8 December 2020 (UTC) reply
        I agree that this appears to be in WP:IDIDNTHEARTHAT territory and suffering from inflamed rhetoric. I disagree about the identity of the editor who is not hearing things and perpetrating inflamed rhetoric. Yes, you have repeatedly and at great length stated your preferences. I don't see a lot of others coming to your support. I think an actual RFC could be helpful as a way of attracting a wider group of editors and making the actual level of support for these characters more clear. — David Eppstein ( talk) 01:30, 8 December 2020 (UTC) reply
        Speaking of rhetoric—who was accusing me of actively avoiding the question above, but seems to have petered out on responding to questions about their own use of fallacies? Sure, let's have an RfC if that's what you guys want. By the rules, though. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 21:53, 9 December 2020 (UTC) reply
        
        I'm not sure what on WP:PGCHANGE you think has been violated. That procedural policy says it's allowed to just go ahead and edit policy pages if someone thinks it's a good idea. I didn't do that and am following the "talk first" approach which is also described there. Which is essentially what you requested on my user talk page. And which is the only thing that makes sense to me given there's a dispute over what markup styles are desirable. -- Beland ( talk) 05:02, 9 December 2020 (UTC) reply

Well, given the majority of editors in the discussion so far want to adopt the change, "leave things as they are" is not an option unless more editors weigh in to support that. It sounded like you were volunteering to write the neutral summary for the RFC, but it wasn't entirely clear, which is why I repeated back to you my understanding. Since it now seems you are declining to do so so, and there is majority support for an RFC, here is my draft, which you can inspect for neutrality:

Should markup for Roman numerals be restricted by the Manual of Style to Latin letters only (ASCII characters like "VII") and exclude characters in the U+21XX range (like "Ⅶ")?

As mentioned above, here's an improved version of the proposed addition to the MOS:

For Roman numerals, ASCII Latin letters should be used instead of the equivalent Unicode characters in the U+21XX range. For example, L and VI, not Ⅼ, and not precomposed characters like Ⅵ. (The only exception is when discussing the Unicode characters themselves.)

Before the RFC starts, feel free to propose any tweaks that would make you happier in the event that this is adopted. (It's unfortunately usually difficult to get RFC participants to come back and give a second opinion on an amended option.) -- Beland ( talk) 00:55, 9 December 2020 (UTC) reply

WP:PGCHANGE says,

because policies and guidelines are sensitive and complex, users should take care over any edits, to be sure they are faithfully reflecting the community's view

Silly me to think that, after all of this discussion, you might be able to think of some view you weren't faithfully reflecting. I keep over-estimating you.

Nice attempt to declare that there are no options other than what you want, I guess, but the Gish gallop is for unstructured debate about things like creationism, not written, change-managed, behavioral-P&G-governed Wikipedia policy discussion.

So, me characterizing your ah, requests, related to rewording your desired mandatory rule changes to this guideline as Asking me to do your work for you and write the "specific phrasing" of your proposed addition accurately, or my response to your inquiry about "formal third-party closure" of this thread, which I explicitly separated from my remarks "As far as an RfC..." are things you heard as "volunteering to write the neutral summary" that "wasn't entirely clear", eh? Right.

I'm definitely not objecting to an RfC at all, just insisting that policies, guidelines, and procedures be followed. WP:RFCBRIEF / WP:RFCNEUTRAL isn't an excuse to take a tabula rasa approach to the RfC, as though we haven't had the above discussion; as it says,

If you have lots to say on the issue, give and sign a brief statement in the initial description and publish the page, then edit the page again and place additional comments below your first statement and timestamp. If you feel that you cannot describe the issue neutrally, you may either ask someone else to write the question or summary, or simply do your best and leave a note asking others to improve it. It may be helpful to discuss your planned RfC question on the talk page before starting the RfC, to see whether other editors have ideas for making it clearer or more concise.

Your RfC should explicitly state that you wish to overrule MOS:STYLERET in these cases, excluding all other styling variations like plain Unicode and I'm assuming things like MathML character entities when MathML is implemented ( see here for example), or if not, you should say so. To faithfully reflect the views you are aware of, at least mention our differing opinions of better styling, the instances we investigated where popular search engines do and do not handle them properly, machine readability, the usability issues you brought up and my responses to them, and the absence of any external style guides speaking to the matter either of us have been able to find.

As I've said, for clarity, and because you are specifically talking about character encoding and not the string comparison algorithm of RFC 20 or some W3C documents, I think that the term "Basic Latin" linked to the article Basic Latin (Unicode block) should be used in place of ASCII, in any sentence addressing character encoding such as this—particularly a sentence that's going to appear in Wikipedia P&G, where we use technical terminology carefully. The wording proposed to be included in the guideline itself should also emphasize that it's really, actually trying to mandate sequences of Latin letters in lieu of specific numerical notation encoding, since this is proposed to follow the §Special symbols rule of thumb saying that mathematical versions of symbols should be used when glyphs are similar. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 21:53, 9 December 2020 (UTC) reply

WP:PGCHANGE is talking about edits to policy and guideline pages, not talk pages. I haven't made any such edits on the Roman numeral question. We're having this very long discussion because I'm carefully trying to build consensus before making such an edit. I will update the draft, but honestly first I need to take a break because the level of snark and personal hurtfulness in your comments is upsetting. -- Beland ( talk) 20:02, 11 December 2020 (UTC) reply

Uh, okay, so if the WP:PGCHANGE subsection of Wikipedia:Policies and guidelines—of which I will point out you literally just said, I [...] am following the "talk first" approach which is also described there (the alternative being boldly making an edit one expects to be unchallenged, which in this case I'd simply have reverted anyways and you knew this after the discussion on your user talk page: it is not some virtuous thing to refrain from starting an edit war on a policy guideline page, it's pretty much just minimal expected proper editor conduct), and which also reads, Because Wikipedia practice exists in the community through consensus...—does not govern talk page discussions seeking consensus to change the text of a guideline, what policy or guideline does govern such discussions?

When it's a matter of rules that would restrict the behavior and Wikipedia editing practices of other people it seems like you can't wait to conjure them out of thin air and grasp at straws for a way to impose your own will through them—but when it comes to any rules which would apply to your own behavior, it's WP:HUH? -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 15:00, 12 December 2020 (UTC) reply

There are lots of policies and guidelines to consider when having talk page discussions. When you above quoted "because policies and guidelines are sensitive and complex, users should take care over any edits, to be sure they are faithfully reflecting the community's view" and said "Silly me to think that, after all of this discussion, you might be able to think of some view you weren't faithfully reflecting." I read that as accusing me of violating the quoted policy. As the quoted policy refers to Wikipedia namespace edits and not talk page edits, as I said, I haven't made any in that namespace on this question. If this is an accurate understanding, I'd appreciate it if you withdraw the apparently false accusation. If not, I'd appreciate it if you'd clarify what you think I've done wrong.

The only thing I can trace all this disgruntlement back to is your original complaint that the opening of this discussion above wasn't neutral. I don't find any policy or guideline that requires it to be. Certainly WP:RFCNEUTRAL requires the top part of RFCs to be neutral, but I was not starting an RFC, just an informal talk page discussion. The idea that there's a general practice that all discussions should start out neutral is false. Most informal discussions start with someone complaining or asking a pointed question or saying something needs to be changed. For processes like RFCs and Wikipedia:Third opinion, the only prose that's required to be neutral is that which is seen on pages other than the talk page where the discussion is happening. For this discussion, as with all informal discussions, there's no off-page summary. Even many formal non-RFC discussions start out with a persuasive rationale. See for example WP:RM#CM, which explicitly points out that page move nominations need not be neutral, in contrast to RFCs. No matter the perspective of the person who starts a discussion, if it's controversial someone with an opposing view will soon contribute. Other people have the opportunity to reframe the question or say it's the wrong question to be asking in the first place, and propose a different question be asked. Everyone gets to read the whole discussion and consider the points that all the participants are making, and in the end it really shouldn't matter who spoke first, if editors are judging the outcome on the merits of the points made. -- Beland ( talk) 02:02, 18 December 2020 (UTC) reply

As the quoted policy refers to Wikipedia namespace edits and not talk page edits, as I said... What you said was I [...] am following the "talk first" approach which is also described there. You have not at any point attempted to faithfully reflect the community's views: you haven't even faithfully reflected what the guideline currently says.

WP:RM#CM literally says,

Unlike other request processes on Wikipedia, such as Requests for comment, nominations need not be neutral. Make your point as best you can; use evidence (such as Google Ngrams and pageview statistics) and refer to applicable policies and guidelines, especially our article titling policy and the guideline on disambiguation and primary topics. [...] Requesters should feel free to notify any other Wikiproject or noticeboard that might be interested in the move request, as long as this notification is neutral.

(My emphasis.) You're seriously trying to suggest that, while the very non-P&G page you quote explicitly says that other procedures are expected to be neutral for mainspace pages, and that even comments mentioning the existence of requested mainspace article move discussions must be neutral, but you can just say whatever you want in a proposal to change P&G, even though the governing policy WP:PGCHANGE explicitly refers to "faithfully reflecting the community's view".

I don't believe, in all the years I've worked on Wikipedia, that I've ever brought up the Wikipedia:Wikilawyering essay in a discussion. But this would appear to be an appropriate point to do so.

Don't try to misrepresent what I've said as being that PGCHANGE proposals can't be persuasive, because that's clearly not what I've said—I pointed you to an entire essay on the subject of changing P&G Wikipedia:How to contribute to Wikipedia guidance § General recommendations written by other editors. You offhandedly dismissed it as an essay that doesn't even necessarily have community consensus but you can't claim I haven't thoroughly and specifically justified my statements about P&G and how this process is supposed to work.

Yes, it really shouldn't matter who spoke first... it shouldn't, IF everyone is participating in good faith, weighing arguments in good faith, and neutrally, faithfully seeking to reflect the community's view and arrive at consensus. But you have explicitly chosen not to do that in this discussion and I'm not just going to assume you'll follow P&G in subsequent discussions. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 14:46, 21 December 2020 (UTC) reply

Well, your own comments on this talk page are not "faithfully reflecting the community's view", as none of the other participating editors have agreed with your overall position, and they are clearly not neutral. I don't think that means you have violated WP:PGCHANGE, does it? -- Beland ( talk) 19:28, 7 January 2021 (UTC) reply

Draft arguments

Neutral summary:

Should markup for Roman numerals be restricted by the Manual of Style to Basic Latin (ASCII) letters only (like "VII") and exclude characters in the U+21XX range (like "Ⅶ")? -- 04:16, 21 December 2020 (UTC)

This RFC proposes adding the following to the end of Wikipedia:Manual of Style/Mathematics#Special symbols:

For Roman numerals, Basic Latin (ASCII) letters should be used instead of the equivalent Unicode characters in the U+21XX range. For example, L and VI, not Ⅼ, and not precomposed characters like Ⅵ. (The only exception is when discussing the Unicode characters themselves.)

Related style guidelines:

It is disputed whether the general preference for non-ASCII characters at Wikipedia:Manual of Style/Mathematics#Special symbols currently applies to Roman numerals. This proposal would make it clear it does not apply. If this proposal fails, we could decide to affirm that the non-ASCII encoding is preferred (which would imply changing millions of instances), or that either encoding is acceptable, referencing MOS:STYLERET to limit the circumstances in which any given instance could be changed from one encoding to the other.
MOS:ORDINAL and MOS:SMALLCAPS currently write Roman numerals in ASCII characters. (Whether this implies other encodings are not allowed is disputed.)

-- Beland ( talk) 04:16, 21 December 2020 (UTC) reply

The following arguments in favor are mostly summarized from the above discussion and were written by Beland with suggestions from other editors... (moved to #Roman Numerals RFC)

I'm assuming you would like to write the arguments against, Struthious Bandersnatch? Anything to add or change, David Eppstein?

I did not add anything explicitly about MathML. Based on the page you linked to, SB, MathML appears to be using the "Unicode characters in the U+21XX range", which are already mentioned. -- Beland ( talk) 04:16, 21 December 2020 (UTC) reply

Looks ok to me. I think MathML is a red herring; almost all mathematics formulas do not use Roman numerals, almost all instances of Roman numerals are outside of mathematics formulas, and almost all people who read Wikipedia do not use browsers with working support for MathML and will not be affected by MathML handling (except in the way they always have been affected, by the fact that the delusional hope of eventual MathML has blocked the use of other systems like MathJax and KaTeX that work now and are better than what Wikipedia uses). If the Wikimedia developers for some reason were to decide to include Roman-numeral unicodes in the input or output representations of mathematics formula processing, it is not something that this RFC can affect. So I agree with not including anything about that. — David Eppstein ( talk) 07:04, 21 December 2020 (UTC) reply

Some comments.

The links to certain sources is WP:Common style fallacy and is accordingly also irrelevant to the discussion. It is sufficient to say "we don't know if any style guides address this issue" (which is itself a mistruth; the Unicode consortium's recommendation can be seen as a/the pertinent style guide).
Screen readers reading numbers off is likely their way of interpreting the "numbers in box" effect a sighted person would see when the browser/OS has no font with the glyphs in question. If in this list, it should be as a corollary to the same point for a visual browser, currently #8.
MOS:MARKUP might be something to indicate, and perhaps WP:ACCESS. I would probably add one of the more oft-requested parallels here, which is curly-quotes. Many of the items above are similar to the reasons for our rejection of those marks, though the reasons go undocumented in the MOS itself. This precomposition is also similar to our rejection of precomposed fractions at WP:MOSMATH#Fractions.

-- Izno ( talk) 07:27, 21 December 2020 (UTC) reply

I've edited the above arguments to reflect these comments; feel free to tweak. I think the "common style fallacy" comment is simply an argument against one of the points above, so I'll start an "arguments against" section below and put that there. -- Beland ( talk) 16:30, 21 December 2020 (UTC) reply

@ Beland: To "do your best" to "describe the issue neutrally" as WP:WPRFC says:

An RfC on this topic should explicitly state that this new rule will override the current Wikipedia:Manual of Style/Mathematics § Special symbols guidance that As a rule of thumb, specific mathematical symbols shall be used, not similar-looking ASCII or punctuation symbols, even if corresponding glyphs are indistinguishable. It should do this in the initial summary.
It should not have the "If it fails" clause, at least not phrased as above; it should clearly indicate that this is the status quo ante, that either encoding is acceptable, and MOS:STYLERET would [prohibits] changing any given instance from one to the other are the current guidelines in force.
The bit about MOS:ORDINAL and MOS:SMALLCAPS is misleading—if something like this is included, it should explicitly state that the editor styling of the text of those guidelines at some point in the past chose to do so with Unicode Basic Latin characters, not that the guidelines themselves mandate a choice of Unicode code points for writing Roman numerals.
The paragraph beginning with Using a web browser should instead begin with, "Like most preferred styling approaches in MOS:MATHS, using a web browser..."
The paragraph beginning with Some screenreaders pronounce should instead begin with, "Some screenreaders pronounce most code points referred to by Wikipedia:Manual of Style/Mathematics § Special symbols in an overly verbose way which makes for poor accessibility, and this is true of Roman numeral code points..."
"The Unicode standard says not to use the characters we would be prohibiting"—this whole list item is pretty plainly not doing your best to present what the Unicode standard says on the subject neutrally, given the entire above discussion.
"These non-ASCII characters are much more difficult to type and edit."—this is no more true of Unicode Roman numerals than anything else in Wikipedia:Manual of Style/Mathematics § Special symbols, or anything in the "Insert" section of the basic MediaWiki editor. This is not doing your best to present opinions on the subject neutrally.
In the subsequent list item, referring to the use of Unicode Roman numerals as a "mess" is obviously not doing your best to present opinions on the subject neutrally.
The claim that you've demonstrated 36 web sites don't make use of Unicode Roman numerals is untrue, as Izno points out the implication that this would be relevant is WP:Common style fallacy, and this is not doing your best to present opinions on the subject neutrally.
In the list item mentioning that "non-ASCII characters might not render" it should be pointed out that in non-Unicode-compatible systems, most characters and stylings preferred by MOS:MATHS will not render, as is the case with other impairments—the preferred embedded-image typesetting of formulae of course does not render in most terminals or Windows notepad.exe mentioned as having shortcomings.
As you decided to bring up fonts, to do your best to present opinions on the subject neutrally you should point out that any concerns related to differing presentations could be solved with open-source web fonts.
Ignoring that actual Wikipedia P&G is currently MOS:STYLERET, to instead claim that your own preferences are "Wikipedia's de facto preference" is untrue, and stating "it would take an enormous amount of work to convert all instances to the non-ASCII version" as though anyone has proposed this, when in fact my very first sentence in this talk page explicitly states that I am not proposing this, is attempting to create a false dilemma and is not doing your best to present opinions on the subject neutrally.
[NLP systems encountering Unicode Roman numerals] appearing rarely or some of the time probably results in worse performance than not appearing at all—as I've said repeatedly above, ^{citation needed}. Repeating this again and again with no evidence does not make it true, and of course doing so for the 𝑛th time, or offering your own third-party system which you won't have designed to support all Unicode characters if you're opposed to that as some sort of generalization "we" have "seen", is not doing your best to present opinions on the subject neutrally.
And of course, one of my basic arguments—that Unicode Roman numeral code points encode different information than collections of, or individual, Basic Latin characters, and that therefore removing them is actually removing information from Wikipedia articles—isn't even mentioned in this formulation of an RfC, which also is not doing your best to present opinions on the subject neutrally.

As far as what I might do, I'll wait to see how you carry out your responsibilies under Wikipedia P&G as the editor requesting a comment from the community—noting that you yourself linked to a Wikipedia: namespace page above giving RfCs as an example of a Wikipedia request process which needs to be neutral—before I decide how I will comment myself.

@ David Eppstein: Funny how transient concerns about accessibility are when they don't support your desired mandated styling rules; it becomes a "delusional hope", as I see you surreptitiously added to your comment. (MathML remedies several accessibility concerns which apply to most preferred styling in MOS:MATHS, but which have been brought up as objections to Unicode Roman numerals alone here.)

[A]lmost all people who read Wikipedia do not use browsers with working support for MathML—this isn't true. Firefox/Gecko supports MathML and so does WebKit. With Edge switching over to be (WebKit-derived) Blink-based last year— Microsoft Edge § Anaheim (2019–present)—all major browsers now contain at least the code to support MathML. Even for browsers with MathML support turned off or older browsers lacking support, extremely mature javascript polyfills/shims like MathJax are available to enable rendering.

So the path to MathML is pretty firm; it's only "delusional" if one assigns no importance to accessibility and other benefits. MathJax has a variety of accessibility measures but the specific concern you and Beland voiced about Ctrl+f doesn't seem to work, or works differently, from native MathML, in my cursory testing. (Which would appear to affirm that progress towards native MathML support in browsers and in Wikipedia will be optimal for that particular aspect of accessibility.)

@ Izno: MOS:MARKUP is an interesting intersection, though again something which would bear on everything in Wikipedia:Manual of Style/Mathematics § Special symbols rather than specifically having to do with Roman numerals. I could see creating some sort of templates like those in Category:Logic symbol templates and preferring their use for Mathematical Alphanumeric Symbols and Roman numerals, in lieu of or in combination with the Insert dropdown of the MediaWiki basic editor. -- ‿Ꞅtruthious 𝔹andersnatch ͡ |℡| 14:46, 21 December 2020 (UTC) reply

@ Struthious Bandersnatch: The neutral statement ends at the timestamp "04:16, 21 December 2020". The list below that is clearly marked "arguments in favor" and are not intended to be neutral; they are the opinions of myself and concurring editors. It is intended to be followed by a list of "arguments against", which I have not drafted because I did not want to put words in your mouth. Your points 4 through 14 are criticisms of the arguments in favor. Feel free to add those as points in the "arguments against" list (started below based on Izno's comments), if you think they are important enough to bring to the attention of RFC participants. For point 3, personally I interpret the ASCII-only nature of MOS:ORDINAL and MOS:SMALLCAPS as evidence only ASCII Roman numerals are desirable. For the sake of neutrality, I have noted that this implication is disputed. For point 2, I dispute your claim that the MOS already allows non-ASCII Roman numerals. My understanding for several years has been that MOS:MARKUP and MOS:ORDINAL indicate a preference for ASCII, recent investigation implies to me that Roman numerals are out of scope for Wikipedia:Manual of Style/Mathematics#Special symbols, and MOS:STYLERET does not apply because non-ASCII Roman numerals are essentially an error, and encodings should only be changed in one direction. However, if this RFC fails, then I would acknowledge that consensus is against my interpretation. For point 1, I dispute that Roman numerals are within the scope of the "Special symbols" section as currently written. Personally, my thinking is that arguing over what the rules were is a waste of time, given when this RFC is done we'll have perfect clarity and may well have simply changed the rules. If you want to add "this is already allowed by the MOS" to the list of arguments against the proposal, along the lines of points 1 and 2, then RFC participants can decide if they agree, but I don't endorse those as clearly established interpretations. (If we had agreed on the status quo ante, we wouldn't have been making conflicting edits in the first place.) We could also add those points to the neutral summary but mark them as disputed, if you prefer. -- Beland ( talk) 17:03, 21 December 2020 (UTC) reply

@ Struthious Bandersnatch: On your point 12, I did not say that anyone was proposing changing all the instances of Roman numerals to the non-ASCII encoding. However, if you believe that Wikipedia:Manual of Style/Mathematics#Special symbols applies to Roman numerals, that implies that the non-ASCII encoding is currently mandatory because it says "shall", not "may". If you want MOS:STYLERET to apply, I think someone would need to propose a wording change to specifically say that Roman numerals have multiple acceptable styles. I have clarified the RFC language to explain this a bit better. -- Beland ( talk) 03:43, 8 January 2021 (UTC) reply

@ Beland: I'm new to this discussion tonight but am utterly convinced after skimming the discussion and reading these arguments. I'd simply suggest spelling out NLP on its first usage in argument 3. Nice job making this case, and keeping your patience. Retswerb ( talk) 10:04, 5 January 2021 (UTC) reply

Done, and thanks! -- Beland ( talk)

@ Struthious Bandersnatch: Happy New Year! We haven't heard from you on this topic in a while. I was hoping that you would write a summary of your arguments in your own words, as you originally promised, because they are best presented by someone who actually believes in them. Non-response can't be a veto in favor of the status quo, so the RFC will proceed either way. Rather than running the RFC without a summary of arguments against, I have drafted my own summary of your arguments below. Feel free to throw it away completely and express your ideas in your own way, or tweak it if you think some is worth keeping. If there's no response in a week or so, I'll go ahead with the RFC. -- Beland ( talk) 19:56, 7 January 2021 (UTC) reply

(moved to #Roman Numerals RFC)

Generalization

This discussion is a new instance of many similar past discussions about non-ASCII Unicode characters and symbols. Examples that I remember include ellipses ( $...$ ), radical sign ( $\sqrt$ ), blackboard bold ( $ℝ$ ), function composition ( $\circ$ ), integer exponents ( $x 2$ ), fractions ( $.mw-parser-output .frac{white-space:nowrap}.mw-parser-output .frac .num,.mw-parser-output .frac .den{font-size:80%;line-height:0;vertical-align:super}.mw-parser-output .frac .den{vertical-align:sub}.mw-parser-output .sr-only{border:0;clip:rect(0,0,0,0);clip-path:polygon(0px 0px,0px 0px,0px 0px);height:1px;margin:-1px;overflow:hidden;padding:0;position:absolute;width:1px}1⁄2$ ), but the list is certainly not complete. From all these discussions, I arrive to the following suggestion for the manual of style:

The use of non-ASCII Unicode characters and symbols is discouraged unless if there is no convenient equivalent in plain text or LaTeX (in mathematical formulas), or when talking about them. (Typo and grammar fixed as suggested below. D.Lazard ( talk) 10:14, 10 January 2021 (UTC)) reply

The rationale for this is

This is the conclusion of almost all discussion that resulted to a consensus
The rendering of Unicode symbols strongly depend on surrounding fonts, the used browser and its configuration. For example the rendering is often different inside and outside {{ math}}.
Unicode is a standard for font design, not for writing style. So Wikipedia editors are not supposed to know Unicode. Moreover, in mathematics, the standard for symbols is LaTeX, not Unicode.
Many mathematical symbols are associated with rules for spaces around them, and the rules changes with the semantic of the symbol. For example the spaces around $|$ are not the same whether is it a bracket (absolute value), a separator (set builder notation) or an operator (divisibility). With Unicode, an editor must know well these rules for applying them manually, while this is done automatically by LaTeX, if the correct macro is used.

For the present discussion, I can add that the semantics of a Roman numeral is based on the fact that it is a sequence of digits represented by Latin letters. The combined Unicode symbols destroy this semantics. D.Lazard ( talk) 14:12, 9 January 2021 (UTC) reply

@ D.Lazard: I think there may be a typo in the proposed green text above? Should it say that "non-ASCII Unicode characters" are discouraged unless there is no convenient ASCII or LaTeX equivalent? MOS:ELLIPSIS favors ASCII, MOS:RADICAL favors LaTeX, MOS:BBB favors ASCII or LaTeX, MOS:UNITS favors font effects on ASCII characters, MOS:FRAC favors LaTeX or font effects on ASCII characters, and I'm not sure where the discussion on function composition took place. -- Beland ( talk) 01:53, 10 January 2021 (UTC) reply

To editor Beland: Thanks. Of course. I have edited my suggestion.

Thanks also for the links. One may also add MOS:' and MOS:CURLY which favor ASCII.

About function composition, I have found two sections in Talk:Function composition that discuss the right Unicode character to be used, and recommend U+2218. But they do not discuss the use of LaTeX instead of Unicode, nor take into account that, with Safari, the rendering of this Unicode character is so small that it is difficult to distinguish it from a dot (at least on my laptop). So, in my opinion, my suggestion must apply also to this case. D.Lazard ( talk) 10:14, 10 January 2021 (UTC) reply

@ D.Lazard: and other interested editors...

So favoring ASCII characters over non-ASCII would mean:

ASCII Roman numerals would be preferred as proposed above.
U+002D - HYPHEN-MINUS would be preferred over U+2212 − MINUS SIGN and −. That would be a change to Wikipedia:Manual of Style/Mathematics#Minus sign.
U+002A * ASTERISK (&ast; where needed due to wiki markup) would be preferred over U+204E ⁎ LOW ASTERISK and U+2217 ∗ ASTERISK OPERATOR (&lowast;).
U+003A : COLON and U+003D = EQUALS SIGN would be preferred over U+2236 ∶ RATIO, U+2254 ≔ COLON EQUALS, and U+2255 ≕ EQUALS COLON.
U+007E ~ TILDE would be preferred over U+223C ∼ TILDE OPERATOR and U+223D ∽ REVERSED TILDE

I would support all of those preferences.

Wikipedia:Manual of Style/Mathematics#Multiplication sign prefers U+00D7 × MULTIPLICATION SIGN or × (and ⋅ where appropriate). I'd lean away from changing × to the ASCII letter "x", just because it's typographically distinct and there are a very large number of instances. If there is consensus in favor of keeping ×, it would be good to note it explicitly as an exception. I would support changing U+2715 ✕ MULTIPLICATION X, U+2A09 ⨉ N-ARY TIMES OPERATOR, and U+2A2F ⨯ VECTOR OR CROSS PRODUCT to U+00D7 × MULTIPLICATION SIGN for the same reasons as we prefer ASCII characters, like find-in-page consistency. Where these characters do appear, they are usually not used "correctly" according to how the Unicode standard defines the semantics. Even though U+00D7 is in a slightly higher character range, it's much more widely used than the others, and is more easily accessed because it is on the special character list in every Wikipedia edit window (for desktop browsers).

I would also support converting all instances of "×" to "×" since the difference with "x" is pretty obvious, and we almost always already do this anyway.

For the record, in the December 20, 2020, database dump, I see:

1,452,304 instances of U+00D7 × MULTIPLICATION SIGN
17,159 instances of ×
123 instances of U+2715 ✕ MULTIPLICATION X
About 20 instances of U+2A09 ⨉ N-ARY TIMES OPERATOR, in contexts where the multiplication sign is more appropriate.
33 instances of U+2A2F ⨯ VECTOR OR CROSS PRODUCT
563,214 instances of U+2212 − MINUS SIGN
59,897 instances of −
48 instances of U+204E ⁎ LOW ASTERISK
2,093 instances of U+2217 ∗ ASTERISK OPERATOR
No instances of &lowast; (other than discussing the character itself)
A bunch of non-math uses of U+2055 ⁕ FLOWER PUNCTUATION MARK
451 instances of U+2236 ∶ RATIO
34 instances of U+2254 ≔ COLON EQUALS
No instances of U+2255 ≕ EQUALS COLON (other than discussing the character itself)
1,097 instances of U+223C ∼ TILDE OPERATOR
A handful of U+223D ∽ REVERSED TILDE

Favoring LaTeX markup over non-ASCII Unicode characters is an interesting but much more complicated question which I would like to discuss sometime soon. I'm going to defer that for now, since the ASCII preference alone is pretty complicated. Given the very long discussion we've already had and the complicated arguments made, I'd like to proceed with the Roman numerals RFC as planned, to get an explicit consensus on that. Either after that or in parallel, I think we should discuss flipping Wikipedia:Manual of Style/Mathematics#Special symbols to prefer ASCII symbols, which as mentioned, would affect asterisk, colon, equals, tilde, and perhaps others. If there is no opposition on this talk page, would we want to just make the change, or would we want to do a formal RFC on that, given there must have been a pre-existing consensus to write the current rule? Would we want to make flipping Wikipedia:Manual of Style/Mathematics#Minus sign to prefer hyphen-minus a separate discussion? Lump it in with the rest? Maybe do a single RFC but ask editors if it should be kept as an exception? -- Beland ( talk) 20:40, 15 January 2021 (UTC) reply

I do not think you would find consensus to deprecate/change times and minus usage (especially the latter, given that distinguishing between straight horizontal lines is not in MOSMATH's sole authority). The more specialized times symbols might feasibly have consensus to be deprecated but I do not think they should be personally. The others I express no current opinion. -- Izno ( talk) 22:52, 15 January 2021 (UTC) reply

I, for one, would strongly oppose using hyphens in place of minus signs and asterisks in place of multiplication signs in mathematical formulas, and I strongly suspect that a large subset of WP:WPM regulars who care about mathematical typography would as well ( User:Michael Hardy, for instance). This is very different from the Roman numeral issue, where the special characters have no benefit in appearance. Hyphens look too different from minus signs to make an adequate substitute. And insisting on ASCII when Unicode supplies typography that is clearly and visibly better is very 1990s; it's an outdated attitude that is incompatible with long practice in MOS:DASH and elsewhere in the MOS. — David Eppstein ( talk) 00:28, 16 January 2021 (UTC) reply

ok, I couple of points:

That which is being called "LaTeX" here is of course obviously NOT LaTeX. I wonder if some people master the stripped-down TeX that is used here and think they've learned LaTeX. They're in for a shock if they are called upon to use actual LaTeX. Nor is it the same as (actual) TeX.
For inline use, the "LaTeX" used here sometimes (often) results in mismatches in fonts or character sizes. This is not a problem in a displayed, as opposed to inline, context.

Michael Hardy ( talk) 05:44, 19 January 2021 (UTC) reply

It is true that inline <math></math> produce small mismatches in font or characther size, but many Unicode character produce large mismatches: here are some of the most common mathematical symbols, displayed inside and outside {{ math}}:

\forall

∀

\exists

∃

\neq

≠

\in

∈

\infty

∞

\subseteq

⊆

On my screen (standard configuration of Safari on a MacBook Air), the two versions have a very different size, and, when the size are similar, they have a different vertical alignment. So, Unicode has much more rendering problems than <math></math>.

I agree with some above comments about minus and multiplication sign. So I change my suggestion into:

The use of non-ASCII Unicode characters and symbols is discouraged unless if there is no convenient equivalent in plain text or LaTeX (in mathematical formulas), or when talking about them. This does not apply to the non-mathematical use of these symbols and to symbols that are commonly used outside mathematics, such as the minus and the multiplication signs.

I am not completely happy with this formulation, because its application to Roman numerals is unclear. But I am pretty sure that, if a consensus is reached on the principle, a better formulation will be found. D.Lazard ( talk) 10:50, 19 January 2021 (UTC) reply

Roman Numerals RFC

The following discussion is an archived record of a request for comment. Please do not modify it. No further edits should be made to this discussion. A summary of the conclusions reached follows.

There's a strong consensus not to use non-ASCII renderings of Roman numerals ( non-admin closure) ( t · c) buidhe 21:03, 10 February 2021 (UTC) reply

RFC summary

Should markup for Roman numerals be restricted by the Manual of Style to Basic Latin (ASCII) letters only (like "VII") and exclude characters in the U+21XX range (like "Ⅶ")? -- 19:45, 26 January 2021 (UTC)

This RFC proposes adding the following to the end of Wikipedia:Manual of Style/Mathematics#Special symbols:

For Roman numerals, Basic Latin (ASCII) letters should be used instead of the equivalent Unicode characters in the U+21XX range. For example, L and VI, not Ⅼ, and not precomposed characters like Ⅵ. (The only exception is when discussing the Unicode characters themselves.)

Related style guidelines:

It is disputed whether the general preference for non-ASCII characters at Wikipedia:Manual of Style/Mathematics#Special symbols currently applies to Roman numerals. This proposal would make it clear it does not apply. If this proposal fails, we could decide to affirm that the non-ASCII encoding is preferred (which would imply changing millions of instances), or that either encoding is acceptable, referencing MOS:STYLERET to limit the circumstances in which any given instance could be changed from one encoding to the other.
MOS:ORDINAL and MOS:SMALLCAPS currently write Roman numerals in ASCII characters. (Whether this implies other encodings are not allowed is disputed.)

-- Beland ( talk) 19:45, 26 January 2021 (UTC) reply

Pre-RFC arguments summary

The following arguments in favor are mostly summarized from the above subsections and were written by Beland with suggestions from other editors.

Using a web browser (we tested Firefox and Chrome searching for "VIII" on [37]) to search an article for e.g. "III" will not turn up instances of "Ⅲ", and vice versa. The vast majority of readers won't know why, won't be able to work around the problem, and may not even notice that they are missing anything.
Some screenreaders pronounce non-ASCII characters like "Ⅵ" essentially unintelligibly as "letter two one seven five" but pronounce the ASCII sequence usefully like "vee eye". This thwarts the goals of WP:ACCESS. In some cases, the non-ASCII characters might not render for visual readers either, depending on what fonts the user has installed in their web browser, terminal, notepad, or whatever other programs they copy the text into.
The Unicode standard says not to use the characters we would be prohibiting. (English Wikipedia doesn't use vertical text, and there's no other applicable technical advantage to the non-ASCII characters.) We should follow the standard's recommendation to maximize interoperation with standards-compliant web browsers, word processors, natural language processing systems, training corpuses, etc. Quoting from Unicode 7.0.0, Chapter 22, p. 754:
Roman Numerals. For most purposes, it is preferable to compose the Roman numerals from sequences of the appropriate Latin letters. However, the uppercase and lowercase variants of the Roman numerals through 12, plus L, C, D, and M, have been encoded in the Number Forms block (U+2150..U+218F) for compatibility with East Asian standards. Unlike sequences of Latin letters, these symbols remain upright in vertical layout.
These non-ASCII characters are much more difficult to type and edit.
Search engines do not give the same results for the ASCII vs. non-ASCII Roman numerals. (We tested "billy iii" vs. "billy Ⅲ" on Google and Duck Duck Go.) Arguably this is a bug, but being inconsistent with what users actually type could cause some articles not to show up correctly in search results.
The precomposed characters only go up to 12 ("Ⅻ"). This means that to write 13 and higher, we'd need to use either multiple characters in the U+21XX range, or revert to ASCII. It also means that there are three ways to write a number like uppercase 12: "Ⅻ" (single character), "ⅩⅠⅠ" (multiple character, non-ASCII), and "XII" (ASCII), making an even bigger mess in terms of in-article search and search engine compatibility, and possibly inconsistent visual appearance for low vs. high range numbers.
A sampling of 36 major reliable sources in the "External practices" section above finds (as of this writing) they all use ASCII Roman numerals. We don't know of any reputable style guides that address the issue (other than what the Unicode Consortium says).
Whether the non-ASCII characters render with serifs, and whether they look better, worse, or exactly the same as the ASCII characters is somewhat unpredictable, given that it depends on what fonts the reader has installed and on personal aesthetic preferences. Typically they render similarly enough that most readers won't notice the difference, making other considerations more important.
ASCII Roman numerals are already Wikipedia's de facto preference, and it would take an enormous amount of work to convert all instances to the non-ASCII version (if we wanted to only allow one variation and didn't choose ASCII). For example, in the 2020-11-01 database dump, there were e.g. 1,412,537 instances of "III" but only 288 instances of "Ⅲ" (with perhaps a hundred more systematically removed before this discussion began).
Responding to the argument that non-ASCII characters are more "machine readable" and carry more information about their numerical value: NLP systems might perform well if Roman numerals were non-ASCII nearly 100% of the time, but appearing rarely or some of the time probably results in worse performance than not appearing at all. With a mix of encodings, machine learning systems would need to learn more cases but have fewer examples for each. NLP systems currently must already handle the ASCII characters which are currently about 99.98% of instances. We've already seen rule-based systems (like Beland's spell checker) that correctly handle the ASCII versions (e.g. in the common case of regnal names like "Queen Elizabeth II") but get confused by non-ASCII Roman numerals.
ASCII encoding keeps markup simple, which is a goal of MOS:MARKUP.
There is precedent in the MOS preference for ASCII quotation marks and apostrophes ( MOS:STRAIGHT) and against precomposed fractions ( MOS:FRAC).

The following arguments against were written by Beland (who does not endorse them) as a summary of points made by Struthious Bandersnatch (who has not commented on this phrasing).

Most of the external sources cited in #External practices (e.g. news sources) are not written in an encyclopedic register, and so Wikipedia:Common-style fallacy argues they should be ignored for purposes of determining Wikipedia's MOS.
More generally, Wikipedia should always ignore the style guides and practices of other publications and only follow its own style guide.
Any problems with web browsers, search engines, text-to-speech engines, natural language processing, and other programs are deficiencies in those programs, which Wikipedia should not attempt to fix by changing its content.
The Unicode standard doesn't say why it is usually preferable to use ASCII letters for Roman numerals, and doesn't say what other exceptions there are to that advice.
The non-ASCII characters are unambiguous and machine readable as numeric, unlike letters. For example, "I" can be a pronoun, and "VI" can mean "Virgin Islands". Converting to letters destroys this unambiguous information.
The non-ASCII characters look better.
Typing difficulties, find-in-page, copy-paste, and text-to-speech problems also affect all the other mathematical symbols that have similar-looking ASCII characters, like plus and minus, as well as characters in <math>...</math> markup. (And many other non-mathematical symbols in common use on Wikipedia.) We can't make a rule for Roman numerals only; we would have to change the "rule of thumb" to favor ASCII characters for all math symbols. Find-in-page problems for <math>...</math> markup could be fixed more generally with MathML improvements.
We should apply MOS:STYLERET and allow either style as acceptable.

RFC discussion

Support as the proposer, for the in-favor reasons summarized above. -- Beland ( talk) 19:45, 26 January 2021 (UTC) reply
Support per extensive previous discussion. — David Eppstein ( talk) 19:47, 26 January 2021 (UTC) reply
Support per above discussion, and per the following: Wikipedia is written by humans for being read by humans, not computers. So, the arguments based on semantics (hard-coded distinction between roman numerals and the corresponding Latin lettters) are totally irrelevant. Moreover, editors and readers of Wikipedia are not supposed to be expert in typography. So anything that makes things clearer for software at the cost of being confusing for humans must be avoided. For humans, roman numerals are sequences of some Latin letters (and have been introduced historically as such). So, changing this can only be confusing. D.Lazard ( talk) 21:01, 26 January 2021 (UTC) reply
Support per the arguments in favor, with emphasis on points 1, 2, and 5. Wikipedia is an encyclopedia, and should be as accessible as possible to as many people as possible. Searchability and accessibility are both key. warmly, ezlev. talk 22:01, 26 January 2021 (UTC) reply
Support per the stated arguments above and the previous discussion. This is a clear win for accessibility and editability. Retswerb ( talk) 07:17, 31 January 2021 (UTC) reply
Support. Any of points 1–3 (searching, screen readers, Unicode standard recommendation) on its own would be enough to convince me. Together they make a powerful case for using ASCII instead of those other Unicode characters. The opposing arguments are weak, and I note that at least on my browser the ASCII characters look identical or virtually identical to the U+21XX characters. (I use the default skin, like the vast majority of Wikipedia readers.) — Granger ( talk · contribs) 19:29, 2 February 2021 (UTC) reply

The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

Accessibility of precomposed fraction characters

Wikipedia:Manual of Style/Mathematics#Fractions says that precomposed fractions like ½ cause accessibility problems. However, in the discussion at Wikipedia:Categories for discussion/Log/2021 March 3#Category:10¼ in gauge railways in England, Graham87, who uses a screenreader, says these characters do not cause problems. Is anyone aware of any specific accessibility problems caused by these characters, or should that claim be removed? I do know search engines don't always handle them well, and though that may impede access, that's not what we generally mean when we say "accessibility". -- Beland ( talk) 19:13, 10 September 2021 (UTC) reply

As I tried to imply at the CFD, The precomposed fraction characters that cause accessibility problems are those not in ISO/IEC 8859-1 (i.e. anything besides ¼, ½, and ¾). Graham 87 03:23, 11 September 2021 (UTC) reply

I don't expect most editors to realize that there's a distinction between these three and the other precomposed fraction characters (I certainly didn't), so rather than trying to clarify that distinction in MOS it seems easiest just to say that there are accessibility issues (as it did before Beland tagged it) without going into detail about which characters have those issues. — David Eppstein ( talk) 05:00, 11 September 2021 (UTC) reply

Well, consensus for railroad categories was to keep ½ etc. because those characters had no accessibility issues, and that also means there's no accessibility objection to keeping Ranma ½, which was favored on Talk:Ranma ½. If we don't note which characters are OK and which are not, editors may think the rationale for that choice was factually incorrect, and may argue over whether all or none of the characters are problematic, when neither is the case. -- Beland ( talk) 06:36, 11 September 2021 (UTC) reply

I tucked in a footnote to clarify without bloating the main text. -- Beland ( talk) 06:48, 11 September 2021 (UTC) reply

Zeroth

The word "zeroth" appears in over 500 articles. It is potentially unfamiliar to users outside the Anglophone countries. Some English speakers might need to pause or re-read the word to infer the intended pronunciation and hence the meaning when it is written as "zeroth".

Should the word written as "zero'th", "zero-th" or "0th"; the "th" be in superscript; or a link to the Wiktionary page for 'zeroth' added to clarify what is meant?

Sesquivalent ( talk) 19:01, 26 September 2021 (UTC) reply

ah, I see MOS:SUPERSCRIPT says to not use superscript for ordinals, which probably blocks that option. Sesquivalent ( talk) 19:11, 26 September 2021 (UTC) reply

Linking would be WP:OVERLINK. "Zeroth" is in my Merriam-Webster Unabridged, so I don't see a need to do anything. Someone using an encyclopedia can probably be trusted to look up an unfamiliar word. ;)

As regards spelling, except for "0th", I've never seen any other spelling in use. "Zero-th" never, we don't write "four-th", either. Same goes for "zero'th". And "0th" should be avoided, that's jargon. Paradoctor ( talk) 08:34, 4 October 2021 (UTC) reply

Blackboard bold for numbers

Tracked in Phabricator
Task T279805

Doing some cleanup work, I just discovered that LaTeX-based double-stroke blackboard bold doesn't work for numbers when using "mathbb". There is a workaround using "text" but it leaves a lot of space after the number. Conversion to regular bold is a possibility, but not where the notation itself is being explained. What's the preferred solution for a. discussion the notation itself, and b. when using the notation?

Markup examples:

$...\mathbb {N} ...$ <math>...\mathbb{N}...</math>
$...\mathbb {1} ...$ <math>...\mathbb{1}...</math>
$...{\text{𝟙}}...$ <math>...\text{𝟙}...</math>
...𝟙... ...𝟙...
$...𝟙...$ {{math|...𝟙...}}
...1... ...'''1'''...

Articles currently affected:

-- Beland ( talk) 19:53, 10 January 2022 (UTC) reply

Related previous discussion: Wikipedia talk:WikiProject Mathematics/Archive/2021/Apr#Typesetting \mathbb{1} within Wikipedia articles. I think it's best to avoid using the blackboard bold 1 notation both because of the technical issue and because, at least in most contexts, other notations are more standard. When including it to mention the notation itself, what Heaviside step function currently does seems fine – basically, using the Unicode 𝟙 (and in that case, using Template:math consistent with the rest of the article). Adumbrativus ( talk) 01:11, 11 January 2022 (UTC) reply

@ Adumbrativus: Aha, good find. Pinging Salix alba since we seem to have discovered a situation (talking about the notation itself) where this is actually used. I'm assuming then for Kan extension we use

1

({{math|'''1'''}}) and Quantum operation we use that and

{\mathbf {1}}

(<math>\bold{1}</math>)? -- Beland ( talk) 01:29, 11 January 2022 (UTC) reply

Well, David Eppstein I don't know how to spell these out in words, so I just did the above for those articles and bare Unicode for Heaviside step function. Feel free to modify as appropriate. -- Beland ( talk) 06:52, 28 January 2022 (UTC) reply

I think maybe we just need to write this off as one of those standard TeX features like \strut that would be very useful if Wikipedia handled but that the Wikimedia developers will never add because getting mathematics to work has negative priority for them. As for what to do in parts of articles describing notation that looks like this: spell it out in words maybe instead of providing an image of the notation? That's what I ended up doing for a different notation, the half-box notation at the end of Factorial#History, when I couldn't find a way to get an adequate version of the notation itself into the article. — David Eppstein ( talk) 02:39, 11 January 2022 (UTC) reply

It seems to be an upstream bug with MathJax [38], there is a workaround with \unicode[STIXGeneral]{x1D7D9} etc. It might be possible to add non-standard macros for these if it's really required. -- Salix alba ( talk): 19:11, 11 January 2022 (UTC) reply

Explicitly forbid spaces before punctuation?

MOS:MATH#PUNC currently says: "Similarly, if the conventional punctuation rules would require a question mark, comma, semicolon, or other punctuation at that place, the formula must have that punctuation at the end." We also have MOS:PUNCTSPACE: "In normal text, never put a space before a comma, semicolon, colon, period/full stop, question mark, or exclamation mark". However, it might be unclear whether mathematical formulas are "normal text", :–) and unfortunately some people insert spaces (\, and even ~) before such punctuation marks. I think, it would be useful to add to MOS:MATH#PUNC a short phrase against this practice (with a link to MOS:PUNCTSPACE). Any objections or better ideas? — Mikhail Ryazanov ( talk) 19:55, 30 October 2021 (UTC) reply

I disagree. Having punctuation right up against a math formula has potential to be confusing, because sometimes it could be interpreted as having mathematical rather than natural-language meaning. -- Trovatore ( talk) 20:12, 30 October 2021 (UTC) reply

For example? (The only case I could remember seeing is an exclamation mark, which can be confused with a factorial, but the use of exclamation marks in an encyclopedic text is not a good idea by itself, and correcting this by weird punctuation instead of rewording isn't great either. Another case, "Gee what a whale of a lot of calories" from "Mathmanship", is also inherently evil and in fact doesn't apply to WP, since our footnotes are always bracketed and hyperlinked.) — Mikhail Ryazanov ( talk) 21:53, 30 October 2021 (UTC) reply

Don't most math journals punctuate sentences in the usual English way, whether or not they end with mathematical notation? I'm not aware of a counterexample. And I can't recall seeing extra space inserted either. Mgnbar ( talk) 21:16, 30 October 2021 (UTC) reply

There's an example of this in the current version of 89 (number) in which an explicit space separates an ellipsis from a period in a displayed equation:

{\frac {1}{89}}=\sum _{n=1}^{\infty }{F(n)\times 10^{-(n+1)}}=0.011235955\dots \ .

This usage looks ok to me; is it to be forbidden now? — David Eppstein ( talk) 22:24, 30 October 2021 (UTC) reply

Accoriding to Ellipsis, all style guides agree that no extra space between ellipsis and following punctuation is needed. Do you also prefer "

\dots \ ,

" instead of "

\dots ,

"? — Mikhail Ryazanov ( talk) 22:45, 30 October 2021 (UTC) reply

With the comma it is unambiguous. With the ellipsis there is no difference between the dots and their spacing within the ellipsis and the period, so it is difficult to parse as an ellipsis followed by a period, would produce exactly the same ambiguous visual effect as a period followed by an ellipsis, and really just looks like an odd four-dot ellipsis. MOS:ELLIPSIS also says that terminal periods in textual uses of ellipsis are rarely important and can be omitted. But it cannot be omitted here, because the ellipsis does not have its usual textual meaning of terminating a quote before the end of a sentence. So I am not convinced that style guides for textual uses of ellipsis have put any thought into mathematical formatting or are relevant for mathematical formatting. — David Eppstein ( talk) 22:53, 30 October 2021 (UTC) reply

A period followed by an ellipsis without a space is not a valid sequence, so "...." is not ambiguous. However, if you can find a (mathematical) style guide recommending to insert a space in this situation (or at least a well-formatted publication using this convention), then this specific case perhaps can be explained as an exception. Nevertheless, I don't see any excuses to insert spaces before punctuation marks in other cases. — Mikhail Ryazanov ( talk) 23:35, 30 October 2021 (UTC) reply

To me it's not a question of whether the reader can disambiguate. I think it's a problem that they need to. It's just not "logically clean" to have punctuation intended for the natural-language discourse to be in a position where it could look like it's operating at a different level.

In a way this is similar to "logical quotation", which we have adopted uniformly in Wikipedia, notwithstanding it goes against a lot of (especially American) style guides. The quotes push the surrounding text onto the stack, and the commas and periods that belong to that level of the stack should stay with it.

The least intrusive way to do this is to simply omit terminal punctuation from displayed formulas. This is a style frequently used in slides. -- Trovatore ( talk) 17:37, 31 October 2021 (UTC) reply

It is logically clean for all sentences to end in punctuation, regardless of whether the final word in the sentence is spelled in mathematical notation (e.g. "three" vs. "3"). The convention that you seem to be proposing leads to other ambiguities, where sentence breaks are hard to detect, especially when the following sentence begins with a proper noun. I think this is why most math textbooks, journals, etc. (in my experience) do in fact punctuate consistently throughout. Mgnbar ( talk) 18:22, 31 October 2021 (UTC) reply

Well, the cleanest thing would be to put the punctuation at the start of the next line, making a clear break from the displayed formula. Unfortunately that is not a widely followed convention. Omitting terminal punctuation in displayed items (whether or not mathematical) actually is a reasonably attested convention, and quite a sensible and useful one IMO, but probably hasn't made it into a lot of style guides yet. -- Trovatore ( talk) 19:45, 31 October 2021 (UTC) reply

Treating mathematical expressions as normal text is absolutely "logical" and consistent, and this is what MOS:MATH#PUNC currently says to do. My only suggestion was to clarify that another consistent rule, MOS:PUNCTSPACE, also applies here. You apparently don't like this idea but still haven't provided any examples where this consistent treatment can be potentially confusing or misinterpreted. — Mikhail Ryazanov ( talk) 22:21, 31 October 2021 (UTC) reply

The fact that you choose not to believe the examples of confusing or misinterpretable punctuation that have been provided does not mean that no such examples have been provided. — David Eppstein ( talk) 00:15, 1 November 2021 (UTC) reply

It would be much more constructive to reply to what I wrote to you instead of incorrectly accusing me here... — Mikhail Ryazanov ( talk) 00:35, 1 November 2021 (UTC) reply

No, mathematical expressions are not normal text, and should not be treated as such. -- Trovatore ( talk) 01:35, 1 November 2021 (UTC) reply

^{citation needed} — Mikhail Ryazanov ( talk) 01:47, 1 November 2021 (UTC) reply

This should be "sky is blue" territory. Symbols and punctuation are used completely differently in English (including mathematical English) from in formal mathematical expressions. When a formal expression is used together with natural language, it creates a new scope, and it needs to be made clear to the reader when the scope ends. It's true that the reader can often use pragmatics to figure it out, but ideally, they shouldn't have to.

The most important function of a period in natural language is to end a scope, specifically the scope of a sentence. (The second most important function is to alert the reader that they don't have to look somewhere else, like the next page, for continuation). In a displayed environment, the end of the display ends the scope quite effectively.

On the other hand, periods have other meanings in formal expressions (and ordinarily do not end a scope). For instance, "1" might be used to mean the natural number 1, whereas "1." is the corresponding real number; this isn't a common distinction but it's not entirely contrived either (it would work in Fortran or Python, for example). -- Trovatore ( talk) 05:09, 1 November 2021 (UTC) reply

This looks like an original approach, but please provide references to reputable sources that share this opinion. Because, to the best of my knowledge, it's at least not mainstream. For example, here is what "Mathematics into Type" form the American Mathematical Society says (p. 30–31):

Mathematics is written in sentences. Often the subject or the verb of the sentence is a mathematical symbol rather than a word. Copyediting, therefore, requires the ability to determine which part of speech is represented by the various symbols. In §3.2.1 there is a listing of mathematical symbols according to their grammatical function.
EXAMPLES:
$A=B+C.$
The example is a complete sentence with $A$ , $B$ , and $C$ acting as nouns, $+$ as a conjunction, and $=$ as the verb. This is, of course, a relatively simple example but the same principles apply to the more complicated situations.
Authors of mathematics almost invariably write in sentences but sometimes do not punctuate correctly. Although it is not universal practice to punctuate various sections of a display, it often adds to the clarity of the writing. For the most part in AMS publications, mathematical equations are punctuated, with the occasional exception of diagrams, matrices, and determinants. For example, when several separate equations are displayed, it is AMS practice to separate them by inserting a comma or other appropriate punctuation at the end of each line of the display.
When the mathematics in a paragraph is abundant, punctuation needs to be considered with more care than usual. A common mistake, for instance, is for an author to neglect to punctuate an equation that comes at the end of the typed line in a manuscript, even when the next line begins with a separate equation.
...
Specific suggestions are made in the sections below concerning spelling and punctuation. To help a copy editor maintain consistency in punctuation, several guidelines based on AMS practice are proposed; another publisher might well use different criteria. Rules of grammar are not cited because their use in writing mathematical research is no different from their use in other types of writing.
In general, the copy editor should make the manuscript correct if the grammar or punctuation is definitely wrong. In cases where there is more than one correct method, the copy editor sometimes must make a choice to maintain consistency.

(Check also p. 33 about inline expressions and p. 37–41 about spacings.)

And " The Chicago Manual of Style Online":

12.5: Words versus mathematical symbols in text
In general, mathematical symbols may be used in text in lieu of words, and such statements as “ $x\geq 0$ ” should not be rewritten as “ $x$ is greater than or equal to zero.” Nonetheless, symbols should not be used as a shorthand for words if the result is awkward or ungrammatical. In the phrase
the vectors $r_{1},\dots ,r_{n},\neq 0$ ,
the condition “ $\neq 0$ ” is better expressed in words:
the nonzero vectors $r_{1},\dots ,r_{n}$
or
the vectors $r_{1},\dots ,r_{n}$ , all nonzero,
depending on the emphasis desired. Moreover, logical symbols should generally not appear in text:
$\exists$ a minimum value of the function $f$ on the interval $[a,b]$
should be replaced by
there exists a minimum value of the function $f$ on the interval $[a,b]$
or
the function $f$ has a minimum value on the interval $[a,b]$ .

12.18: Mathematical expressions and punctuation
Mathematical expressions, whether run in with the text or displayed on a separate line, are grammatically part of the text in which they appear. Thus, expressions must be edited not only for correct presentation of the mathematical characters but also for correct grammar in the sentence. For example, if several expressions appear in a single display, they should be separated by commas or semicolons. For example,
$x_{1}+x_{2}+x_{3}=3,$ $x_{1}x_{2}+x_{2}x_{3}+x_{3}x_{1}=6,$ $x_{1}x_{2}x_{3}=-1.$ Consecutive lines of a single multiline expression, however, should not be punctuated: ${\begin{aligned}(|a+b|)^{2}=(a+b)^{2}&=a^{2}+2ab+b^{2}\\&\leq a^{2}+2|a||b|+b^{2}\\&=|a|^{2}+2|a||b|+|b|^{2}\\&=(|a|+|b|)^{2}.\end{aligned}}$ Expressions must carry ending punctuation if they end a sentence. All ending punctuation and the commas and semicolons separating expressions should be aligned horizontally on the baseline, even when preceded by constructs such as subscripts, superscripts, or fractions.

Regarding that "'1.' is the corresponding real number" – this is also a questionable statement. Basically, the period in real numbers is a " decimal separator" and "separates the integer part from the fractional part of a number". That is, both parts must be present in order to separate them. Thus, for example, MOS:DECIMAL says that generally "numbers between −1 and +1 require a leading zero (0.02, not .02)", even though some have a habit of doing the opposite. Programming languages can use very strange notation, and we are not talking about them here at all (in any case, inline program code must be enclosed in <code> tags, which provide unambiguous rendering). — Mikhail Ryazanov ( talk) 20:07, 3 November 2021 (UTC) reply

The usage in the sample equation above looks OK to me. I'd probably do something like that if I were including an equation like that in a journal article. The ellipsis is part of the expression of the number, so it should be separated from the terminal punctuation just like a digit would be. XOR'easter ( talk) 01:59, 1 November 2021 (UTC) reply

I don't feel strongly about the example that David Eppstein posted above. The space is not necessary, but it doesn't hurt anything, and there is a long history of people inserting extra space into typeset mathematics to improve clarity. So I don't feel the need to forbid it in this style guide.

I do feel strongly about the side proposal that Trovatore has mentioned, for getting rid of punctuation altogether in some cases. Yes, I have seen some textbooks use that convention. But most mathematical writing does not, and for good reason. Mgnbar ( talk) 02:23, 1 November 2021 (UTC) reply

Sorry, to whom are you replying? Let's, please, adhere to the usual formatting convention to keep this discussion readable. — Mikhail Ryazanov ( talk) 02:41, 1 November 2021 (UTC) reply

What? Digits are not separated from the terminal punctuation. Neither in regular text, not in mathematical notation. — Mikhail Ryazanov ( talk) 02:42, 1 November 2021 (UTC) reply

Mikhail Ryazanov, I was responding to XOReaster, as my indentation indicated. But really I was giving a summary of my position, and how it relates to that of David Eppstein and Trovatore, which I named explicitly. Mgnbar ( talk) 03:05, 1 November 2021 (UTC) reply

Mikhail Ryazanov, I too did not understand XOReaster's remark about digits separated from punctuation. Mgnbar ( talk) 03:05, 1 November 2021 (UTC) reply

I assumed that what was meant was writing sentences like "Let

x=0

." as "Let

x=0\,

.", adding a small space before the period to make clear that it is a period and not a decimal point. I would not generally do that, myself. Although I frequently use syntax like "0." in Python programs to mean the floating point zero (as distinct from the integer zero) I think avoidance of confusion in Wikipedia writing means that we should instead use "0.0". Also like Mgnbar I feel quite a bit more strongly that sentences should end in periods even when they end in a displayed math formula, than I care about separating the formula from the period with a space. — David Eppstein ( talk) 06:37, 1 November 2021 (UTC) reply

I was referring to this example, not inline text:

{\frac {1}{89}}=\sum _{n=1}^{\infty }{F(n)\times 10^{-(n+1)}}=0.011235955\dots \ .

The number is "0.011235955...", with the "..." an essential part of its meaning. So, it makes sense to separate that ellipsis from the final period; the visual arrangement reflects the mathematical meaning. XOR'easter ( talk) 16:22, 1 November 2021 (UTC) reply

I don't understand this argument. What about the utterance "

48+5=53.

"? The "3" is essential to the meaning, and yet it abuts the period. What about the utterance "The enemy attacked my army."? The "y" is essential to the meaning, but it abuts the period. Mgnbar ( talk) 00:21, 2 November 2021 (UTC) reply

In display style, rather than inline math, I'd typeset that with a thin space as

48+5=53\,.

If the formula is inline, I believe standard TeX practice is to have the sentence-ending punctuation outside of the math delimiters (e.g., $48 + 5 = 53$.). The MOS's current position on that is something horribly complicated that the MOS itself doesn't even follow. XOR'easter ( talk) 11:00, 2 November 2021 (UTC) reply

$48 + 5 = 53$. is indeed how it's written in LaTeX, and this markup produces no extra space. I don't remember seeing any professionally typeset publication with extra spaces in display formulas either, so I don't know where did you get the idea that is should be there. — Mikhail Ryazanov ( talk) 20:07, 3 November 2021 (UTC) reply

Some examples of mandatory extra space for typesetting math notation, followed by a crotchety rant

In statistics, a dot (period) is used as a shorthand notation for the sum over that whole subscript; e.g.

\;x.\,\equiv \,\sum _{\ell }x_{\ell }~,

so when a variable is discussed in the text, it must be separated from any sentence period by a non-breaking space, larger than ordinary word spacing, to distinguish it from a summation on its index. So

\;x.\;

is a single scalar number, the sum of the whole contents of the data vector, and when referring to the whole data vector at the end of a sentence standard math typography looks something like

\;x~.

Otherwise confusion ensues.

In tensor calculus, a comma denotes a partial derivative, with respect to the variable that follows it, e.g.

\;v_{i},_{j}\equiv {\frac {\,\partial \,v_{i}\,}{\partial \,x_{j}}}~.

In the same notation, a semicolon is also used –

\;U;_{j}\;

– but for the moment, I've forgotten what the distinction is. (It might be a total derivative vs. a partial derivative.) In tensor notation, the variable subscript can come either before or after the comma or semicolon, depending on its meaning. Grammatical commas must be widely separated from variable names (even scalars) to distinguish them from derivative notation.

The grammatical punctuation en-dash (–) is visually identical to a minus sign (−) (the Unicode minus-sign, U+2212, is non-breaking, so they are typographically distinct even though they are supposed to look the same). Anyone can be confused by mixed dash and minus sign notation –

x

as a variable named at the start of a dash-separated clause looks pretty much like a widely spaced instance of

-x

. And of course, it is good style to space minus signs wider than multiplication, when mixed with multiplication, to reflect the order of operations. For example

ax-b

is just nasty, but

\;-\,b+a\;\!x\;=\;a\;\!x\,-\,b\;

is okay, and wider spacing of + and − is default in LaTeX, so spaced-away minus signs are possible as a matter of course, and a spaced away leading minus sign can easily be accidentally created. (Leading minus signs typically should concatenate onto the following variable or bracket, with only a 'hair' space.)

And you can figure out on your own all the trouble that a prime mark

\;x'\;

can get the reader into, when the notation has

\;x'\;

and

\;x''\;,

and a $'$ quoted phrase ends with

x'

$'$ .

Even single letter variables can easily be confused: The worst in English is the variable

a

, which also happens to be a copiously used word. In Spanish math texts, Euler's constant

e

is a similar stinker. The use of italics is not always adequate to distinguish variables from words, particularly in the loathsome sans-serif fonts (which have their uses, but mostly obfuscate math by scrambling identical-looking characters, like l, I, and | ).

The above five are examples of broadly used notations that come immediately to mind. There are others. Similar examples regularly come up in newly contrived notations, due to the limitations of symbols available in the publisher's font set. Conventional punctuation signs are occasionally all someone believes they can use. This particularly comes in notational carry-overs from the bad old days of ASCII and the slightly less awful "ANSI" 8 bit character sets. (An example of this would be the use of ":" for the parallel sum operator, once notated as

\;a\operatorname {:} b\;

but in modern notation as

\;a\operatorname {\parallel } b~.

)

If you want to approach this issue with philosophy, then consider that mathematical notation is indeed a substitute for spoken-language words, and should always be read so, both in your head and out-loud. It does require appropriate grammatical punctuation. But there's a complication, since mathematical notation is not written out in words: It is written in an international shorthand, with brief symbols replacing almost all of the words, where every little mark you can possibly imagine means something. Dots, and dashes, and commas, and semicolons, and colons are all incorporated into the shorthand. Mostly the marks are not spoken language punctuation, but instead are part of the independent and distinct grammar of mathematical notation.

Modern mathematical notation is an artificial international language that from its very beginning has never been and is not now English. (Actually, most of the early creators of the notation wrote and spoke Latin, hence symbols based on Latin language names, like

\;\operatorname {sgn} \;,

signum;

\;\sin \;,

sinus;

\;{\sqrt {~^{\,^{\,}}}}\;,

radix). Wikipedia policy rules for language punctuation can only be read as applying to the particular language they were thought out for. It is outright dumb to apply policy created for English text to math shorthand. It's a different language than English, with different grammar and punctuation that (usually) has different meaning (often each has several meanings).

A mathematician that speaks only Russian can write out a sequence of expressions in mathematical notation, sans Russian, which (if correct) mathematicians competent in that branch who speak no Russian can none the less understand and read out, each in their own language. (Although granted, having intermixed written text is extremely helpful, and longed for when absent.)

Sometimes the punctuation symbols used in math notation legitimately come at the beginning or the end of a mathematical expression, adjacent to words or spoken language punctuation. This actually ought to be even more common in Wikipedia articles, where generally obscure symbols like

\;\forall \;\exists \;

etc. are deprecated by policy and equivalent common words in the article's language are preferred instead. Inevitably this will result in even more alternation between math notation and spoken language text.

When a punctuation mark used in both is caught between spoken language and shorthand notation, the reader has to easily see which side the mark belongs on: with the language or with the notation. For that reason, math notation and its notational marks must be cleanly separated from the grammatical punctuation that belongs with the article's written / spoken language, even though that punctuation may have been put in to show how the math notation should be expressed or interpreted, as spoken language, after the translation from shorthand. The convention is to do that with blank space, since extra space is already used as a kind of subtle punctuation, to clarify the shorthand. Extra blank space in math notation is mostly equivalent to the various punctuation marks for spoken language that approximately indicate very long (...), long (–), moderate (;), or short (,) pauses in speech.

As a rule of thumb, the extra space between notation and spoken language text should be slightly but distinctly more than a word space.

The default rendering of minimal LaTeX is absolutely not an authoritative guide: The typesetting syntax is deliberately designed to always insert minimal spacing. It is the writer's job to insert the needed extra space into the LaTeX code in order to separate terms and distinguish otherwise insufficiently separated factors, e.g.

\;\operatorname {cas} x\;

(cosine-add-sine operator on

x

) vs.

\;casx\;

(four variables, default spacing) vs.

\;c\;\!a\;\!s\;\!x\;

(slightly expanded spacing).

As evidence of the TeX design, note the many ways to insert a little or a lot more space, like \ \, \; ~ \quad \qquad but only one way to subtract space: \! and then only a tiny amount. TeX was designed to assign aesthetic responsibility to the human typesetter. It's up to you to express yourself clearly with your notation; the math renderer will help, but only to a minimal degree.

So, in short, I say that there must be more-than-word-space separating all mathematical notation from any grammatical text or punctuation. It's standard practice in professionally typeset books. I in no way concur with, agree with, or find any reason to tolerate the contrary opinions declared above. So there.

astro-Tom-ical ( talk) 14:32, 4 March 2022 (UTC) reply

Inserted my understanding of actual practice for mixed HTML and LaTeX in articles

I replaced the text:

Formulas formatted without using TeX should use the same syntax throughout the article to maintain the same appearance.

with the text:

Formulas formatted without using TeX should use the same syntax throughout the article (or main section with no equivalent symbols or synonymous variables shared with other sections) to maintain the same appearance.

I may be mistaken, but as I understand it, the issue is to not have different symbols for the same thing (even using a different font) in the same article. If a whole section consistently uses unique variables (unique by both symbol and intended meaning) then there should be no objection.

My possibly mistaken understanding is that if a symbol is in a different font then it is not allowed (e.g. $\;\mathbb {R} \;$ in one section, but R in another); likewise disallowed is a change of notation for the same or nearly the same object between two sections. So for example, if a spacecraft's velocity $\;=\mathbb {V} \;$ in one place, but same spacecraft, same velocity $\;={\vec {u}}\;$ elsewhere in the same article would be disallowed.

Possibly there would be a reasonable exception for literally quoting a line quod stet from a cited text that uses different notation, if it is embedded in a |quote= item in a <ref>, or a clearly delineated quote in a footnote, just as long as the formula is expressed in the article's notation where it used in the article's own text.

Astro-Tom-ical ( talk) 11:46, 4 March 2022 (UTC) reply

I strongly disagree with this change. Formatting of mathematics should appear consistent over entire articles, not vary from one section to the next. — David Eppstein ( talk) 19:14, 12 March 2022 (UTC) reply

Notational conventions for spaces

I know that specific algebraic structures should be written upright (with operatorname) while unspecified algebraic structures should be written in italics, e.g. as in $\operatorname {GL} _{n}(K)$ ; my question is whether the same applies to other structures/mathematical objects, e.g. topological spaces/manifolds: should the n-sphere be denoted by $S^{n}$ or $\operatorname {S} ^{n}$ ? Joel Brennan ( talk) 18:30, 21 March 2022 (UTC) reply

I do not think that the rule you have described is, in fact, a rule: I could easily dig up dozens of papers using

GL(V)

or similar. -- JBL ( talk) 18:19, 27 March 2022 (UTC) reply

In my experience, the n-sphere is usually

\mathbb {S} ^{n}

. But, as JBL says, not everyone uses the same notations. My guess (without researching the policy) is that each article should strive to be self-consistent, but consistency across all articles is too much to ask. Mgnbar ( talk) 21:00, 27 March 2022 (UTC) reply

I think it depends on what structure of the space you're using. As a topological space, and maybe also as a subset of a Euclidean space, I think the sphere is usually

S^{n}

, but as a space with a uniform Riemannian geometry, Euclidean and hyperbolic spaces are often

\mathbb {E} ^{n}

and

\mathbb {H} ^{n}

and so I would interpret

\mathbb {S} ^{n}

as meaning the Riemannian geometry on a spherical space. — David Eppstein ( talk) 23:33, 27 March 2022 (UTC) reply

To refute your claims, I examined my bookshelf, only to find that it supported your claims. Most authors on my shelf use Sⁿ, although Bredon uses Sⁿ. (I think that I saw mostly

\mathbb {S} ^{n}

in lectures, but now I'm straying far from reliable source territory.) Mgnbar ( talk) 20:56, 31 March 2022 (UTC) reply

[cnote_†] 
† Fonts with open source, Debian-compatible Linux licensing, so there's no excuse for it not to similarly look better in any OS—that's on OS vendors. And it furthermore actually means that Wikipedia could probably use embedded web fonts to improve the display in all browsers, on all platforms, but I'm not advocating for that. Yet.

[†]

Precomposed vs. ASCII Roman numerals

Initial discussion

External practices

Discussion next steps

Draft arguments

Generalization

Roman Numerals RFC

RFC summary

Pre-RFC arguments summary

RFC discussion

Accessibility of precomposed fraction characters

Zeroth

Blackboard bold for numbers

Explicitly forbid spaces before punctuation?

Inserted my understanding of actual practice for mixed HTML and LaTeX in articles

Notational conventions for spaces

Precomposed vs. ASCII Roman numerals

Initial discussion

External practices

Discussion next steps

Draft arguments

Generalization

Roman Numerals RFC

RFC summary

Pre-RFC arguments summary

RFC discussion

Accessibility of precomposed fraction characters

Zeroth

Blackboard bold for numbers

Explicitly forbid spaces before punctuation?

Inserted my understanding of actual practice for mixed HTML and LaTeX in articles

Notational conventions for spaces

Videos

Websites

Encyclopedia

Facebook