Hello! :)
As you know, lately I've been trying to write a script capable of making Smallem to autoupdate. To make that process easier, it would be good to have some anchors on its module results to allow for easier regexing. For example, in the results generated here: {{#invoke:Smallem|lang_lister|list=y}}, it would be good to have the list of language codes start and end with some specific strings. Given your experience, what could be the best approach on this thing, considering a good graphical solution and the easiest way to be regexed?
The same could be said about the list of regexes generated. That list, as you know, comes with an added problem that sometimes it generates this warning: skipped: got: 𐌲𐌿𐍄𐌹𐍃𐌺; from: got.wiki, which theoretically could come accompanied by other similar warnings in the future. Those warnings pollute the list of commands and whatever anchors we choose for them should also take care of leaving the warnings out. - Klein Muçi ( talk) 22:49, 4 March 2021 (UTC)
begin_code_list
... end_code_list
and begin_regex_list
... end_regex_list
.Maybe a camel case in English BeginCodeList
, etc. ? Even though that would make it a bit hard to read but it's not like we'd spend too much time reading it anyway. Albanian strings were the first that came into my mind but we have everything in English there and I wanted to preserve that. -
Klein Muçi (
talk) 00:24, 5 March 2021 (UTC)
Here's what I get as data at the moment:
CodeListBegin</p><div about="#mwt1">aa, ab, ace, ady, ... zh-tw, zh-yue, zu</div><p about="#mwt1">CodeListEnd
This is coming from me getting the HTML file of this page and grepping between CodeListBegin and CodeListEnd. We want only the codes. I can, of course, not include the endpoints but the problem remains with those command lines in-between. Can they be "fixed" somehow? Or are they necessary part of the module? Sadly, that kind of defeats the purpose of having those strings as delimiters. - Klein Muçi ( talk) 13:27, 5 March 2021 (UTC)
CodeListBegin</p><div>aa, ab, ace, ady, ... zh-tw, zh-yue, zu</div><p>CodeListEnd
<div class="div-col columns column-width" style="column-width:30em">
<ul><li>(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Abasinisch(\s*[\|\}])", r"\1abq-latn\2"),</li>
<li>(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Abasinisch(\s*[\|\}])", r"\1abq\2"),</li>
...
<li>(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)힌디어(\s*[\|\}])", r"\1hi\2"),</li></ul>
</div>
<div>...</div>
tags so that it would be somewhat similar to the regex list which has always been wrapped with <div>...</div>
tags and which (apparently) smallem has been able to read and use.|plain=yes
that would instruct the module to render the lists without the 'prettifying' markup and whitespace that make the lists human-readable. But, the keywords would change to include a colon just because humans will read the lists in their machine-readable forms:
CodeListBegin:
... :CodeListEnd
RegexListBegin:
... :RegexListEnd
CodeListBegin:aa,ab,ace,ady,...zh-tw,zh-yue,zu:CodeListEnd
RegexListBegin:(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Abasinisch(\s*[\|\}])", r"\1abq-latn\2"),(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Abasinisch(\s*[\|\}])", r"\1abq\2"),...(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)힌디어(\s*[\|\}])", r"\1hi\2"),:RegexListEnd
I was able to utilize the regexes and create a loop to be able to get the whole list eventually. The loop works as intended but I'm running onto a strange problem. I keep getting edit conflicts every once in a while even though no one is editing that page other than me with these requests. You need to send the latest id revision of the page you're trying to edit with every request you send and I've written a function to get it and send it automatically. Sometimes it glitches, it doesn't update fast enough, or... I don't know really but it will make it appear as running unto an edit conflict, basically with itself (just because it's trying to edit an older version of the page). This would happen even when sending requests manually (before creating the loop) but then it wouldn't cause any harm because I could just re-send the request and it would get fixed. The problem now is that I send 5 codes per request and if the edit conflict happens, those 5 codes are lost and the loop goes on with the other 5 codes, which means that even 1 edit conflict would corrupt the whole end results. I tried making Smallem sleep after every request (started with 5 seconds per request) and more edits got through before running into conflicts so I tried incrementing the sleeping time. Eventually I got into a point when incrementing it wouldn't help anymore (reached 10 minutes per request) and the conflicts were rare but still persisted in a, what appears to be, random manner. You could get 5 requests go through and then get 1-2 conflicts in a row, for example. At this point, I don't really know what's causing them. Maybe some problem in regard to Wikipedia's caches or job queues... Do you have any experience with this kind of problem? I've been reading the bot help pages here and the ones on Mediawiki in regard to API requests but... - Klein Muçi ( talk) 10:18, 7 March 2021 (UTC)
Thank you! You were right. It works even with 100 entries (150 entries reach the MediaWiki limit). Unfortunately, the problem persists. You need 4 iterations to finish the list and I can do only 1 or 2 iterations before it runs into an edit conflict and 100 entries are lost. :P I'll experiment more to try and solve it with good ol' trial and error. Thank you for your persistent help! :)) - Klein Muçi ( talk) 15:19, 7 March 2021 (UTC)
Yes, thank you! Now 1 technical question: The way Smallem autoupdates itself by first copying the codes from here then using them in batches in another page and getting the results back (and removing the duplicates in the end). Now assume that code list changes by MediaWiki. Would the change be reflected in the HTML source of that page I mentioned above automatically after a while? Or would it need to be edited somehow before reflecting the change? Because if it is the latter one, I should teach Smallem to do a dummy edit before starting the process. (A dummy edit would solve that problem, no?) - Klein Muçi ( talk) 23:43, 11 March 2021 (UTC)
|list=
takes a number and uses that number to create lines of codes that are that long?|list=
is not a number or the number is greater than the number of codes, defaults to lists of 100 codes. When the value assigned to |list=
is a number less than or equal to the number of language codes, creates lists of number codes.Seems pretty good. - Klein Muçi ( talk) 15:22, 12 March 2021 (UTC)
No lines? XD I'm mostly kidding now. Thank you! :) Sometimes I look back and think that I'd probably be adding regex lines by hand until now if you hadn't create that. I don't believe I'd be even close to finishing half of those by now. :P - Klein Muçi ( talk) 14:04, 13 March 2021 (UTC)
|list
parameter empty or if you don't put it at all. You must give it a random value for the list to be generated. And the value does nothing because the format is already fixed. Can it be made to behave like the parameter |plain
? -
Klein Muçi (
talk) 16:13, 14 March 2021 (UTC)
|lang=
is missing or empty. When |list=
has a value,
sq:Moduli:Smallem renders the language list; when |list=
is empty or missing, Smallem renders the regex list according to the codes in |lang=
. I've tweaked it so that Smallem still emits an error message, but isn't so strident.|list=
to some value. At
sq:Përdoruesi:Trappist the monk/Livadhi personal there are two {{#invoke:Smallem|lang_lister|...}}
invokes. They both call the same function lang_lister()
. In the first one, |list=
is not set so lang_lister()
expects to find a list of language codes in |lang=
. That parameter is missing so lang_lister()
cannot proceed except to emit an error message. In the second invoke, |list=
is missing but |lang=
has a list of language code so lang_lister()
renders the regexes. |list=
is the parameter that decides what smallem will do. If you want 100-item lists from smallem, you must set |list=100
so that it knows that that is what you want. When you don't tell it to list codes, smallem will assume that you want list regexes.|list=
. We only added that to be able to affect some more the formatting on the generated codes. Shouldn't it work again without it and just default to a specified default state? Or, asking in a more practical aspect, what am I supposed to write on |list=
when generating the list of codes with plain=yes? -
Klein Muçi (
talk) 20:13, 14 March 2021 (UTC)
|list=
. See
this diff. The old form was |list=yes
and it always shortstopped lang_lister()
. The new form uses a number but if you give it something that is not a number (like |list=yes
) it defaults to 100-item lists.|plain=yes
selects the machine readable output forms. When |plain=yes
and |list=<anything>
you get the machine readable language-codes list. When |plain=yes
and |list=
(empty or omitted), you get the machine readable regexes list.|lang=
code list before handing the list to the part of lang_lister()
that makes the regexes.
sq:Përdoruesi:Trappist the monk/Livadhi personalSo, my autoupdate script works fine overall more or less. It still has some problems with accessing the Wikimedia API but I can't do much about those now, because I can't find anything to help me on the documentation. It requires a lot of trial-erroring to reverse engineer it and hopefully with time I'll be able to solve those too. (I've been talking with someone about the API problems but it seems like even that conversation it's coming to an end without solving all of the problems. Do take a look if you have time.) Meanwhile I had 1 problem I wanted to ask you about, maybe you can orientate me a bit on the right direction. I don't get consistent results. Most of the times I will be getting 41063 lines but the number can vary a lot (40603, 40778, 40842, 41413, 40735, etc). This is with the list of regexes + 9 lines of commands. Now the variation and the lack of consistency is part of the API problems. (Sometimes I also get errors.) But I'd like to be able to know more specifically about the variation between lines though. Do you have any ideas how I might understand more about what's going on behind the curtains in regard to this? What's changing between every run, what's getting left behind or ignored? The exact number should be 42597 + 9 command lines. Some common debugging tactics I can use. Don't worry if you can't help much though. As you can see above, I haven't been able to get help properly even on MediaWiki so... :P :) - Klein Muçi ( talk) 12:15, 17 March 2021 (UTC)
manually generated list had some entries with empty characterswhat do you mean by that? What does an
[entry] with empty characterslook like? What should it look like?
aa
regex list I found: Fe'Fe', Lamnso', Mi'kmaq, Mka'a, Nda'Nda', N’Ko (U+2019 – single comma quotation mark), and O'odham. Didn't find N'go.[entry] withlook like? What should it look like?emptyinvisible characters
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Mi'kmaq(\s*[\|\}])", r"\1mic\2")
→ (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Mikmaq(\s*[\|\}])", r"\1mic\2")
|language=Mi'kmaq
→ |language=mic
aa
regex generation):
... (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Mi'kmaq(\s*[\|\}])", r"\1mic\2"), ... <twelve regexes> ... (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Mka'a(\s*[\|\}])", r"\1bqz\2") ...
... (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Mia(\s*[\|\}])", r"\1bqz\2") ...
aa
, get the results manually, concatenate the lists and keep only the unique lines, which I'll show here. -
Klein Muçi (
talk) 00:38, 18 March 2021 (UTC)So, both methods showed the same number of lines. Script 858, manually 857 (script adds 1 empty line in the end of the results which is removed in the last concatenation). I then concatenated both lists, sorted them lexicographically and kept only the lines which weren't duplicated (if a line had a symmetrical friend, they both got deleted) which were the lines above. I was wrong with the U+code. It only shows the carriage return error which here, strangely enough, gets converted to that red dot (check the source code to see what I mean). So, as you can see, the lines above were the only ones who weren't fully symmetrical and that's about it in regard to the change. There are no totally new entries in a list which don't appear at all in the other list. I believe the red dot is coming from the script list judging by the N’Ko/Nko case in which the manual entri gets positioned in second place. - Klein Muçi ( talk) 01:03, 18 March 2021 (UTC)
{{#language:es-formal|aa}}
→ Spanish (formal address){{#language:hu-formal|aa}}
→ Hungarian (formal address){{#language:nl-informal|aa}}
→ Dutch (informal address)aa
doesn't change anything
{{#language:es-formal|es}}
→ Spanish (formal address){{#language:es-formal}}
→ español (formal)nqo
or loosing the U+1B36 Balinese vowel sign ulu in ban-bali
.{{#language:be-tarask}}
→ беларуская (тарашкевіца){{#language:be-x-old}}
→ беларуская (тарашкевіца){{#language:zh-cn}}
→ 中文(中国大陆){{#language:zh-tw}}
→ 中文(臺灣)As expected, this is what I got now after redoing the same steps mentioned yesterday:
These are the comparisons of the first 10 codes. I thought there would be only some apostrophes missing but the final results surprised me. The Red Dot is back. I'll continue with the other comparisons and hopefully only post the unexpected results, if there are any. The length of the lists continues to be the same in both methods. - Klein Muçi ( talk) 11:46, 19 March 2021 (UTC)
he
:
{{#language:he|am}}
→ ዕብራይስጥ(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)ইগ্<200c>বো(\s*[\|\}])", r"\1ig\2)),
Even though I'm still not finding any changes between lists, I found some lines appear like this (this was only one of many) when I was on these codes be-tarask, be-x-old, bg, bh, bi, bjn, bm, bn, bo, bpy
. Usually I get that symbol (<200c>) whenever there's a "strange" character. I used to get the same with what we've talked above until you removed those. What do you think is happening there? -
Klein Muçi (
talk) 03:07, 20 March 2021 (UTC)
cho, chr, chy, ckb, co, cr, crh, cs, csb, cu
ig
in Bengali, the language name contains গ্ and বো (both of which are combinations of 2 and 3 codepoints). Without U+200C ZWNJ you get গ্বো but with it you get গ্বো. Your second example (gd
– from Central Kurdish ckb
) has two U+200C ZWNJ (the flipflop between right-to-left Arabic script and left-to-right Latin script is confusing). U+200C ZWNJ is not a codepoint that should be removed or replaced.<200c>
text with the U+200C ZWNJ codepoint in the regexes, the best that we can do is have
sq:Moduli:Smallem skip language names that contain U+200C ZWNJ. I have tweaked Moduli:Smallem to skip. See
sq:Përdoruesi:Trappist the monk/Livadhi personal which uses the same ten language codes that you listed above. Before the skip code was added, Moduli:Smallem returned 5250 regexes; after, 5245 (Bengali (bn
) and Bishnupriya (bpy
) both use Bengali script so for codes ig
, kpe
, hmn
, and lv
they produce the same regex.gor, got, gsw, gu, gv, ha, hak, haw, he, hi
:<number-something>
part but instead get some squares or just plain "space" (which is not space, per se, but more like, invisible characters) and I've noticed that in those cases, the bot works fine. I mean, I've done only 1 test some time ago. I must disclose though that in those cases, when you copy the line into a text editor, the squares and the spaces are converted to what the string really is, while the <number-something>
cases stay like that even when copied to somewhere else, so maybe there is a difference. Again, sorry for the lack of terminology. -
Klein Muçi (
talk) 23:04, 20 March 2021 (UTC)
kg, ki, kj, kk, kl, km, kn, ko, koi, kr
-
Klein Muçi (
talk) 23:41, 20 March 2021 (UTC)plain "space" (which is not space, per se, but more like, invisible characters). We need to know about these characters because they might prevent a regex match if they are in the MediaWiki list but not in
|language=
as a human would write the parameter value.when you copy the line into a text editor, the squares and the spaces are converted to what the string. When I copy regexes that have U+200F rtl mark from Wikipedia's Moduli:Smallem rendering to Notepad++, the U+200F rtl mark is not converted to '<200f>'.
mr
, si
, and tcy
language sets. U+200D ZWJ is required to join certain codepoints together into a single character so we can do nothing with them. There are five U+200B ZWSP all from the Khmer language set. These are typographic indicators that can be used to indicate when a line break is permitted; I do not know if they are required or if humans writing these language names would include them.Thank you for all the information! I have a clarification to make though: Regexes that have U+xxxX on them DO NOT get converted when you change the working space. The regexes that get converted are those with squares or "plain space". That's what I was saying and why I believe U+xxxX cases behave different from square/question mark AND plain space/invisible character cases. At this point we're suggesting the same thing apart from the plain space/invisible character cases, which I believe to behave the same as square/question mark cases, so in other words, the tool (Git Bash in my case) doesn't have a font to display them properly. But I don't know why some cases get squares/question marks and some get plain spaces/invisible characters. There are A LOT of both of these cases and, as it may be already expected now, they all happen in non-Latin characters. I'll try to paste some of them here from the K-codes mentioned above I was already working now to give some examples. - Klein Muçi ( talk) 01:18, 21 March 2021 (UTC)
For me, all the Korean characters above (it is Korean, no?) are shown as empty squares in Git Bash. Just 1 example.
For me, the empty squares in the regex above (squares with question marks, if you see the source code) are shown as just space/void. These cases are rarer than the empty squares cases and after searching for more than 30 codes, I was able to refind only this case. But there are codes in which they appear more often.
So to make it more clear, there are 3 cases when the strings aren't rendered as they should:
xal, xh, xmf, yi, yo, yue, za, zea, zh, zh-classical
but instead, accidently, I put xal, xh, xmf, yi, yo, yue, za, zea, zh, zh-
or maybe even xal, xh, xmf, yi, yo, yue, za, zea, zh, zh
. The way I think it could work is basically understand what codes are possible and if it finds alphabetical characters that are not creating a code or are creating the same code twice, it renders an error accordingly. -
Klein Muçi (
talk) 02:19, 21 March 2021 (UTC)
|lang=
that end with a hyphen. See
sq:Përdoruesi:Trappist the monk/Livadhi personal.krc, ks, ksh, ku, kv, kw, ky, la, lad, lb
. 4 examples:xal, xh, xmf, yi, yo, yue, za, zea, zh, z
? Or is that not possible? -
Klein Muçi (
talk) 15:37, 21 March 2021 (UTC)
So where do the <200c> etc text strings come from?If that text string is created by you actually editing the regex, then we should not bother to skip those language names. If those strings are created by a machine, then, yes, language names with U+200B ZWS and U+200D ZWJ should also be skipped.
{{#language:sq}}
→ shqip – 'to' language tag not specified so returns autonym{{#language:sq|sq}}
→ shqip – 'to' language tag specified as Albanian so returns Albanian language name in Albanian{{#language:sq|sq-}}
→ Albanian – malformed 'to' language tag not ignored as one would expect but instead, returns Albanian language name in English{{#language:sq|sq-L}}
→ Albanian – malformed 'to' language tag not ignored as one would expect but instead, returns Albanian language name in English{{#language:sq|s}}
→ shqip – malformed 'to' language tag is ignored as expected so returns Albanian autonym|lang=
must be found in MediaWiki's English list of language codes.Hmm, I thought I had made that clear already... Those kind of text strings together with other oddities that I mentioned above (those 3 cases I explained) are all regexes that appear in my bash (shell). I use Git Bash (used to use cmd.exe) to SSH to ToolForge where I have an account. I've written a script (part of a bigger script aiming to deal with the whole update of the source code of Smallem - the list of regexes basically) that gets the result from this page using cURL and saves them in a file. When I see these results, the ones I've sent here appear like I've sent them. With the <200c> etc. text strings.
Interesting results the ones you showed. I'm curious if other languages may manifest problems like these as well. I don't know who exactly "looks after" these codes. Are they all dealt with from MediaWiki developers not related directly to the languages they deal with? Or are there global volunteers somehow helping in this subject? I believe we've already talked about this in the past. Either way, I expect small wikis' languages to suffer more from inaccurancies like this. - Klein Muçi ( talk) 18:53, 21 March 2021 (UTC)
|list=<number>
and |plain=yes
really an error? |list=
has to have some sort of an assigned value else
sq:Moduli:Smallem attempts to make a regex list from the values assigned to |lang=
.lang_lister()
twice; once against the whole list of possible language codes (output suppressed – no error or skipped messages unless something is so wrong that it can't coninue) and the second time against the list of language codes in |lang=
. I can imagine Moduli:Smallem overrunning its allocated time if we try to do that.Hey man! I'm sorry that I kinda disappeared, especially when I was supposed to basically do the last step in testing, the reason why we've been doing all the aforementioned adjustments above. I made some changes to my internet at home and that required some days to take effect. I made Smallem make a full run of autoupdate. The total I get is 40635 lines. From that, remove 17 lines which are other regexes and command lines, and you get 40618 regex lines. I believe the number is still off the right total no? Even after making the module skip some "problematic" entries. I'm just asking now because I haven't checked the new total after we made the changes. - Klein Muçi ( talk) 10:23, 30 March 2021 (UTC)
zu
) last code in the list of codes. Perhaps the results of the last code aren't being saved?zu
. Most of the language names begin with 'isi-' so I spotchecked several of them; all that I checked were zu
. I have looked at all of the language names at the top and bottom of the list and all of them are zu
except 'Patois' (jam
):
{{#language:jam|zu}}
→ Jamaican Creole English
Second list of discrepancies. Much to my amazement, the change between iterations wasn't that big. There were the same missing results except: (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Patois(\s*[\|\}])", r"\1jam\2"),
which in the second iteration wasn't missing. 405 results in total in the second run. I'll try it again because I believe the close numbers were just a coincidence. I'll post the results here in the same method when I have them. Looking forward to 5 or 10 tests. -
Klein Muçi (
talk) 00:56, 31 March 2021 (UTC)
zu
code temporarily? I can then do 2-3 test and see what happens. If we get no discrepancies in any of those tests, we would have successfully isolated the problem. My belief though is that the test that I'm running now won't stay on that realm. :/ -
Klein Muçi (
talk) 01:22, 31 March 2021 (UTC)
zu
from Module:Smallem. But this is really strange and inspiring at the same time. -
Klein Muçi (
talk) 02:40, 31 March 2021 (UTC)
zu
, perhaps it would be better to invert the language code sort so that zu
is first and aa
is last. If the result is the same then that confirms your something-wrong-with-zu-handling hypothesis. If the result shows that all of the lost regexes are from aa
then that suggests that somewhere the last groups of regexes isn't being correctly handled.Test 1 with inversion complete. Things are starting to go into the unknown again. I got 41269 lines in total. Remove 17 "other" lines and you get 41252 lines of regexes. I'll soon post here the exact missing lines. - Klein Muçi ( talk) 01:20, 1 April 2021 (UTC)
aa
has only one language name written in Afar, its autonym. All other language codes fall back to some other language (likely English):
{{#language:aa|aa}}
→ Qafár af{{#language:de|aa}}
→ German{{#language:en|aa}}
→ English{{#language:fr|aa}}
→ French{{#language:sq|aa}}
→ Albanianzu
and nothing wrong with aa
. Your aa
results tend to lend credence to the notion that the last group of regexes isn't being included in the final list of regexes (it's the automated list that is coming up short, right?)Yes, that was my initial belief too, ever since you mentioned that all the problems arise with zu
, which was the last code. The reason for that being in the way the script works. Speaking in general lines, it gets the list of codes from a wiki page and then transforms that concatenated one-line list into an one-code-per-line list (using the commas as delimiters). Then it starts a loop that gets 1 line from that list (so 1 code), creates an API request with it, saves the results (the list of regexes generated) in a file and removes that 1 line (so 1 code) from the overall code list. Then it gets 1 line from that list (which is the next code in line, because the line before that just got deleted), creates an API request with it... So on until it runs out of codes. Being that I'm not good on scripting, I may have designed badly the loop and the results from the last API request fail to be saved, whatever that is. So, you arising at the same conclusion is a good sign because that would mean that I just need to fix the loop and one of my 2 problems would be solved. The problem that confuses me though is that the automated list, which is the one that is coming short, is not coming short of just 1 entry compared to the manual list. If we do a simple subtraction, the results missing are around 1000 lines, no? Or am I being misled in the logic that I'm using? Either way, in order for us to be totally sure that this really is the problem that we're having, I have to ask, how easy is to twist module Smallem to create the environment for this experiment: Make it only produce 2 codes. One is en
and the other one being a "popular" language that we know that it doesn't have any fallbacks. Maybe Italian (it
)? Then we can see what happens. If all the regexes from it
go missing, then we're sure that the problem is on the loop and I have to fix that. I don't know how easy/or hard is it for you to do that, though. If it is hard, maybe I can twist my script to work with only those 2 codes but I'm a bit reluctant to do further changes in it before being sure what the current problem is because of being afraid in introducing even more bugs. -
Klein Muçi (
talk) 15:40, 1 April 2021 (UTC)
{{#invoke:Smallem|lang_lister|list=yes |plain=yes}}
to be CodeListBegin:en, it:CodeListEnd
. In that case,
sq:Përdoruesi:Trappist the monk/Livadhi personal.It
has Regex count: 859 though. Do you think fallbacks are to be blamed for the discrepancy? -
Klein Muçi (
talk) 20:07, 1 April 2021 (UTC)
it
wouldn't have any fallbacks so the total number of regex lines from that code would also be the number of missing lines from the grand total (given that there would be only 2 codes and only en
would be used). And that would secure the fact that the problem is coming from the loop not saving the results from the last code. But maybe the experiment above is enough as proof nonetheless? -
Klein Muçi (
talk) 20:16, 1 April 2021 (UTC)
en
so I guess... -
Klein Muçi (
talk) 20:18, 1 April 2021 (UTC)I was able to fix the loop and now I get 1481 lines all the time, which is 1464 + 17, precisely how many lines there should be. I'm excited! Can you revert your change about the hardcoding to module:Smallem so I can do a full run with all the codes now to see what happens? - Klein Muçi ( talk) 22:53, 1 April 2021 (UTC)
zu
regex lines missing again. :/ I'll retry switching the module to 2 codes/debug mode to be totally secure it works fine in that mode. :/ -
Klein Muçi (
talk) 01:03, 2 April 2021 (UTC)
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Qafár\ af(\s*[\|\}])", r"\1aa\2"),
This one goes missing. Any idea what might be happening? Do you think it's still the same problem that we're seeing? (The results from the last code are being ignored?) These were the codes used: aa, ab, ace, ady, af, ak, als, alt, am, an, ang, ar, arc, ary, arz, as, ast, atj, av, avk
Keep in mind that it worked perfectly fine with aa, ab, ace, ady, af, ak, als, alt, am, an
. -
Klein Muçi (
talk) 10:20, 2 April 2021 (UTC)Meh, redid some more runs with the same codes. Sometimes I get that line missing, sometimes I don't. I'm disappointed that we got back the inconsistencies between runs. I know they are related to non-printing characters somehow confusing my script but I don't know how. Because the problems started happening as soon as the module started skipping results for the first time. Is there anything special with that single line? I believe whatever is happening in this case, is happening throughout the whole process and the missing lines get accumulated in the end, to give the 1k short total. There must be another character somehow (not necessarily in that line) that we also need to skip. I'm gonna take a close look at the whole list of results, see if I find anything strange. - Klein Muçi ( talk) 10:40, 2 April 2021 (UTC)
aa
from the group and from the |lang=
parameter. The Qafár af regex is the only unique language name contributed by aa
(the autonym) all other language codes fallback to English which are contributed by the other languages in the group. Perhaps your 'fix' to the tail end disrupted the head end?|lang=
gets the same language codes that are listed in the debug lang_codes
(
line 104). What happens if you swap aa
with ab
?aa
from Module:Smallem (that's what you meant, no?) and tried a full run. I got no problems. 4500 results in total: 4483+17 "other" lines. What does this mean? I'm following blindly now. -
Klein Muçi (
talk) 21:06, 2 April 2021 (UTC)
lang_codes = {'ab', 'aa', 'ace', 'ady', 'af', 'ak', 'als', 'alt', 'am', 'an', 'ang', 'ar', 'arc', 'ary', 'arz', 'as', 'ast', 'atj', 'av', 'avk'}; -- debug
Okay then. I'll do that now. And what's the hypothesis we're testing with that? - Klein Muçi ( talk) 21:35, 2 April 2021 (UTC)
Perhaps your 'fix' to the tail end disrupted the head end?
aa
, perhaps you did...lang_codes = {'aa', 'ab', 'ace', 'ady', 'af', 'ak', 'als', 'alt', 'am', 'an', 'ang', 'ar', 'arc', 'ary', 'arz', 'as', 'ast', 'atj', 'av', 'avk'}; -- debug
. I still get 4501 results now. The correct number. Without doing any changes to the script. :/ Should we revert the debug mode now and try a full normal run? Even though, this doesn't make much sense. We didn't do any change basically, no? :P -
Klein Muçi (
talk) 23:22, 2 April 2021 (UTC)
Well... Basically, yes... I mean, my initial idea was to keep adding codes into the "old" debug mode gradually until I started having problems (which happened when I added the second group of 10 codes). Even then, the problems weren't consistent. They happened only half of the times. Then you changed module:Smallem to introduce the "new" debug mode which automatically takes care for the code and regex list. I tried a run with that with aa
missing. Then tried it with aa
in second place (after ab
) and then tried it with aa
in the beginning. All the tests brought the correct, expected results. Without doing any change to the script. Now I don't know what to do. :/ We either switch it to the normal form, or continue adding other codes in this mode and see what happens. The problem with the second option though is that you have changed module:Smallem to add the codes automatically to |lang=
. When more than 10 codes are used simultaneously, the script tends to malfunction in getting the generated regex lines and starts saving empty results. I don't know how it will behave if we add 10 more codes and basically are forced to do 30 codes simultaneously. -
Klein Muçi (
talk) 23:53, 2 April 2021 (UTC)
|lang=
from 20 to 30) and the end results after the script completed its run were with only those "other" 17 results, empty from any regex lines, as expected. :/ If we want to keep experimenting and add codes gradually by 10, we need to turn back to the old debug mode where codes weren't added automatically to |lang=
. I don't know what else to try now, really. Because everything seem to be working fine at the moment. :P In this mode with only those 20 codes that is. -
Klein Muçi (
talk) 00:32, 3 April 2021 (UTC)|lang=
value to some subset of lang_codes
by writing a replacement for line 104:
args.lang = 'ab, ab, ace';
I have noticed that you have some interest in this article I created. I am on a quest to make as many Good Articles as I can this year. If you think this is a possibility, would you be interested in copy editing the article and nominating it GAN. You would get credit for a Good Article as the nominator and I would get credit as the creator of the article when it gets promoted. I will be glad to solve most of the issues the reviewer brings up.-- Doug Coldwell ( talk) 14:34, 6 April 2021 (UTC)
|last=
.Regarding: I don't know what to think about
developer discouraged
. Outside of the RFC, is there such a thing?
No, I literally just made it up. I'm not exactly wedded to the term, and I justed wanted people to understand what the exact status of these parameters were (to replace the old definition of deprecated but also describe something where support won't ever necessarily be removed).
I was exactly thinking it was going to be used for maintenance category names (figured it would be something like Category:CS1 maint: nonhyphenated parameter
), but that's not really for me to weigh in on. –
MJL
‐Talk‐
☖ 18:14, 7 April 2021 (UTC)
How long have you been working, and this is the only way I know how to send a message haha, and thanks for being here, I saw you working on tons of articles before! Ilikememes128 ( talk) 14:23, 13 April 2021 (UTC) |
Hello, Trappist,
All of a sudden, this template is showing a red link category but the most recent edit was by you in May 2020 so I'm not sure what caused things to change. Typically when red link categories appear, they are due to a recent edit by a new editor but that's not the case here. I don't like to edit templates, well, except userboxes that are causing problems, so I was hoping you could look this over and see what the problem is. Thank you in advance. Liz Read! Talk! 21:13, 13 April 2021 (UTC)
Hi, Trappist. I hate to bother you about this, but would you mind re-adding this code you added a while back (or a revised version of it) into the current version of the sandbox? I tried to add it back, but the local month names didn't work. Here are two testcases pages. – Srđan ( talk) 21:45, 12 April 2021 (UTC)
Category:CS1 maint: discouraged parameter has been nominated for deletion. A discussion is taking place to decide whether this proposal complies with the categorization guidelines. If you would like to participate in the discussion, you are invited to add your comments at the category's entry on the categories for discussion page. Thank you. Fram ( talk) 17:14, 16 April 2021 (UTC)
@
Trappist the monk:, hello from el.wiktionary. I am so grateful for your lessons for Lua's
a-b-c and
how to make data modules. It helped to make possible many modules for our small wiki,
like this one! My understanding of Lua is limited to "if bla then xxx end" but unfortunately we have no Luaists anymore around, so, we have to procede with whatever we can.
If you ever have time, could you help with an extra question (it is not urgent at all, and things function ok as they are now). Some modules are becoming SO big! I do not know how to extract a large part and place it in an outside module, or at a subpage. They are not data, they have lots of "ifs". For example
Thank you, and excuse my bothering you with such questions! Sarri.greek ( talk) 11:03, 26 April 2021 (UTC)
require ('Module:el-articles').articles (args) -- modifies args{}
is_set()
function so that tests like this:
if args'ακε' ~= '' and args'ακε' ~= nil then args'ακε' = args'ακε' else args'ακε' = '' end
if not is_set (args'ακε']) then args'ακε' = '' end
create_link()
, stem_color()
, etc are used by multiple modules, you might consider creating a separate utilities module that other modules in this family of modules might use.because elsewhere, a language link is involvedWhat does that mean? There is no
language linkinvolved in
require ('Module:el-articles').articles (args)
and there would be no language linkinvolved if you created el:wikt:Module:Utilities. Functions in Module:Utilities would be required into the module that needs them just as el:wikt:Module:tin is required into el:wikt:Module:el-nouns-decl and also into el:wikt:Module:el-articles. The advantage is that these utility function live in a single place so when changes are necessary, those changes occur in only one place not in many places.
is_set()
is a function that returns a boolean (true
or false
). Here is the example I used above:
if not is_set (args'ακε']) then args'ακε' = '' end
args['ακε']
has a non-nil
value because the value assigned to a_klenstr
is a concatenation of args['ακε']
and \n
. In Lua, you can't concatenate a nil
type to a string
type. When |ακε=
is included without an assigned value in the template or module call, args['ακε']
gets an empty string value. When |ακε=
is omitted from the template or module call, args
does not have a key ['ακε']
so args['ακε']
returns nil
.is_set (args['ακε'])
to determine if args['ακε']
has an assigned value that is anything but blank. Blank means empty string or nil
. If args['ακε']
is empty string or nil
, is_set()
returns false
indicating that args['ακε']
is not set. In the snippet, not
inverts the value returned from is_set()
so a false
return becomes true
for the if ... then
test indicating that args['ακε']
is not set so args['ακε'] = ''
ensures that args['ακε']
is not nil
for the concatenation that makes a_klenstr
.tinti
was missing. I hacked a crude regex to search Module:el-article for text that looks like a function call. I think that the regex was [a-z] *\(
.What is the process for making new citation templates? medRxiv preprints are usually put in the cite journal template, which triggers a citation error, or cite web, which seems clumsy. There is Template:Cite bioRxiv, so it may be beneficial for there to also be Template:Cite medRxiv. Velayinosu ( talk) 01:51, 3 May 2021 (UTC)
Hello Trappist the monk, the category is empty. Yipee !!! Lotje ( talk) 12:24, 16 May 2021 (UTC)
Hi! I noticed that you have made many edits to Module:Citation/CS1/Configuration so I hope you have a little time to help me/Danish Wikipedia.
Some time ago the English module was copied to Danish Wikipedia and modified whenever there was a problem. When InternetArchiveBot was introduced a local user made some edits and we started up the bot. Sadly the setup was not 100% correct and the user that used to fix it have not been active a few month. Last comment was something about bad health so we do not know if the user can ever return. So I started to look at it.
I think it would perhaps be best to copy the modules from enwiki to dawiki and make the conversions from scratch. So I have copied the modules to da:Modul:Citation/CS1/sandkasse etc. (sandkasse = sandbox).
So my question for you is where I should make edits to make it localized?
I know I have to uncomment 3 lines like here: da:Special:Diff/10754534 and to make a lot of edits in da:Modul:Citation/CS1/Configuration/sandkasse (so far I only made a few adjustments related to dates like da:Special:Diff/10754497).
I know da:Modul:Citation/CS1/Date validation/sandkasse should be modified if we want to allow 'yMd'. I have not done this yet because I got errors right from the start: da:Speciel:PermanentLink/10754556. And the problem does not seem to be related to ymd.
Should I set language to 'da' somewhere? As far as I can tell the module should be able to figure out that it is da.wiki.
Do I need to change any of the other sub modules to make it work?
My plan is do document the changes needed on da.wiki so that more than one user knows how to set it up locally. -- MGA73 ( talk) 18:21, 30 April 2021 (UTC) -- MGA73 ( talk) 18:21, 30 April 2021 (UTC)
date_names'local']['long'
and date_names'local']['short'
are inverted for use by reformatter()
. It is not possible to invert 'December' = 12, 'december' = 12
to 12 = 'December', 12 = 'december'
and have access to both forms of the month name. If both 'December' and 'december' are needed, some sort of special code will have to be written to support that. If all that you need is the lowercase form, that is all that should be included in the date_names'local']['long'
and date_names'local']['short'
tables.patterns{}
; perhaps something like this:
-- day-initial: day. month year
'd.My' = {'^([1-9]%d?)%. +(%D-) +((%d%d%d%d?)%a?)$', 'd', 'm', 'a', 'y'},
check_date()
; perhaps something like this:
elseif mw.ustring.match(date_string, patterns'd.My'][1]) then -- day-initial: day. month year
day, month, anchor_year, year = mw.ustring.match(date_string, patterns'd.My'][1]);
month = get_month_number (month);
if 0 == month then return false; end -- return false if month text isn't one of the twelve months
'd.My'
then I think it should perhaps be 'd.my'
because we use lowercase months. That makes me wonder if I should change all the codes with the "M" variants to lowercase or if the code can handle capital letters?'dec.' = 12
I guess that is a bad idea? It would be better to create a pattern for that like when we use a dot after day? So we should have a 'd.m.y'
? --
MGA73 (
talk) 08:27, 1 May 2021 (UTC)
'd.My'
the uppercase 'M' is intended to mean 'month-as-name' while lowercase 'm' is intended to mean 'month-as-digit' so I think that you should not change these.'dec.' = 12
is correct. Those patterns that look for month names, are looking for anything that is not a digit ('My'
is an oddity). The patterns don't care if the month name has punctuation. The whole month-name capture is used to index into date_names'local']['short'
.is_valid_month_range_style()
in the en.wiki sandbox; see
Module:Citation/CS1/Date validation/sandbox#L-246. The original code there was quite old (December 2014) and, I think, precedes any real notion of i18n. You might want to replace your is_valid_month_range_style()
with the en.wiki/sandbox version. You can be our testbed.Hello again! Great then things are going the right way :-) I changed my sandbox with a copy from your sandbok (see da:Special:Diff/10755535). I tested it in my own sandbox like da:Speciel:PermanentLink/10755537. It looks like it can handle "maj" like it should. But not "Maj" and that is okay because Danish months use lowercase. I guess the reason it accepts "December" is because it also accepts English months. So it works? -- MGA73 ( talk) 17:26, 1 May 2021 (UTC)
is_valid_month_range_style()
because none of the dates are ranges. is_valid_month_range_style()
is used for these (copy them as you see them here and paste them into your sandbox):
*{{cite web/sandkasse |title=Title |url=//example.com |date=oktober–december 2021}}
– no error*{{cite web/sandkasse |title=Title |url=//example.com |date=okt.–december 2021}}
– error*{{cite web/sandkasse |title=Title |url=//example.com |date=okt.–dec. 2021}}
– no error*{{cite web/sandkasse |title=Title |url=//example.com |date=oktober–dec. 2021}}
– error11. december 2021
and 11. December 2021
but the auto-translation of the month names at least causes the module to render 'december' from December
. In your testcases, Maj
cannot be translated because that name is not an English month-name. We might tweak the auto-translation code so that it adds a maintenance category whenever a translation is made. One of your gnomes then can monitor that category and fix the dates in the wikisource.'d.My'
we talked about earlier. They use dot too in
no:,
nn:,
fi: and
fo: (perhaps other wikis too). So perhaps you could add the support in the code so that the wikis that need it can just uncomment it? --
MGA73 (
talk) 08:53, 2 May 2021 (UTC)
cfg.date_name_auto_xlate_enable
and cfg.date_digit_auto_xlate_enable
which you have set in ~/Configuration. I do not see this as self detectbecause a human has to enable/disable the auto-translation – the module knowing that it is on da.wiki is
self detect.
date_hyphen_to_dash()
does what you want it to do, then use it.did not not seem to work good with the new codemeans to you. There are two hyphen/dash converters one in ~/Date validation for dates and the other in the main module for
|pages=
, |issue=
, etc.Hello again! We tried again and da:Modul:Citation/CS1/Date_validation/sandkasse#L-1059 and forward seems to show us what we want now :-D
However it also changes some wrong dates to correct dates without showing us errors. Examples can be seen in da:Skabelon:Citation/testcases#Interval_med_datoer/dansk "(Forventet: OK)" means "Expected OK" and "(Forventet: fejl)" means "Expected error".
We use "-" without spaces around here:
We use " – " with spaces around here:
Perhaps you could have a look at the code and tell us if we could do something much smarter? For example I can't help think that the original code define different patterns for dates so we should perhaps be able to figure out how to use that. -- MGA73 ( talk) 13:19, 15 May 2021 (UTC)
{{Citation/sandkasse |...}}
?Hello again! I made
da:Special:Diff/10769630 because I suddenly realized that we have other variants than 'd.My'
that include a dot. We also have 'd.-d.My'
, 'd.M-d.My'
and 'd.My-d.My'
. It elimilated a lot of the problems we had :-) --
MGA73 (
talk) 17:14, 17 May 2021 (UTC)
Hello Trappist the monk I am a student who is new to Wikipedia. One of my courses consists in editing and updating the Kirindy Forest Wikipedia page, I saw that you edited the Kirindy Mitea National Park page and was wondering if you would be able to provide me with some feedback and help me improve it.
Any assistance would be greatly appreciated :) — Preceding unsigned comment added by Marie Salichon ( talk • contribs) 01:47, 22 May 2021 (UTC)
141.91.210.58 ( talk) 06:45, 4 June 2021 (UTC)monkbot is bad!
Hello Ttm, I was curious to know if you think it would be possible to add the ability to watch a specific talk page thread, as opposed to the entire page itself. Perhaps as a feature, whereby adding threads to your watchlist? Or follow along by some other means, where maybe there's some kind of notification each time a post is added to the thread?
I'm asking you first, to see if pursuing this is even worthwhile. Is it impossible? Would you know if this has come up before? Otherwise, I suppose I would start at Phabricator and create a new feature request, then go from there? Any feedback would be appreciated. Cheers - wolf 02:39, 2 June 2021 (UTC)
You have written on nanvag rajputs from were you get it Brother Kunalpratapsingh35 ( talk) 16:07, 6 June 2021 (UTC)
I did preview. It's my opinion that bare urls do not reflect well. In some cases, the refill tool does a nice job of improving the ref. In some cases it produces a broken citation often because the underlying URL has a problem. It's my opinion that highlighting the problem, so the editors with SME can see them and fix them, is better than leaving them as bare URLs. It appears you think leaving bare URLs even when they are flawed is a better option. We disagree.-- S Philbrick (Talk) 16:24, 27 June 2021 (UTC)
[http://www.damligan.se/player.html?id=1517462 ''Player Bio'' damligan.se] {{webarchive|url=https://web.archive.org/web/20100813191257/http://www.damligan.se/player.html?id=1517462 |date=2010-08-13 }}
{{Cite web|url=http://www.damligan.se/player.html?id=1517462|archiveurl=https://web.archive.org/web/20100813191257/http://www.damligan.se/player.html?id=1517462|deadurl=y|title=''Player Bio'' damligan.se|archivedate=August 13, 2010}}
{{
cite web}}
: Unknown parameter |deadurl=
ignored (|url-status=
suggested) (
help)[https://archive.is/20121117010216/https://www.eurobasket.com/team.asp?Cntry=Sweden&Team=8211&Page=1 ''Umeå player roster'' eurobasket.com]
{{Cite web|url=https://www.eurobasket.com/team.asp|archiveurl=http://archive.today/20121117010216/https://www.eurobasket.com/team.asp?Cntry=Sweden&Team=8211&Page=1|deadurl=y|title=Udominate Basket Umea basketball - team details, stats, news, roster - EUROBASKET}}
{{
cite web}}
: |archive-url=
requires |archive-date=
(
help); Unknown parameter |deadurl=
ignored (|url-status=
suggested) (
help){{Cite web |url=http://www.damligan.se/player.html?id=1517462 |archiveurl=https://web.archive.org/web/20100813191257/http://www.damligan.se/player.html?id=1517462 |title=Pamela Rosanio |website=Damligan |language=sv |archive-date=August 13, 2010}}
{{Cite web |url=https://www.eurobasket.com/team.asp |archiveurl=http://archive.today/20121117010216/https://www.eurobasket.com/team.asp?Cntry=Sweden&Team=8211&Page=1 |archive-date=2012-11-17 |title=Udominate Basket Umea basketball team |website=Eurobasket}}
I try - after a conversation with Graham87 - to avoid direct links to foreign Wikipedias. How can we do that in cite templates, such as Karl-Günther von Hase? -- Gerda Arendt ( talk) 20:23, 26 June 2021 (UTC)
conversation with Graham87.
{{
ill}}
is a notice box that has this image:
. That notice box tells editors that {{ill}}
is not to be used within cs1|2 templates. There are a couple of reasons for that. When one template (in this case {{
cite book}}
contains another template ({{ill}}
), the inner template ({{ill}}
) is rendered before the outer template ({{cite book}}
). When the page is processed by MediaWiki, the {{cite book}}
parameter value |editor2={{ill|Rolf Breitenstein|de}}
is processed first so that what {{cite book}}
gets is:
|editor2=[[Rolf Breitenstein]]<span class="noprint" style="font-size:85%; font-style: normal; "> [[[:de:Rolf Breitenstein|de]]]</span>
<span>...</span>
html tags is not the editor's name so does not belong in |editor2=
.{{ill}}
sets the [[[:de:Rolf Breitenstein|de]]]
to a font size of 85% of the current text size. That is ok when {{ill}}
is used in the body of an article because 85% of the normal '100%' article body font size yields the smaller text at 85% of the normal font size. But,
MOS:FONTSIZE instructs us to not set smaller font sizes inside infoboxen, navboxen, and reference sections. {{
refbegin}}
(used in
Karl-Günther von Hase) sets the references to 90% of normal. {{ill}}
then sets [[[:de:Rolf Breitenstein|de]]]
to 85% of 90% or 100 × 0.90 × 0.85 = 76.5% which is smaller than the allowed 85% font size. Here is a simple illustration of how that works (for clarity, you may need to addjust your browser's zoom setting):
<span style="font-size:90%;>[<span style="font-size:85%;>[]</span></span>
|editor2={{ill|Rolf Breitenstein|de}}
to |editor2=Rolf Breitenstein |editor-link2=:de:Rolf Breitenstein
.This conversation having apparently died, I am reverting Editor Gerda Arendt's partial revert of my edits.
— Trappist the monk ( talk) 13:17, 29 June 2021 (UTC)
Hello! :)
As you know, lately I've been trying to write a script capable of making Smallem to autoupdate. To make that process easier, it would be good to have some anchors on its module results to allow for easier regexing. For example, in the results generated here: {{#invoke:Smallem|lang_lister|list=y}}, it would be good to have the list of language codes start and end with some specific strings. Given your experience, what could be the best approach on this thing, considering a good graphical solution and the easiest way to be regexed?
The same could be said about the list of regexes generated. That list, as you know, comes with an added problem that sometimes it generates this warning: skipped: got: 𐌲𐌿𐍄𐌹𐍃𐌺; from: got.wiki, which theoretically could come accompanied by other similar warnings in the future. Those warnings pollute the list of commands and whatever anchors we choose for them should also take care of leaving the warnings out. - Klein Muçi ( talk) 22:49, 4 March 2021 (UTC)
begin_code_list
... end_code_list
and begin_regex_list
... end_regex_list
.Maybe a camel case in English BeginCodeList
, etc. ? Even though that would make it a bit hard to read but it's not like we'd spend too much time reading it anyway. Albanian strings were the first that came into my mind but we have everything in English there and I wanted to preserve that. -
Klein Muçi (
talk) 00:24, 5 March 2021 (UTC)
Here's what I get as data at the moment:
CodeListBegin</p><div about="#mwt1">aa, ab, ace, ady, ... zh-tw, zh-yue, zu</div><p about="#mwt1">CodeListEnd
This is coming from me getting the HTML file of this page and grepping between CodeListBegin and CodeListEnd. We want only the codes. I can, of course, not include the endpoints but the problem remains with those command lines in-between. Can they be "fixed" somehow? Or are they necessary part of the module? Sadly, that kind of defeats the purpose of having those strings as delimiters. - Klein Muçi ( talk) 13:27, 5 March 2021 (UTC)
CodeListBegin</p><div>aa, ab, ace, ady, ... zh-tw, zh-yue, zu</div><p>CodeListEnd
<div class="div-col columns column-width" style="column-width:30em">
<ul><li>(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Abasinisch(\s*[\|\}])", r"\1abq-latn\2"),</li>
<li>(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Abasinisch(\s*[\|\}])", r"\1abq\2"),</li>
...
<li>(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)힌디어(\s*[\|\}])", r"\1hi\2"),</li></ul>
</div>
<div>...</div>
tags so that it would be somewhat similar to the regex list which has always been wrapped with <div>...</div>
tags and which (apparently) smallem has been able to read and use.|plain=yes
that would instruct the module to render the lists without the 'prettifying' markup and whitespace that make the lists human-readable. But, the keywords would change to include a colon just because humans will read the lists in their machine-readable forms:
CodeListBegin:
... :CodeListEnd
RegexListBegin:
... :RegexListEnd
CodeListBegin:aa,ab,ace,ady,...zh-tw,zh-yue,zu:CodeListEnd
RegexListBegin:(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Abasinisch(\s*[\|\}])", r"\1abq-latn\2"),(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Abasinisch(\s*[\|\}])", r"\1abq\2"),...(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)힌디어(\s*[\|\}])", r"\1hi\2"),:RegexListEnd
I was able to utilize the regexes and create a loop to be able to get the whole list eventually. The loop works as intended but I'm running onto a strange problem. I keep getting edit conflicts every once in a while even though no one is editing that page other than me with these requests. You need to send the latest id revision of the page you're trying to edit with every request you send and I've written a function to get it and send it automatically. Sometimes it glitches, it doesn't update fast enough, or... I don't know really but it will make it appear as running unto an edit conflict, basically with itself (just because it's trying to edit an older version of the page). This would happen even when sending requests manually (before creating the loop) but then it wouldn't cause any harm because I could just re-send the request and it would get fixed. The problem now is that I send 5 codes per request and if the edit conflict happens, those 5 codes are lost and the loop goes on with the other 5 codes, which means that even 1 edit conflict would corrupt the whole end results. I tried making Smallem sleep after every request (started with 5 seconds per request) and more edits got through before running into conflicts so I tried incrementing the sleeping time. Eventually I got into a point when incrementing it wouldn't help anymore (reached 10 minutes per request) and the conflicts were rare but still persisted in a, what appears to be, random manner. You could get 5 requests go through and then get 1-2 conflicts in a row, for example. At this point, I don't really know what's causing them. Maybe some problem in regard to Wikipedia's caches or job queues... Do you have any experience with this kind of problem? I've been reading the bot help pages here and the ones on Mediawiki in regard to API requests but... - Klein Muçi ( talk) 10:18, 7 March 2021 (UTC)
Thank you! You were right. It works even with 100 entries (150 entries reach the MediaWiki limit). Unfortunately, the problem persists. You need 4 iterations to finish the list and I can do only 1 or 2 iterations before it runs into an edit conflict and 100 entries are lost. :P I'll experiment more to try and solve it with good ol' trial and error. Thank you for your persistent help! :)) - Klein Muçi ( talk) 15:19, 7 March 2021 (UTC)
Yes, thank you! Now 1 technical question: The way Smallem autoupdates itself by first copying the codes from here then using them in batches in another page and getting the results back (and removing the duplicates in the end). Now assume that code list changes by MediaWiki. Would the change be reflected in the HTML source of that page I mentioned above automatically after a while? Or would it need to be edited somehow before reflecting the change? Because if it is the latter one, I should teach Smallem to do a dummy edit before starting the process. (A dummy edit would solve that problem, no?) - Klein Muçi ( talk) 23:43, 11 March 2021 (UTC)
|list=
takes a number and uses that number to create lines of codes that are that long?|list=
is not a number or the number is greater than the number of codes, defaults to lists of 100 codes. When the value assigned to |list=
is a number less than or equal to the number of language codes, creates lists of number codes.Seems pretty good. - Klein Muçi ( talk) 15:22, 12 March 2021 (UTC)
No lines? XD I'm mostly kidding now. Thank you! :) Sometimes I look back and think that I'd probably be adding regex lines by hand until now if you hadn't create that. I don't believe I'd be even close to finishing half of those by now. :P - Klein Muçi ( talk) 14:04, 13 March 2021 (UTC)
|list
parameter empty or if you don't put it at all. You must give it a random value for the list to be generated. And the value does nothing because the format is already fixed. Can it be made to behave like the parameter |plain
? -
Klein Muçi (
talk) 16:13, 14 March 2021 (UTC)
|lang=
is missing or empty. When |list=
has a value,
sq:Moduli:Smallem renders the language list; when |list=
is empty or missing, Smallem renders the regex list according to the codes in |lang=
. I've tweaked it so that Smallem still emits an error message, but isn't so strident.|list=
to some value. At
sq:Përdoruesi:Trappist the monk/Livadhi personal there are two {{#invoke:Smallem|lang_lister|...}}
invokes. They both call the same function lang_lister()
. In the first one, |list=
is not set so lang_lister()
expects to find a list of language codes in |lang=
. That parameter is missing so lang_lister()
cannot proceed except to emit an error message. In the second invoke, |list=
is missing but |lang=
has a list of language code so lang_lister()
renders the regexes. |list=
is the parameter that decides what smallem will do. If you want 100-item lists from smallem, you must set |list=100
so that it knows that that is what you want. When you don't tell it to list codes, smallem will assume that you want list regexes.|list=
. We only added that to be able to affect some more the formatting on the generated codes. Shouldn't it work again without it and just default to a specified default state? Or, asking in a more practical aspect, what am I supposed to write on |list=
when generating the list of codes with plain=yes? -
Klein Muçi (
talk) 20:13, 14 March 2021 (UTC)
|list=
. See
this diff. The old form was |list=yes
and it always shortstopped lang_lister()
. The new form uses a number but if you give it something that is not a number (like |list=yes
) it defaults to 100-item lists.|plain=yes
selects the machine readable output forms. When |plain=yes
and |list=<anything>
you get the machine readable language-codes list. When |plain=yes
and |list=
(empty or omitted), you get the machine readable regexes list.|lang=
code list before handing the list to the part of lang_lister()
that makes the regexes.
sq:Përdoruesi:Trappist the monk/Livadhi personalSo, my autoupdate script works fine overall more or less. It still has some problems with accessing the Wikimedia API but I can't do much about those now, because I can't find anything to help me on the documentation. It requires a lot of trial-erroring to reverse engineer it and hopefully with time I'll be able to solve those too. (I've been talking with someone about the API problems but it seems like even that conversation it's coming to an end without solving all of the problems. Do take a look if you have time.) Meanwhile I had 1 problem I wanted to ask you about, maybe you can orientate me a bit on the right direction. I don't get consistent results. Most of the times I will be getting 41063 lines but the number can vary a lot (40603, 40778, 40842, 41413, 40735, etc). This is with the list of regexes + 9 lines of commands. Now the variation and the lack of consistency is part of the API problems. (Sometimes I also get errors.) But I'd like to be able to know more specifically about the variation between lines though. Do you have any ideas how I might understand more about what's going on behind the curtains in regard to this? What's changing between every run, what's getting left behind or ignored? The exact number should be 42597 + 9 command lines. Some common debugging tactics I can use. Don't worry if you can't help much though. As you can see above, I haven't been able to get help properly even on MediaWiki so... :P :) - Klein Muçi ( talk) 12:15, 17 March 2021 (UTC)
manually generated list had some entries with empty characterswhat do you mean by that? What does an
[entry] with empty characterslook like? What should it look like?
aa
regex list I found: Fe'Fe', Lamnso', Mi'kmaq, Mka'a, Nda'Nda', N’Ko (U+2019 – single comma quotation mark), and O'odham. Didn't find N'go.[entry] withlook like? What should it look like?emptyinvisible characters
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Mi'kmaq(\s*[\|\}])", r"\1mic\2")
→ (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Mikmaq(\s*[\|\}])", r"\1mic\2")
|language=Mi'kmaq
→ |language=mic
aa
regex generation):
... (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Mi'kmaq(\s*[\|\}])", r"\1mic\2"), ... <twelve regexes> ... (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Mka'a(\s*[\|\}])", r"\1bqz\2") ...
... (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Mia(\s*[\|\}])", r"\1bqz\2") ...
aa
, get the results manually, concatenate the lists and keep only the unique lines, which I'll show here. -
Klein Muçi (
talk) 00:38, 18 March 2021 (UTC)So, both methods showed the same number of lines. Script 858, manually 857 (script adds 1 empty line in the end of the results which is removed in the last concatenation). I then concatenated both lists, sorted them lexicographically and kept only the lines which weren't duplicated (if a line had a symmetrical friend, they both got deleted) which were the lines above. I was wrong with the U+code. It only shows the carriage return error which here, strangely enough, gets converted to that red dot (check the source code to see what I mean). So, as you can see, the lines above were the only ones who weren't fully symmetrical and that's about it in regard to the change. There are no totally new entries in a list which don't appear at all in the other list. I believe the red dot is coming from the script list judging by the N’Ko/Nko case in which the manual entri gets positioned in second place. - Klein Muçi ( talk) 01:03, 18 March 2021 (UTC)
{{#language:es-formal|aa}}
→ Spanish (formal address){{#language:hu-formal|aa}}
→ Hungarian (formal address){{#language:nl-informal|aa}}
→ Dutch (informal address)aa
doesn't change anything
{{#language:es-formal|es}}
→ Spanish (formal address){{#language:es-formal}}
→ español (formal)nqo
or loosing the U+1B36 Balinese vowel sign ulu in ban-bali
.{{#language:be-tarask}}
→ беларуская (тарашкевіца){{#language:be-x-old}}
→ беларуская (тарашкевіца){{#language:zh-cn}}
→ 中文(中国大陆){{#language:zh-tw}}
→ 中文(臺灣)As expected, this is what I got now after redoing the same steps mentioned yesterday:
These are the comparisons of the first 10 codes. I thought there would be only some apostrophes missing but the final results surprised me. The Red Dot is back. I'll continue with the other comparisons and hopefully only post the unexpected results, if there are any. The length of the lists continues to be the same in both methods. - Klein Muçi ( talk) 11:46, 19 March 2021 (UTC)
he
:
{{#language:he|am}}
→ ዕብራይስጥ(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)ইগ্<200c>বো(\s*[\|\}])", r"\1ig\2)),
Even though I'm still not finding any changes between lists, I found some lines appear like this (this was only one of many) when I was on these codes be-tarask, be-x-old, bg, bh, bi, bjn, bm, bn, bo, bpy
. Usually I get that symbol (<200c>) whenever there's a "strange" character. I used to get the same with what we've talked above until you removed those. What do you think is happening there? -
Klein Muçi (
talk) 03:07, 20 March 2021 (UTC)
cho, chr, chy, ckb, co, cr, crh, cs, csb, cu
ig
in Bengali, the language name contains গ্ and বো (both of which are combinations of 2 and 3 codepoints). Without U+200C ZWNJ you get গ্বো but with it you get গ্বো. Your second example (gd
– from Central Kurdish ckb
) has two U+200C ZWNJ (the flipflop between right-to-left Arabic script and left-to-right Latin script is confusing). U+200C ZWNJ is not a codepoint that should be removed or replaced.<200c>
text with the U+200C ZWNJ codepoint in the regexes, the best that we can do is have
sq:Moduli:Smallem skip language names that contain U+200C ZWNJ. I have tweaked Moduli:Smallem to skip. See
sq:Përdoruesi:Trappist the monk/Livadhi personal which uses the same ten language codes that you listed above. Before the skip code was added, Moduli:Smallem returned 5250 regexes; after, 5245 (Bengali (bn
) and Bishnupriya (bpy
) both use Bengali script so for codes ig
, kpe
, hmn
, and lv
they produce the same regex.gor, got, gsw, gu, gv, ha, hak, haw, he, hi
:<number-something>
part but instead get some squares or just plain "space" (which is not space, per se, but more like, invisible characters) and I've noticed that in those cases, the bot works fine. I mean, I've done only 1 test some time ago. I must disclose though that in those cases, when you copy the line into a text editor, the squares and the spaces are converted to what the string really is, while the <number-something>
cases stay like that even when copied to somewhere else, so maybe there is a difference. Again, sorry for the lack of terminology. -
Klein Muçi (
talk) 23:04, 20 March 2021 (UTC)
kg, ki, kj, kk, kl, km, kn, ko, koi, kr
-
Klein Muçi (
talk) 23:41, 20 March 2021 (UTC)plain "space" (which is not space, per se, but more like, invisible characters). We need to know about these characters because they might prevent a regex match if they are in the MediaWiki list but not in
|language=
as a human would write the parameter value.when you copy the line into a text editor, the squares and the spaces are converted to what the string. When I copy regexes that have U+200F rtl mark from Wikipedia's Moduli:Smallem rendering to Notepad++, the U+200F rtl mark is not converted to '<200f>'.
mr
, si
, and tcy
language sets. U+200D ZWJ is required to join certain codepoints together into a single character so we can do nothing with them. There are five U+200B ZWSP all from the Khmer language set. These are typographic indicators that can be used to indicate when a line break is permitted; I do not know if they are required or if humans writing these language names would include them.Thank you for all the information! I have a clarification to make though: Regexes that have U+xxxX on them DO NOT get converted when you change the working space. The regexes that get converted are those with squares or "plain space". That's what I was saying and why I believe U+xxxX cases behave different from square/question mark AND plain space/invisible character cases. At this point we're suggesting the same thing apart from the plain space/invisible character cases, which I believe to behave the same as square/question mark cases, so in other words, the tool (Git Bash in my case) doesn't have a font to display them properly. But I don't know why some cases get squares/question marks and some get plain spaces/invisible characters. There are A LOT of both of these cases and, as it may be already expected now, they all happen in non-Latin characters. I'll try to paste some of them here from the K-codes mentioned above I was already working now to give some examples. - Klein Muçi ( talk) 01:18, 21 March 2021 (UTC)
For me, all the Korean characters above (it is Korean, no?) are shown as empty squares in Git Bash. Just 1 example.
For me, the empty squares in the regex above (squares with question marks, if you see the source code) are shown as just space/void. These cases are rarer than the empty squares cases and after searching for more than 30 codes, I was able to refind only this case. But there are codes in which they appear more often.
So to make it more clear, there are 3 cases when the strings aren't rendered as they should:
xal, xh, xmf, yi, yo, yue, za, zea, zh, zh-classical
but instead, accidently, I put xal, xh, xmf, yi, yo, yue, za, zea, zh, zh-
or maybe even xal, xh, xmf, yi, yo, yue, za, zea, zh, zh
. The way I think it could work is basically understand what codes are possible and if it finds alphabetical characters that are not creating a code or are creating the same code twice, it renders an error accordingly. -
Klein Muçi (
talk) 02:19, 21 March 2021 (UTC)
|lang=
that end with a hyphen. See
sq:Përdoruesi:Trappist the monk/Livadhi personal.krc, ks, ksh, ku, kv, kw, ky, la, lad, lb
. 4 examples:xal, xh, xmf, yi, yo, yue, za, zea, zh, z
? Or is that not possible? -
Klein Muçi (
talk) 15:37, 21 March 2021 (UTC)
So where do the <200c> etc text strings come from?If that text string is created by you actually editing the regex, then we should not bother to skip those language names. If those strings are created by a machine, then, yes, language names with U+200B ZWS and U+200D ZWJ should also be skipped.
{{#language:sq}}
→ shqip – 'to' language tag not specified so returns autonym{{#language:sq|sq}}
→ shqip – 'to' language tag specified as Albanian so returns Albanian language name in Albanian{{#language:sq|sq-}}
→ Albanian – malformed 'to' language tag not ignored as one would expect but instead, returns Albanian language name in English{{#language:sq|sq-L}}
→ Albanian – malformed 'to' language tag not ignored as one would expect but instead, returns Albanian language name in English{{#language:sq|s}}
→ shqip – malformed 'to' language tag is ignored as expected so returns Albanian autonym|lang=
must be found in MediaWiki's English list of language codes.Hmm, I thought I had made that clear already... Those kind of text strings together with other oddities that I mentioned above (those 3 cases I explained) are all regexes that appear in my bash (shell). I use Git Bash (used to use cmd.exe) to SSH to ToolForge where I have an account. I've written a script (part of a bigger script aiming to deal with the whole update of the source code of Smallem - the list of regexes basically) that gets the result from this page using cURL and saves them in a file. When I see these results, the ones I've sent here appear like I've sent them. With the <200c> etc. text strings.
Interesting results the ones you showed. I'm curious if other languages may manifest problems like these as well. I don't know who exactly "looks after" these codes. Are they all dealt with from MediaWiki developers not related directly to the languages they deal with? Or are there global volunteers somehow helping in this subject? I believe we've already talked about this in the past. Either way, I expect small wikis' languages to suffer more from inaccurancies like this. - Klein Muçi ( talk) 18:53, 21 March 2021 (UTC)
|list=<number>
and |plain=yes
really an error? |list=
has to have some sort of an assigned value else
sq:Moduli:Smallem attempts to make a regex list from the values assigned to |lang=
.lang_lister()
twice; once against the whole list of possible language codes (output suppressed – no error or skipped messages unless something is so wrong that it can't coninue) and the second time against the list of language codes in |lang=
. I can imagine Moduli:Smallem overrunning its allocated time if we try to do that.Hey man! I'm sorry that I kinda disappeared, especially when I was supposed to basically do the last step in testing, the reason why we've been doing all the aforementioned adjustments above. I made some changes to my internet at home and that required some days to take effect. I made Smallem make a full run of autoupdate. The total I get is 40635 lines. From that, remove 17 lines which are other regexes and command lines, and you get 40618 regex lines. I believe the number is still off the right total no? Even after making the module skip some "problematic" entries. I'm just asking now because I haven't checked the new total after we made the changes. - Klein Muçi ( talk) 10:23, 30 March 2021 (UTC)
zu
) last code in the list of codes. Perhaps the results of the last code aren't being saved?zu
. Most of the language names begin with 'isi-' so I spotchecked several of them; all that I checked were zu
. I have looked at all of the language names at the top and bottom of the list and all of them are zu
except 'Patois' (jam
):
{{#language:jam|zu}}
→ Jamaican Creole English
Second list of discrepancies. Much to my amazement, the change between iterations wasn't that big. There were the same missing results except: (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Patois(\s*[\|\}])", r"\1jam\2"),
which in the second iteration wasn't missing. 405 results in total in the second run. I'll try it again because I believe the close numbers were just a coincidence. I'll post the results here in the same method when I have them. Looking forward to 5 or 10 tests. -
Klein Muçi (
talk) 00:56, 31 March 2021 (UTC)
zu
code temporarily? I can then do 2-3 test and see what happens. If we get no discrepancies in any of those tests, we would have successfully isolated the problem. My belief though is that the test that I'm running now won't stay on that realm. :/ -
Klein Muçi (
talk) 01:22, 31 March 2021 (UTC)
zu
from Module:Smallem. But this is really strange and inspiring at the same time. -
Klein Muçi (
talk) 02:40, 31 March 2021 (UTC)
zu
, perhaps it would be better to invert the language code sort so that zu
is first and aa
is last. If the result is the same then that confirms your something-wrong-with-zu-handling hypothesis. If the result shows that all of the lost regexes are from aa
then that suggests that somewhere the last groups of regexes isn't being correctly handled.Test 1 with inversion complete. Things are starting to go into the unknown again. I got 41269 lines in total. Remove 17 "other" lines and you get 41252 lines of regexes. I'll soon post here the exact missing lines. - Klein Muçi ( talk) 01:20, 1 April 2021 (UTC)
aa
has only one language name written in Afar, its autonym. All other language codes fall back to some other language (likely English):
{{#language:aa|aa}}
→ Qafár af{{#language:de|aa}}
→ German{{#language:en|aa}}
→ English{{#language:fr|aa}}
→ French{{#language:sq|aa}}
→ Albanianzu
and nothing wrong with aa
. Your aa
results tend to lend credence to the notion that the last group of regexes isn't being included in the final list of regexes (it's the automated list that is coming up short, right?)Yes, that was my initial belief too, ever since you mentioned that all the problems arise with zu
, which was the last code. The reason for that being in the way the script works. Speaking in general lines, it gets the list of codes from a wiki page and then transforms that concatenated one-line list into an one-code-per-line list (using the commas as delimiters). Then it starts a loop that gets 1 line from that list (so 1 code), creates an API request with it, saves the results (the list of regexes generated) in a file and removes that 1 line (so 1 code) from the overall code list. Then it gets 1 line from that list (which is the next code in line, because the line before that just got deleted), creates an API request with it... So on until it runs out of codes. Being that I'm not good on scripting, I may have designed badly the loop and the results from the last API request fail to be saved, whatever that is. So, you arising at the same conclusion is a good sign because that would mean that I just need to fix the loop and one of my 2 problems would be solved. The problem that confuses me though is that the automated list, which is the one that is coming short, is not coming short of just 1 entry compared to the manual list. If we do a simple subtraction, the results missing are around 1000 lines, no? Or am I being misled in the logic that I'm using? Either way, in order for us to be totally sure that this really is the problem that we're having, I have to ask, how easy is to twist module Smallem to create the environment for this experiment: Make it only produce 2 codes. One is en
and the other one being a "popular" language that we know that it doesn't have any fallbacks. Maybe Italian (it
)? Then we can see what happens. If all the regexes from it
go missing, then we're sure that the problem is on the loop and I have to fix that. I don't know how easy/or hard is it for you to do that, though. If it is hard, maybe I can twist my script to work with only those 2 codes but I'm a bit reluctant to do further changes in it before being sure what the current problem is because of being afraid in introducing even more bugs. -
Klein Muçi (
talk) 15:40, 1 April 2021 (UTC)
{{#invoke:Smallem|lang_lister|list=yes |plain=yes}}
to be CodeListBegin:en, it:CodeListEnd
. In that case,
sq:Përdoruesi:Trappist the monk/Livadhi personal.It
has Regex count: 859 though. Do you think fallbacks are to be blamed for the discrepancy? -
Klein Muçi (
talk) 20:07, 1 April 2021 (UTC)
it
wouldn't have any fallbacks so the total number of regex lines from that code would also be the number of missing lines from the grand total (given that there would be only 2 codes and only en
would be used). And that would secure the fact that the problem is coming from the loop not saving the results from the last code. But maybe the experiment above is enough as proof nonetheless? -
Klein Muçi (
talk) 20:16, 1 April 2021 (UTC)
en
so I guess... -
Klein Muçi (
talk) 20:18, 1 April 2021 (UTC)I was able to fix the loop and now I get 1481 lines all the time, which is 1464 + 17, precisely how many lines there should be. I'm excited! Can you revert your change about the hardcoding to module:Smallem so I can do a full run with all the codes now to see what happens? - Klein Muçi ( talk) 22:53, 1 April 2021 (UTC)
zu
regex lines missing again. :/ I'll retry switching the module to 2 codes/debug mode to be totally secure it works fine in that mode. :/ -
Klein Muçi (
talk) 01:03, 2 April 2021 (UTC)
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Qafár\ af(\s*[\|\}])", r"\1aa\2"),
This one goes missing. Any idea what might be happening? Do you think it's still the same problem that we're seeing? (The results from the last code are being ignored?) These were the codes used: aa, ab, ace, ady, af, ak, als, alt, am, an, ang, ar, arc, ary, arz, as, ast, atj, av, avk
Keep in mind that it worked perfectly fine with aa, ab, ace, ady, af, ak, als, alt, am, an
. -
Klein Muçi (
talk) 10:20, 2 April 2021 (UTC)Meh, redid some more runs with the same codes. Sometimes I get that line missing, sometimes I don't. I'm disappointed that we got back the inconsistencies between runs. I know they are related to non-printing characters somehow confusing my script but I don't know how. Because the problems started happening as soon as the module started skipping results for the first time. Is there anything special with that single line? I believe whatever is happening in this case, is happening throughout the whole process and the missing lines get accumulated in the end, to give the 1k short total. There must be another character somehow (not necessarily in that line) that we also need to skip. I'm gonna take a close look at the whole list of results, see if I find anything strange. - Klein Muçi ( talk) 10:40, 2 April 2021 (UTC)
aa
from the group and from the |lang=
parameter. The Qafár af regex is the only unique language name contributed by aa
(the autonym) all other language codes fallback to English which are contributed by the other languages in the group. Perhaps your 'fix' to the tail end disrupted the head end?|lang=
gets the same language codes that are listed in the debug lang_codes
(
line 104). What happens if you swap aa
with ab
?aa
from Module:Smallem (that's what you meant, no?) and tried a full run. I got no problems. 4500 results in total: 4483+17 "other" lines. What does this mean? I'm following blindly now. -
Klein Muçi (
talk) 21:06, 2 April 2021 (UTC)
lang_codes = {'ab', 'aa', 'ace', 'ady', 'af', 'ak', 'als', 'alt', 'am', 'an', 'ang', 'ar', 'arc', 'ary', 'arz', 'as', 'ast', 'atj', 'av', 'avk'}; -- debug
Okay then. I'll do that now. And what's the hypothesis we're testing with that? - Klein Muçi ( talk) 21:35, 2 April 2021 (UTC)
Perhaps your 'fix' to the tail end disrupted the head end?
aa
, perhaps you did...lang_codes = {'aa', 'ab', 'ace', 'ady', 'af', 'ak', 'als', 'alt', 'am', 'an', 'ang', 'ar', 'arc', 'ary', 'arz', 'as', 'ast', 'atj', 'av', 'avk'}; -- debug
. I still get 4501 results now. The correct number. Without doing any changes to the script. :/ Should we revert the debug mode now and try a full normal run? Even though, this doesn't make much sense. We didn't do any change basically, no? :P -
Klein Muçi (
talk) 23:22, 2 April 2021 (UTC)
Well... Basically, yes... I mean, my initial idea was to keep adding codes into the "old" debug mode gradually until I started having problems (which happened when I added the second group of 10 codes). Even then, the problems weren't consistent. They happened only half of the times. Then you changed module:Smallem to introduce the "new" debug mode which automatically takes care for the code and regex list. I tried a run with that with aa
missing. Then tried it with aa
in second place (after ab
) and then tried it with aa
in the beginning. All the tests brought the correct, expected results. Without doing any change to the script. Now I don't know what to do. :/ We either switch it to the normal form, or continue adding other codes in this mode and see what happens. The problem with the second option though is that you have changed module:Smallem to add the codes automatically to |lang=
. When more than 10 codes are used simultaneously, the script tends to malfunction in getting the generated regex lines and starts saving empty results. I don't know how it will behave if we add 10 more codes and basically are forced to do 30 codes simultaneously. -
Klein Muçi (
talk) 23:53, 2 April 2021 (UTC)
|lang=
from 20 to 30) and the end results after the script completed its run were with only those "other" 17 results, empty from any regex lines, as expected. :/ If we want to keep experimenting and add codes gradually by 10, we need to turn back to the old debug mode where codes weren't added automatically to |lang=
. I don't know what else to try now, really. Because everything seem to be working fine at the moment. :P In this mode with only those 20 codes that is. -
Klein Muçi (
talk) 00:32, 3 April 2021 (UTC)|lang=
value to some subset of lang_codes
by writing a replacement for line 104:
args.lang = 'ab, ab, ace';
I have noticed that you have some interest in this article I created. I am on a quest to make as many Good Articles as I can this year. If you think this is a possibility, would you be interested in copy editing the article and nominating it GAN. You would get credit for a Good Article as the nominator and I would get credit as the creator of the article when it gets promoted. I will be glad to solve most of the issues the reviewer brings up.-- Doug Coldwell ( talk) 14:34, 6 April 2021 (UTC)
|last=
.Regarding: I don't know what to think about
developer discouraged
. Outside of the RFC, is there such a thing?
No, I literally just made it up. I'm not exactly wedded to the term, and I justed wanted people to understand what the exact status of these parameters were (to replace the old definition of deprecated but also describe something where support won't ever necessarily be removed).
I was exactly thinking it was going to be used for maintenance category names (figured it would be something like Category:CS1 maint: nonhyphenated parameter
), but that's not really for me to weigh in on. –
MJL
‐Talk‐
☖ 18:14, 7 April 2021 (UTC)
How long have you been working, and this is the only way I know how to send a message haha, and thanks for being here, I saw you working on tons of articles before! Ilikememes128 ( talk) 14:23, 13 April 2021 (UTC) |
Hello, Trappist,
All of a sudden, this template is showing a red link category but the most recent edit was by you in May 2020 so I'm not sure what caused things to change. Typically when red link categories appear, they are due to a recent edit by a new editor but that's not the case here. I don't like to edit templates, well, except userboxes that are causing problems, so I was hoping you could look this over and see what the problem is. Thank you in advance. Liz Read! Talk! 21:13, 13 April 2021 (UTC)
Hi, Trappist. I hate to bother you about this, but would you mind re-adding this code you added a while back (or a revised version of it) into the current version of the sandbox? I tried to add it back, but the local month names didn't work. Here are two testcases pages. – Srđan ( talk) 21:45, 12 April 2021 (UTC)
Category:CS1 maint: discouraged parameter has been nominated for deletion. A discussion is taking place to decide whether this proposal complies with the categorization guidelines. If you would like to participate in the discussion, you are invited to add your comments at the category's entry on the categories for discussion page. Thank you. Fram ( talk) 17:14, 16 April 2021 (UTC)
@
Trappist the monk:, hello from el.wiktionary. I am so grateful for your lessons for Lua's
a-b-c and
how to make data modules. It helped to make possible many modules for our small wiki,
like this one! My understanding of Lua is limited to "if bla then xxx end" but unfortunately we have no Luaists anymore around, so, we have to procede with whatever we can.
If you ever have time, could you help with an extra question (it is not urgent at all, and things function ok as they are now). Some modules are becoming SO big! I do not know how to extract a large part and place it in an outside module, or at a subpage. They are not data, they have lots of "ifs". For example
Thank you, and excuse my bothering you with such questions! Sarri.greek ( talk) 11:03, 26 April 2021 (UTC)
require ('Module:el-articles').articles (args) -- modifies args{}
is_set()
function so that tests like this:
if args'ακε' ~= '' and args'ακε' ~= nil then args'ακε' = args'ακε' else args'ακε' = '' end
if not is_set (args'ακε']) then args'ακε' = '' end
create_link()
, stem_color()
, etc are used by multiple modules, you might consider creating a separate utilities module that other modules in this family of modules might use.because elsewhere, a language link is involvedWhat does that mean? There is no
language linkinvolved in
require ('Module:el-articles').articles (args)
and there would be no language linkinvolved if you created el:wikt:Module:Utilities. Functions in Module:Utilities would be required into the module that needs them just as el:wikt:Module:tin is required into el:wikt:Module:el-nouns-decl and also into el:wikt:Module:el-articles. The advantage is that these utility function live in a single place so when changes are necessary, those changes occur in only one place not in many places.
is_set()
is a function that returns a boolean (true
or false
). Here is the example I used above:
if not is_set (args'ακε']) then args'ακε' = '' end
args['ακε']
has a non-nil
value because the value assigned to a_klenstr
is a concatenation of args['ακε']
and \n
. In Lua, you can't concatenate a nil
type to a string
type. When |ακε=
is included without an assigned value in the template or module call, args['ακε']
gets an empty string value. When |ακε=
is omitted from the template or module call, args
does not have a key ['ακε']
so args['ακε']
returns nil
.is_set (args['ακε'])
to determine if args['ακε']
has an assigned value that is anything but blank. Blank means empty string or nil
. If args['ακε']
is empty string or nil
, is_set()
returns false
indicating that args['ακε']
is not set. In the snippet, not
inverts the value returned from is_set()
so a false
return becomes true
for the if ... then
test indicating that args['ακε']
is not set so args['ακε'] = ''
ensures that args['ακε']
is not nil
for the concatenation that makes a_klenstr
.tinti
was missing. I hacked a crude regex to search Module:el-article for text that looks like a function call. I think that the regex was [a-z] *\(
.What is the process for making new citation templates? medRxiv preprints are usually put in the cite journal template, which triggers a citation error, or cite web, which seems clumsy. There is Template:Cite bioRxiv, so it may be beneficial for there to also be Template:Cite medRxiv. Velayinosu ( talk) 01:51, 3 May 2021 (UTC)
Hello Trappist the monk, the category is empty. Yipee !!! Lotje ( talk) 12:24, 16 May 2021 (UTC)
Hi! I noticed that you have made many edits to Module:Citation/CS1/Configuration so I hope you have a little time to help me/Danish Wikipedia.
Some time ago the English module was copied to Danish Wikipedia and modified whenever there was a problem. When InternetArchiveBot was introduced a local user made some edits and we started up the bot. Sadly the setup was not 100% correct and the user that used to fix it have not been active a few month. Last comment was something about bad health so we do not know if the user can ever return. So I started to look at it.
I think it would perhaps be best to copy the modules from enwiki to dawiki and make the conversions from scratch. So I have copied the modules to da:Modul:Citation/CS1/sandkasse etc. (sandkasse = sandbox).
So my question for you is where I should make edits to make it localized?
I know I have to uncomment 3 lines like here: da:Special:Diff/10754534 and to make a lot of edits in da:Modul:Citation/CS1/Configuration/sandkasse (so far I only made a few adjustments related to dates like da:Special:Diff/10754497).
I know da:Modul:Citation/CS1/Date validation/sandkasse should be modified if we want to allow 'yMd'. I have not done this yet because I got errors right from the start: da:Speciel:PermanentLink/10754556. And the problem does not seem to be related to ymd.
Should I set language to 'da' somewhere? As far as I can tell the module should be able to figure out that it is da.wiki.
Do I need to change any of the other sub modules to make it work?
My plan is do document the changes needed on da.wiki so that more than one user knows how to set it up locally. -- MGA73 ( talk) 18:21, 30 April 2021 (UTC) -- MGA73 ( talk) 18:21, 30 April 2021 (UTC)
date_names'local']['long'
and date_names'local']['short'
are inverted for use by reformatter()
. It is not possible to invert 'December' = 12, 'december' = 12
to 12 = 'December', 12 = 'december'
and have access to both forms of the month name. If both 'December' and 'december' are needed, some sort of special code will have to be written to support that. If all that you need is the lowercase form, that is all that should be included in the date_names'local']['long'
and date_names'local']['short'
tables.patterns{}
; perhaps something like this:
-- day-initial: day. month year
'd.My' = {'^([1-9]%d?)%. +(%D-) +((%d%d%d%d?)%a?)$', 'd', 'm', 'a', 'y'},
check_date()
; perhaps something like this:
elseif mw.ustring.match(date_string, patterns'd.My'][1]) then -- day-initial: day. month year
day, month, anchor_year, year = mw.ustring.match(date_string, patterns'd.My'][1]);
month = get_month_number (month);
if 0 == month then return false; end -- return false if month text isn't one of the twelve months
'd.My'
then I think it should perhaps be 'd.my'
because we use lowercase months. That makes me wonder if I should change all the codes with the "M" variants to lowercase or if the code can handle capital letters?'dec.' = 12
I guess that is a bad idea? It would be better to create a pattern for that like when we use a dot after day? So we should have a 'd.m.y'
? --
MGA73 (
talk) 08:27, 1 May 2021 (UTC)
'd.My'
the uppercase 'M' is intended to mean 'month-as-name' while lowercase 'm' is intended to mean 'month-as-digit' so I think that you should not change these.'dec.' = 12
is correct. Those patterns that look for month names, are looking for anything that is not a digit ('My'
is an oddity). The patterns don't care if the month name has punctuation. The whole month-name capture is used to index into date_names'local']['short'
.is_valid_month_range_style()
in the en.wiki sandbox; see
Module:Citation/CS1/Date validation/sandbox#L-246. The original code there was quite old (December 2014) and, I think, precedes any real notion of i18n. You might want to replace your is_valid_month_range_style()
with the en.wiki/sandbox version. You can be our testbed.Hello again! Great then things are going the right way :-) I changed my sandbox with a copy from your sandbok (see da:Special:Diff/10755535). I tested it in my own sandbox like da:Speciel:PermanentLink/10755537. It looks like it can handle "maj" like it should. But not "Maj" and that is okay because Danish months use lowercase. I guess the reason it accepts "December" is because it also accepts English months. So it works? -- MGA73 ( talk) 17:26, 1 May 2021 (UTC)
is_valid_month_range_style()
because none of the dates are ranges. is_valid_month_range_style()
is used for these (copy them as you see them here and paste them into your sandbox):
*{{cite web/sandkasse |title=Title |url=//example.com |date=oktober–december 2021}}
– no error*{{cite web/sandkasse |title=Title |url=//example.com |date=okt.–december 2021}}
– error*{{cite web/sandkasse |title=Title |url=//example.com |date=okt.–dec. 2021}}
– no error*{{cite web/sandkasse |title=Title |url=//example.com |date=oktober–dec. 2021}}
– error11. december 2021
and 11. December 2021
but the auto-translation of the month names at least causes the module to render 'december' from December
. In your testcases, Maj
cannot be translated because that name is not an English month-name. We might tweak the auto-translation code so that it adds a maintenance category whenever a translation is made. One of your gnomes then can monitor that category and fix the dates in the wikisource.'d.My'
we talked about earlier. They use dot too in
no:,
nn:,
fi: and
fo: (perhaps other wikis too). So perhaps you could add the support in the code so that the wikis that need it can just uncomment it? --
MGA73 (
talk) 08:53, 2 May 2021 (UTC)
cfg.date_name_auto_xlate_enable
and cfg.date_digit_auto_xlate_enable
which you have set in ~/Configuration. I do not see this as self detectbecause a human has to enable/disable the auto-translation – the module knowing that it is on da.wiki is
self detect.
date_hyphen_to_dash()
does what you want it to do, then use it.did not not seem to work good with the new codemeans to you. There are two hyphen/dash converters one in ~/Date validation for dates and the other in the main module for
|pages=
, |issue=
, etc.Hello again! We tried again and da:Modul:Citation/CS1/Date_validation/sandkasse#L-1059 and forward seems to show us what we want now :-D
However it also changes some wrong dates to correct dates without showing us errors. Examples can be seen in da:Skabelon:Citation/testcases#Interval_med_datoer/dansk "(Forventet: OK)" means "Expected OK" and "(Forventet: fejl)" means "Expected error".
We use "-" without spaces around here:
We use " – " with spaces around here:
Perhaps you could have a look at the code and tell us if we could do something much smarter? For example I can't help think that the original code define different patterns for dates so we should perhaps be able to figure out how to use that. -- MGA73 ( talk) 13:19, 15 May 2021 (UTC)
{{Citation/sandkasse |...}}
?Hello again! I made
da:Special:Diff/10769630 because I suddenly realized that we have other variants than 'd.My'
that include a dot. We also have 'd.-d.My'
, 'd.M-d.My'
and 'd.My-d.My'
. It elimilated a lot of the problems we had :-) --
MGA73 (
talk) 17:14, 17 May 2021 (UTC)
Hello Trappist the monk I am a student who is new to Wikipedia. One of my courses consists in editing and updating the Kirindy Forest Wikipedia page, I saw that you edited the Kirindy Mitea National Park page and was wondering if you would be able to provide me with some feedback and help me improve it.
Any assistance would be greatly appreciated :) — Preceding unsigned comment added by Marie Salichon ( talk • contribs) 01:47, 22 May 2021 (UTC)
141.91.210.58 ( talk) 06:45, 4 June 2021 (UTC)monkbot is bad!
Hello Ttm, I was curious to know if you think it would be possible to add the ability to watch a specific talk page thread, as opposed to the entire page itself. Perhaps as a feature, whereby adding threads to your watchlist? Or follow along by some other means, where maybe there's some kind of notification each time a post is added to the thread?
I'm asking you first, to see if pursuing this is even worthwhile. Is it impossible? Would you know if this has come up before? Otherwise, I suppose I would start at Phabricator and create a new feature request, then go from there? Any feedback would be appreciated. Cheers - wolf 02:39, 2 June 2021 (UTC)
You have written on nanvag rajputs from were you get it Brother Kunalpratapsingh35 ( talk) 16:07, 6 June 2021 (UTC)
I did preview. It's my opinion that bare urls do not reflect well. In some cases, the refill tool does a nice job of improving the ref. In some cases it produces a broken citation often because the underlying URL has a problem. It's my opinion that highlighting the problem, so the editors with SME can see them and fix them, is better than leaving them as bare URLs. It appears you think leaving bare URLs even when they are flawed is a better option. We disagree.-- S Philbrick (Talk) 16:24, 27 June 2021 (UTC)
[http://www.damligan.se/player.html?id=1517462 ''Player Bio'' damligan.se] {{webarchive|url=https://web.archive.org/web/20100813191257/http://www.damligan.se/player.html?id=1517462 |date=2010-08-13 }}
{{Cite web|url=http://www.damligan.se/player.html?id=1517462|archiveurl=https://web.archive.org/web/20100813191257/http://www.damligan.se/player.html?id=1517462|deadurl=y|title=''Player Bio'' damligan.se|archivedate=August 13, 2010}}
{{
cite web}}
: Unknown parameter |deadurl=
ignored (|url-status=
suggested) (
help)[https://archive.is/20121117010216/https://www.eurobasket.com/team.asp?Cntry=Sweden&Team=8211&Page=1 ''Umeå player roster'' eurobasket.com]
{{Cite web|url=https://www.eurobasket.com/team.asp|archiveurl=http://archive.today/20121117010216/https://www.eurobasket.com/team.asp?Cntry=Sweden&Team=8211&Page=1|deadurl=y|title=Udominate Basket Umea basketball - team details, stats, news, roster - EUROBASKET}}
{{
cite web}}
: |archive-url=
requires |archive-date=
(
help); Unknown parameter |deadurl=
ignored (|url-status=
suggested) (
help){{Cite web |url=http://www.damligan.se/player.html?id=1517462 |archiveurl=https://web.archive.org/web/20100813191257/http://www.damligan.se/player.html?id=1517462 |title=Pamela Rosanio |website=Damligan |language=sv |archive-date=August 13, 2010}}
{{Cite web |url=https://www.eurobasket.com/team.asp |archiveurl=http://archive.today/20121117010216/https://www.eurobasket.com/team.asp?Cntry=Sweden&Team=8211&Page=1 |archive-date=2012-11-17 |title=Udominate Basket Umea basketball team |website=Eurobasket}}
I try - after a conversation with Graham87 - to avoid direct links to foreign Wikipedias. How can we do that in cite templates, such as Karl-Günther von Hase? -- Gerda Arendt ( talk) 20:23, 26 June 2021 (UTC)
conversation with Graham87.
{{
ill}}
is a notice box that has this image:
. That notice box tells editors that {{ill}}
is not to be used within cs1|2 templates. There are a couple of reasons for that. When one template (in this case {{
cite book}}
contains another template ({{ill}}
), the inner template ({{ill}}
) is rendered before the outer template ({{cite book}}
). When the page is processed by MediaWiki, the {{cite book}}
parameter value |editor2={{ill|Rolf Breitenstein|de}}
is processed first so that what {{cite book}}
gets is:
|editor2=[[Rolf Breitenstein]]<span class="noprint" style="font-size:85%; font-style: normal; "> [[[:de:Rolf Breitenstein|de]]]</span>
<span>...</span>
html tags is not the editor's name so does not belong in |editor2=
.{{ill}}
sets the [[[:de:Rolf Breitenstein|de]]]
to a font size of 85% of the current text size. That is ok when {{ill}}
is used in the body of an article because 85% of the normal '100%' article body font size yields the smaller text at 85% of the normal font size. But,
MOS:FONTSIZE instructs us to not set smaller font sizes inside infoboxen, navboxen, and reference sections. {{
refbegin}}
(used in
Karl-Günther von Hase) sets the references to 90% of normal. {{ill}}
then sets [[[:de:Rolf Breitenstein|de]]]
to 85% of 90% or 100 × 0.90 × 0.85 = 76.5% which is smaller than the allowed 85% font size. Here is a simple illustration of how that works (for clarity, you may need to addjust your browser's zoom setting):
<span style="font-size:90%;>[<span style="font-size:85%;>[]</span></span>
|editor2={{ill|Rolf Breitenstein|de}}
to |editor2=Rolf Breitenstein |editor-link2=:de:Rolf Breitenstein
.This conversation having apparently died, I am reverting Editor Gerda Arendt's partial revert of my edits.
— Trappist the monk ( talk) 13:17, 29 June 2021 (UTC)