[issue:254] Re: regexpatch: error on (u)pTeX with Japanese char

Hironobu Yamashita h.y.acetaminophen @ gmail.com
2019年 1月 17日 (木) 21:46:32 JST


Hi Bruno,

> Do you know why there are two separate
> primitives \catcode and \kcatcode?
> It seems that their functionality
> could have been combined into a single one.

I don't know the birth of pTeX, but I guess they cannot be
merged because they have some different meanings.

[1] Although both of these primitives take an argument
  which denotes a character code, the meaning is different.

* \catcode (allowed values: 0--15) is meant for individual
  char code.
  e.g. \catcode`A=13 affects only the character "41 (= A).
* \kcatcode (allowed values: 16--18 for pTeX, 15--19 for upTeX)
  is meant for char code range.  The code range is different
  between pTeX and upTeX, because of its internal encodings.
    - pTeX can handle JIS X 0208 characters.  These characters
      are classified into "ku"-"ten" table, so the range is
      based on "ku" ("区") in JIS standard.
    - upTeX can handle Unicode characters, so the range is
      based on Unicode blocks.
  e.g. \kcatcode`あ=18 affects all of Hiragana characters.

Please consider the following example:

  \showthe\kcatcode`あ % => 17
  \showthe\kcatcode`い % => 17
  \kcatcode`あ=18
  \showthe\kcatcode`あ % => 18
  \showthe\kcatcode`い % => 18

"あ" and "い" are Hiragana characters, which are located in
JIS 4-ku (for pTeX) and Unicode block 0x3040--0x309F (for upTeX).
The default \kcatcode for these ranges are 17, but if we change
that of "あ" to 18, all of Hiragana characters have 18.

[2] upTeX's \kcatcode has additional meaning, compared to pTeX's.

When (u)pTeX is given an input, JP characters and non-JP characters
are distinguished during tokenization.  The criteria here is
different between pTeX and upTeX:

* With pTeX, the criteria is always fixed:
    - char code 0--255: non-JP
    - others: JP
* With upTeX, the criteria can be flexibly changed, even inside
  a single document.  For this purpose, upTeX extended the
  \kcatcode primitive as follows:
    - if \kcatcode = 15 is set, non-JP (precisely "non-CJK")
    - if \kcatcode = {16,17,18,19} is set, JP (precisely "CJK")
  A simple example:

  % plain upTeX
  \showthe\kcatcode`^^c0 % => default: 15 (it only means non-JP)
  \showthe\catcode`^^c0  % => 12
  \kcatcode`^^c0=18      % (change this Unicode block to JP)
  \showthe\kcatcode`^^c0 % => 18
  \showthe\catcode`^^c0  % => returns 12, but not actually used.

After the distinction, JP char will be converted to JP char token
accompanying \kcatcode; non-JP char will be converted to
non-JP char token accompanying \catcode.


> I may be missing some reason why \catcode and \kcatcode are separate.
> Of course the documentation is hard to read for non-Japanese speakers.

Currently upTeX is not well documented, but we hope to provide
some in the future... And also, English manuals for both pTeX
and upTeX should be there.

Hironobu


issue メーリングリストの案内