[issue:254] Re: regexpatch: error on (u)pTeX with Japanese char

2019年 1月 17日 (木) 21:46:32 JST

Hi Bruno,

> Do you know why there are two separate
> primitives \catcode and \kcatcode?
> It seems that their functionality
> could have been combined into a single one.

I don't know the birth of pTeX, but I guess they cannot be
merged because they have some different meanings.

[1] Although both of these primitives take an argument
  which denotes a character code, the meaning is different.

* \catcode (allowed values: 0--15) is meant for individual
  char code.
  e.g. \catcode`A=13 affects only the character "41 (= A).
* \kcatcode (allowed values: 16--18 for pTeX, 15--19 for upTeX)
  is meant for char code range.  The code range is different
  between pTeX and upTeX, because of its internal encodings.
    - pTeX can handle JIS X 0208 characters.  These characters
      are classified into "ku"-"ten" table, so the range is
      based on "ku" ("区") in JIS standard.
    - upTeX can handle Unicode characters, so the range is
      based on Unicode blocks.
  e.g. \kcatcode`あ=18 affects all of Hiragana characters.

Please consider the following example:

  \showthe\kcatcode`あ % => 17
  \showthe\kcatcode`い % => 17
  \kcatcode`あ=18
  \showthe\kcatcode`あ % => 18
  \showthe\kcatcode`い % => 18

"あ" and "い" are Hiragana characters, which are located in
JIS 4-ku (for pTeX) and Unicode block 0x3040--0x309F (for upTeX).
The default \kcatcode for these ranges are 17, but if we change
that of "あ" to 18, all of Hiragana characters have 18.

[2] upTeX's \kcatcode has additional meaning, compared to pTeX's.

When (u)pTeX is given an input, JP characters and non-JP characters
are distinguished during tokenization.  The criteria here is
different between pTeX and upTeX:

* With pTeX, the criteria is always fixed:
    - char code 0--255: non-JP
    - others: JP
* With upTeX, the criteria can be flexibly changed, even inside
  a single document.  For this purpose, upTeX extended the
  \kcatcode primitive as follows:
    - if \kcatcode = 15 is set, non-JP (precisely "non-CJK")
    - if \kcatcode = {16,17,18,19} is set, JP (precisely "CJK")
  A simple example:

  % plain upTeX
  \showthe\kcatcode`^^c0 % => default: 15 (it only means non-JP)
  \showthe\catcode`^^c0  % => 12
  \kcatcode`^^c0=18      % (change this Unicode block to JP)
  \showthe\kcatcode`^^c0 % => 18
  \showthe\catcode`^^c0  % => returns 12, but not actually used.

After the distinction, JP char will be converted to JP char token
accompanying \kcatcode; non-JP char will be converted to
non-JP char token accompanying \catcode.

> I may be missing some reason why \catcode and \kcatcode are separate.
> Of course the documentation is hard to read for non-Japanese speakers.

Currently upTeX is not well documented, but we hope to provide
some in the future... And also, English manuals for both pTeX
and upTeX should be there.

Hironobu