Mailing List CyrTeX-en@vsu.ru Message #203
From: Laurent Siebenmann <sieben@cristal.math.u-psud.fr>
Subject: Re: Russian/Polish/German...without switching
Date: Sun, 23 Jun 2002 03:33:53 +0100 (WEST)
To: <CyrTeX-en@vsu.ru>, <vvv@vsu.ru>


Hi Vladimir, Leif and others

More thoughts on directions for 16bit TeX (\simeq Omega).
Excuse my being a bit prolix.

As everybody is tributary to standard computers and these
seem to be drifting toward 16-bit fonts and text files,
the issue is not to provoke a change, but rather to
imagine how to derive some profit from a change that is
inevitable.  Yes, I do believe that 16 bit text will gain
and become universally accepted at some level that is
relevant to TeX. Bill Gates is implenting 16 bit text, and
I imagine he hopes it will become obligatory. I personally
hope it will just be an interesting option and a basis for
useful standards.

LaTeX users may well believe that the simplifications
offered by 16 bit text are quite unneeded. Programmers
too.  However computer OS designers who dare hope their
product will be sold with minimal adjustment worldwide
seem sure to opt for 16 bit text as a common design
feature. In a couple of decades a majority of computers
will be sold in the countries like India, China, Japan
where the need for 16 bits is clear.  TeX is a flea on the
back of the electronic publishing elephant; bill Gates
goads the elephant, not the flea.

Some possible desiderata for multilingual TeX typing:

In typing TeX, all Cyrillic characters and punctuation should
be distinct and disjoint from Latin characters and punctuation.

The ASCII range should be reserved as the language of
TeX programming.

I'll not risk saying anything revolutionary about math; it can
certainly continue to be coded as ASCII.

In general, languages with noticeably distinct typographic
traditions should be disjointly encoded for typing.

Color or style should be be used on screen to distinguish
languages (and ASCII).

As far as TeX (Omega) is concerned, the typescript should be
nothing more than a sequence of 16 bit characters.  

There should (ultimately) be an TeX exchange format for 16 bit
TeX typescripts.

Concerning difficulty:

 VV> in other words, your approach is practically unusable: it
 > is hard to implement, it is non-flexible (requires changing
 > if one more language needs to be added, or if new font
 > shapes need to be used), and gives unneeded complications to
 > what can be achieved with plain markup commands.

Today yes.  After all, 16 bit screen fonts with languages
disjointly stacked do not yet exist except in germinal form.
Nor decent physical typing comforts. However, although the
approach may seem difficult to you as seen from the tangled
situation we currently live with, I suspect it is intrinsically
the simplest way to go multilingual.


Concerning hyphenation:

 VV> some issues with hyphenation mentioned in my previous email
 > are due to the fact that english and russian have different
 > righthyphenmin values: 3 for english and 2 for russian. if
 > we use a combined russian-english patterns, we have to use
 > some setting like 2 for righthyphenmin (minimal number of
 > letters which are allowed to be cut off the end of the
 > hyphenated word). the ruhyphen package provides some
 > mechanism to make that work in such a way that english
 > words will be hyphenated with 3 letter limit, but that
 > works mostly, but not 100% equivalent to the case when the
 > languages are separated (and language-switching markup
 > commands are used).

The "hyphenmin" problem is real with current TeX so probably it
is also real with Omega (though I'm not sure). But as you
explain, the the problem is by now largely solved where
Russian is concerned. Three more points:

  -- even where no special patterns have been made to solve
  this problem one may get good results by using value 3 for
  hyphenmin; just so long as no bad breaks crop up.

  -- Omega might someday introduce a 'character class' feature
  in its hyphenation mechanism to solve the hyphenmin problem
  with just a couple of patterns!

  -- In rare cases of need (as when bad line breaks crop up)
  one is free to change the hyphenmin values, ad hoc.

Concerning punctuation:

 VV> also, there are some typographical issues (russian has
 > frenchspacing and some special typesetting of :;?! signs,
 > etc). so in general it is better to use language switching
 > commands, so that typographical features of all languages
 > will be activated at their full volume (if you don't switch
 > languages, the typographical rules are shared for all
 > languages).

This seems wrong.  An the contrary.  Do not some Russian
typographers want  "nonfrenchspacing" for Russian, and also
special national punctuation with national side-bearings
(different from English)? Today "nonfrenchspacing" is
inaccessible to them.  Even for French, disjoining ASCII from
French punction is very desirable today to avoid having
punctuation of 'active' category when TeX inputs the typing.

In the 16bit world, the Knuthian \sfcode mechanism (used
character-by-character) has enough flexibility today to impose
frenchspacing in a language-dependent way. Thus in the stacked
encoding scheme I am recommending, each language has its own
punctuation with tailor-made side-bearings, and french spacing
could even be turned on or off language by language.

 > but what encoding would these disjointed files use? it will
 > not be unicode (neiter utf-8 nor utf-16 or whatever),
 > because in unicode all latin letters are jointed.

Somewhere in unicode there are regions reserved for private
use; why not TUG use?  Standards are a good thing; lets have
enough of them! Russian might encode on "0700--"07FF and
French on "2100--"21FF (placement motivated by well known
international telephone prefixes).  Alternatively, the
placement of a chunk like T2A in such 16 bit encoding could
be left indeterminate and rather declared in a header. This
would avoid TUG agonizing over basicly fatuius placement
decision and it would also make the 16 bit system adequate
for any real manuscript without *ever* going to 32 bit TeX.

 VV> i'd like to emphasize that the encoding where latin letters
 > are separated for different languages (as well as cyrillic,
 > e.g. for Russian and Ukrainian) has to be non-standard and
 > custom (specific to the languages being used), because all
 > existing character encodings do not support the disjoint
 > latin alphabet for different languages.

Indeed.  Many, many, many languages (not just Russian, and
possibly a majority) would benefit typographically
from having their own optimal punctuation -- something that
only English really has now.

 > in your editor you have to use some keystrokes (or mouse
 > clicks) to switch from one "color" (language) to another.

I suggested using function keys to change language.  Then the
keyboard encoding will change (but not the 16 bit screen font
encoding); also the the color changes (or maybe the style).
Most important is that the 16bit text file character sequence
tells all though a header linking to preexisting codings is
probably wanted; that text file goes straight to 16bit TeX
for typesetting.

TeX implementors do not build text editors; they merely
adapt them.  I hope enough suitable editors will appear in
the normal course of progress.  On the Mac it seems
probable that they will appear in the next 3  to 10 years.
As Leif has emphasized Nisus on Mac seems fertile ground
for this devellopment. Why not MSWord and WordPerfect?

      Finally, let me answer your summarized complaints in a
summarized way:

 VV> home-made encoding

Not really, if assembled from standard pieces like T2A in
a way specified in a header.

 VV> (font coloring) you will use mouse/etc to "color" text

An intrinsicly colored 16bit screen font might be best.
That seems foolproof.  Then one just changes keyboard
(physical or virtual) to access different parts of it.

 VV> special editor (so
 > not everybody will be able to use the TeX source
 > files written in that home-made encoding)

We have to wait for 16 bit screen fonts and capable editors.
Incidentally I expect (and hope) that virtual 16bit screen
fonts will be based on preexisting 8-bit screen fonts. MSWord
and Nisus and other classical word-processors are remarkably
close to that.  But I hope to see program editors enter the
fray (see BBEdit on the mac).

 VV> special VF fonts

I hope a future Omega would automatically assemble them
from preexisting pieces which are national TeX efforts
like T2A encoded Cyrillic fonts.

 VV> mouse/etc to "color" text

No, just a function key to fetch the keyboard of the new
language; color (or style) changes should be part of the
screen font.

     Let me counter all this negative criticism by suggesting
that the scheme I am conjuring compares well with preexisting
competitors for any highly multilingual typescript -- meaning
one which which cannot be typed and read using a single 8-bit
font.

     One competitor that comes to mind is a classical
wordprocessor using several text fonts and embedded (La)TeX
language switches.  One easily goes to TeX via 8-bit text.
But what about typescript porting and archiving? RTF is a
poor candidate and others are worse.

    Another competitor is perhaps one Vladimir is backing.
Namely the unicode variant UTF8 with with (La)TeX language
switches. Real sixteen bit unicode requires Omega. But many
planes of unicode may be needed ie this not really 16bit...
but 32bit reencoded to 8bit.  Does that not become tangled
and slow? As always, typing and screen viewing is a bad
bottleneck. I would emphasize that while Linux/unix is UTF8
oriented, Bill Gates prefers unicode.

    Just to confuse matters I add that these 3 competitors
can be cross-bred to beget others...

    All three competitors seem worth discussing, and I hope
that will map the territory ahead for the day when the text
processing prerequisites do become available. Maybe even in
time for LaTeX 3?

Cheers

Laurent S.

PS.  On the (La)TeX output side, this devellopment does not
concern raw fonts with real type1 or bitmapped glyphs
attached.  Anything new at the font level would be realized
using virtual fonts.


Subscribe (FEED) Subscribe (DIGEST) Subscribe (INDEX) Unsubscribe Mail to Listmaster