Return-Path: Received: from matups.math.u-psud.fr ([129.175.50.4] verified) by vsu.ru (CommuniGate Pro SMTP 3.5.9) with ESMTP id 945575; Sun, 23 Jun 2002 05:34:43 +0400 Received: from stats.math.u-psud.fr (beryl.math.u-psud.fr [129.175.54.194]) by matups.math.u-psud.fr (8.11.6/jtpda-5.3.3) with ESMTP id g5N1YeY19853 ; Sun, 23 Jun 2002 03:34:40 +0200 (MEST) Received: (from sieben@localhost) by stats.math.u-psud.fr (8.11.6+Sun/8.11.6) id g5N2XrL23407; Sun, 23 Jun 2002 03:33:53 +0100 (WEST) Date: Sun, 23 Jun 2002 03:33:53 +0100 (WEST) From: Laurent Siebenmann Message-Id: <200206230233.g5N2XrL23407@beryl.math.u-psud.fr> To: CyrTeX-en@vsu.ru, vvv@vsu.ru Subject: Re: Russian/Polish/German...without switching Hi Vladimir, Leif and others More thoughts on directions for 16bit TeX (\simeq Omega). Excuse my being a bit prolix. As everybody is tributary to standard computers and these seem to be drifting toward 16-bit fonts and text files, the issue is not to provoke a change, but rather to imagine how to derive some profit from a change that is inevitable. Yes, I do believe that 16 bit text will gain and become universally accepted at some level that is relevant to TeX. Bill Gates is implenting 16 bit text, and I imagine he hopes it will become obligatory. I personally hope it will just be an interesting option and a basis for useful standards. LaTeX users may well believe that the simplifications offered by 16 bit text are quite unneeded. Programmers too. However computer OS designers who dare hope their product will be sold with minimal adjustment worldwide seem sure to opt for 16 bit text as a common design feature. In a couple of decades a majority of computers will be sold in the countries like India, China, Japan where the need for 16 bits is clear. TeX is a flea on the back of the electronic publishing elephant; bill Gates goads the elephant, not the flea. Some possible desiderata for multilingual TeX typing: In typing TeX, all Cyrillic characters and punctuation should be distinct and disjoint from Latin characters and punctuation. The ASCII range should be reserved as the language of TeX programming. I'll not risk saying anything revolutionary about math; it can certainly continue to be coded as ASCII. In general, languages with noticeably distinct typographic traditions should be disjointly encoded for typing. Color or style should be be used on screen to distinguish languages (and ASCII). As far as TeX (Omega) is concerned, the typescript should be nothing more than a sequence of 16 bit characters. There should (ultimately) be an TeX exchange format for 16 bit TeX typescripts. Concerning difficulty: VV> in other words, your approach is practically unusable: it > is hard to implement, it is non-flexible (requires changing > if one more language needs to be added, or if new font > shapes need to be used), and gives unneeded complications to > what can be achieved with plain markup commands. Today yes. After all, 16 bit screen fonts with languages disjointly stacked do not yet exist except in germinal form. Nor decent physical typing comforts. However, although the approach may seem difficult to you as seen from the tangled situation we currently live with, I suspect it is intrinsically the simplest way to go multilingual. Concerning hyphenation: VV> some issues with hyphenation mentioned in my previous email > are due to the fact that english and russian have different > righthyphenmin values: 3 for english and 2 for russian. if > we use a combined russian-english patterns, we have to use > some setting like 2 for righthyphenmin (minimal number of > letters which are allowed to be cut off the end of the > hyphenated word). the ruhyphen package provides some > mechanism to make that work in such a way that english > words will be hyphenated with 3 letter limit, but that > works mostly, but not 100% equivalent to the case when the > languages are separated (and language-switching markup > commands are used). The "hyphenmin" problem is real with current TeX so probably it is also real with Omega (though I'm not sure). But as you explain, the the problem is by now largely solved where Russian is concerned. Three more points: -- even where no special patterns have been made to solve this problem one may get good results by using value 3 for hyphenmin; just so long as no bad breaks crop up. -- Omega might someday introduce a 'character class' feature in its hyphenation mechanism to solve the hyphenmin problem with just a couple of patterns! -- In rare cases of need (as when bad line breaks crop up) one is free to change the hyphenmin values, ad hoc. Concerning punctuation: VV> also, there are some typographical issues (russian has > frenchspacing and some special typesetting of :;?! signs, > etc). so in general it is better to use language switching > commands, so that typographical features of all languages > will be activated at their full volume (if you don't switch > languages, the typographical rules are shared for all > languages). This seems wrong. An the contrary. Do not some Russian typographers want "nonfrenchspacing" for Russian, and also special national punctuation with national side-bearings (different from English)? Today "nonfrenchspacing" is inaccessible to them. Even for French, disjoining ASCII from French punction is very desirable today to avoid having punctuation of 'active' category when TeX inputs the typing. In the 16bit world, the Knuthian \sfcode mechanism (used character-by-character) has enough flexibility today to impose frenchspacing in a language-dependent way. Thus in the stacked encoding scheme I am recommending, each language has its own punctuation with tailor-made side-bearings, and french spacing could even be turned on or off language by language. > but what encoding would these disjointed files use? it will > not be unicode (neiter utf-8 nor utf-16 or whatever), > because in unicode all latin letters are jointed. Somewhere in unicode there are regions reserved for private use; why not TUG use? Standards are a good thing; lets have enough of them! Russian might encode on "0700--"07FF and French on "2100--"21FF (placement motivated by well known international telephone prefixes). Alternatively, the placement of a chunk like T2A in such 16 bit encoding could be left indeterminate and rather declared in a header. This would avoid TUG agonizing over basicly fatuius placement decision and it would also make the 16 bit system adequate for any real manuscript without *ever* going to 32 bit TeX. VV> i'd like to emphasize that the encoding where latin letters > are separated for different languages (as well as cyrillic, > e.g. for Russian and Ukrainian) has to be non-standard and > custom (specific to the languages being used), because all > existing character encodings do not support the disjoint > latin alphabet for different languages. Indeed. Many, many, many languages (not just Russian, and possibly a majority) would benefit typographically from having their own optimal punctuation -- something that only English really has now. > in your editor you have to use some keystrokes (or mouse > clicks) to switch from one "color" (language) to another. I suggested using function keys to change language. Then the keyboard encoding will change (but not the 16 bit screen font encoding); also the the color changes (or maybe the style). Most important is that the 16bit text file character sequence tells all though a header linking to preexisting codings is probably wanted; that text file goes straight to 16bit TeX for typesetting. TeX implementors do not build text editors; they merely adapt them. I hope enough suitable editors will appear in the normal course of progress. On the Mac it seems probable that they will appear in the next 3 to 10 years. As Leif has emphasized Nisus on Mac seems fertile ground for this devellopment. Why not MSWord and WordPerfect? Finally, let me answer your summarized complaints in a summarized way: VV> home-made encoding Not really, if assembled from standard pieces like T2A in a way specified in a header. VV> (font coloring) you will use mouse/etc to "color" text An intrinsicly colored 16bit screen font might be best. That seems foolproof. Then one just changes keyboard (physical or virtual) to access different parts of it. VV> special editor (so > not everybody will be able to use the TeX source > files written in that home-made encoding) We have to wait for 16 bit screen fonts and capable editors. Incidentally I expect (and hope) that virtual 16bit screen fonts will be based on preexisting 8-bit screen fonts. MSWord and Nisus and other classical word-processors are remarkably close to that. But I hope to see program editors enter the fray (see BBEdit on the mac). VV> special VF fonts I hope a future Omega would automatically assemble them from preexisting pieces which are national TeX efforts like T2A encoded Cyrillic fonts. VV> mouse/etc to "color" text No, just a function key to fetch the keyboard of the new language; color (or style) changes should be part of the screen font. Let me counter all this negative criticism by suggesting that the scheme I am conjuring compares well with preexisting competitors for any highly multilingual typescript -- meaning one which which cannot be typed and read using a single 8-bit font. One competitor that comes to mind is a classical wordprocessor using several text fonts and embedded (La)TeX language switches. One easily goes to TeX via 8-bit text. But what about typescript porting and archiving? RTF is a poor candidate and others are worse. Another competitor is perhaps one Vladimir is backing. Namely the unicode variant UTF8 with with (La)TeX language switches. Real sixteen bit unicode requires Omega. But many planes of unicode may be needed ie this not really 16bit... but 32bit reencoded to 8bit. Does that not become tangled and slow? As always, typing and screen viewing is a bad bottleneck. I would emphasize that while Linux/unix is UTF8 oriented, Bill Gates prefers unicode. Just to confuse matters I add that these 3 competitors can be cross-bred to beget others... All three competitors seem worth discussing, and I hope that will map the territory ahead for the day when the text processing prerequisites do become available. Maybe even in time for LaTeX 3? Cheers Laurent S. PS. On the (La)TeX output side, this devellopment does not concern raw fonts with real type1 or bitmapped glyphs attached. Anything new at the font level would be realized using virtual fonts.