Revised Tk Text Widget

Support of Hyphenation

Command-Line Name: -hyphens
Database Name: hyphens
Database Class: Hyphens: Specifies a boolean indicating whether the hyphenation support will be activated. If activated, then in state normal or readonly the soft hyphen (Unicode point U+00AD) – also called SHY – will be invisible unless it is used at line end for displaying a hyphen character in case of dividing a word for line adjusting. If the widget is in state normal then the hyphen will be displayed like a normal character (with codepoint U+2010), it will not be used for hyphenation (but may be used for line wrapping, depending on the value of option -wrap).

HYPHENATION RULES

In natural (pre-reform) German orthography, a “c” before the hyphenation point can change into a “k”: “Drucker” hyphenates into “Druk-ker”. This spelling change will be called ck rule.
In modern Dutch, an e-diaeresis after the hyphenation point can change into a simple “e”: “reëel” hyphenates into “re-eel”. This also applies for i-diaeresis, o-diaeresis, and u-diaeresis. This spelling change will be called trema rule.
In German, Norwegian, and Swedish, a triple consonant can change into a double consonant: Swedish “tugg-gummi” becomes “tuggummi” when not hyphenated, Norwegian "buss-sjåfør" becomes "bussjåfør", and German "Schiff-fahrt" becomes "Schiffahrt" (in pre-reform orthography). This spelling change will be called tripleconsonant rule. (Note that is a reverse rule, applied when a word will not be hyphenated at this point).
In Dutch, a letter can disappear inside a doubled vowel: “opaatje” hyphenates into “opa-tje”. In case of an double "e" it will change to the corresponding vowel with accute, for example "cafeetje" becomes "café-tje". This spelling change will be called doublevowel rule.
Hungarian has an unusual hyphenation case which involves reinsertion of a root-letter, as in the following example: "vissza" becomes "visz-sza". These special cases, occurring when the characters are in the middle of a word are: "ccs" becomes "cs-cs", "ggy" becomes "gy-gy", "lly" becomes "ly-ly", "nny" becomes "ny-ny", "tty" becomes "ty-ty", "zzs" becomes "zs-zs", "ssz" becomes "sz-sz". This rule is named doubledigraph.
In Catalan, a geminated consonant can be splitted: the word "paral·lel" hyphenates into "paral-lel". The name for this rule is gemination.
In Polish the hyphen will be repeated after line break, this means for example that "tech-nik" becomes "tech- -nik". The name of this rule is repeathyphen.

Applying hyphenation rule RULE (which may result into spelling changes) will be done automatically in the following cases:

The hyphen is tagged with hyphenation rule RULE (tag option -hyphenrules includes RULE).
The widget option -hyphenrules contains rule RULE, and no tag (attached to this hyphen) is overruling.

Command-Line Name: -hyphenrules

Database Name: hyphenRules

Database Class: HyphenRules

Specifies a list of spelling change rules for hyphenated words, given by the identifier of the rules. Only the rules given with this set will be used for spelling changes. Per default the set of rules is empty, this means that all rules are allowed. The rules are important when command tk_textInsert (or tk_textReplace) will be used with automatic spelling changes. An exception will be thrown if one of the given rules is not defined. See section HPHENATION RULES for the defined rules.

If also option -lang is specified for a particular soft hyphen, then only the rules belonging to this language will be applied to this soft hyphen. This supports simpler hyphenators, not knowing spelling changes.

This is the way how the hyphen rules will be applied ("\+" denotes the soft hyphen character):

ck: If the soft hyphen is the right neighbor of character "c", and the right neighbor is character "k", then the ck hyphenation rule will be applied to this hyphen. Example: the German word "Druc\+ker" hyphenates into "Druk-ker". This rule belongs to language German (de).
gemination: If the soft hyphen is the left neighbor of a geminated letter 'l' (or 'L'), then the gemination hyphenation rule will be applied. Example: the Catalan word "para\+llel" hyphenates into "paral-lel". This rule belongs to the Catalan language (ca).
doublevowel: If the soft hyphen is the right neighbor of any vowel, and the right neighbor is the same vowel, then the doublevowel hyphenation rule will be applied to this hyphen. If this vowel is an "e" then a conversion to e-acute will be done. Example: the Dutch word "opaa\+tje" hyphenates into "opa-tje", and "cafee\+tje" becomes "café-tje". This rule belongs to the Dutch language (nl).
doubledigraph: In Polish language the following spelling change will be applied with this rule: "c\+cs" becomes "cs-cs", "g\+gy" becomes "gy-gy", "l\+ly" becomes "ly-ly", "n\+ny" becomes "ny-ny", "t\+ty" becomes "ty-ty", "z\+zs" becomes "zs-zs", and "s\+sz" becomes "sz-sz".
repeathyphen: In Polish the hyphen will be repeated after line break, this means for example that "tech\+nik" becomes "tech- -nik".
trema: If the soft hyphen is the right neighbor of any vowel, and the right neighbor of any vowel with trema (umlaut), then the trema hyphenation rule will be applied to this hyphen. Example: the Dutch word "re\+ëel" hypenates into "re-eel". This rule belongs to the Dutch language (nl).
tripleconsonant: If the soft hyphen is the right neighbor of a double consonant, and the right neighbor is the same consonant, and this consonant is followed by a vowel (or by letter "j" – Norwegian), then the tripleconsonant hyphenation rule will be applied when *not* hyphenating at this point. Examples: the German word "Schiff\+fahrt" hyphenates into "Schiff-fahrt", but will be written "Schiffahrt" when no hyphenation will be done. The Swedish word "tugg\+gummi" becomes "tuggummi", and the Norwegian word "buss\+sjåfør" becomes "bussjåfør". This rule belongs to languages German (de), Norwegian (no, nb, nn), and Swedish (sv).

This option should also be available for tags:

-hyphenrules rules

Specifies a list of spelling change rules for hyphenated words, specified by the identifier of the rules. Only the rules given with this set will be used for spelling changes. The rules will be applied in the given order of this list. Per default the set of rules is empty. An exception will be thrown if one of the given rules is not defined.

This tag option overrules the global option -hyphenrules, this means that the global rules will be applied only to hyphens without tagged rules.

For the definition of hyphenation rules see widget option -hyphenrules.

Support of hyphenation is a must in modern applications, even chess applications, namely Scidb, are using hyphenation for displaying text. Support of hyphenation is also the precondition for the support of full justified lines (looks ugly without the use of hyphenation). The additional support of spelling changes is required in a multilingual environment. Note that the hyphenation support is fulfilling bug item 1096580fff.

Because the soft hyphen has codepoint U+00AD, it is conceivable that this special character has size 2, but I decided that is has size 1, this avoids many special cases in current code (because any segment, not a char segment, has either size 0 (zero) or size 1), and this seems to be more natural.

For user convenience we will provide a very useful helper function, which especially supports the predefined spelling changes:

NAME

tk_textInsert - insert characters into a text widget regarding soft hyphen pattern.

SYNOPSIS

tk_textInsert pathName ?-hyphentags tagList? index chars ?tagList chars tagList ...?

DESCRIPTION

This procedure is especially supporting the predefined spelling changes (see section HPHENATION RULES).

The procedure tk_textInsert inserts all of the chars arguments just before the character at index in the specified text widget, see command insert of the text widget for a detailed description. This procedure will pre-parse the chars arguments before it calls command insert of pathName for the execution of the insertion. The parser is recognizing the following escape sequences:

The sequence "\-" will be replaced by the soft hyphen character. Example: "Sen\\-tence" will hyphenate into "Sen-tence".
The sequence "\+" will be replaced by the soft hyphen character, and, provided that no tagged list of rules is given (in tagList), a special tag with all known hyphen rules (regarding option -lang) will be attached to this hyphen (this means that all known hyphenation rules will be applied to this hyphen – probably restricted by option -lang).
The sequence "\:RULE:" – RULE is the identifier for a defined hyphenation rule (see section HPHENATION RULES) – will be replaced with a soft hyphen, and applies a tag with the specified hyphenation rule RULE to this soft hyphen character. An error will be thrown if the specified rule is not defined, or if the trailing colon is missing.
Any other escaped character (not "-", nor "+", nor ":") will be replaced by this character; this means that the former escape character will be removed.

The latter case is required to allow the escape character inside a string – it has to be doubled – otherwise an unambiguous identification is not possible. Example: the character string "\\-" will be interpreted as "\-" (a string), but the character string "\-" will be interpreted as soft hyphen character. (Please note that all the statements about the escape character are related to the list notation, inside the string notation every escape character has to be doubled.)

It is allowed to abbreviate all rules with a two-letter code:

:ck: is already an abbreviation
:dd: is :doubledigraph:
:dv: is :doublevowel:
:ge: is :gemination:
:tc: is :tripleconsonant:
:tr: is :trema:
:rh: is :repeathyphen:

If option -hyphentags is specified then the associated tagList will additionally be associated with all soft hyphens in new text; combined with the inherited tags, or the specified tags. This is especially useful for hiding/showing soft hyphens (see tag option -elide).

The function tk_textInsert provides a convenient way to insert hyphenated text when specific spelling changes are involved. Examples:

tk_textInsert end .t "Nach dem Ein\\-le\\-gen des Pa\\-piers soll\\-ten
die Druc\\+ker\\-ein\\-stel\\-lun\\-gen über\\-prüft wer\\-den."

This text in German may be hyphenated and displayed in the following way:
```
Nach dem Einlegen des Papiers sollten die Druk-
kereinstellungen überprüft werden.
```
tk_textInsert end .t "Nach dem Ein\\-le\\-gen des Pa\\-piers soll\\-ten
die Druc\\:ck:ker\\-ein\\-stel\\-lun\\-gen über\\-prüft wer\\-den."

This gives the same result, but here the ck rule is explictly specified.
tk_textInsert end .t "Nach dem Ein\\+le\\+gen des Pa\\+piers soll\\+ten
die Druc\\+ker\\+ein\\+stel\\+lun\\+gen über\\+prüft wer\\+den."

This will also give the same result, because only the ck rule will be detected for spelling changes.
tk_textInsert end .t "Nach dem Ein\\-le\\-gen des Pa\\-piers soll\\-ten
die Druc\\-ker\\-ein\\-stel\\-lun\\-gen über\\-prüft wer\\-den." ck
.t tag configure ck -hyphenrules {ck}

The same result as before, because the ck rule is applied explictly (to all soft hyphens).
tk_textInsert end .t {Nach dem Ein\-le\-gen des Pa\-piers soll\-ten
die Druc\-ker\-ein\-stel\-lun\-gen über\-prüft wer\-den.} trema
.t tag configure trema -hyphenrules {trema}

In this case only the trema rule will be applied to the text - the ck rule is missing - thus we cannot get here the correct result, this text may be (wrongly) hyphenated in the following way:
```
Nach dem Einlegen des Papiers sollten die Druc-
kereinstellungen überprüft werden.
```

The explicit specification of the hyphenation rule is the preferred method (second example), however it depends on the quality of the hyphenator which method will be used. For instance, the TeX hyphenator does not know about spelling changes, and here the method with the use of "\+" (for any hyphenation point like in the third example) may give fairly good results. Of course, the use of "\+" is (currently) only useful for the languages Catalan, Dutch, German, Hungarian, Norwegian, Polish, and Swedish.

With the usage of this function the hyphenator only has to insert the appropriate escape sequences, "\-" for a normal hyphenation, and "\+", or "\:RULE:", when a spelling change is needed at this point.

We should not forget the related helper function for replacing text:

NAME

tk_textReplace - replace a range of characters inside a text widget regarding soft hyphen pattern.

SYNOPSIS

tk_textReplace pathName ?-hyphentags tagList? index chars ?tagList chars tagList ...?

DESCRIPTION

This procedure is especially supporting the predefined spelling changes (see section HPHENATION RULES).

The procedure tk_textReplace replaces the range of characters between index1 and index2, see command replace for a detailed description. This procedure will pre-parse the chars arguments before it calls command replace of pathName for the execution of the replacement. The parser is recognizing the escape sequences "\-" , "\+", and "\:RULE:", see procedure tk_textInsert which works in a similar way.

In general the visible rendition of the soft and hard hyphens are indistinguishable, so it's very convenient if the user has the option to set a different foreground color for soft hyphens.

Command-Line Name: -hyphencolor
Database Name: hyphenColor
Database Class: HyphenColor: Specifies the foreground color to use when displaying the soft hyphen character (U+00AD). An empty argument will force the use of the foreground color (set with -foreground; this is the default).

This requires an additional tag attribute:

-hyphencolor color: Specifies the foreground color to use when displaying the soft hyphen character (U+00AD) inside the tagged region. An empty argument will force the use of the foreground color (set with -foreground; this is the default).

The following script is demonstrating the hyphenation with all defined spelling changes, it is also an example for the new justifcation mode full. (Please do not expect meaningful text, it's only for demonstration.)

Expand/Collapse Script

Appendix

For a proper support of soft hyphens some other commands have to be extended a bit:

pathName count ?options? index1 index2

Counts the number of relevant things between the two indices. If index1 is after index2, the result will be a negative number (and this holds for each of the possible options). The actual items which are counted depend on the options given. The result is a list of integers, one for the result of each counting option given. Valid counting options are -chars, -displaychars, displayhyphens, -displayindices, -displaylines, -displaytext, -hyphens, -indices, -lines, -text, -xpixels and -ypixels. The default value, if no option is specified, is -indices. There is an additional possible option -update which is a modifier. If given (and if the text widget is managed by a geometry manager), then all subsequent options ensure that any possible out of date information is recalculated. This currently only has any effect for the -ypixels count (which, if -update is not given, will use the text widget's current cached value for each line). The count options are interpreted as follows:

-chars: count all characters, whether elided or not, this also includes soft hyphens. Do not count embedded windows or images.
-displaychars: count all non-elided characters (see -chars).
-displayhyphens: count all non-elided soft hyphens.
-displayindices: count all non-elided characters, soft hyphens, windows and images (see -indices).
-displaylines: count all display lines (i.e. counting one for each time a line wraps) from the line of the first index up to, but not including the display line of the second index. Therefore if they are both on the same display line, zero will be returned. By definition displaylines are visible and therefore this only counts portions of actual visible lines.
-displaytext: count all non-elided characters, but discard soft hyphens (see -text).
-hyphens: count all soft hyphens, whether elided or not.
-indices: count all characters, soft hyphens, and embedded windows or images (i.e. everything which counts in text-widget index space), whether they are elided or not.
-lines: count all logical lines (irrespective of wrapping) from the line of the first index up to, but not including the line of the second index. Therefore if they are both on the same line, zero will be returned. Logical lines are counted whether they are currently visible (non-elided) or not.
-text: count all characters, whether elided or not, but discard soft hyphens.
-xpixels: count the number of horizontal pixels from the first pixel of the first index to (but not including) the first pixel of the second index. To count the total desired width of the text widget (assuming wrapping is not enabled, and the used font is ideally monospaced), first find the longest line and then use [.text count -xpixels "${line}.0" "${line}.0 lineend"].
-ypixels: count the number of vertical pixels from the first pixel of the first index to (but not including) the first pixel of the second index. If both indices are on the same display line, zero will be returned. To count the total number of vertical pixels in the text widget, use [.text count -ypixels begin end], and to ensure this is up to date, use [.text count -update -ypixels begin end].

In the description of -ypixels I've removed the following sentence:

This -update option is obsoleted by pathName sync, pathName pendingsync and <<WidgetViewSync>>.

See Severe Problems With "sync" Command for the reason.

pathName get ?switch? ?--? index1 ?index2 …?

Return a range of characters from the text. The return value will be all the characters in the text starting with the one whose index is index1 and ending just before the one whose index is index2 (the character at index2 will not be returned). If index2 is omitted then the single character at index1 is returned. If there are no characters in the specified range (e.g. index1 is past the end of the file or index2 is less than or equal to index1) then an empty string is returned. If the specified range contains embedded windows, no information about them is included in the returned string. If multiple index pairs are given, multiple ranges of text will be returned in a list. Invalid ranges will not be represented with empty strings in the list. The ranges are returned in the order passed to pathName get. The switch will be interpreted as follows:

-chars: all characters within the ranges will be returned, whether elided or not, this also includes soft hyphens. This is the default, if no switch is given.
-text: all characters, which are not soft hyphens, within the ranges will be returned, whether elided or not.
-displaychars: only those characters which are not elided will be returned, this also includes soft hyphens. This may have the effect that some of the returned ranges are empty strings.
-displaytext: only those characters, which are neither soft hyhens, nor elided, will be returned. This may have the effect that some of the returned ranges are empty strings.

This is fully compatible to older library version.

pathName search ?switches? pattern index ?stopIndex?

…
-discardhyphens: Do not match with soft hyphens, this means that soft hyphens inside widget content will be discarded while performing the search operation. If the text contains soft hyphen this option is often a must.
…

Probably the other way around – an option which includes the soft hyphens in search – is more natural, but this would be incompatible to older library version.

Also command dump has been adapted to new hyphenation support, see Additional Switch for '-dump' for changed documentation.