When encountering an unrecognized word, the spell checker engine tries to alter it in different ways and find the altered forms in its dictionaries. Standard alterations are: inserting a character (to correct an omission), suppressing a character (to correct a superfluous keystroke), swapping two adjacent characters, replacing one character by another (especially pairs of characters that are neighbors on a keyboard).
The spell checker engine can also use specific knowledge of the considered language. Some spell checker engines convert the word into phonetics and lookup the phonetic form. This is powerful for seriously bad misspellings. The drawback is that writing phonetic conversion rules is tedious and delicate.
The method used by XMLmind Spell Checker is both simple and powerful: it consists of specifying groups of character sequences that are easily mistaken one for the other. For example, if the sequences "ph"and "f" are said to be easily mistaken one for the other, the engine will quickly find the correct spelling for "elefant", because it will try to substitute "f" by "ph".
Most often, such hints reflect phonetic similarity, but they can also deal with more specific cases (for example in French, people often write "ceuil" instead of "cueil" in words like "recueil", "accueil" etc.). Some spell checkers treat such frequent mistakes by using special catalogs, but the method implemented in XMLmind Spell Checker is more general and powerful.
In short, the hints files define two types of information:
Characters allowed in words and their properties (see below)
Proximity between particular sequences of characters: (directive %mistake
)
This directive specifies that some character sequences are likely to be confused. This is a hint for helping the spell-checking engine find smart suggestions.
Example: the following rule means that f ff and ph sound the same in can be tried instead of each other. This rule can help to sort out 'giraphes' and 'elefants'...
%mistake f ff ph
Another example: in French, "ell" "èl", resp. "au" "eau" "ô" sound similarly; this can be expressed by two rules like:
%mistake ell èl %mistake au eau ô
With such a rule, if one mistakenly writes "burau", the proper suggestion "bureau" will come atop the suggestion list more easily.
A more common example: in most languages, some letters appear sometimes doubled, (for example: "spell"), sometimes not doubled (for example: "repel"). To deal with such mistakes, the following rule could be given:
%mistake ll l
The %mistake
directive can of course be applied to simple characters:
%mistake a à â %mistake e é è ê ë
Special case: keyboard proximity between characters:
An erroneous occurrence of a character can come from the keyboard layout: a finger slip can replace a key by its neighbor. It could be expressed by a %mistake
rule for each pair of adjacent keys, but this would be tedious to write: the %kbline
directive provides an easy way to define the keyboard proximity rules. (See below)
XMLmind Spell Checker requires a declaration for characters used in word lists. This helps to detect malformed words.
By default, the ASCII uppercase and lowercase letters, digits, hyphen, dot and apostrophe are declared as acceptable ``word characters''.
To declare supplementary characters, use the %chars
directive. It takes one argument (i.e. no space inside) which is a string of characters to declare. For example:
%chars àâéèëêîôùû
The %chars
directive declares the characters may appear anywhere in the word.
Two other directives %noninitial
and %nonfinal
allow to refine this. They define whether a character may appear at the first or the last position in a word. For example:
%noninitial ' %nonfinal '
means that the apostrophe may appear only inside a word, not at the beginning (%noninitial
) or at the end (%nonfinal
).
By default the hyphen, dot and apostrophe are non-initial and non-final.
These directives are rarely used beyond the example above (Namely in French and Italian).
The syntax is very simple:
%mistake[modifier
]seq1
seq2
...seqN
This means that each time one of these sequences is found in an unknown word, the spell-checking engine will attempt to replace it by one of the other sequences of the same rule and lookup the newly formed hypothesis in the dictionary.
To put it more clearly, let's consider the rule %mistake f ff ph
and assume that the word 'elefant' is encountered. The engine here will try to replace "f" by "ff" and "ph", generating and looking up in the dictionary "eleffant" and "elephant", and in principle will find the latter as a suggestion.
The modifier is an indication of how likely the substitutions are. The possible forms are '-' (less likely), or '+' (more likely). Several modifiers can be combined. For example, in French we could have the following directives:
%mistake+ a â à %mistake++ i î %mistake- i y
It means that stumbling over grave or circumflex accents is quite likely, while confusing a 'i' with 'y' is less likely.
Note: the %mistake--
likelihood is the default for any pair of letters. So it is generally useless to specify more than one '-'.
It is suggested to use this directive with moderation, as it can slow down the engine. Especially, directives with many sequences lead to a higher combinatorial complexity.
Special cases: characters ^ and $
These characters have a special meaning. When used in a sequence, they make the sequence match only when appearing respectively at the beginning or the end of a word. For example:
%mistake ^kn ^n %mistake $ gh$ w$
The first rule tells that at the beginning of a word "kn" can be mistaken for (sounds like) a "n". The second rule means that at the end of a word, "gh" or "w" can be forgotten or erroneously added ("$" alone means "nothing" or "silent").
It makes no sense to mix sequences with and without a "$" (resp. a "^"). However it is possible for a sequence to have both (whole word). This should be used with moderation.
This is in fact a kind of shortcut to replace many %mistake
directives: the argument is a string of horizontally adjacent characters of a keyboard. The directive specifies that each character is ``close to'' its one or two neighbors.
For example here, "q" is close to "w", "w" to "e", "e" to "r" etc.
# English keyboard: %kbline qwertyuiop %kbline asdfghjkl %kbline zxcvbnm
The likelihood defined is roughly equivalent to the one of %mistake-
. Modifiers can also be applied to %kbline
. Thus %kbline+
is roughly equivalent to %mistake
.
Another directive controls the compound words: %compoundmin
length
This directive means that compound words (without hyphens) are automatically allowed, provided that the length of each component is at least the length specified in the directive. This is meant for German and Nordic languages.
For example in German, the directive %compoundmin 3
means that words like "aus" and "gehen" can be automatically composed into "ausgehen", and that "in" and "gehen" will not allow "ingehen" (because the length of "in" is less than 3).