4. Using the builder

4.1. System Requirements

The builder requires a Java™ Runtime Environment (JRE) version 1.4+.

A powerful machine is recommendable. The compression algorithm used by the builder is very memory intensive. You should have at least 128Mb of memory (or of swap space at a pinch) on your machine. The dictbuilder script sets the memory limit to a high value. In exceptional cases, you might have to set it higher. As an indication, compiling our big French word list (800,000 words, 9.8Mb) requires 30 seconds and 87Mb of memory on a 1 GHz Pentium III.

4.2. Obtaining and installing the builder

The builder is no longer included in the SDK. It must be downloaded separately from www.xmlmind.com/xmleditor/dictbuilder.shtml.

In all cases, the builder is a command-line utility: a shell file named dictbuilder on Unix or MacOS, dictbuilder.bat on Windows.

4.3. Command and options

General form of the command line:

dictbuilder ?options? word_list ... word_list ?-sub word_list ... word_list?

It is also possible to use a compiled dictionary as input. This is the way to create a new version of an existing dictionary if you do not possess the source word list.

General options:

-cs character_encoding

Encoding used in word lists, frequent word list and hints files. This must be an encoding supported by Java™ runtime.

Important

This option must be placed before the files it applies to.

-hints hints_file

Specifies the hints file.

Important

Specifying a hints file is almost always needed as this file is used to specify which characters may be used to form a word.

The hints files used to build XMLmind's en, fr, de, and es dictionaries are found here: en.hints, fr.hints, de.hints, es.hints. Note that the encoding of all these hints files is ISO-8859-1.

-freq word_list

List of frequent words.

-prefixes word_list

List of standard prefixes.

-sub word_list ... word_list

Every word list whose path follows this option will be subtracted from the resulting dictionary, instead of being merged with. It means that every word belonging to this word list will be absent from the result. This option should be placed after the input word lists.

-o output_file

Specifies the compiled dictionary output file. The convention is to use a .cdi extension, but there is no obligation.

Other options:

-verbose

Explain what is being done.

-dump out_word_list

After merging all the compiled and textual word lists specified in the command line and after subtracting words if the -sub option is used, output the resulting word list in specified text file. As always, the encoding of the generated text file is specified using the -cs option.

Example 1: Create compiled dictionary mylang.cdi out of word lists mywords.txt and extrawords.txt. The encoding of all text files specified in the command line is ISO-8859-2. Hints file is mylang.hints. Frequent words are contained in frqw.txt. Standard prefixes are contained in myprefixes.txt.

dictbuilder -cs ISO-8859-2 -hints mylang.hints -freq frqw.txt -prefixes myprefixes.txt \
    mywords.txt extrawords.txt -o mylang.cdi

Example 2: Add words contained in added_words.txt to compiled dictionary de.cdi. Compile the resulting word list as new_de.cdi.

dictbuilder -cs ISO-8859-1 -hints de.hints de.cdi added_words.txt -o new_de.cdi

Example 3: Subtract words contained in removed_words.txt from compiled dictionary de.cdi. Compile the resulting word list as new_de.cdi.

dictbuilder -cs ISO-8859-1 -hints de.hints de.cdi \
    -sub removed_words.txt -o new_de.cdi

Example 4: Output in text file de.txt all the words contained in compiled dictionary de.cdi.

dictbuilder -verbose -cs ISO-8859-1 -hints de.hints de.cdi -dump de.txt