|1.0||22 November 2012||First version.||ShellShock|
|1.1||23 November 2012||Improvements suggested by tshering.||ShellShock|
The following instructions are for Windows and Unix. Be prepared –
for all but the smallest dictionaries, you will probably need some coding or scripting skills to convert your source
dictionary into a format from which the Kobo dictionary can be built.
- First identify a suitable source dictionary, preferably one that uses some sort of html or xhtml format for the definitions;
this will make your conversion work easier, as the Kobo dictionary definitions are in html, which means most if not all
of the formatting in the source html can be re-used in your Kobo dictionary. You can use plain text as your source, which
will mean you will have to add any formatting you want yourself. If you are looking for a good, free Windows text editor, then I
recommend Notepad++, especially as it has good support for regular expressions.
- You need to split the definitions in your source dictionary into html files, where the name of the file has the format
xx.html, xx being the first two letters of each word defined in that file. For example, the word
aardvark will go into a file called aa.html, whereas the word atom will go into a file called
at.html. Each of the html files must have the following format:
<?xml version="1.0" encoding="utf-8"?> <html> <w><p><a name="word"/>Definition of word. Most HTML tags are allowed.</p></w> </html>
Although you can use most html tags in the definitions, links to resources outside the html file do not work. So, anchor tags (a)
with hrefs to other html files, do not work. This is a real pity, because it means you cannot link from a word definition to another
word in the dictionary.
Here is an example aa.html file:
<?xml version="1.0" encoding="utf-8"?> <html> <w><p><a name="aardvark"/><b>aardvark<b/>. A mammal native to Africa.</p></w> <w><p><a name="atom"/><b>atom<b/>. A fundamental particle.</p></w> <w><p><a name="atom"/><b>atom<b/>. An extremely small amount.</p></w> </html>
The example also shows that if a word has multiple definitions, then you should create a w tag for each definition.
Single letter words must go into an html file named xa.html, where x is the word. For example, the word I
should be defined in the file ia.html. Use lower case letters for your html file names, but the case for the word in the
definitions is not so important. So, in the ia.html file we might have:
<?xml version="1.0" encoding="utf-8"?> <html> <w><p><a name="I"/>First person pronoun.</p></w> <w><p><a name="i"/>A mathematical symbol.</p></w> </html>
Words where either the first or second character are not a letter should be defined in a file called 11.html,
for example the words 1a and o’clock should be defined in 11.html.
I recommend that you use UTF-8 file encoding for the html files – this will give you a very wide range of characters to choose from,
including a lot of symbols (which can often be used to replace images in the source dictionary); the Kobo has good support for
UTF-8 in dictionaries (I have not yet found a character it will not display correctly).
Although it would be possible to create the html files manually using a text editor, this would require a lot of work,
especially if your source dictionary has a lot of entries. So if you have any sort of coding skills, now is the time to
use them! Also bear in mind that source dictionaries come in many different file formats, so there is no single solution for
converting them into the html format required for the Kobo.
- Put all your split html files into a single directory. In the same directory, create an index.txt file, again in UTF-8 encoding
with Unix line endings. This file will contain the source index. For each unique definition in your html files (that is, for each w tag)
create a line in the index.txt file that contains the defined word on its own, for example:
a aardvark aargh aback
Very important – line endings in the index.txt file must be in Unix format. That is, they must just be a line-feed
character (10), and not the normal carriage-return (13) + line-feed (10) used by Windows. This will keep the marisa-build tool happy
- Now, compress each html file into gzip format, but keeping the same file name. So, when you compress aa.html the resulting
compressed file should also be called aa.html. On Windows, I use 7-Zip for the compression, with the
7z a -tgzip "compressed\aa.html" "aa.html"
This puts the compressed aa.html file into a compressed sub-directory.
The equivalent Unix command is:
gzip aa.html > compressed/aa.html
- Now get hold of marisa, which is used to convert your index.txt file into a fast and efficient
index. Marisa is only available in source code format from this link, but it does come with make files for Unix, and a Microsoft Visual Studio 2008 solution,
which is what I used to build marisa. You will find the marisa Windows binaries in the same location as these instructions. I am distributing the binaries
under the terms of the BSD license, as linked from the marisa home page.
- Run marisa-build like this to build the index (the command is the same for Unix and Windows):
marisa-build -owords index.txt
This creates a file called words, which contains your indexed words. To test the index, run marisa-lookup:
At the marisa-lookup prompt (a blank line), type in one of your indexed words, and hit Enter. You should get a number > -1 displayed, which is the key for the
word in the index. If you get back -1 then the word is not indexed – check that you have used Unix line endings in your index.txt file!
- Now use zip compression to zip all your gzipped html files, and the words file, into a single file called dicthtml.zip, e.g.,
7z a -tzip dicthtml.zip *.html words
On Unix this is:
zip dicthtml.zip *html words
This will create a dictionary file that will replace the English dictionary on the Kobo. If you want to replace a different language dictionary, then use
the appropriate suffix, e.g., dicthtml-de.zip for German, dicthtml-nl.zip for Dutch.
- Copy the new dictionary file to the Kobo, into the directory .kobo/dict (backup the existing dictionary first!). You may need to open a book and then flip between dictionaries in order to
get your new dictionary’s index to be loaded and recognised by the Kobo (you only need to do this once after replacing the dictionary).
- Assuming everything went well, give yourself a big pat on the back for your hard work. And, if you are used a public domain source dictionary,
then don’t forget to post your new Kobo dictionary at MobileRead so other forum members can benefit from your hard work.