Converting CHM files to text files

I find that it’s extremely useful to have a folder containing ebooks in the form of PDFs and CHMs (compiled HTML).  Often, when I can’t remember the exact command, coding, or configuration parameter, these ebooks are an excellent source to hit upon before heading out to Google to filter out websites with the correct answers.

The ebooks are stored across different directories according to their primary subject matter: Linux/FreeBSD, Security, Programming, Networking, Database, VOIP, etc.

While it’s all and good to have electronic reference books handy and ready at a moment’s notice, I wanted to take it one step further and make these ebooks searchable.

Enter Google Desktop which can be set to index files inside specific folders.  However, after installing and configuring Google Desktop, I noticed that the application doesn’t have deep indexing capability for PDFs and CHMs.  The desktop application cannot search the text inside these files and return the successful hits in the Desktop search results.

As a workaround, PDFs and CHMs can be exported or converted to regular text files.  After the corresponding text files are dumped into the same directory as the PDFs and CHMs,  Google Desktop has no problems indexing all the words inside the text files. This enhances the ability to perform keyword searching to find any ebooks containing the search string.

Fortunately, the current incarnation of Adobe PDF Reader allows you to export a PDF to text file, so that takes care of the PDF files.

CHM (compiled HTML) is a different story and isn’t as easily converted to a text file. CHM is basically a file composing of many HTML files that have been bundled together.  Fortunately, it’s possible to convert CHM into a single text file.  Archmage can be used to decompile CHM and break it up back to the original mess of HTML files. After the CHM decompiling,  lynx is run in a batch job to open each of these HTML files one by one and append the text output into a single text file.

Here’s the process and shell script to do the CHM to text conversion:

archmage ebook.chm
(a HTML directory is automatically created with all the HTML files)

cd  to the html directory

ls | sort -n > filelist
(this generates a file with sorted list of all the files in the directory. Most of the time, the files are numerically ordered so the sort -n helps to rapidly reorder them)

Edit filelist to get the right order from beginning to end:  It’s a good idea to have the TOC (Table Of Content), preface, main files at the top and followed by the correct order of chapters/sections. Remove any filenames (such as the alphabetized index) that shouldn’t be processed into the final converted text file.  Finally join all the filenames into a single line, separated by a space character (hint: vi editor makes this very easy via the command ‘J’).

Copy and paste the single line into the script below and run it.

#!/bin/sh

# archmage can be used to decompile chm into html files
# First generate the list of files and order them to be processed correctly.
# ls | sort -n > filelist.txt
# edit filelist.txt , remove unnecessary files , rearrange order of files then join them all into one line.
# then paste & replace the line of files below

for i in main.html toc.html part01.html part02.html
do
lynx -dump $i >> final1.txt
done

#Do some post-processing to further clean up the file.

# Remove all “jared” , “Previous Page” , “Next Page” , “References” , “Visible links” , “Hidden links”
grep -v -e “\(jared\)\|\(Next Page\)\|\(Previous Page\)\|\(^References$\)\|\(Visible links\)\|\(Hidden links\)” final1.txt > final2.txt

# Replace all [digits] with a space character
sed “s/\[[0-9]*\]/ /g” final2.txt > final3.txt

final3.txt will be the final sanitized text file which can be dumped into the same directory of the original CHM file and indexed by Google Desktop.  It’s also a good idea to double-check the post-processing of the text file to customize the clean-up process to get cleaner results (will be different for each CHM file).

This entry was posted in Web/Tech. Bookmark the permalink.