HTML2text

Introduction

HTML2text is a set of Tcl procedures for converting HTML to ASCII text. It differs from other converters I've seen in that it gives the user control of the rendering of the various tags.

It uses the HTML parser from Sun's Hippo HTML display library (by Stephen Uhler). A copy of this library is included in the HTML2text distribution.

Manifest

HTML2text
Executable Tcl script which uses the library to convert from HTML to ASCII text
README
Text version of the README file
README.html
HTML version of the README file
html_library-0.3
Directory containing version 0.3 of the Hippo library
html_text.tcl
Tcl routines to render ASCII text
htmlsample.html
A sample HTML file

Usage

From the shell command line

NOTE: Before using the program (on Unix), you'll need to change the first line of the HTML2text file to point to the location of a Tcl 7.4 (or newer) shell.

To convert from HTML to text you can simply run the HTML2text utility. It works as a filter or you can optionally specify an input file and output file as arguments.

All of the following could be used to generate the text version of the README:

        HTML2text < README.html > README
        HTML2text README.html > README
        HTML2text README.html README

From within a Tcl program

In order to do the conversion in a Tcl program, you need to read in the html_library.tcl file from the Hippo library and the html_text.tcl file. You can either do this by "source"ing them directly or by arranging for them to be autoloaded.

You can then convert HTML to ASCII by calling the HTconvert_html command. For example:

        puts [HTconvert_html "This text is emphasized"]
would output the text in emphasized form. Using the defaults it would be:
        _This text is emphasized_

Customization

See the contents of the HTML2text file for a couple of examples of changing the renderings.

Installation

If you wish to install the HTML2text program into a standard location for executables, you'll need to first install the html_library-0.3/html_library.tcl and html_text.tcl files into a library directory. Then modify the HTML2text program to load in the two library files (as well as changing the first line to point to tclsh, if you haven't already done so). The HTML2text file can then be copied into place.