Appendix - A Free-format to HTML Translator.


This appendix describes fft2html, an Icon program that translates free-format text into HTML.

Free-format text is text that is formatted by hand using white space and other graphics characters (e.g., asterisks as bullets); the alternative to free-formatted text is text containing explicit formatting commands, such as those provided by HTML or Tex.

<fft2html.icn>= [D->]
define white_space ' \t'

record block(type, text)


procedure main() 

  local blocks

  blocks := []
  while put(blocks, read_block(&input))

  recognize_headers(blocks)
  recognize_pagenos(blocks)
  recognize_image(blocks)
  recognize_list_items(blocks)
  recognize_definition_list_items(blocks)
  blocks := recognize_block_paragraphs(blocks)
  recognize_paragraphs(blocks)
  blocks := recognize_definition_lists(blocks)
  blocks := recognize_lists(blocks)
  recognize_simple_tables(blocks)
  recognize_shopping_lists(blocks)
  blocks := recognize_ei_definitions(blocks)
  paragraph_block_paragraphs(blocks)

  to_html(blocks, &output)

  end

The first translation step is to group the input lines into blocks. A line is the sequence of input characters between two successive newline characters; implicitly, input always starts and ends with a newline. A blank line is a line containing no characters, or only white-space characters (i.e., space, vertical and horizontal tabs, form feeds, and so on). A block is the sequence of lines appearing between two consecutive blank lines; only blocks containing one or more lines are of interest in this problem, so implicitly ``block'' means ``non-empty block.''

Read and return a block; the lines in the block come from the input file i.

<fft2html.icn>+= [<-D->]
procedure read_block(i)

  local b, line

  while (line := (read(i) | fail)\1) & is_blank(line)
  b := [detab(line)]
  while (line := read(i)) & not(is_blank(line)) do put(b, detab(line))

  return block(&null, b)

  end 


A blank line contains either no characters or only white-space characters.

<fft2html.icn>+= [<-D->]
procedure is_blank(l)

  return ((*l = 0) | (l ? (tab(many(white_space)) & pos(0))))

  end

A text block b is a header if

  1. b has no type assigned and the text associated with b contains exactly one line, and
  2. The text line in b contains at least two words, the first of which is made up of only digits and periods, and
  3. There's at least one period in the first word, and
  4. There's at least one digit on either side of every period in the first word (that is, 1. and .5 are not header levels, but 1.0 is).
A header data structure contains the header tag (e.g., 1.0), the header's level (i.e., the number of periods in the header tag; note that header tags such as 1.0 have level 1), and the header text.
<fft2html.icn>+= [<-D->]
record header_block(level, tag, text)

procedure recognize_headers(blocks)

  local b, ln, lv, ok

  every b := !blocks do
    if /b.type & (*b.text = 1) then {
      ln := split_first_word(b.text[1])
      if (ln[1] ? (tab(many('0123456789.')) & pos(0))) & (*ln[2] > 1) then {
        lv := split_levels(ln[1])
        if *lv > 1 then
          ok := 1
          every if *(!lv) = 0 then ok := 0
          if ok = 1 then {
            b.type := header_block(*lv, ln[1], ln[2])
            if lv[-1] == "0" then b.type.level -:= 1
            }
        }
      }

  end

Split the input line l just after the first word of l; return the pieces in a two-element list, the second element of which may be empty if l contains only one word. White space before and after the first word is deleted in the output; all other white space is maintained.

<fft2html.icn>+= [<-D->]
procedure split_first_word(l)

  local pieces

  (l || " ") ? {
    tab(many(white_space))
    wd := tab(upto(white_space))
    tab(many(white_space))
    return [wd, (tab(0)[1:-1] | "")]
    }

  return pieces

  end

<fft2html.icn>+= [<-D->]
procedure split_levels(lvl)

  local l

  l := []

  (lvl || ".") ? while (put(l, tab(upto('.'))) & move(1))
  
  return l

  end

A text block b is a paragraph if

  1. b has no type assigned, and
  2. The second through last lines in the text associated with b have the same (more or less) indentation, and
  3. The first line in the text associated with b is indented eight (or so) spaces with respect to the rest of the lines.
This test is ambiguous for one-line paragraphs (for example, it accepts a single-line enumerated list item as a paragraph). There are a number of ways to deal with this, but the easiest is to recognize paragraphs after recognizing everything else.

A paragraph has no interesting features other than its text, so the paragraph data structure needs no fields. Unfortunately, Icon records need at least one field.

<fft2html.icn>+= [<-D->]

record paragraph_block(an_unused_field)

procedure recognize_paragraphs(blocks)

  local b, indents, i

  every b := !blocks do
    if /b.type then {
      indents := get_indentation(b.text)
      i := indents[1] - (if *indents > 1 then indents[2] else 0)
      if (7 <= i) & (i <= 9) then
        b.type := paragraph_block()
      }

  end

A block paragraph is a paragraph in which the first line is not indented. A block paragraph results when a regular paragraph is split by a list or a page break, for example.

A text block b is a block paragraph if

  1. b has no type assigned, and
  2. All the lines of text associated with b have no indentation.

If a block paragraph is separated from a paragraph or another block paragraph by a page number or an image, move the block paragraph over the page number or image and merge it with the other paragraph.

<fft2html.icn>+= [<-D->]
record block_paragraph_block(indentation)

procedure recognize_block_paragraphs(blocks)

  local b, indent, i, s 

  s := *blocks
  every i := s to 1 by -1 do
    if /blocks[i].type then
      if indent := uniform_indentation(blocks[i].text) then
        if (type(blocks[i - 1].type) == ("pageno_block" | "image_block")) then
          if ((type(blocks[i - 2].type) == "paragraph_block") &
              (indent = 0)) |
             ((type(blocks[i - 2].type) == "block_paragraph_block") &
              (indent = blocks[i - 2].type.indentation)) then {
            blocks[i - 2].text |||:= blocks[i].text
            blocks := blocks[1:i] ||| blocks[i+1:0]
            }
          else blocks[i].type := block_paragraph_block(indent)
        else blocks[i].type := block_paragraph_block(indent)

  return blocks;
  end

Create and return a list of integers i with the property that i[j] equals the number of space characters at the start of the line text[j]. This code looks at only space characters; other white-space characters --- tab, in particular --- are not taken into account by this code.

<fft2html.icn>+= [<-D->]
procedure get_indentation(text)

  local indents

  indents := []
  every (!text ? put(indents, (many(' ') - 1) | 0)\1)

  return indents

  end

Return a copy of text line l in which all tabs have been replaced by an appropriate number of spaces (each tab stop is assumed to be eight characters to the right of previous tab stop, starting at character position one on the left).

<fft2html>=
procedure detab(l)

  local i

  while i := upto('\t', l) do {
    l := (l[1 : i] | "") || repl(" ", 8 - ((i - 1) % 8)) || (l[i+1 : 0] | "")

  return l

  end

A text block b is a page number if

  1. b has no type assigned, and
  2. there's one line of text associated with b, and
  3. the text contains two words, the first of which begins with `M` and the second of which is a page number.
The page data structure holds the actual page number.
<fft2html.icn>+= [<-D->]

record pageno_block(pno)

procedure recognize_pagenos(blocks)

  local b, ln

  every b := !blocks do
    if /b.type & (*b.text = 1) then {
      ln := split_first_word(b.text[1])
      if (ln[1][1] == "M") & is_page_no(ln[2]) then
        b.type := pageno_block(ln[2])
      }

  end


A text block b is an image if

  1. b has no type assigned, and
  2. there's one line of text associated with b, and
  3. the text contains two words, the first of which is image and the second of which is a file name containing the image data.
The image_block data structure holds the name of the file containing the image.
<fft2html.icn>+= [<-D->]

record image_block(fname)

procedure recognize_image(blocks)

  local b, ln

  every b := !blocks do
    if /b.type & (*b.text = 1) then {
      ln := split_first_word(b.text[1])
      if (ln[1] == "image")then
        b.type := image_block(ln[2])
      }

  end


A string s is a page number if it's an integer possibly proceeded by a letter and a hyphen (e.g. B-19, an appendix page number).

<fft2html.icn>+= [<-D->]
procedure is_page_no(s)

 (*s > 0) | fail
 (s ? (tab(many(&digits)) & pos(0))) & return
 (s ? (tab(many('ABCDEFGHIJKLMNOPQRSTUVWXYZ')) & ="-" &
       tab(many(&digits)) & pos(0))) & return

  end

A text block b is a list item if

  1. b has no type assigned, and
  2. The second through last lines in the text associated with b have the same (more or less) indentation, and
  3. the first line in the text associated with b is out-dented with respect to the rest of the lines, and
  4. The first word of the first line is a list-entry tag, which is a single letter or digit followed by a period.
This test is ambiguous for one-line paragraphs (for example, it accepts a single-line enumerated list entry as a paragraph). There are a number of ways to deal with this, but the easiest is to recognize paragraphs after recognizing everything else.

A list entry has no interesting features other than its text, so the paragraph data structure needs no fields. Unfortunately, Icon records need at least one field.

<fft2html.icn>+= [<-D->]

record list_item_block(tag, indentation, text)

procedure recognize_list_items(blocks)

  local b, indents, ln

  every b := !blocks do
    if /b.type then {
      indents := get_indentation(b.text)
      if (*indents = 1) | (indents[1] < indents[2]) then {
        ln := split_first_word(b.text[1])
        if (ln[1] ? (tab(many(&digits ++ &letters)) & ="." & pos(0))) then {
          b.type := list_item_block(ln[1][1:-1], indents[1], copy(b.text))
          b.type.text[1] := ln[2]
          }
        }
      }

  end

A list is recognized by the sequence of its list items; generally, a consecutive sequence of list items belong to the same list.

<fft2html.icn>+= [<-D->]
record list_start_block(unused_field)
record list_end_block(unused_field)

procedure recognize_lists(blocks)

  return _recognize_lists(blocks, "list_item_block",
                          list_start_block(), list_end_block())

  end

*

<fft2html.icn>+= [<-D->]
procedure _recognize_lists(blocks, item_type, list_start, list_end)

  local i

  <migrate page numbers>

  return listify(blocks, item_type, list_start, list_end)

  end

Unfortunately, there are a number of things that can confuse the simple list-recognition algorithm described in the previous paragraphs. If a list is split across pages, the page number will interrupt the list-entry sequence. Because the idea of a page is flexible in hypertext, one way to deal with a list split across pages is to merge it back into a single list by migrating the page number to the end of the split list. Note this algorithm works only if the list is split across one page; otherwise, the migrating page number bumps into the following page number and the migration stops.

<migrate page numbers>= (U->)
every i := 1 to *blocks - 2 do
  if (type(blocks[i].type) == item_type) &
     (type(blocks[i + 1].type) == "pageno_block") &
     (type(blocks[i + 2].type) == item_type) then
    blocks[i + 1] :=: blocks[i + 2]    


Given the block list blocks, surround the sequences of list items of type itemt with the start- and end-list markers lists and liste respectively. Changes in list-item indentation are used to keep track of nested lists.

The stack pagenos keeps track of page numbers encountered amid the list-item sequences, as would occur when a list is spread over several pages. It is difficult to know what to do locally with the stacked-up page numbers; the best answer is to globally renumber. However, to get going, only the oldest page number in the stack is restored to blocks.

<fft2html.icn>+= [<-D->]
procedure listify(blocks, itemt, lists, liste)

  local newblks, indents, b, pagenos

  newblks := []
  indents := [0]
  pagenos := []

  every b := !blocks do {
    if type(b.type) == itemt then {
      if b.type.indentation > indents[1] then {
        put(newblks, block(copy(lists)))
        push(indents, b.type.indentation)
        }
      else if b.type.indentation < indents[1] then {
        put(newblks, block(copy(liste)))
        pop(indents)
        if *indents = 1 & *pagenos > 0 then {
          put(newblks, pagenos[-1])
          pagenos := []
          }
        }
      put(newblks, b)   
      }
    else if type(b.type) == "pageno_block" then
        if *indents = 1 then put(newblks, b)
        else push(pagenos, b)
    else {
      while *indents > 1 do {
        put(newblks, block(copy(liste)))
        pop(indents)
        }
      if *pagenos > 0 then {
        put(newblks, pagenos[-1])
        pagenos := []
        }
      put(newblks, b)
      }
    }

  every 2 to *indents do put(newblks, block(copy(liste)))
  if *pagenos > 0 then put(newblks, pagenos[-1])

  return newblks
  end 


A shopping list is a small, informal list, along the lines of

alpha
milk
walk the dog
A shopping list differs from a regular list; a shopping list has no bullets or tags, shopping-list items are one line and short, and shopping lists have no inter-item spacing.
<fft2html.icn>+= [<-D->]
record shopping_list_block(indentation)

procedure recognize_shopping_lists(blocks)

  every b := !blocks do
    if (type(b.type) == "block_paragraph_block") & (b.type.indentation > 0) &
       unfilled_text(b.text) & not(internal_indent(b.text)) then
      b.type := shopping_list_block(b.type.indentation)

  end # recognize_shopping_lists

Translate the structure in blocks into HTML, writing the result to out.

<fft2html.icn>+= [<-D->]
procedure to_html(blocks, out)

  write(out, "<html>")
  write(out, "<body>")

  every b := !blocks do {
    case type(b.type) of {
      "block_paragraph_block":
        if b.type.indentation = 0 then {
          write(out, "<p>")
          every write(out, !b.text)
          write(out, "</p>")
          }
        else {
          write(out, "<blockquote>")
          every write(out, !b.text)
          write(out, "</blockquote>")
          }

      "paragraph_block": {
        write(out, "<p>")
        every write(out, !b.text)
        write(out, "</p>")
        }

      "ei_definition_block": {
        write(out, "<blockquote><dl><dt>", b.text[1], "<dd>")
        every write(out, b.text[2 to 5], "<br>")
        write(out, "</dl></blockquote>")
        }

      "header_block":
        write(out, "<h", b.type.level, ">",
                    b.text[1],
                   "</h", b.type.level, ">")

      "pageno_block": {
        }

      "image_block":
        write(out, "<p align = center><img src = \"", b.type.fname, "\"></p>")

      "list_start_block":
        write(out, "<ol>")

      "list_item_block": {
        write(out, "<li>")
        every write(out, !b.type.text)
        }

      "list_end_block":
        write(out, "</ol>")

      "definition_list_start_block":
        write(out, "<blockquote><dl>")

      "definition_list_item_block": {
        write(out, "<dt>", b.type.tag, "<dd>")
        every write(out, !b.type.text)
        }

      "definition_list_end_block":
        write(out, "</dl></blockquote>")

      "shopping_list_block": {
        write(out, "<blockquote>")
        every write(out, !b.text, "<br>")
        write(out, "</blockquote>")
        }

      "simple_table_block":
        do_html_simple_table(out, b)

      default:
        write(&errout, "unknown block type \"", type(b.type), "\".")
      }
    }

  write(out, "</body>")
  write(out, "</html>")

  end

Generate HTML code for the simple table simtab; the HTML code is written to out.

<fft2html.icn>+= [<-D->]
procedure do_html_simple_table(out, simtab)

  local e, cols, left, right;

  cols := simtab.type.right_column_edges ||| [0]

  write(out, "<blockquote><table>")
  every e := !simtab.text do {
    write(out, "<tr>")
    left := 1    
    every right := !cols do {
      write(out, "<td align = left>", e[left:right], "</td>")
      left := right
      }
    write(out, "</tr>")
    }
  write(out, "</table></blockquote>")
  end

A text block b is a definition-list item if

  1. b has no type assigned, and
  2. The second through last lines in the text associated with b have the same (more or less) indentation, and
  3. the first line in the text associated with b is out-dented with respect to the rest of the lines
The indentation of the second through last lines is preserved by the first line; everything to the right of the indentation --- that is, the outdent --- is taken as the item tag.

The tags distinguish a list item from a definition-list item. A list-item tag is a single letter or digit, while a definition-list-item tag can contain several words (although it is assumed to be small enough to fit between the left edge of the page and the indentation amount). List items should be recognized before definition-list items.

This code also assumes that definition lists are not nested, and so assigns each item a constant indentation of ten.

<fft2html.icn>+= [<-D->]

record definition_list_item_block(tag, indentation, text)

procedure recognize_definition_list_items(blocks)

  local b, indents, ln

  every b := !blocks do
    if /b.type & (*b.text > 1) then {
      indents := get_indentation(b.text)
      if (indents[1] < indents[2]) & (*set(indents[2:0]) = 1) then {
        tag := b.text[1][1:indents[2]]
        b.type := definition_list_item_block(tag, 10, copy(b.text))
        b.type.text[1] := b.text[1][indents[2]:0]
        }
      }

  end

A list is recognized by the sequence of its list items; generally, a consecutive sequence of list items belong to the same list.

<fft2html.icn>+= [<-D->]
record definition_list_start_block(unused_field)
record definition_list_end_block(unused_field)

procedure recognize_definition_lists(blocks)

  return _recognize_lists(blocks, "definition_list_item_block",
                          definition_list_start_block(),
                          definition_list_end_block())

  end

A simple table is a block paragraph in which every line in the block paragraph has a space in the same set of two or more columns. To avoid trivial definitions, the block paragraph must have more than one line.

<fft2html.icn>+= [<-D->]
record simple_table_block(right_column_edges)

procedure recognize_simple_tables(blocks)

  local b, sp, common

  every b := !blocks do  {
    # write(&errout, type(b.type))
    if (type(b.type) == "block_paragraph_block") & (*b.text > 2) then {
      sp := space_profile(b.text)
      common := sp[1]
      every common := ol_intersect(common, !sp)
      # write(&errout, "*common = ", *common)
      if *common > 1 then {
        # write(&errout, "indent = ", b.type.indentation,
        #               ", common[1] = ", common[1])
        if b.type.indentation = common[1] - 1 then pop(common)
        b.type := simple_table_block(common)
        }
      }
    }

  end # recognize_simple_tables


A block of text is unfilled if a majority of the lines in the text are significantly shorter than the maximum line size (here assumed to be 80 characters).

``Significantly shorter'' has a technical meaning: if the first word in the line following line l could fit on line l without exceeding the maximum line size, then line l is significantly shorter than the maximum line size. The code below, however, cheats and assumes a line of less than seventy characters is significantly shorter.

<fft2html.icn>+= [<-D->]
procedure unfilled_text(text)

  local little, l

  little := 0
  every l := !text do if *l < 70 then little +:= 1
  return ((little > 0) & (*text/little < 2))

  end # unfilled_text


A block of text is uniformly indented if each line has more or less the same amount of indentation. More precisely, a strict majority of lines are indented i spaces and the remainder of the lines are indented i plus or minus one. Return the uniform indentation i if text is uniformly indented, or fail of text is not uniformly indented.

<fft2html.icn>+= [<-D->]
procedure uniform_indentation(text)

  local minor, i

  indents := frequencies(get_indentation(text))
  if *indents > 3 then fail

  minor := 0
  every i := 1 to *indents - 1 do minor +:= indents[i][2]
  if indents[-1][2] <= minor then fail

  every i := 1 to *indents - 1 do
    if (abs(indents[-1][1] - indents[i][1]) > 1) then fail

  return indents[-1][1]

  end


A block of text has an internal indentation if the second word of every line starts at the same column (give or take). If text has an internal indent, return it, otherwise fail.

<fft2html.icn>+= [<-D->]
procedure internal_indent(text)

  local ilist, f

  ilist := []
  every !text ? (tab(many(white_space))\1 & tab(upto(white_space))\1 &
                 tab(many(white_space))\1 & put(ilist, &pos))

  if *text = *ilist then {
    f := frequencies(ilist)
    if *f = 1 then return f[1][1]
    if (*f = 2) & (abs(f[1][1] - f[2][1]) < 2) then return f[2][1]
    }

  end # internal_indent


Given a list of integers ilist, return a list of pairs giving the frequency of occurrence in ilist. The first coordinate of a pair is an integer from ilist, the second coordinate is the number of times the integer occurred in ilist. The pairs are sorted in increasing order on frequency.

<fft2html.icn>+= [<-D->]
procedure frequencies(ilist)

  local t, i

  t := table(0)
  every i := !ilist do t[i] +:= 1

  return sort(t, 2)

  end

A block paragraph standing by

<fft2html.icn>+= [<-D->]
procedure paragraph_block_paragraphs(blocks)

  local b

  every b := !blocks do
    if type(b.type) == "block_paragraph_block" & (*b.text = 1) then {
      if b.type.indentation = 0 then 
        stop("unindented stand-alone block paragraph:\n  \"" ||
             b.text[1] || "\"")
      b.type := paragraph_block(b.type.indentation)
      }

  end

An external-interface definition is text of the form

<fft2html.icn>+= [<-D->]
record ei_definition_block(unused_field)

procedure recognize_ei_definitions(blocks)

  local new_blks, i, j, b, saved

  new_blks := []
  i := 1
  while i <= *blocks do {
    b := blocks[i]
    if (type(b.type) == "block_paragraph_block") & (*\(b.text) = 1) &
        (b.text[1] ? (tab(many(' '))\1 & ="External Interface: ")) then {
      put(new_blks, block(ei_definition_block(), [b.text[1]]));
      j := 1
      saved := []
      while j <= 4 do {
        b := blocks[i +:= 1]
        if type(b.type) == ("pageno_block" | "image_block") then put(saved, b)
        else {
          put(new_blks[-1].text, b.text[1])
          j +:= 1
          }
        }
      new_blks |||:= saved
      }
    else
      put(new_blks, b)
    i +:= 1
    }

  return new_blks
  end

A space profile for a block of text containing n lines is an n element list in which list element i corresponds to text line i. Each list element is itself a list of integers; each integer represents the position of the right-most end of a non-empty, maximal sequence of spaces within the corresponding text line. For example, j in list element i means text line i has a sequence of spaces that ends a column j; that is, column j contains a space and column j+1 contains a non-space character (note this assumes first that all white-space characters have either been eliminated or translated into spaces, and second that text lines don't end with spaces).

The integers within a list item are given in increasing order; that is, they follow the left-to-right order of the text line; the list items within the list follow the top-to-bottom order of the lines in the text.

<fft2html.icn>+= [<-D->]
procedure space_profile(text)

  local l, sp, i

  sp := []
  every l := !text do {
    put(sp, [])
    l ? repeat {
      tab(upto(' ') | &pos)
      i := many(' ') | break
      put(sp[-1], i)
      tab(i)
      }
    }

  return sp

  end # space_profile

Return the intersection of two increasingly ordered integer lists l1 and l2; the intersection is itself returned as an increasingly ordered integer list.

<fft2html.icn>+= [<-D]
procedure ol_intersect(l1, l2)

  local common, e1, e2, next_l1, next_l2

  common := []
  next_l1 := create !l1
  next_l2 := create !l2

  e1 := @next_l1 | (return common)
  e2 := @next_l2 | (return common)

  repeat
         if e1 < e2 then e1 := @next_l1 | (return common)
    else if e1 > e2 then e2 := @next_l2 | (return common)
    else {
      put(common, e1)
      e1 := @next_l1 | (return common)
      e2 := @next_l2 | (return common)
      }

  end