Sustainable Suburbia: Minimal Text Interpreter

The operation and main routines of a minimal text interpreter - Part 3

This post is merely a description of the first implementation of the text interpreter looking at the principal routines. It's so I can remember what I did in 6 months time.

Function

The minimal text interpreter is the first stage of enabling plain text to be converted into computer machine language. Charles H. Moore the inventor of Forth, found an efficient means of doing this in the 1960s whilst working with mainframe computers, so I have chosen to adhere reasonably close to his methods.

A word is a group of consecutive characters ending in whitespace. The first task for the interpreter is to step through the line of characters and identify the individual words. At the same time it can keep track of a word-length counter so that it can reformat the text into a more compact format.

Take the sentence "A word is a group of consecutive characters ending in whitespace"

It is a string of 64 characters, containing 11 words, and a minimum of (11-1) spaces. We first pick out the individual words and store them into a temporary processing buffer, we can also count the number of characters in the word and put it alongside

A 1
word 4
is 2
a 1
group 5
of 2
consecutive 12
characters 10
separated 9
by 2
whitespace 10

So we have 11 entries in our table, with their lengths. If we assume that we are going to restrict the word length to 16 characters - we could express the length as a single hex character

We now crop the words down to the first 3 characters. For those of less than 3 characters, we fill in with spaces:

A 1
wor4
is 2
a 1
gro5
of 2
conC (12)
chaA (10)
sep9
by 2
whiA (10)

So we have reduced our original 64 input characters to 11 x 4 - which is about a 30% reduction. Tests carried out with Forth showed that the first 3 characters and the length was an optimum way of differentiating between and storing words in the dictionary. For assembler applications where mnemonics tend to be only 3 or 4 characters it is pretty much an optimal first step of decoding the source code, prior to allocating the machine code op-code tokens.

Once the text scanner has reduced each input word to three characters and a length byte, it then become practical to use this shortened representation to perform a word match. All of the system keywords and user words can be stored compactly in this format, along with their jump addresses. If 8 bytes are allocated to the word entry in the look-up table, this allows a 16 bit jump address, a pointer to the table where the original unencoded word is stored and a 1 byte attribute - that can be used to tell the compiler something about the word to help at compilation time.

Consider a system that has a maximum of 256 user words and 256 keywords - each internally coded at 8 bytes per word. Then this would need 2K in Flash for the keywords - not a problem for most small micros, but the 2K user table would rapidly start to eat up the limited RAM resources. Fortunately most small user applications are unlikely to have anything like 256 User words, and so halving this, a 1K user dictionary space would be quite acceptable.

Immediate and Compilation Modes

The above text scanning function can be used in two distinct cases: immediate and compilation modes.

In immediate mode, a word typed into the input buffer will be executed immediately - provided of course that it already exists in the dictionary. If not, it will lead to an error message. Once found in the dictionary the jump address associated with that word is put onto the pc and the processor executes the code it finds at the jump address.

However in compilation mode, the user wants to create a new word definition, and uses a Forth method called the colon definition. As it name suggests a colon definition begins a new line with a colon : This tells the text interpreter, that the proceeding word is going to be new and the interpreter should enter compilation mode. A colon definition begins with a colon: then the name of the new_word, then the definition i.e. the code words associated with that function and finally ends with a semi-colon ;

The Forth word colon definition performs the same operation as a function in C - compiling the function and putting it into memory as a series of threaded calls to the various routines.

: new_word (put the definition here) ;

In C

new_word()
{
// put definition here
}

Once a new_word has been defined as above it can be executed immediately - just by typing its name. This is what gives Forth it's almost unique characteristics of being an interactive and extensible language. The functions are written and compiled and can be tested in isolation of one another - so that a large project may be built interactively in small blocks - testing each block as you go.

Currently only the basics have been implemented - by way of a proof of concept, and running on a 2K RAM Arduino. Later this will be ported to various ARM Cortex parts, the FPGA - softcore ZPUino and ultimately the J1 Forth processor.

There are probably many ways in which this could be implemented - some giving even more codespace and memory efficiency. As a rookie C programmer, I have stuck to really basic coding methods - that I understand. A more experienced programmer would probably find a neater solution using arrays, pointers and the strings library - but for the moment I have kept it simple.

The interpreter resides in a continuous while(1) loop and consists of the following routines:

txt_read

Reads the text from the UART into a 128 character buffer using u_getchar.
Checks that the character is printable - i.e. resides between space (32) and tilde ~ (127) in the ascii table and stores it in the buffer.
Keeps accepting text until it hits the buffer limit of 128 characters or breaks out of this if it sees a return or newline \r or \n character.

colon_check

This checks if the text starts with a colon, and so is going to be a new colon definition.
sets flag colon=1
calls the build_buffer function

word_scan

If the leading character is not a colon, this function determines that the word is either within the body of the definition, or it is for immediate execution. It calls build_buffer, but only builds the header to allow a word match. It should not add the word to the dictionary, if it gets a match and is already there.

build_buffer

This checks the first 3 characters of the word and puts them into a new header slot in the headers table.
It also calculates the word length by counting the characters as it stores them into the dictionary table, which it continues until it sees a terminating space character.
It increments the dictionary pointer ready for the next word

word_match

This compares the 4 characters of the header of the newly input word with all the headers in the header table.
If all 4 characters match then it drops out with a match_address (for the jump address look-up table) and sets a match flag match= 1.

header_list

This is a utility routine which prints out a list of all the headers in the order they are stored in the headers table.

dictionary_list

This is a utility routine which prints out a list of all the words in the dictionary in the order they were stored in the dictionary table.

txt_eval

This is the main character interpretation function which implements the SIMPL language core

is_a_word

Not yet implemented. Returns true if it finds a word and invokes build_buffer and word_match

is_a_num

Not yet implemented. Converts the ascii text to a signed integer and stores it in a parameter table.
Might possibly use ascii 0x80 (DEL) to signify to the header builder that the following bytes are a number. Will need a conversion routine to go between printable and internal storage formats.

UART Routines

These provide getchar and putchar support directly to the ATmega328 UART. Saves a huge amount of codespace compared to Serial.print etc

uart_init

Initialises the ATmega328 UART to the correct baudrate and format.

u_putchar

Waits until the Tx register is empty and then transmits the next character

u_getchar

Waits until a character is present in the UART receive register and returns with it

Printing Routines

Having banished Serial.print - I had to implement some really basic functions

printnum()

Sends a 16 bit integer to the UART for serial output

printlong()

Sends a 32 bit integer to the UART for serial output

Sustainable Suburbia

Thursday, November 12, 2015

Minimal Text Interpreter - Part 3

No comments: