c80dos.doc

	 >>>  Small-C Version 1-N Compiler Documentation  <<<

         NOTE:  C80DOS.EXE is the MSDOS compiled binary for running on
                a standard PC class machine which emits 8080 assembler
                that can then be assembled and loaded on the PC using
                lasm.cpm and load.cpm with the zrun.com CP/M emulator.
                The final output (or any of the intermediate output in
                8080 assembler or Intel HEX format) can then be ported
                to the CP/M machine by telecommunicating with a any of
                a myriad of programs or by writing the disk directly
                using something like the Uniform.exe program or its
                equivalent.  Hopefully, in the near future, a Z80
                opcodeversion of the compiler as well as PC executable
                versions of lasm and load will be finished. (RDK)
                

	      Available in the  directory is a compiler  for  a
	 subset of the language C.  It consists of the two files C80.C
	 (compiler)  and  C80LIB.I80 (runtime library) It is in source
	 form  and  is   free   to   anyone   wishing   to   use   it.
	 Characteristics of the compiler are as follows:

	      (1)  It  supports  a subset of the language C.  (see the
	 book "C A  Programming  Language",  by  Brian  Kernighan  and
	 Dennis  Ritchie.)   (2) It is written in C itself.  (3) It is
	 syntactically identical to the C on UNIX (unlike  some  other
	 small  C  compilers  and  interpreters).   (4) It produces as
	 output a text file suitable for input to an  8080  assembler.
	 (5)  It is a stand-alone single-pass compiler (which means it
	 does its own syntax checking  and  parsing  and  produces  no
	 intermediate  files).  (6) It can compile itself.  This means
	 any processor supporting C can be used to develop this  small
	 C compiler for any other processor.

	      The intention behind the writing of this compiler was to
	 bring  the  C  language to small computers.  It was developed
	 primarily on a 8080 system with  40  K  bytes  and  a  single
	 mini-floppy.   Consequently,  an  effort was made to keep the
	 compiler small in order to fit  within  limited  memory,  and
	 intermediate  files  were avoided in order to conserve floppy
	 space.


	 COMPILER SPECIFICATIONS

	      As of this writing, the compiler supports the following:

	 (1) Data type declarations can be:

	      - "char" (8 bits)
	      - "int"  (16 bits)
	      - (by placing an "*" before the variable name, a pointer
		can be formed to the respective	type of
		data element).

	 (2) Arrays:

	      - single dimension (vector) arrays can be
	        of type "char" or "int".

	 (3) Expressions:

	      - unary operators:
		 "-" (minus)
		 "*" (indirection)
		 "&" (address of)
		 "++" (increment, either prefix or postfix)
		 "--" (decrement, either prefix of postfix)
	      - binary operators:
		 "+" (addition)
		 "-" (subtraction)
		 "*" (multiplication)
		 "/" (division)
		 "%" (mod, i.e. remainder from division)
		 "|" (inclusive 'or')
		 "^" (exclusive 'or')
		 "&" (logical 'and')
		 "==" (test for equal)
		 "!=" (test for not equal)
		 "<"  (test for less than)
		 "<=" (test for less than or equal to)
		 ">"  (test for greater than)
		 ">=" (test for greater than or equal to)
		 "<<" (arithmetic left shift)
		 ">>" (arithmetic right shift)
	      - primaries:
		 -array[expression]
		 -function(arg1, arg2,...,argn)
		 -constant
			-decimal number
			-quoted	string ("sample	string")
			-primed	string ('a' or 'Z' or 'ab')
		 -local variable (or pointer)
		 -global (static) variable (or pointer)

	 (4) Program control:

		-if(expression)statement;
		-if(expression)	statement;
			else statement;
		-while (expression) statement;
		-break;
		-continue;
		-return;
		-return	expression;
		-; (null statement)
		-{statement; statement;	... statement;}
			 (compound statement)

	 (5) Pointers:

		-local and static pointers can contain the
		 address of "char" or "int" data elements.

	 (6) Compiler commands:

		- #define name string (pre-processor will replace
			name by	string throughout text.)
		- #include filename (allows program to include other
			files within this compilation.)
		- #asm (not supported by standard C)
			Allows all code	between	"#asm" and "#endasm"
			to be passed unchanged to the target
			assembler.  This command is actually a statement
			and may	appear in the context:
			"if (expression) #asm...#endasm	else..."

	 (7) Miscellaneous:

		-Expression evaluation maintains the same hierarchy
		 as standard C.

		-Function calls	are defined as
		 any primary followed by an open paren,	so legal forms
		 include:

			variable();
			array[expression]();
			constant();
			function()();

		-Pointer arithmetic takes into account the data
		 type of the destination (e.g. pointer++ will increment
		 by two if pointer was declared "int *pointer").

		-Pointer compares generated unsigned
		 compares (since addresses are not signed numbers).

		-Often used pieces of code
		 (i.e. storing the primary register indirect through the
		 top of	the stack) generate calls to library routines to
		 shorten the amount of code generated.

		-Generated code	is "pure" (i.e.	the code may be	placed
		 in Read Only Memory).	Code, literals,	and variables
		 are kept in separate sections of memory.

		-The generated code is re-entrant.  Everytime a	function
		 is called, its	local variables	refer to a new stack
		 frame.	 By way	of example, the	compiler uses
		 recursive-descent for most of its parsing, which relies
		 heavily on re-entrant (recursive) functions.


	 COMPILER RESTRICTIONS

	      Since recent stages of compiler check-out have been done
	 both on an 8080 system and on UNIX, language  syntax  appears
	 to  be identical (within the given subset) between this small
	 C compiler and the standard UNIX compiler.


	 Not supported yet are:

	 (1) Structures.
	 (2) Multi-dimensional arrays.
	 (3) Floating point, long integer, or unsigned data types.
	 (4) Function calls returning anything but "int".
	 (5) The unaries "!", "~", and "sizeof".
	 (6) The control binary operators "&&", "||", and "?:".
	 (7) The declaration specifiers "auto", "static", "extern",
		and "register".
	 (8) The statements "for", "switch", "case",
		and "default."
	 (9) The use of arguments within a "#define" command.


	 Compiler restrictions include:

	      (1) Since it is a single-pass compiler, undefined  names
	 are not detected and are assumed to be function names not yet
	 defined.   If  this  assumption  is  incorrect, the undefined
	 reference will not  appear  until  the  compiled  program  is
	 assembled.

	      (2)  No  optimizing is done.  The code produced is sound
	 and capable  of  re-entrancy,  but  no  attempt  is  made  to
	 optimize  either  for  code  size or speed.  It was assumed a
	 post-processor optimizer  would  later  be  written  for  the
	 target machine.

	      (3)   Since   the   target   assembler   is  of  unknown
	 characteristics, no attempt is made to produce pseudo-ops  to
	 declare static variables as internal or external.

	      (4)  Constants  are not evaluated by the compiler.  That
	 is, the line of code:

			X = 1+2;

	 would generated code to add "1"  and  "2"  at  runtime.   The
	 results are correct, but unnecessary code is the penalty.


	 ASSEMBLY LANGUAGE INTERFACE

	      Interfacing   to   assembly   language   is   relatively
	 straight-forward.  The "#asm ...  #endasm"  construct  allows
	 the  user  to  place assembly language code directly into the
	 control context.  Since it is considered by the  compiler  to
	 be a single statement, it may appear in such forms as:

			while(1) #asm ... #endasm

			or

			if (expression)	#asm...#endasm else...

	      Due  to  the  workings of the preprocessor which must be
	 suppressed in this construct, the pseudo-op  "#asm"  must  be
	 the  last  item  before the carriage return on the end of the
	 line (i.e.  the text between #asm  and  the    is  thrown
	 away),  and  the "#endasm" pseudo-op must appear on a line by
	 itself (i.e.  everything after #endasm is also thrown  away).
	 Since  the  parser is completely free-format outside of these
	 execeptions, the expected format is as follows:

			if (expression)	#asm
			...
			...
			#endasm
			else statement;

	      Note a semicolon is not required after the #endasm since
	 the end of context is  obvious  to  the  compiler.   Assembly
	 language  code  within  the  "#asm  ...  #endasm" context has
	 access to all global symbols and functions by name.  It is up
	 to the programmer  to  know  the  data  type  of  the  symbol
	 (whether  "char"  or  "int"  implies  a byte access or a word
	 access).  Stack locals and  arguments  may  be  retrieved  by
	 offset   (see   STACK  FRAME).   External  assembly  language
	 routines invoked by  function  calls  from  the  c-code  have
	 access to all registers and do not have to restore them prior
	 to  exit.  They may push items on the stack as well, but must
	 pop them off before exit.  It is the  responsibility  of  the
	 calling  program  to  remove arguments from the stack after a
	 function call.  This must not be done by the function itself.
	 There is no limit to the number of  bytes  the  function  may
	 push  onto  the  stack,  providing  they are removed prior to
	 returning.   Since  parameters  are  passed  by  value,   the
	 paramters on the stack may be modified by the called program.



	 STACK FRAME

	      The stack is used extensively by the compiler.  Function
	 arguments  are  pushed onto the stack as they are encountered
	 between parentheses (note, this is opposite that of  standard
	 C,  which  means routines expressly retrieving arguments from
	 the stack rather than declaring them by  name  must  beware).
	 By the definition of the language, parameter passing is "call
	 by  value".  For example the following code would be produced
	 for the C statement:

		function(X, Y, z());

		LHLD X
		PUSH H
		LHLD Y
		PUSH H
		CALL z
		PUSH H
		CALL function
		POP B
		POP B
		POP B

	      Notice, the compiler cleans up the stack after the  call
	 using a simple algorithm to use the least number of bytes.

	      Local  variables  allocate  as  much  stack  space as is
	 needed, and are then assigned the current value of the  stack
	 pointer (after the allocation) as their address.

		int X;

	 would produce:

		PUSH B

	 which  merely  allocates  room  on the stack for 2 bytes (not
	 initialized to any value).  References to the local  variable
	 X  will  now  be  made  to the stack pointer + 0.  If another
	 declaration is made:

		char array[3];

	 the code would be:

		DCX SP
		PUSH B

	 Array[0] would  be  at  SP+0,  array[1]  would  be  at  SP+1,
	 array[2] would be at SP+2, and X would now be at SP+3.  Thus,
	 assembly  language  code using "#asm...#endasm" cannot access
	 local variables by name, but must know how  many  intervening
	 bytes  have  been  allocated  between  the declaration of the
	 variable and  its  use.   It  is  worth  pointing  out  local
	 declarations   allocate  only  as  much  stack  space  as  is
	 required, including an odd number of bytes, whereas  function
	 arguments  always  consist of two bytes apiece.  In the event
	 the argument was type "char" (8 bits), the  most  significant
	 byte  of  the  2-byte  value is a sign-extension of the lower
	 byte.



	 OPERATING THE COMPILER

	      The small C compiler begins by asking  the  user  for  a
	 number  of options regarding the expected compilation.  Since
	 it was easier to ask questions than to pull arguments from  a
	 command  line  (which  is  in no way similar between the 8080
	 developmental  system  and  UNIX),  this  was  the  preferred
	 method.

	 The questions asked are as follows:

	      Do you want the c-text to appear?

	      This gives the  user  the  option  of  interleaving  the
	 source code into the output file.  Response is Y or N.  If Y,
	 a  semicolon  will  be placed at the start of each input line
	 (to force a comment to the  8080  assembler)  and  the  input
	 lines will be printed where appropriate.  If the answer is N,
	 only the generated 8080 code will be output.

	      Do you wish the globals to be defined?

	      This  question  is primarily a developmental aid between
	 machines.  If the  answer  is  Y,  all  static  symbols  will
	 allocate  storage  within the module being compiled.  This is
	 the normal method.  If N, no storage will be  allocated,  but
	 symbol  references  will  still  be  made  in the normal way.
	 Essentially, this question allows the user to specify all  or
	 none  of the static symbols external.  It is to be considered
	 a temporary measure.

	      Starting number for labels?

	      This  lets  the  user  supply  the  first  label  number
	 generated by the compiler for it internal labels (which  will
	 typically  be  "ccXXXXX",  where  XXXXX  is  a decimal number
	 increasing with each label).  This option allows  modules  to
	 be compiled separately and later appended on the source level
	 without generating multi-defined labels.

	      Output filename?

	      This question gets from the user the name of the file to
	 be created.  A null line sends output to the user's terminal.

	      Input filename?

	      This  question  gets  from  the  user  the name of the C
	 module to use as input.  The question will be  repeated  each
	 time  a  name  is  supplied,  allowing  the user to create an
	 output file consisting  of  many  separate  input  files  (it
	 behaves  as  if  the  user  had  appended  them  together and
	 submitted only the one file).  A null line response ends  the
	 compilation process.


	 COMPILING THE COMPILER

	      The  power  of  the  compiler  lies  in  the fact it can
	 compile itself.   This  allows  a  user  to  "bootstrap"  the
	 compiler onto a new machine without excessive recoding.

	      To compile the compiler under the UNIX operating system,
	 the appropriate command is:

	      % cc C80.c -lS

	 which will invoke the UNIX C-compiler and the UNIX linker  to
	 create  the  runnable file "a.out".  This file may be renamed
	 as needed and used.  No other files are needed.

	      In  order  to  create  a compiler for a new machine, the
	 user will need to compile the compiler into the  language  of
	 the  destination  processor.  The procedure currently used to
	 create the compiler for my 8080 system is as follows:

	 (1) Edit the file C80.c to modify two lines of code:

	 -change the line of code

		#include 
			to
		#define	NULL 0

	      (this is  done  since  the  "stdio.h"  I/O  header  file
	 contains  unparsable  lines  for  the small compiler, and the
	 line defining NULL is the only line of  "stdio.h"  needed  by
	 the compiler).

	 -change the line of code

		#define	eol 10
			to
		#define	eol 13

	      (this is done since my 8080 system uses  for the end
	 of line character, and UNIX uses the "newline" character).


	 (2) Invoke the compiler (by typing "a.out" or whatever other
	     name it was given.

	 (3) Answer the questions by the compiler to use the file
	     C80.c as input and to produce the file C80.I80
	     as output.

	 (4) Append the files C80.I80 and C80LIB.I80 (the code for the
	     compiler and the code for the runtime library,
	     respectively).

	 (5) Assemble the combined file using some 8080 assembler.

	 (6) Execute the created run file.

	      Currently, the 8080  assembler  used  must  possess  the
	 abilities  to  handle symbol names unique to 8 characters and
	 to recognize lower-case symbol names  as  unique  from  their
	 upper-case  equivalent.  This is due to the fact the compiler
	 recognizes 8-character names and passes all  static  variable
	 and  function names intact to the assembler.  There are a few
	 symbol names within the compiler which are not  unique  until
	 the  7th  character and which have "upper-case twins".  These
	 discourage the use of  the  KL-10's  MACN80  since  it  folds
	 lower-case  to  upper case and does not recognize 8-character
	 names.  It may be used, however, if  the  user  is  aware  of
	 these  limitations  and  chooses  symbol  names  within these
	 restrictions.


	 THE FUTURE OF THE COMPILER

	      That part of the compiler which produces  code  for  the
	 8080  is  all  together in the final section of the compiler.
	 Routines used by the compiler to produce code are kept  short
	 and  are  commented.   Changing this compiler to produce code
	 for any other machine is a matter of changing only these  few
	 routines,  and  does  not  entail  digging around through the
	 internals of the program.   I  would  expect  the  change  to
	 another  machine  could be made in an afternoon providing the
	 target machine had the following attributes:

	      (1) A stack, preferably running backwards as  items
		   are pushed onto it.

	      (2)  Two  sixteen-bit registers.  In the 8080 these
		   are the HL register pair (the primary register
		   to the compiler) and the DE register pair (the
		   secondary register).

	      (3) An assembler (or cross-assembler).


	      Since  the  compiler is just now on its feet and subject
	 to feedback from users, it is expected many changes  will  be
	 made  to  it.   Already planned changes (in order of expected
	 addition) are:

	      (1)  Constants  will  be   pre-evaluated   by   the
		   compiler.   Something like x=1+2*3 will become
		   x=7 prior to generating any code.

	      (2) Structures will be added.  This is one  of  the
		   powers  of  C.   Its  omission has always been
		   considered temporary.

	      (3) Assignment operators (+=, &=,  etc.)   will  be
		   added.

	      (4)   Missing   unary   and  binary  operators  and
		   statements will be added.

	      (5) The expression parser will create  intermediate
		   tree-structures  of  the  expressions and will
		   walk through them before generating any  code.
		   This  will  allow  some  optimization and will
		   allow the function arguments to be  passed  on
		   the stack in the same sequence as UNIX.

	      (6)  A peep-hole optimizer will be added to improve
		   the generated code.

	      Many  of  these things represent a wish-list.  Time will
	 be spent only when it becomes available.  Any volunteer  help
	 in any of these areas would be appreciated.

	      Questions should be directed to Ron  Cain  here  at  SRI
	 either at extension 3860 or at CAIN@SRI-KL.

-== eof ==-