Implementation notes:



Specification as-implemented
-----------------------------

V1.1
----

SLM_* Models are now required to reference sections build strictly from
the "Diagonal_matrix' data type.  Previously, any data type was allowed,
provided there were no off-diagonal elements.  The ICM specification is
quite clear on this point.  The v1.0 parser was too lax in this respect.

The '[End ICM Section]' argument is deprecated and should not be used
in new models.  The v1.1 parser will not tolerate the argument if it is
present.  If you have v1.0 files that you must parse, either fix them,
or use the forcing flag to specify the rules you want to use.

The parser, by default, checks but does not honor the [ICM Ver] 
keyword in an ICM file.  It will always parse using v1.1 rules unless 
compelled to do otherwise with the forcing '-f' flag.  '-f' by itself will
cause the parser to parse using the rules specified by the input file
[ICM Ver] keyword.  '-f1.0' will force the parser to use the v1.0 rules
always.  '-f1.1' is recognized, but is the default anyway.  INFO level
(-v) messages are emitted regarding the actual choice of ruleset.

The parser implements the new [Frequency] dependent RLGC matrixes.  This is
a compatible extension to v1.0 formatted ICM files.

The parser implements the new optional 'Side' subparameter that may 
immediately follow the Model_pinmap and Model_nodemap subparameters.
'Side' is permitted anywhere these subparameters are.

The v1.1 parser no longer requires a terminating Model_pinmap on a
[Tree path Description].  This was an error of interpretation of
the v1.0 specification.  The v1.0 parser is unchanged in this regard.

SLM_* matrixes are now explicitly required to be of the 'diagonal' type.
It is disallowed to represent such matrixes in any other format.
Previously, the parser would tolerate this as long as the data itself
was diagonal in nature.

The parser has been significantly restructured to provide it's functionality
in a convenient callable form, and to support multiple rule sets.
It returns a data structure which faithfully represents the content of the 
input ICM document.  icmchk.c demonstrates how this is done.  The same 
data structure is returned for both v1.0 and v1.1 ICM files.

NOTE:  In order to make the callable interface available to tool vendors,
the necessary binary libraries have been relicensed to use the 'Lesser'
or 'Library' GPL.  Basically, you can link with libicm.so without fear 
of triggering the GPL contamination of your sources.  Of course, 
you may still bundle the icmchk1 tool as well.

This does not absolve you from the requirement to publish sources for any
changes you might make to the icm parser itself.  If you make changes to 
icmchk1 or it's library and distribute the resulting binaries, you are 
still required to make those changes available as source to all who might 
ask for them.

If that's not acceptable to you, get in touch with either me or the IBIS
committee and we'll see what we can work out with regard to alternative
licenses.  It won't be a cost free option, but it won't be absurd either.
It would be reasonable to presume an arrangement and a cost schedule very 
similar to that of the existing IBIS parser.


V1.0
----

The [Disclaimer] text block has a recommended maximum of no more than 96 lines.
This is neither checked nor enforced.


The [Redistribution Notes] text block has a recommended maximum of no more 
than 24 lines.  This is neither checked nor enforced.


The ICM Family Description] text block has a recommended maximum of no more
than 4 lines.  This is neither checked nor enforced.


Unless specifically restricted by the specificiation, as is the case with
a 'node name', user defined identifiers may use a sequence of any legal 
dark printing character, bounded by white space.  Thus, '*&^%$' is a 
legal name, and would be accepted as a 'model name'.  Why anyone would 
actually want to do this is quite beyond me, but by my reading of the 
specification it is not actually ruled out.


The parser requires a Model_pinmap at both ends of the main path in a 
[Tree Path Description].  It is not clear to me that this is required by 
the specification or not, but it seems to be.  The first Model_pinmap is
expressly required, the second is implied by the wording "A second 
Model_pinmap is used at the end of the path description to referene the
pin map used for the other end of the interconnect".  Unlike the first,
this wording does NOT say it is required, but seems to make a pretty clear
statement that it is there.


Fork/Endfork blocks are allowed to nest within enclosing Fork/Endfork
blocks in a [Tree Path Description].  There is no wording to prohibit
this.  The wording about Forks is that "The sections (zero or more) 
between the Fork and Endfork subparameters are connected together as in 
any other path description."  Since the main path description expressly 
allows the use of Forks, I conclude that Fork paths also allow the use 
of Forks.  The parser permits Fork/Endfork nesting to any depth, and
therefore the construction of a generalized binary tree of interconnected
paths.  Naturally, the parser's internal representation of this structure
is also a binary tree.


SLM_* Modes must use matrices that consist solely of diagonal elements.
The parser does not enforce this by limiting the representation of these 
matrices to the "Diagonal_matrix" format.  Instead, it permits any 
representation to be used, and then expressly tests the resulting matrix 
for diagonality.

The various technical notes concerning the values of RLGC matrices from
pages 54-59 are not tested.  These comments are either assertions on
the meaning of the data, or require action by a simulator to evaluate 
properly.

The [End ICM Section] requires a text string argument.  This is probably
a hold-over from earlier revisions of the specification, when this
value was required to match the Section name given on the [Begin ICM Section]
keyword.  Since it's in the spec, the parser does require that this
string be present, but will only warn if it does not match.


Technical
---------

Overall Structure
-----------------

The parser is implemented in three parts; a common lexical analyzer implemented
in F/LEX, and two parsers implemented in BISON/YACC.  The YACC parsers
each adhere solely to a particular version of the ICM specification.  The
correct parser is transparently selected at run time according to
command line arguments.  The default behavior is to parse all ICM input
documents using v1.1 rules.

There is a method to my madness here.  The point is to not clutter the v1.1 
parser with any unnecessary v1.0 legacy behaviors or errata, or break
the original rulesets while implementing new rulesets.  While the
virtue of this may not yet be entirely evident going from 1.0 to 1.1, I am 
thinking ahead to a point in time when there are major 2.x or 3.x releases.

Writing a single unified parser that faithfully supports multiple major 
versions is a problematic and highly fragile business.  Regressions are
inevitable, and become ever more difficult to deal with as each version 
accretes.  It is far cleaner to drive a separate parser for each version, 
and then unite them under the covers via an intelligent front end switch.

The upshot is that V1.0 ICM documents will continue to parse with EXACTLY 
the same result they always have, because they still transparently use the 
original v1.0 parser.  V1.1 features have no effect whatsoever on v1.0 
documents; they simply can't by the nature of the design.

Since this is a dual parser, two complete test suites are included.
One is suitable for v1.0, the other for v1.1.  The v1.0 test suite requires
the use of the parser's '-f1.0' command line flag to select the proper
rule set.


YACC/LEX
--------


YACC/LEX are not ideally suited to the ICM language, as the language
itself is far from regular.  Fundamentally, this results in the lexical
analyzer needing to know far more about the structure of the language than 
it might otherwise.  Ideally, the lexical analyzer should be stateless.

This one is not stateless;  it does retain a certain limited knowledge
of what it has recently seen, particularly  with regard to what the last
keyword was, and thus what it's legitimate subparameters are.  

It also possesses a number of token processing substates to decode several
kinds of values such as numbers, strings, and names of various types.
The lexical analyzer cannot, however, readily determine in all cases
when these substates are appropriate to the current context.  It depends
on the parser to inform it when a particular substate is appropriate.

There are a lot of shift/reduce and some reduce/reduce conflicts
in the YACC grammar.  Nearly all of them arose when an extensive series of
error recovery points were added to the grammar, and nearly all of them involve
multiple choices among error targets.  The actions that BISON takes by default
to resolve these are adequate, and typically results in the 'nearest' error
trap being activated.  I see no good way to get rid of these conflicts 
without sacrificing a great deal of error recovery capability.  Suggestions 
are welcome.


Hinting
-------

The lexical analyzer does not maintain extensive knowledge of the contextual 
state of the parse.  What it does retain it little more than 'the current 
keyword is X', 'the current subparameter is Y', and a hint regarding the
possible nature of the next token.  This limited information is entirely 
inadequate to parse complex structures such as tables, multi-line lists, 
and optional arguments.

The parser possess much richer contextual state, and is capable of
making these sorts of distinctions.  Based on it's context, the parser
may drive hints to the analyzer whenever the analyzer's default behaviour
is inadequate to a particular situation.  The hint results in the lexical 
analyzer pushing itself into a substate to try to find a token of the 
hinted type.  This may fail.  If so, the analyzer pops its state and 
reverts to it's default interpretation of the input stream.

Of course, this will result in a parse error when an unexpected token
is returned to the parser.  The parser has to be prepared to deal with
this condition.  But then, that was always it's task anyway.

For the most part, this works quite well, and severely limits the amount
of state that the lexical analyzer must retain.  Unfortunately, there
are some circumstances in which it's rather difficult to ensure that
correct hinting occurs to the lexical analyzer before the next token
is fetched.  

In particular, whenever the parser needs to obtain a 'look ahead' token 
to determine what state it should enter next, hinting becomes problematic
as the parser simply fetches the look-ahead token without regard to 
any hinting.  If you think about it, that is all it can really do.  There
is no way to hint if a choice must be made among several possibilities
based on what comes next.

This is suprisingly rare, though.  It's dealt with by coercing an
appropriate hint in advance of such a decision point.  Coercion is 
accomplished by embedding in-line hint 'action' directly within the 
ruleset at the appropriate points.  Fortunately there aren't very many 
places where this sort of ad-hoc hackery is necessary.

For the most part, all the necessary hints can be emitted automagically
by simply defining targets such as 'number', which emit a hint and then 
immediately fetch the required token.  There are only a handful of such 
rules that are needed; the rest of the ruleset simply uses them whenever
it's appropriate to do so.

If the parser and the lexical analyzer should ever disagree as to what type 
of token to expect next, the parser obviously wins.  The analyzer sets 
it's internal hint before returning to the parser; the parser establishes 
it's hint, if any, before calling the analyzer again, overriding
whatever the analyzer may have pre-established.

Certain tokens can never be overridden regardless of the presence of any
hints.  For example, [keywords] are always recognized as such regardless 
of what else may be going on.


regarding the parse ruleset
---------------------------

Trade offs have been made within the parser to maximize both the
clarity of error messages, and the ability to recover from syntax errors.

While it is possible to construct a ruleset that rigidly expresses the
allowed syntax of the ICM language, the result is a parser that will
usually give up on the first error, or worse, like a compiler, generate 
nonsense errors for every remaining line of the input file as it struggles
to find some match with it's ruleset.

As a result, the ruleset as implemented is primarily expressed as iterators 
that gather up sections of the ICM input file, and expressly check whether 
or not a particular element is legitimate within a given section.

For example, the first iterator breaks down the ICM input into it's header,
the models, and the sections.  The header is then broken down by another
iterator looking for the various legal elements of a header.  Finally
a validation routine tests the proposed header to ensure it is proper, that
everything is there that should be there, and that any necessary ordering
rules were observed while it was being assembled.  Similar tactics are
employed with the ICM models and sections.  

The result is that the parser can more easily resyncronize itself when a 
syntax error is detected, and it can recognize more of the input file 
even if some input must be discarded to do so.


String token values
-------------------

The lexical analyzer maintains a table of strings that it constructs on the
fly.  This table is stored in a balanced b-tree.  The routines for doing this 
are general purpose, but are easily capable of storing millions of strings
and can locate the stored copies extremely rapidly.

Only ONE UNIQUE COPY of any string is ever stored in the b-tree.  Attempts
to store a string redundantly simply result in the lookup and return of 
the pointer to the previously stored string.

Whenever a string value is returned by the analyzer, it actually returns the 
pointer of the stored string.  The parser may depend on the stability of 
this pointer;  it will never be deleted nor will it change.  Of course, 
the parser should neither free nor modify these strings in any way.  
They do not belong to it.

This allows the parser to treat string pointers as if they are symbolic IDs.
It is legitimate for the parser to compare two occurance of a string for
equality by simply checking if they both in fact possess the same pointer.  

Routines are provided for the parser to register it's own strings with the
analyzer as well so that they too may be used in this manner.  This is 
exploited in a number of places.

For example, the parser may check that a returned string token is 'Yes' by 
looking up 'Yes' in the analyzer's symbol table and then determining if
the pointer returned is in fact the same as the token in question.  
This type of comparison is not an error, although admittedly it does look
strange if you don't understand what's going on, as the first impulse is
to use strncmp() (which also works).

The result of a successful parse is the return of a (struct_s *icm) structure.
One structure is returned every time icm_parse() is called and an input
file reduced.  This struct is owned by the parser.  You must treat it as
READ ONLY.  When you are done with it, give it to the icm_delete_icm()
destructor, which will properly dispose of it.  DO NOT DELETE THE ICM
STRUCT ANY OTHER WAY.  If you do so, you will almost certainly corrupt the
string database and any other outstanding ICM structs that may be hanging
around.

The parser counts how many structs it has handed out, and how many are 
returned to it.  Only when ALL structs have been desctoyed will the parser 
advise the lexical analyzer that it is safe to delete it's string database.
