Tip:
Highlight text to annotate it
X
Welcome to lecture 3. We started with grammar in the last lecture
and we will continue in some detail.
Let me just briefly summarize what we did in the last lecture. As a grammar is a four
tuple consisting of a finite set of non terminal symbols or grammatical categories, a finite
set of terminal symbols which usually constitutes the vocabulary of a programming language is
a finite collection of formation rules or productions which are really rules of replacement
and the start symbol really signifies the grammatical category called a sentence of
the language.
We went through one simple grammar for example, the generation of Boolean expression. We have
various syntactic categories. We have a set of non terminals and the grammatical categories
consist of essentially expressions. The V stands for expressions, the C stands for complement
expressions and there is a start symbol. The terminal set consists of open and close
parenthesis and the connectives and 'Not'. While we are dealing with grammar, I will
use the color black for terminal symbols and since the grammatical categories are a level
of abstraction higher I will use light green for some and dark green for some others
This was the grammar we had. We also saw how a sentence which is a string of terminal symbols
can be generated from this grammar by applying the production rules. In each of these cases,
I have circled in orange the non terminal symbol which I am replacing and I have several
choices for replacing S. If you choose different choices then you will
get a large number of other sentences. You will generate a large number of other sentences
since there is absolutely no restriction on how long you can keep getting S. In the case
of this grammar you can actually generate an infinite set of sentences.
As you can see, a grammar is a finitary representation of an infinite set. A large part of computer
science, mathematics and logic really has to do with how to represent infinitary object
in a finite manner and this is one such example. Some of the warnings and cautions that you
must keep in mind are that this set of non terminals and the set of terminal symbols
should be disjoint. The production set is really a binary relation from non terminal
symbols to strings of non terminal and terminal symbols.
The replacement rule allows you to choose any non terminal symbol and replace it by
a string consisting of terminal and non terminal symbols. There is a star here which is to
denote that you can replace this set which is really the set of all possible finite strings
that can be generated from this set (NuT).
I will explain some of this in general for any set A, a '*' is the set of all the strings
of finite length, in particular the finite length could be a length of 0.
A 0 length string is really nothing so it is called the empty string and we usually
use the Greek letter epsilon to denote it. A + is the set of all non empty strings generated
from this set of all non empty strings. There is only one string of 0 length that is the
empty string. So it is the set of all non empty strings and A + = A * with epsilon removed
from it since there is only one 0 length string. So, the particular kind of grammar that we
have been considering as an example is what is known as a context-free grammar.
When the production rules are such that on the left hand side of the arrow mark you have a single non terminal symbol
and on the right side you have some string of terminals and non terminals that is called
a context-free grammar. It is called context-free as opposed to a context-sensitive grammar.
A context-sensitive grammar has production rules in which given a certain string of terminals
and non terminal symbols you are allowed to replace the non terminal symbols by some other
string of non terminals and terminals. Let us take an example in context-free grammar.
Let us take an arbitrary string in this example and we are replacing this S.
Let us assume we are choosing this arbitrary string so this S appears in a context and
the rest of this is the context. Similarly this S here appears in this context and we are calling this grammar context-free
because we are allowing this replacement of S by a string of non terminals and terminals
regardless of what context that S appears in.
This is the context in which this S appears and we have a uniform rule, uniform in the
sense that regardless of what contexts the non terminal appears in, you are allowed to
do a replacement. In a context-sensitive grammar you could for example specify that a certain
non terminal can be replaced by a string only if it appears in a particular kind of context.
In particular you might define that context to be an empty context.
In general a context-sensitive grammar is really more general than a context-free grammar.
You might look upon every context-free grammar production as specifying a context which consists
of the empty string. We will go into that a little later but let us first consider some
simpler grammars also Let us look at languages. The language generated
by any grammar is a set of all possible sentences that may be generated from the start symbol.
For example, you can take the start symbol located in some context and generate a string
that may not be in the language at all unless that context itself could have been generated
from the start symbol.
This is what we would call a language and in general we would call it a language on
the set of terminal symbols and a language on a set of terminal symbols is a possibly
infinite set of strings from the terminal set. We are saying that any subset of T * is
a language and for example here are some trivial languages that you can define on any set T.
You have the empty set for strings which is the empty language. You have this language
consisting of a single element, the empty string.
You have T * which itself is a language and T + itself is a language and in between you
have a whole lot of other languages. So you might regard this T + and T * as one
extreme and the empty language and the language containing the empty string as the other extreme
and you can have lots of subsets in between. The problem is that when you have a programming
language you have an infinite set of possible programs and the problem is of defining exactly
what grammar can generate. We will say that just as we have defined grammar
we will also define languages in a similar fashion.
First, let us go into some particular kinds of grammars called regular grammars.
In a regular grammar supposing you take a grammar in which every production is of a
form where this capital A denotes a non terminal symbol this capital B denotes a non terminal
symbol and this small 'a' denotes a terminal symbol which in fact should have been made
black and if every production is of this form then we call this a right-linear regular grammar.
The first point to realize is that a right-linear regular grammar is also a context-free grammar.
There is really no difference. A context-free grammar allows productions which are not for
example a right-linear regular grammar. This means that on the right hand side you should
have just one non terminal symbol and one terminal symbol appearing in this order. The
terminal symbol is followed by the non terminal symbol.
In a context-free grammar we do not have that restriction for example we had S.
In a context-free grammar if you take any of these rules they are not at all regular,
they are not right linear or regular and of course you would also allow just a terminal
to be generated. After all you have to generate strings of
the language from the terminal set if you always had only non terminal symbols appearing
on the right hand side. If that is the case then you will never be able to generate a
full sentence of the language. A right linear regular grammar is one in which
all the productions are of this form. Similarly, you might define a left linear
regular grammar as one in which all the productions are of this form and of course it has this
terminal generation rule too. So, every production that is either of this
form or of the other form is a left linear regular grammar. Such a grammar is called
a left linear grammar and if you have designed some hardware using finite state machines it turns out that you can actually represent finite state machines that is machines without output.
In fact, you can take the state transition diagram of the machine and refer to each state
as a non terminal symbol and refer to the input symbol. The input into that state has
a terminal symbol so a finite state machine automatically defines a right linear grammar.
Most finite state machines have a start symbol which is the start state. We are talking about
a powerful language. Let us summarize what we have looked at. The general properties
of grammars are that firstly every regular grammar whether right linear or left linear
is also context-free. Every context-free grammar is also context-sensitive. In particular,
all the productions of the context-free grammar can be considered in the context of empty
strings on both sides of the non terminal symbol.
In general
you can take any of these strings as having a 'Not' symbol, an empty string symbol, then
an open bracket, an empty string and then A, an empty string and a close bracket.
You can look upon every context-free production as appearing in a context which contains an
empty string and the empty string implicitly appears everywhere between terminal symbols
and non terminal symbols. It is for that reason that every context-free grammar is also context-sensitive.
Our interest in grammars is ultimately in generating languages. Supposing you take any
language that can be generated by a right linear grammar then it is also possible to
define a left linear grammar which will generate the same language. Similarly, if you were
to take any language generated by a left linear regular grammar then the same language can
also be generated by a right linear one. Let us look at the set T *. The set T * for
any set of terminal symbols is just the set of all strings obtained by this terminal symbol, T. In particular I can
look upon T* as; I can define a binary operation called catenation.
The effect of catenation is to take two strings and put them together. For simplicity let
us assume that the set T consists of just two symbols. Let us just call those two symbols
a and b. So, I can take a string in T*. Let us take a string in the set, ababb and let
me take another string let us say, bab. The operation of catenation which I will denote
by just a DOT is to produce the string, ababbbab. The operation of catenation just juxtaposes
the two strings. It is a binary operation on strings which just puts the two strings
together and gives you a new string. For example; this string is of length 5 so this belongs
to the set T 5, T raised to five. This string is of length 3 which belongs to the set T
cube and this string has a length 8 and it belongs to the set, T8, T raised to 8 and
of course all these sets are subsets of T *and so catenation is really an operation
from T * x T * ->T *. It has a functionality which is to take two
strings of finite length and juxtapose it.
When you take a string and juxtapose an empty string to it you get back the same string. If you take
the empty string and you juxtapose some other string to it you get back the other string.
The empty string satisfies these conditions that for SET * for any S belonging to T*,
SE = S and ES = S. Very often catenation is a juxtaposition operation where we just get
rid of the Dot symbol but you could have a Dot in between.
So one obvious property now is that this epsilon, E is in fact the identity element for catenation.
It is like a zero for addition.
Secondly, catenation is associative in the sense that if I take three strings s, t and
u; I can catenate them in any order. The set T * under catenation and with the empty string
is really a monoid because this operation is associative. Therefore, catenation is associative
and it has an identity element however it is not commutative so it is not an abelian
monoid, it is just a monoid. Now what do we mean by context-sensitiveness? Let us take
a production of an arbitrary context-sensitive grammar. Let us keep this here so that we
might in fact need it for some reason. So if you were to take a production of a context-sensitive
grammar, what it specifies is that I have a non terminal symbol A and if it appears
in some context, a Ab then I can replace this non terminal symbol by some other string which
means that the context still is going to be preserved.
But 'a' is going to be replaced so it is a conditional rewrite rule. I can replace a by B, C and may be xyz; x and y padded with
some x and y. What this production says is that I can replace a by the string xBCy only
if on both sides of the A, I have small 'a' and small 'b' appearing.
In that sense this production is context-sensitive. In the generation process there is some large
arbitrary string and it turns out that there are some other symbols which actually are
around 'A'. 'A' appears in a context which does not contain small 'a' and small 'b' on
either side. Then A appears in a context in which this rule cannot be applied. So, a context
specifies a certain minimal shell within which that element should appear. In the case of
our context-free grammar you are implicitly specifying that the context in which it should
appear is epsilon on both sides. In the case of
our context-free grammar we are implicitly specifying that if S appears in an empty context
which means that you do not care what appears on either side of S then I can replace S by
A. So, a context really specifies the smallest kernel around that symbol which we should satisfy
and regardless of what the string is in the generation process, if it is of the form 'a'A
b and regardless of what else occurs in the string, this A is a candidate for replacement
by this rule. Context-sensitive grammars are actually more general in the sense that this
padding of 'a' and 'b' is also perhaps not necessary. You might get rid of small 'a'
and small 'b' too. Let us not worry about it because that is the intuitive meaning of
context-sensitiveness. You are specifying some minimal padding around that non terminal
symbol which will enable a rule to be applied and in the case of a context-free grammar
the minimal padding is nothing.
All those rules can be thought of as rules in the context in which on both sides of the
padding then the minimal padding that you require is the empty string which means that
it does not matter what the rest of the string is in the generation process. Regardless of
what the rest of the string is, you can apply the production. As far as programming languages
are concerned there are practical reasons why we do not take their context sensitivity
into account. It is much simpler to deal with the programming language as generated by a
context-free grammar and deal with context-sensitive aspects later on in the process of compilation.
A typical context-sensitive feature even in languages like PASCAL is that no variable
can appear in a statement unless there is a declaration of that variable. If you have
a context-free grammar for PASCAL, it will fail to check on these context-sensitive issues.
Undeclared variables can appear in your program if you just go by a context-freeness
but there are no efficient algorithms to recognize or parse context-sensitive languages represented
as context-sensitive grammars. We have efficient algorithms to recognize and parse context-free
grammars. If you take context-sensitive grammars then you are not likely to get a linear algorithm.
There are no linear algorithms available for parsing context-sensitive grammars so many
people usually specify it as a context-free grammar and later as part of the semantics,
specify its context-sensitive aspects also. Many people in fact consider context-sensitive
to be synonymous with the semantics of the language but that is not quite true. Every
regular grammar is context-free; every context-free grammar is context-sensitive and every language
generated by a right linear regular grammar can also be generated by a left linear regular
one. Every language generated by a left linear one can also be generated by a right linear
one. Supposing you have a grammar which is regular but not necessarily right linear or
left linear, then some of the rules might be right linear and some of the rules may
be left linear. Such grammars can also be always converted into either purely right
linear ones or purely left linear ones. In fact that conversion helps us to design
machines for recognizing languages of these grammars.
When we are looking at languages we would say that the language is regular if there
exists a regular grammar which generates it. Similarly a language is context-free if there
exists a context-free grammar which generates it and a language is context-sensitive if
there exists a context-sensitive grammar which generates it.
It is possible that for some language you have generated a context-free grammar. It
is context-free but the language could still be regular. Similarly, it is possible that
you have written context-sensitive grammar for a language which actually could be context-free
in the sense that you might be able to come out with a grammar which is purely context-free.
Let us just look at a few small examples of regular languages.
Let us consider our Arabic numerals. We must remember it is one matter to design a grammar
and ask what the language generated by the grammar is. It is a different matter to take
an existing language and try to define a grammar for it. The numerals were known long before
any grammar was defined for them. Actually a large number of numerals comes from the
notion of a grammar in natural language and Sanskrit grammar for example was always a
neat and rigorous art form. That is one of the reasons perhaps that such a neat notation
for numbers was evolved. If you take the Arabic numerals, the terminal set is the set of all
possible digits that you have. I am considering representation in decimal which is 0...9 . I
require just one non terminal symbol S and I have just these following productions.
.
It is a nice and simple grammar that is a right linear grammar and the corresponding
left linear grammar which is equivalent to it is . Firstly, it is not at all fully clear what
the terminal set is in the sense that the Romans never considered numbers beyond a few
tens of thousands. The Romans had a pattern in the sense that
they had symbols and they assumed that they had symbols also for (10,000), (50,000), (100,000),
(500,000) but if you continue that pattern then you would require an infinite set of
terminal symbols. Supposing you did have an infinite set of terminal symbols then your
condition for being a grammar already is violated but supposing you could have an infinite set
of terminal symbols then the numerals were written in a very context-sensitive way.
For example; an x cannot precede
a c, an x cannot precede 500.
It is very context-sensitive in the sense that the roman numerals are not as simple
as this. A regular language is implicitly a very simple object. It is easy to see what
language it generates. Very often it is easy to construct a grammar for it and these two
grammars are really equivalent. Remember that our ultimate aim is to represent languages
somehow in a finitary fashion. The criterion for saying that two grammars are equivalent
is that they should have the same terminal set. They need not have the same non terminal
set, they need not have the same production set but the language they generate should
be the same. The language that we have previewed for the
context-free language that we have previously given here is an equivalent context-free grammar.
For example; I have just factored out the open parenthesis and the S and introduced a new non terminal symbol
called B for which I have two rules:
I have gotten rid of the non terminal symbols 'a' and 'b' and I had gotten rid of c because
it was directly written but I could have had c by choice.
So let us quickly go through that grammar. Here was the original grammar:
Here is the equivalent grammar:
We have taken the fact that there is a common occurrence of left parenthesis followed by
S. We have factored that out. An equivalent way of writing this grammar is to use a new
symbol S. Let us say S -> D and let D produce this string. The elimination of C was just
to make a grammar smaller to reduce the number of non terminals. It is important to reduce
the number of non terminals because your parsing of the language really depends on how many
non terminal symbols there are.
For the same language you might have a variety of grammars. It is a matter of decision making
to choose the right kind of grammar which generates that language and the criteria for
choosing a grammar are that firstly, the grammar should not be so complicated that it is impossible
to parse the language and preferably it should be a context-free grammar. The number of non
terminals must be kept low and another important constraint is that it should facilitate an
easy explanation of the semantics of the language.
In fact the left linear and the right linear numerals have that difference and that is
they are both equivalent in terms of actual generation but the fact is that it is easier
to specify semantics for the right linear one rather than the left linear one and certain
parsing algorithms actually will choose the right linear over the left linear because
there is an inherent constraint in those parsing algorithms. The inherent constraint has to
do with recursive calls. The recursive calls in the left linear case could lead to an infinite
recursion whereas in the right linear case, they would lead to a finite recursion based
on a look ahead.