Definite Clause Grammars

Much of the research done to date on building efficient natural language systems has concluded that logical inference is a powerful means by which language can be parsed and generated. Definite clause logic, the subset of logic on which Prolog is based, has proven highly useful for building parsers and generators. In fact, Prolog was developed by A. Colmerauer specifically to address natural language technology.

Although other logic-based techniques (often implemented in Prolog) have proven more useful for deep natural language processing, definite clause grammars remain a powerful technique for small grammars, for shallow processing, and for processing artificial grammars.

This chapter describes a Prolog grammar formalism called "definite clause grammars" and describes how that formalism is used with Prolog.

Understanding language structure

Definite clause grammar (DCG) notation is a syntactic sugar that extends ordinary Prolog syntax. DCG's are a particular class of grammars useful for building natural language systems. DCG's make efficient use of the full power of Prolog as a general purpose programming language. DCG's are useful in two specific ways: they serve as a description of a language and as a description of a process for analyzing that language.

In order to fully understand a language, whether a natural language or a programming language, you have to understand its grammar. The structure of a language relies on sequences of words ordered according to specific rules. These rules oversee tense, gender, voice, case, agreement, and sentence structure. These rules also apply to the descriptions, limitations, and qualifications of words and phrases.

Context-free grammars

There are many classes of grammars; each has its own notation. A familiar class of grammar is known as context-free grammar. Programming languages are often specified in context-free grammars, generally in the BNF (Backus-Naur form) notation.

The following example shows a simplistic context-free grammar in BNF that accepts the sentence "the musician plays the violin."

<sentence> ::= <noun_phrase> <verb_phrase>
<noun_phrase> ::= <determiner> <noun>
<verb_phrase> ::= <verb> <noun_phrase>
<determiner> ::= the
<noun> ::= musician | violin
<verb> ::= plays

In grammars, terminal symbols are a short hand for words found in sentences; nonterminal symbols are a shorthand for phrases of the language. The terminals for the example above include the following: the, musician, plays, and violin. The nonterminals include the following: sentence, noun-phrase, verb-phrase, determiner, noun, and verb.

Context-free grammars allow you to analyze a sentence (or string) generating a parse tree. The parse tree shows the phrase structure for a sentence and the hierarchical relationships between those phrases. A parser, on the other hand, is a program that builds parse trees.

Definite clause grammars

Based on the work of Colmerauer and Kowalski, a formalism known as definite clause grammar expresses context-free rules as logic statements. DCG's are more powerful as a class of grammars than context-free grammars are. The notation used for DCG's is syntactic sugar on top of Prolog.

A simple DCG rule can break a sentence down into a noun phrase and a verb phrase. This DCG rule looks like the following:

sentence --> noun_phrase, verb_phrase.

In order to read a rule in a definite clause grammar, you read the –> symbol to mean "can take the form" and the comma to mean "followed by". You can read the rule in the example above to mean "A sentence can take the form of a noun-phrase followed by a verb-phrase."

The definite clause grammar rule above is equivalent to the following Prolog clause:

sentence(S0, S) --> noun_phrase(S0, S1), verb_phrase(S1, S).

You can read the clause as "A sentence extends from S0 to S if there is a noun phrase from S0 to S1 and a verb phrase from S1 to S."

A DCG for a grammar that accepts the sentence "the musician plays the violin" looks like the following:

sentence --> noun_phrase, verb_phrase.

noun_phrase --> determiner, noun.

verb_phrase --> verb, noun_phrase.

determiner --> [the].

noun --> [musician].
noun --> [violin].

verb --> [plays].

DCG'S and natural language parsers

Many programming languages are not well suited for transforming grammars into natural language parsers and generators. Prolog, however, is a particularly convenient language for doing this.

Programs written in DCG notation are transformed into Prolog using the expand_term/2 predicate (described in detail later in this chapter). This transformation occurs automatically when a program written in DCG notation is compiled or consulted. Prolog translates each DCG rule into a clause that it can understand. The –> operator is transformed into a neck (:-). Prolog then translates nonterminals into predicates with two additional arguments. The first argument it adds is the list of words you passed into the nonterminal (referred to as input tokens). The final argument is the list left over when it has finished (referred to as output tokens).

Sequences of tokens are lists. The empty token is equivalent to the empty list. The empty list is written as the symbol [].

Whatever the first phrase (the noun phrase in the example below) leaves behind, the second phrase (the verb phrase) takes as its input. The whole phrase must equal the subphrases to the right of the main functor –> in order for the sentence to parse.

The Prolog equivalent to the definite clause grammar that accepts the sentence "the musician plays the violin" is:

sentence(S0,S) :- noun_phrase(S0,S1), verb_phrase(S1,S).

noun_phrase(S0,S) :- determiner(S0,S1), noun(S1,S).

verb_phrase(S0,S) :- verb(S0,S1), noun_phrase(S1,S).

determiner([the|S],S).

noun([musician|S],S).
noun([violin|S],S).

verb([plays|S],S).

DCG syntax

DCG syntax is similar to Prolog syntax. The following lists some specific rules when using DCG's:

Nonterminal symbols are written as any Prolog atom or structure that is not a list.
Terminals are written as any Prolog term. To distinguish terminals from nonterminals, a sequence of terminals is written as a Prolog list; an empty sequence is written as the empty list []. Sequences of terminals that are character lists are placed inside of double quotes.
The left hand side of a grammar rule contains only the nonterminals, optionally followed by a sequence of terminals written as a list.
The right hand side of a grammar rule may contain nonterminals and terminals.
Alternatives may be stated explicitly in the right hand side of a grammar rule, using the disjunction operator (;) or the vertical bar (|).
Extra tests are included in the right hand side of a grammar rule. These are enclosed in curly brackets ({ ... }). The DCG translation process does not change whatever is enclosed inside curly brackets.
The cut symbol may be included in the right hand side of a grammar rule. It is not necessary to enclose the cut inside the curly brackets.

Calling your DCG

Once you have written your DCG, you call it in Prolog with the predicate that performs the top-level processing in your grammar. For example, you can call the sentence predicate in the following way:

sentence([the,musician,plays,the,violin], []).

Adding additional arguments

You can add any number of arguments to nonterminals you are defining while you are using DCG's. You might want to add extra arguments, for example, to signal number agreement. Otherwise a sentence may parse, but it may not be grammatically correct. For example, both of the following examples are grammatically incorrect:

The musicians plays the violin.
The musician play the violin.

In the simple example below, you add the extra argument Number which may be instantiated to either the value singular or plural to ensure number agreement. You also add an argument for semantic type to nouns and noun phrases to differentiate a person from a thing. In this example, you want to differentiate between the noun "musician" (person) and the noun "violin" (thing) in order for the sentence to read "the musician plays the violin." If you had not made this distinction, the sentence "the violin plays the musician" could have parsed.

sentence -->
    noun_phrase(Number, person),
    verb_phrase(Number).

noun_phrase(Number, Type) -->
    determiner,     
    noun(Number, Type).

verb_phrase(Number) -->
    verb(Number),   
    noun_phrase(Number, thing).

determiner --> [the].

noun(singular, person) --> [musician].
noun(plural, person) --> [musicians].
noun(singular, thing) --> [violin].

verb(singular) --> [plays].
verb(plural) --> [play].

You can also add additional arguments to see a result of an evaluation that occurs while the parse is taking place. A good example is a simple grammar to parse and evaluate simple arithmetic expressions (involving numbers 0 through 9). In this example, "Z" is the additional argument.

expr(Z) --> term(X), [`+], expr(Y), {Z is X+Y}.
expr(Z) --> term(X), [`-], expr(Y), {Z is X-Y}.
expr(Z) --> term(Z).

term(Z) --> number(X), [`*], term(Y), {Z is X*Y}.
term(Z) --> number(X), [`/], term(Y), {Z is X/Y}.
term(Z) --> number(Z).


number(C) --> [`+], number(C).
number(C) --> [`-], number(X),{C is -X}.
number(X) --> [C], {`0 =< C, C =< `9, X is C - `0}.

The following example uses this grammar to perform a calculation:

?- expr(Z, "-2+3*5+1", []).
Z=14
yes

The following is the parse tree that is "pretty printed" for the sentence "the musician plays the violin."

sentence(   
 noun_phrase(     
  determiner(the),      
  noun(musician)
 ),
 verb_phrase(
  verb(plays),
  noun_phrase(
   determiner(the),
   noun(violin)
  )
 )
).

In order to return a parse tree, you add an extra argument to each predicate. This indicates that you want to see an entire parse tree built from the trees of the sub-phrases. If Prolog can find a noun-phrase followed by a verb-phrase, the two parse trees can be combined to form one larger parse tree that uses the functor sentence to make the tree for the entire sentence.

sentence(sentence(NP,VP) -->
    noun_phrase(NP, Number, person),
    verb_phrase(VP, Number).

noun_phrase(noun_phrase(Det, Noun), Number, Type) -->
    determiner(Det),
    noun(Noun, Number, Type).

verb_phrase(verb_phrase(Verb, NP), Number) -->
    verb(Verb, Number),   
    noun_phrase(NP, Number, thing).

determiner(determiner(the)) --> [the].

noun(noun(musician), singular, person) --> [musician].
noun(noun(musician), plural, person) --> [musicians].
noun(noun(violin), singular, thing) --> [violin].

verb(verb(play), singular) --> [plays].
verb(verb(play), plural) --> [play].

This of course is merely a toy example, but it provides a flavor of how DCGs may be used.

Transforming DCGs into Prolog

As mentioned earlier in this chapter, programs written in DCG notation are transformed into Prolog by the expand_term/2 predicate.

expand_term(+T1,-T2)

If the first term (T1) is a Prolog term, then Prolog returns the unchanged term in the value for T2. If T1 is a DCG term, Prolog returns the equivalent Prolog term in T2.

The expand_term/2 predicate is used automatically by the compiler, consult/1, and reconsult/1. It is also used by the interpreter at the top level. This predicate is not used by the read predicate.

A sample program using DCG'S

The following is a classic example used to illustrate the capabilities of definite clause grammars. This example is taken from Pereira and Warren, "Journal Of Artificial Intelligence" [1978]. This example uses arguments to build up structures that represent phrases.

Using the definite clause grammar formalism, you can write a grammar that accepts the following two sentences:

John loves Mary. Every man that lives loves a woman.

The program generates a term that expresses the logic of the accepted sentences. The program below includes the DCG to parse both of these sentences.

:- op(910,xfy,&).
:- op(920,xfy,=>).
:- op(930,xfy,:).

sentence(P) --> noun_phrase(X,P1,P), verb_phrase(X,P1).

noun_phrase(X,P1,P) -->
    determiner(X,P2,P1,P),
    noun(X,P3),
    rel_clause(X,P3,P2).
noun_phrase(X,P,P) --> name(X).

rel_clause(X,P1,P1&P2) --> [that],   verb_phrase(X,P2).
rel_clause(_,P,P) --> [].

verb_phrase(X,P) -->
    trans_verb(X,Y,P1),
    noun_phrase(Y,P1,P).
verb_phrase(X,P) -->
    intrans_verb(X,P).

trans_verb(X,Y,loves(X,Y)) --> [loves].

intrans_verb(X,lives(X)) --> [lives].

determiner(X,P1,P2, all(X):P1=>P2) -->     [every].
determiner(X,P1,P2, exists(X):P1&P2) -->   [a].

noun(X,man(X)) --> [man].
noun(X,woman(X)) --> [woman].

name(john) --> [john].
name(mary) --> [mary].

By supplying phrases to the sentence predicate, you can discover the logic connected with the phrases:

?- sentence(X,[john,loves,mary],[]).
X = loves(john,mary)
yes

?- sentence(X,[every,man,that,lives,loves,a,woman],[]).
X = all(A) : (man(A) & lives(A)) => exists(B) : woman(B) & loves(A,B)
yes

In this example DCG, the last argument in the phrase specifies the meaning of the phrase. The meaning of the phrase can also depend on other factors, provided by other arguments in the program. For example, the verb lives represents lives(X) where X stands for a person who lives. The verb lives does not mean anything outside of the context of someone who lives. In other words, neither the verb lives nor the value of X is of much use without the context in which both are used. As a result, the definition in this program says that for any X, when the verb is applied to X, the actual meaning is lives(X).