Dedicated to Jan von Plato on his 50th birthday.
Representation 2 Representation 3 Representation 4 Representation 5 Representation 6
In the study of languages, printed inflection tables still flourish. While the rules for computing forms of words are known by students and teachers of languages, they might be so only partially, and there is so much incertainty involved when selecting a rare form to be used in a critical publication that inflection tables serve as invaluable memory aids. There is a massive publishing business of inflection tables, such as the French Bescherelle series.
At the same time as inflection tables have a market among consumers, linguists are confident in formalizing and implementing the rules that could easily automatize the production of such tables and replace them by computer programs - even ones that are concise enough to be implementable in hand-held devices. The present paper exemplifies this line of work. We will describe a program that implements one representative of a notoriously complex family of inflectional systems: Italian verbs, a typical example of verb systems in Romance languages. The program will cover all the inflection tables in the volume Les verbes italiens of Collection Bescherelle.
The computer program that we present is usable in many more ways than a printed text. Examples of these ways are:
The main stream of computational morphology uses finite-state transducers (that is, finite-state automata over symbol pairs) to implement synthesis and analysis. Special-purpose languages have been developed to facilitate the production of transducers on a reasonably high level, including the language of XFST (Xerox Finite State Tool). One outcome of the present work is XFST code generated from the program. This code can be fed into the XFST compiler to produce a finite-state transducer. The question we pose is whether it more convenient to use a general-purpose language to generate XFST code than to write the code directly in the XFST language.
The closest point of comparison to this work is our own recent implementation of the French conjugation à la Bescherelle in the Grammatical Framework, GF. GF is a special-purpose language built upon a computational model of linguistic structures but inheriting means of expression from typed functional languages. One outcome of the present work is low-level GF code generated from the morphology program: the question we are interested in is analogous to the XFST question, namely, whether it would have been more appropriate to use plain Haskell to produce the GF code. We will also address the more general question whether GF could be implemented as an embedded language in Haskell.
The general conclusion we will draw is that, for the practical use of a morphology program, it does not matter so much how it is produced, but only that it is in an easily portable format. From a carefully chosen format, very little effort is needed to translate the morphology into other formats: GF, XFST, and LaTeX typesetting code for inflection tables were each produced by a Haskell program of approximately ten lines, which is independent of the size of the morphology.
From the theoretical point of view. however, the value of a morphological description comes from the generalizations that it is able to make. These are features of the source code no longer visible in the compiled program. In this respect, functional programming languages appear as superior to most other languages: whatever general rule you may find that governs your data, you will be able to express it by a first-class object of a functional language.
data Numero = Sg | Pl data Genere = Masc | Fem data Persona = P1 | P2 | P3define the parameter types of number, gender, and person. respectively.
type Nome = Numero -> StringThus the noun vino could be defined
vino :: Nome
vino n = case n of
Sg -> "vino"
Pl -> "vini"
This definition, however, does not capture the generalization
traditionally expressed by declensions: vino
belongs to a large class of nouns all of which are inflected in
a similar way. It is better that we introduce a function
expressing this declension,
nom2 :: String -> Nome
nom2 vin n = case n of
Sg -> vin ++ "o"
Pl -> vin ++ "i"
and define vino in terms of this function:
vino = nom2 "vin"
Italian has a small number of noun declensions, the most frequent ones of which are
nom1, nom2, nom3 :: String -> Nome
nom1 ran n = case n of
Sg -> ran ++ "a"
Pl -> ran ++ "e"
nom2 vin n = case n of
Sg -> vin ++ "o"
Pl -> vin ++ "i"
nom3 sal n = case n of
Sg -> sal ++ "e"
Pl -> sal ++ "i"
Besides number, nouns also have a gender, but in a different way than number: the gender of a noun does not produce different forms of the noun, but is something inherently belonging to the noun. Thus, for instance, vino is masculine and mano is feminine. Inherent parameters could be formalized in Haskell by using tuples:
type Nome' = (Numero -> String, Genere) vino' = (nom2 "vin", Masc)But we will postpone the formalization of inherent features, since it is not needed in the inflection of verbs.
type Aggettivo = Genere -> NomeThe main adjective declensions are then defined easily by using the already defined noun declensions:
agg1, agg3 :: String -> Aggettivo
agg1 ner g = case g of
Masc -> nom2 ner
Fem -> nom1 ner
agg3 verd g = case g of
_ -> nom3 verd
To avoid generating spurious verb forms, we define a hierarchical parameter type that only covers those forms that really exist:
data VForm =
Inf -- infinitive
| Indic Tempo Numero Persona -- indicative
| Pass Numero Persona -- remote past
| Fut Numero Persona -- future
| Cong Tempo Numero Persona -- conjunctive
| Cond Numero Persona -- conditional
| Imp PersonaI -- imperative
| Ger -- gerund
| Part TempoP Genere Numero -- participle
data Tempo =
Pres -- present
| Imperf -- imperfect
data TempoP =
PresP -- present participle
| PassP -- past participle
data PersonaI = -- person-number combinations for imperative
IPs2 | IPs3 | IPp1 | IPp2 | IPp3
In this hierarchy, we follow closely the structure and terminology
of the Italian Bescherelle. What should be noticed, in particular,
is that Pass and Fut are not included in Tempo,
so that we can share Tempo between the indicative and the conjunctive.
Given the type VForm expressing the hierarchy of verb parameters, we define verbs as functions from this type to strings:
type Verbo = VForm -> StringThus we are ready to define our conjugations, i.e. inflection tables for verbs: they are functions from strings to Verbo,
type Coniugazione = String -> Verbo
v30chiudere :: Coniugazionewhose application instance
v30chiudere "chiu"produces the forms of the verb chiudere can equally well be applied in
v30chiudere "illu"to produce the forms of the verb illudere. In other words, the one who extends the lexicon by adding a new verb will only have to choose the conjugation and give one string, the stem to which the conjugation is to be applied. The Italian Bescherelle has an index of 8000 verbs, where each verb is equipped with a reference to the corresponding table.
The names of conjugations have the following structure: the letter v followed by the number of a table in Bescherelle followed by the infinitive of the verb that Bescherelle uses in the table.
riandare = v14andare "ri" `except` [(Indic Pres Sg P3,"rivà")]where the function except is defined in the obvious way:
except :: Eq a => (a -> b) -> [(a,b)] -> a -> b except f es p = maybe (f p) id $ lookup p esThe question of course arises what is a conjugation of its own and what is an exception. We have simply followed Bescherelle's decisions here.
Sometimes exceptions come in groups, and they are conveniently gathered in list comprehensions. An example is the frequently deviant present participle:
exceptPart finent =
[(Part PresP g n, agg3 finent g n) | g <- values, n <- values]
The hierarchic parameter type VForm has 57 parameter values, none of which is spurious. The table created by applying a verb to each of the 57 parameter values looks almost exactly the same as a table in the Italian Bescherelle (the most important exception being that we do not show compound verb forms, which are easily derived from the past participle forms and the conjugations of auxiliary verbs). It is never the case that we have to list all 57 forms to define a conjugation: usually a number of two to four stems is enough. This economy is based, first of all, on sets of suffixes, one suffix per number-person combination:
type UnoNumero = Persona -> String type UnoTempo = Numero -> UnoNumero type SoffNumero = (String, String, String) type SoffTempo = (SoffNumero,SoffNumero)To use a suffix set to conjugate a verb is simply to select the correct form:
useSoffNumero :: String -> SoffNumero -> UnoNumero
useSoffNumero am (o,i,a) p = am ++ case p of
P1 -> o
P2 -> i
P3 -> a
useSoffTempo :: String -> SoffTempo -> UnoTempo
useSoffTempo am (sg,pl) n p = case n of
Sg -> useSoffNumero am sg p
Pl -> useSoffNumero am pl p
For instance, the imperfect indicative, the future, the imperfect
conjunctive, and the conditional are almost always formed by the
following suffixes:
soffImperf, soffFut, soffCongImp, soffCond :: SoffTempo
soffImperf = (("vo","vi","va"),("vamo","vate","vano"))
soffFut = (("rò","rai","rà"),("remo","rete","ranno"))
soffCongImp = (("ssi","ssi","sse"),("ssimo","ste","ssero"))
soffCond = (("rei","resti","rebbe"),("remmo","reste","rebbero"))
The other tenses have more variation in suffixes, but still usually
employ one of a small number of sets, such as those for the present
indicative, remote past, and present conjunctive of the first conjugation:
soffPresAre = (("o","i","a"),("iamo","ate","ano"))
soffPassAre = (("ai","asti","ò"),("ammo","aste","arono"))
soffCongAre = (("i","i","i"),("iamo","iate","ino"))
In general, it is these three groups that characterize a conjugation.
Thus it is convenient to group them together:
soffAre = (soffPresAre, soffPassAre, soffCongAre)
Most conjugations can be created by the following "worst case macro", which implements all the generalizations that are valid for (almost) all conjugations. The macro takes as arguments seven strings and a triple of suffix sets.
mkVerbo :: String -> String -> String -> String -> String -> String -> String ->
(SoffTempo, SoffTempo, SoffTempo) -> Verbo
mkVerbo am amia ama amo ame ami amat (pres,pass,cong) v = case v of
Inf -> ama ++ "re"
Indic Pres n p -> useSoffTempo (atonPres (n,p) amia am) pres n p
Indic Imperf n p -> useSoffTempo ama soffImperf n p
Pass n p -> useSoffTempo (brevPass (n,p) amo amia) pass n p
Fut n p -> useSoffTempo ame soffFut n p
Cong Pres n p -> useSoffTempo (atonPres (n,p) amia am) cong n p
Cong Imperf n p -> useSoffTempo ama soffCongImp n p
Cond n p -> useSoffTempo ame soffCond n p
Imp IPs2 -> ami
Imp IPp2 -> mkv (Indic Pres n p) where (n,p) = imp2pers IPp2
Imp pi -> mkv (Cong Pres n p) where (n,p) = imp2pers pi
Ger -> ama ++ "ndo"
Part PresP g n -> agg3 (ama ++ "nt") g n
Part PassP g n -> agg1 amat g n
where
brevPass np x y = if elem np [(Sg,P1),(Sg,P3),(Pl,P3)] then x else y
atonPres np x y = if elem np [(Pl,P1),(Pl,P2)] then x else y
mkv = mkVerbo am amia ama amo ame ami amat (pres,pass,cong)
In addition to the suffixes, the most important generalizations
expressed by this macro are the following:
v6amare :: Coniugazione v6amare am = mkVerbo am am (am++"a") am (am++"e") (am++"a") (am++"at") soffAreThe second conjugation comes in two variants: one with stress on the penultimate syllable in the infinitive (= the Latin second conjugation),
v20temere :: Coniugazione
v20temere tem =
mkVerbo tem (tem++"e") tem (tem++"e") (tem++"i") (tem++"ut") soffEre
the other with stress on the antepenultimate (= the Latin third conjugation).
This latter case provides too much variation really to be defined as
one conjugation; the Bescherelle has about 70 different
tables! It was, however, mostly pure enjoyment to define these tables,
with the help of the macros that tell whether the verb needs
four, three, or two different variants of the stem:
vReGen :: String -> String -> String -> String -> Coniugazione
vReGen v vv r vut be =
mkVerbo (be++v) (be++v++"e") (be++vv) (be++r) (be++v++"i") (be++vut) soffRe
vReGen3 :: String -> String -> String -> Coniugazione
vReGen3 d s st = vReGen d s (d ++ "e") st
vReGen2 :: String -> String -> Coniugazione
vReGen2 nd s = vReGen3 nd s s
Here some examples:
v21accendere = vReGen2 "nd" "s" v22affiggere = vReGen2 "gg" "ss" v23ardere = vReGen2 "d" "s" v25assolvere = vReGen3 "v" "s" "t" v26assumere = vReGen3 "m" "ns" "nt" v28cadere = vReGen "" "d" "" "ut" v29chiedere = vReGen3 "d" "s" "st" v30chiudere = v23ardere -- !!! v31cingere = vReGen3 "g" "s" "t"The third conjugation (corresponding to the fourth conjugation in Latin) has two major peculiarities: the appearance of isc in some verbs (e.g. finire, finisco in contrast to sentire, sento) and the present participle and gerund forms with the vowel e instead of i, as would be expected from the infinitive. To handle this in a uniform way, we introduce a generic third conjugation macro,
vIreGen :: String -> Coniugazione
vIreGen isc fin =
let fini = fin ++ "i" in
mkVerbo fin fini fin fini (fin ++isc++"i") (fini ++"t") (soffIre isc) `except`
((Ger,fin ++ "endo") : exceptPart (fin ++ "ent"))
and define the individual conjugations as special cases:
v99sentire = vIreGen "" v100finire = vIreGen "isc"
v27bere be = vReGen "v" "vv" "r" "vut" be `except` [(Inf,be ++ "re")]
v24assistere assist =
vReGen3 "" "" "it" assist `except`
[(Pass n p, assist ++ suff) |
(n,p,suff) <- [(Sg,P1,"ei"),(Sg,P3,"é"),(Pl,P3,"erono")]]
Notice how convenient it is to keep the exceptions in a list: we can use
list comprehension to generate exceptions in a way that captures
a generalization.
changePref :: [(String,String)] -> String -> String
changePref cs t = case cs of
(s,r) : rs -> if isPrefixOf s t then (r ++ drop (length s) t) else changePref rs t
_ -> t
The changes function matches sunstrings against a list of
search-replace pairs. Its full definition is in the file
Generale.hs.
Here are the main examples of phonological changes in the first conjugation:
v7cercare cer v = cer ++ changePref [("ce","che"),("ci","chi")] (v6amare "c" v)
v8legare le v = le ++ changePref [("ge","ghe"),("gi","ghi")] (v6amare "g" v)
v9cominciare comin v =
comin ++ changePref [("cii","ci"),("cie","ce")] (v6amare "ci" v)
v10mangiare man v =
man ++ changePref [("gii","gi"),("gie","ge")] (v6amare "gi" v)
Regular phonological changes are not so common in the other conjugations:
in the course of language development, they have been integrated in
the conjugations in subtle ways. Occasionally, however, we
have found a phonological change in describing what otherwise would
require a massive list of exceptions:
v86spegnere spe v =
spe ++ changePref [("gno","ngo"),("gna","nga")] (vReGen3 "gn" "ns" "nt" "" v)
conjug1 :: String -> Verbo
conjug1 amare = v6amare (dropSuffix 3 amare)
conjug2a :: String -> Verbo
conjug2a temere = v20temere (dropSuffix 3 temere)
conjug2b :: String -> String -> String -> String -> String -> Verbo
conjug2b chiedere chiedo chiesi chiedero chiesto =
mkVerbo (dropSuffix 1 chiedo) (dropSuffix 3 chiedere)
(dropSuffix 2 chiedere) (dropSuffix 1 chiesi)
(dropSuffix 2 chiedero) (dropSuffix 1 chiedo ++ "i")
(dropSuffix 1 chiesto) soffRe
conjug3 :: String -> String -> Verbo
conjug3 finire finisco = vIreGen isc fin where
(fin,isc) = case splitAt (length finisco - 3) (init finisco) of
(fin,"isc") -> (fin,"isc")
_ -> (init finisco,[])
Notice the use of the function
dropSuffix :: Int -> [a] -> [a] dropSuffix i s = take (length s - i) swhich seems to say that the inflection "erases" letters from a verb form. While this is hardly a linguistically realistic explanation of a conjugation, it is a common everyday understanding of inflection. Computationally, accepting "erasure" of this kind has no strange consequences, since it is just a shorthand for telling where the stem boundary goes.
table :: (Enum a, Bounded a) => (a -> b) -> [(a,b)] table f = [(v, f v) | v <- [minBound .. maxBound]]We define instances of the standard Haskell classes Enum and Bounded for morphological parameter types so that we can use the same function table not only for verbs but for other word classes as well.
Once we have a table, we can print it e.g. in the Latex format:
prLatexTable :: (Show a) => [(a,String)] -> String
prLatexTable tab =
"\\begin{center}\n\\begin{tabular}{|l|l|}\\hline\n" ++
unlines [show a ++ " & {\\em " ++ b ++ "} \\\\" | (a,b) <- tab] ++
"\\hline\n\\end{tabular}\n\\end{center}\n\\newpage\n\n"
We can also produce a sorted
full-form lexicon, which
can be efficiently compiled to a binary search tree or to
a trie to perform morphological analysis.
The Latex tables, full-form lexicon, and other related formats are defined in the file Usaggio.hs. One should notice that these operations are in no way specific to Italian grammar, but can be applied to any string-valued functions on finite datatypes.
Writing an efficient transducer compiler would be a substantial work, and we have therefore chosen to rely on Xerox Finite State Compiler. Thus we define, in the file Usaggio.hs, a function than transforms a list of tables into an XFST source file.
prXFSTRules :: (Show a) => String -> [[(a,String)]] -> String
prXFSTRules cat tabs =
"define " ++ cat ++
" [\n" ++ unlines (intersperse " |" (map prEntry tabs)) ++ " ] ;"
where
prEntry tab@((_,s):_) = concat (intersperse " |\n" (map (prForm s) tab))
prForm stem (a,b) = changes [("{*}","%*")]
(" [ {"++ stem ++"}" ++prTags a ++ " .x. {" ++ b ++"}]")
prTags a = unwords [" %+" ++ t | t <- words (show a)]
This source defines one gicantic regular expression, named after the
desired category cat. The file looks like
this.
prGFOper :: (Show a) => String -> [(a,String)] -> String
prGFOper cat tab@((_,f):_) =
"oper " ++ f ++ " : Lin " ++ cat ++ " = table {\n" ++
concat (intersperse " ;\n" [" "++ show a ++ " => "++ show b | (a,b) <- tab]) ++
"\n } ;\n"
or creating both an abstract and a concrete syntax (as in
this GF file).
The latter kind of files are readily usable for GF applications,
such as morphological analysis and synthesis, as well as
a language-training program generating random
exercises (instructions here).
The former kind of files are more suitable as resources,
i.e. as auxiliary grammars that the application grammars proper
can import.
Resource grammars look quite similar to XFST source files, consisting of entries such as
oper amare : Lin V = table {
Inf => "amare" ;
Indic Pres Sg P1 => "amo" ;
Indic Pres Sg P2 => "ami" ;
Indic Pres Sg P3 => "ama" ;
Indic Pres Pl P1 => "amiamo" ;
---
Part PassP Masc Pl => "amati" ;
Part PassP Fem Sg => "amata" ;
Part PassP Fem Pl => "amate"
} ;
There are 57 forms for each verb. However, when a resource grammar
is imported by an application grammar, GF may notice that
only some of the forms are actually needed. For instance,
a grammar for mathematical language typically only needs four forms:
the present indicative and conjunctive third person singular and plural forms.
The massive inflection table is then reduced to a much smaller table,
table {
Ind Sg => "ama" ;
Ind Pl => "amano" ;
Con Sg => "ami" ;
Con Pl => "amino"
}
so that the run-time environment of GF needs less memory and the
applications run faster. This type of program optimization,
which exploits the fact that the input of a funtion is partly
known at compile time, is known as partial evaluation.
The way in which GF uses partial evaluation is very similar
to the solution of the
aircraft crew planning problem by Lennart
Augustsson.
XFST is one way to implement finite-state morphology, another way being the two-level morphology of Kimmo Koskenniemi. In parallel to XFST, researchers at Xerox have developed the language TWOLC, which implements the rule format of two-level morphology rather than regular expressions, but which is also compiled to finite-state transducers.
The main arguments for using dedicated morphology languages such as XFST and TWOLC are
The high-level point is a strong case for Haskell: it is possible to define much stronger generalizations in a functional language than in a dedicated finite-state language, for instance, by using higher-order functions. In XFST, only constant macros ara available (although it would be rather straightforward to add non-recursive macros with arguments). It may still be the case, though, that the customized notations of TWOLC and XFST are easier for the working linguist to learn than the general, abstract notions of Haskell. There are, for instance, nice ways of expressing phonological changes, e.g.
c -> c h || _ e | isays that c becomes ch when followed by e or i. There is, moreover, yet another language in the XFST family, LEXC, with a format specifically tailored for defining lexica with classes of words behaving in the same ways, such as the conjugations of Bescherelle.
The way we can tailor Haskell for morphology and lexica is by creating libraries, in a way building an embedded language of morphology. Typical examples of embedded language constructs are the except and changePref functions used in the present work.
The efficiency issue is settled by the compilation of inflection functions into finite-state transducers. In this way, we attain the same efficiency as programs whose source is written in a finite-state language.
The one-source idea and reusability are similarly consequences of the compilation. There is no problem even in using Haskell source in combination with XFST source in this way, since we can generate XFST code from Haskell code.
One aspect that is not well supported by the XFST is correctness: there is neither type checking to guarantee that only meaningful parameters are used for a word class, nor completeness checking to guarantee that all required forms are defined for a given word.
What remains outside the reach of the ideas of the present paper but within the reach of XFST are transducers that exploit the full power of finite-state methods, using the Kleene closure, which compiles into cyclic transducers. Our inflection tables are in the end just a special case of transducers built by alternation and sequencing, but without the Kleene closure. Sophisticated ideas for creating cyclic automata by suitable combinators have been presented in functional programming literature. However, we see no motivation for going beyond finite tables in the morphology of Italian and other languages in which inflection is non-recursive.
Out main concern was the level of abstraction and factorization that the description can attain. In this respect, GF and Haskell turned out to be quite similar: both can define a higher-order function to capture almost any conceivable generalization. This is with such exceptions as erasure of characters (as in the school grammar patterns) and the definitions of exceptions by list comprehension, which are not available in GF.
In contrast to Haskell, GF has a built-in compilation of finite functions into tables (without having to tweak with instantiating the classes Enum and Bounded), as well as a type-checker that guarantees the termination and completeness of inflection tables. We can, however, force Haskell to perform these checks simply by print the inflection tables: if there is a loop or a missing case, it will come out in the course of printing.
As regards code consumption, the Haskell source for the Italian conjugation and and the GF source for the French conjugation both have about the same figure, around 500 lines/20kB. Part of the length of the GF code comes from the fact that GF more often requires type signatures; the Italian grammar, on the other hand, is slightly larger than the French one.
All in all, those who have experience in Haskell programming, should choose Haskell rather than GF for programming morphologies, even ones to be used in GF. Those who want to do everything in GF, however, will not lose enormously if doing so.
Simple pretty-printers were defined for producing other representations of the morphology: Latex-typeset inflection tables, XFST code, and GF code.
Given that other formats, which are possibly required by linguistic applications, are so easily generated from the Haskell source, we would seriously consider implementing the morphology part of any future language-processing project in Haskell, and generate the code we need in special-purpose formalisms from this implementation. What speaks for Haskell is the arbitrary high level of abstraction achieved by higher-order types, which is hard to attain in special-purpose languages.
We have not used the most idiomatic features of Haskell in an essential way: what we have done could have equally well been carried out in another functional language, such as ML.
To make this way of implementing morphology more accessible for the working linguist, we have started - but far from finished - the development of a combinator library that enables morphology programming in an economical and intuitive way.