Department of Sociolinguistics - Hungarian Academy of Sciences
Department of Sociolinguistics - Hungarian Academy of Sciences
The paper presents a statistical examination of the Hungarian noun paradigm. The results are based on a tagged corpus, which was collected and analysed for the Historical Dictionary of Hungarian. The process of corpus building and analysing is described briefly as well as the summary of the Hungarian noun paradigm. The frequency of the possible noun suffix combinations is peresented in tables. The results can be used for further corpus linguistic studies (statistical taggers, syntactic analysers etc.).
Introduction
The very first result in Hungarian Computational Lexicography was the monograph of [5] which was based upon the analysis of the database of the Dictionary of Hungarian (Magyar Értelmezõ Szótár). This database offered the first opportunity to study the most widely used Hungarian lexemes from several viewpoints. The project to computerize this dictionary was carried out in the sixties. At this time it was a pilot project even by international standards.
Today the possibilities offered by computers and computerized dictionaries and corpora are highly different from those ages. Although there still does not exist any complete Hungarian monolingual explanatory dictionary in computerized format, we at least possess a computerized corpus of Hungarian which was compiled during the last decade to serve as the source material for the planned Historical Dictionary of Hungarian. The morphological analysis of the currently available corpus was carried out recently which - amongst other things - made possible to examine the behavior of the Hungarian suffixes in the running texts. This paper presents the first statistical results of the analysis of the noun paradigm.
The process of the investigation
The corpus of Hungarian contains several small excerpts taken from a variety of sources. This is - or at least was meant to be - a balanced corpus of the language. The samples to be recorded were selected by literary historians. The on-line corpus contains 17.000.000 running words currently, 7 million taken from the literature of the 19th century, the rest from the 20th. The enlargement and correction of the corpus is still in progress. Therefore any results presented here can only be partial.
It was soon realised that the retrieval of the bare running words for lexicographic purposes would not be appropriate since Hungarian morphology is much too complex. In many cases the lexicographers could not find easily all the occurences of the searched word, either because some of the roots change before certain endings, or because the inflected froms of one word correspond with other words [3]. Therefore a morphological analyser programme was developed [6],[7] which was applied on the corpus. The programme is able to analyse the running words even when the actual root differs from the lexeme and it can segment correctly the most complicated suffixed words. For example the running word lovaimét (the complex posessive form of the word `horse` is in the accusative case) is segmented by the programme in this way:
ló[FN]=lov+aim[PSe1i]+é[POS]+t[ACC]
where ló is the lexeme for `horse`, while lov is the actual root, aim is a complex first person singular possessive suffix with an i infixum which means that the possession is plural (`I have more than one horse`). The é is a different possessive suffix in this case meaning the property of my horse (an anaphoric meaning see [1]), and the t is the accusative ending. The codes in the square brackets identify the part of speech of the lexemes and the type of the suffixes.
Since a large percent (30%) of the running words in the texts can have more than one correct analysis the question of homograph disambiguation also had to be solved. For this purpose a tool which uses local rules for disambiguation was developed and tested on the corpus. This programme examines the context of each word which has more than one analysis and tries to choose the appropriate solution [2]. It is still under development but it was run on the whole corpus. In those cases where it could not decide among the alternatives a postprocessor chose the first solution given by the morphological analyser. The error rate of this process is still to be checked. Therefore any statistical observation based on the analysed corpus can only be considered a preliminary result. We think, however, that these already show the trend of the use of the suffixes and suffix combinations.
From the analysed corpus the theoretically possible noun suffix combinations were selected by the use of PERL regular expressions. The result is a database of the noun suffixes which contains the code combinations (eg. [FN][PSe1i][POS][ACC] in the case above), as well as the actual form of each suffix (aim, e, t), because in a later study we would like to examine which forms of the suffices are most widely used. The summaries were prepared from this database.
The Hungarian noun paradigm
As we have indicated in the example above nouns can have rather complex suffixes in Hungarian. While the English hotel noun has two forms: hotel, hotels, the Hungarian equivalent word can have the following forms:
In this table the possible forms of the suffixes are in front of the '=' sign. Where more than one possible form can occur after the word the alternatives are separated by '/'. Not all Hungarian nouns can be followed by so many alternative forms. Hotel, like many loan words can be followed by both low and high forms of the same suffix. The possessive suffix in the second column does not only have low/high alternatives but a/ja e/je etc. alternatives as well. Loan words tend to be suffixed by the version beginning with 'j' (Papp 1974), and hence this word may have even more alternative suffixed forms than most of the nouns. The i infixum, as we have shown earlier, can be inserted in the possessive suffix. The analyser handles these as one complex identified suffix. The other i, which occurs after the é anaphoric possessive suffix also indicates the plural form. The 'normal' plural is coded by PL, this is simply the plural form of the noun, which can be followed by the possessive coded by POS (or its plural) and with the case suffices.
The result of the analysis
The overall summary of the noun suffix database can be seen in Table 1. The first column contains the number of nouns only suffixed by the case endings. The succeding columns are: PL and case suffix, POS and case suffix, the plural form of POS and case suffix, the combination of PL and POS and case suffix etc. PS means the summary of the other kind of possessive suffix (coded with PSe1,Pse2 etc.). From this table we can see that although the highly complex suffixed forms described above are theoretically possible, on the whole they very rarely occur. Actually the most complicated possibility PSi+POSi, did not occur at all (that is why it was not included in the statistics). The less complicated possibility PS+PSi occured 24 times altogether in the corpus, but it was nearly always the suffixed form of the pronoun maga (magadei, magunkei 'the objects belonging to you, the objects belonging to us'). The only noun with this suffix was szellem
[FN]+ük[PSt3]+éi[POSi]+vel[INS] 'with their mind's belongings'. This only occurence was also left out from the table.
One can also note that the case ending SOC(iative), meaning together with someone or something does not behave the same way as the rest of the case endings: it cannot be preceded by another suffix.
Table 2. and 3. contain the details of the posessive endings and their combinations. The plural form (Pse1i,..PSt3i) is much less frequent (about 20% of the singular). In the possessive paradigm the third person singular is the most frequent, and first person singular is the second in front of any case suffix. From Table 3. we can draw the conclusion that the combination of the two kinds of possessive forms are very rare even in their singular forms, but they do occur even followed by a case ending.
Summary
The results presented here are the first which are based on a relatively large corpus of Hungarian (17 million words). With this study we hope to contribute to Hungarian corpus linguistics researches. It can give useful informations for statistic based taggers, and the method can serve as a model for similar kind of examinations of the inflectional system of verbs or the paradigm of adjectives in Hungarian.
Table 1.
CASES
PL
POS
POSi
PL+POS
PL+POSi
PS
PSi
PS+POS
PSi+POS
SUMMA
ABL
20076
4082
75
0
35
0
6799
1832
7
4
32910
ACC
315093
66096
567
2
140
4
153840
35016
73
11
570842
ADE
12408
3510
78
0
25
0
4878
862
4
2
21767
ALL
22479
3962
150
4
52
0
9815
1802
10
4
38278
CAU
6558
897
2
0
0
0
3225
467
3
0
11152
DAT
89685
16930
23
1
14
1
41360
11412
8
0
159434
DEL
21068
4840
18
0
3
0
11490
2377
5
1
39802
ELA
31620
6063
42
0
9
0
16376
3137
9
0
57256
FAC
9128
976
12
0
11
0
1970
222
0
0
12319
FOR
4613
282
1
0
0
0
1134
90
0
0
6120
ILL
53023
4142
52
0
11
0
26187
2041
6
1
85463
INE
124146
18913
57
0
15
0
63239
8317
10
0
214697
INS
102592
23372
200
4
57
4
31822
8563
24
6
166644
SOC
292
0
0
0
0
0
0
0
0
0
292
SUB
87597
12177
78
0
20
0
36333
4407
5
4
140621
SUP
87719
11976
94
0
32
0
48147
5497
8
0
153473
TEM
1853
89
0
0
0
0
648
71
0
0
2661
TER
14558
1804
1
0
0
0
2212
30
1
0
18606
SUMMA
1004508
180111
1450
11
424
9
459475
86143
173
33
1732337
CASES
PSe1
PSe2
PSe3
PSt1
PSt2
PSt3
Pse1i
Pse2i
PSe3i
Pst1i
Pst2i
Pst3i
SUMMA PS
SUMMA PSi
ABL
608
221
5552
203
13
202
111
28
1463
85
6
139
6799
1832
ACC
13460
4123
125386
4326
299
6246
1113
540
28866
1545
97
2855
153840
35016
ADE
384
163
3916
209
14
192
45
36
639
73
2
67
4878
862
ALL
862
265
8042
332
19
295
109
38
1453
77
5
120
9815
1802
CAU
403
250
2336
102
14
120
34
22
331
26
2
52
3225
467
DAT
3510
1556
33792
1741
87
674
448
175
9858
506
25
400
41360
11412
DEL
791
272
9685
337
13
392
134
50
1887
125
4
177
11490
2377
ELA
1205
326
13796
448
38
560
181
66
2520
112
12
246
16373
3137
FAC
116
34
1718
56
4
42
3
1
204
3
0
11
1970
222
FOR
8
3
1109
4
0
10
0
0
90
0
0
0
1134
90
ILL
4152
960
19055
1095
90
835
149
64
1489
128
12
199
26187
2041
INE
4874
1167
52058
2879
107
2208
424
135
6324
824
20
590
63293
8317
INS
2365
861
26734
613
59
1190
482
210
6752
271
22
826
31822
8563
SUB
3456
1355
28857
1423
96
1146
299
139
3361
273
9
326
36333
4407
SUP
2263
1083
43009
672
43
1077
127
65
4832
169
13
291
48147
5497
TEM
35
6
575
16
0
16
0
0
4
66
0
1
648
71
TER
125
58
1919
46
0
64
2
0
25
0
0
3
2212
30
SUM
38617
12703
377539
14502
896
15269
3661
1569
70098
4283
229
6303
459526
86143
Table 2.
Cases
Pse1
+POS
PSe2
+POS
Pse3
+POS
Pst1
+POS
PSt2
+POS
PSt3
+POS
PSe1i
+POS
Pse2i
+POS
PSe3i
+POS
PSt1i
+POS
Pst2i
+POS
Pst3i
+POS
Summa
PS+POS
Summa
Psi+POS
ABL
1
0
3
1
0
2
0
0
3
0
0
1
7
4
ACC
30
11
24
4
3
1
4
5
2
73
11
ADE
0
0
2
1
0
1
0
0
1
1
0
0
4
2
ALL
3
0
5
2
0
0
0
0
2
2
0
0
10
4
CAU
1
0
1
1
0
0
0
0
0
0
0
0
3
0
DAT
2
6
0
0
0
0
0
0
0
0
0
0
8
0
DEL
1
2
2
0
0
0
1
0
0
0
0
0
5
1
ELA
4
2
3
0
0
0
0
0
0
0
0
0
9
0
FAC
0
0
0
0
0
0
0
0
0
0
0
0
0
0
FOR
0
0
0
0
0
0
0
0
0
0
0
0
0
0
ILL
1
5
0
0
0
0
0
0
1
0
0
0
6
1
INE
1
6
2
1
0
0
0
0
0
0
0
0
10
0
INS
14
1
4
3
0
2
1
0
5
0
0
0
24
6
SOC
0
0
0
0
0
0
0
0
0
0
0
0
0
0
SUB
0
4
1
0
0
0
0
0
4
0
0
0
5
4
SUP
4
2
1
1
0
0
0
0
0
0
0
0
8
0
TEM
0
0
0
0
0
0
0
0
0
0
0
0
0
0
TER
0
1
0
0
0
0
0
0
0
0
0
0
1
0
NOM
120
30
145
36
0
7
7
4
49
7
1
5
338
73
SUM
182
70
193
50
3
13
13
4
70
12
1
5
511
106
Table 3.
References
1. Kornai A.: (1994) On Hungarian Morphology. Research Institute for Linguistics HAS, Budapest
2.Pais J. - Pajzs J.: (1997) Using local rules for disambiguation of homographs in Hungarian corpora. Submitted to EURALEX '98.
3.Pajzs J.: (1991) The Use of a Lemmatized Corpus for Compiling the Dictionary of Hungarian In: Using Corpora Proceedings of the 7th Annual Conference of the OUP & Centre for the New OED and Text Research. University of Waterloo Centre for the New OED, pp. 129-136.
4.Pajzs J.: Project Report on the Historical Dictionary of Hungarian. Papers in Computational Lexicography Proceedings of COMPLEX '94 Edited by F.KIEFER, G. KISS J. PAJZS Budapest 1992. pp. 205-213.
5.Papp F.: (1975) A magyar fõnév paradigmatikus rendszere Akadémiai Kiadó, Budapest.
6. Prószéky G. - Tihanyi L.: (1992) A Fast Morphological Analyser for Lemmatizing Corpora of Agglutinative Languages. Papers in Computational Lexicography Proceedings of COMPLEX '92 Edited by F.KIEFER, G. KISS J. PAJZS Budapest 1992. pp. 275-278.
7. Prószéky G.: (1996) HUMOR- A Morphological System for Corpus Analysis. Proceedings of the first TELRI Seminar in Tihany Ed. Rettig, H. Budapest 1996. pp. 149-158.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)
Debrecen, Hungary
July 5, 1998 - July 10, 1998
109 works by 129 authors indexed
Conference website: https://web.archive.org/web/19991022041140/http://lingua.arts.klte.hu/allcach98/
References: http://web.archive.org/web/19990225164509/http://lingua.arts.klte.hu/allcach98/abst/jegyzek.htm
Attendance: ~60 (https://web.archive.org/web/19990128030244/http://lingua.arts.klte.hu/allcach98/listpar3.htm)