The present authors developed an information retrieval system on Internet for Japanese and Chinese full-text database. Retrieval of the text data is based on matching indexed words in stored words with words supplied by the user. Their system is based on the Multiple-Hash Screening Algorithm (MHSA) for text segmentation, since Japanese or Chinese text is not segmented into words. In the written forms of English and most European languages, words are delimited by blanks or other obvious delimiters. In those languages, stored words and query words can be found simply by locating one of the delimiters in the given text string and segmenting the data string into words. However, in some other languages such as Japanese and Chinese, there is no obvious set of delimiters in their standard written forms, and the detection of a "word" in text data string is not a trivial problem. This, in turn, complicates the indexing and retrieval processes. In the present paper, a text retrieval system based on MHSA for text segmentation will be presented. The design principles were:
(1) Index permissively. Index all possible combinations of words to maximize recall. Do not try to reduce them by using overly strict grammatical rules which are susceptible to change and fluctuation in actual word usage.
(2) Shield the user from details of segmentation by using the same dictionary and similar segmentation algorithms in processing the database text and the query.
(3) Segment the query selectively. To reduce noise (unwanted match) as much as possible, do not use all the possible combinations of words obtained by segmenting the query, but only the more probable ones.
In spite of principle (2) above, give the user feedback and choice in case of doubt.MHSA is a method, based only on an open hash table (similar to a signature file) of a dictionary of words (morphemes), of finding all the possible ways of segmenting a text string into a sequence of dictionary entries. MHSA is an algorithm which does not require a large dictionary file on disk, and is therefore faster for a large amount of data. We developed two MHSA programs: the former version for Japanese text coded with EUC (Extended UNIX Code), and the current one for Japanese and Chinese texts coded Unicode. We have developed a method for statistically formulating feature-word-lists from words, which MHSA program generates from Japanese abstract texts, for large fields: "information processing", "agricultural chemistry" and "civil engineering". Feature-word-lists can be used for classification and retrieval of Japanese texts. Then, we study feasibility of the method to apply to classification and retrieval of texts, for example, of Japanese literature, or Japanese or Chinese newspaper. The present authors will show some applications of their system.
1.K. Hasebe, K. Nakamoto, T. Yamamoto, "An information retrieval system on Internet for languages without obvious word delimiters",Proc. of Int. Symp. Digital Libraries '95, pp. 181-185 (Aug. 1995,Tsukuba).
2.E. Ishida, H. Ishizuka, M. Negishi, T. Yamamoto,"Formulation and application of word lists for classifying texts into large fields",Inf. Proc. Soc. Jpn. SIG-FI Note, No.47, pp. 109-116 (1997) (in Japanese).
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)
July 5, 1998 - July 10, 1998
109 works by 129 authors indexed
Conference website: https://web.archive.org/web/19991022041140/http://lingua.arts.klte.hu/allcach98/