Academic Digital Libraries and Contents in Japan

multipaper session
  1. 1. Hisashi Yasunaga

    Dept. of Research Information - National Institute of Japanese Literature

Work text
Session Abstract:

In this session, we present the present digital activities on the university and Inter-university Research Institute in Japan. Three typical papers will be introduced, that is, on the

(1)digital library system for Japanese academic journals,

(2)digital library and archive system for Japanese literature, and

(3)new algorithm for Japanese text processing.

First; the National Center for Science Information Systems(NACSIS) has been providing an electronic document delivery service called NACSIS-ELS (NACSIS Electronic Library Service) since April 1997. In this service Japanese academic journals are captured and made available to researchers through the Internet.Journals are acquired from Japanese academic societies that have given NACSIS permission to use their journals for the NACSIS-ELS service.As of November 1997, 29 societies are participating in NACSIS-ELS, and 48 journals will be available on the Internet by the end of 1997.

Second;the National Institute of Japanese Literature(NIJL) has been constructing digital library and archive system as multimedia databases such as catalogs, images and texts for Japanese classical literary books. These databases will be put in service to public by Internet in 1998. At present, although the NIJL's catalog databases are undoubtedly useful to find existence of the materials for users, direct access to the materials is difficult for distant users (particularly for foreigners). The digital library and archive system is one solution to this dissatisfaction. The system is constructed from catalog database and the image archives as hyper-link system. The digital image archives are developed from the microfilms of the Archival Manuscripts and Printed Books on Classical Japanese Literature holding in NIJL, as about 150,000 frames in 1997.

Third;the University of Library and Information Science(ULIS) has been developing a new high-speed algorithm for text processing, which is called as the Multiple-Hash Screening Algorithm (MHSA). MHSA is a segmentation algorithm for full-text without words separation like as Japanese or Chinese text. Based on open hash table of dictionary of words (morphemes), MHSA can find faster all possibilities for segmenting full-text into the morphemes than others. As an application of MHSA, ULIS has also developed an information retrieval system for Japanese and Chinese full-text database through the Internet .

Paper #1

Digitization of Japanese Academic Journals and its Services
Jun Adachi

Research and Development Department,

KEYWORDS: electronic journal, digital contents, internet

NACSIS (National Center for Science Information Systems), 3-29-1 Otsuka, Bunkyo-ku, Tokyo 112, Japan


The National Center for Science Information Systems (NACSIS) is one of the national inter-university research institutions under the Japanese Ministry of Education, Science, Culture and Sports. Since its establishment in 1986, NACSIS has been giving support for researchers and universities in terms of information services. NACSIS has been engaged in the development of its electronic library system called NACSIS-ELS (NACSIS Electronic Library System) since early 90s. After trial service for a few years, full service was launched in April 1997 as NACSIS's new service for scholarly information dissemination. Through the operation of NACSIS-ELS, the digitization of Japanese academic journals also thrives in a large scale. Many of Japanese academic societies in humanities are considering in participation in this project. The functions now available on NACSIS-ELS are online document delivery capabilities on the Internet. NACSIS-ELS has unique features as compared with other digital library projects which are based on conventional libraries. For example, the coverage of materials provided by NACSIS-ELS is restricted to scholarly journals which are all copyrighted, and potential users are assumed to be mainly researchers. It stores no paper-form materials and permits access for users only through networks. In this paper, the NACSIS activities and the background of the NACSIS-ELS project are described in section 2. Then, the profile of the services are described in section 3. Issues concerning academic societies, database compilation, and copyright, etc. are described thereafter.

2. Overview of NACSIS Activities

2.1 Cataloging Services

One of the most heavily utilized services is NACSIS-CAT service, which is an online cataloging system for the compilation of the union catalogs of university library materials. The records of catalogs and holdings of books and serials at university braries accumulate in databases day by day through the NACSIS-CAT service, which started in 1984. More than 400 university libraries and several public libraries are connected online to the NACSIS central catalog database server, which stores over 20 million holdings records in total. These records are utilized for OPAC services at each university library as well.

2.2 Document Delivery

The service which combines catalog information and general databases is NACSIS-ILL (NACSIS Inter-Library Loan service), which runs a kind of e-mail system for exchanging inquiries for document delivery from users to libraries. Users can issue a request to receive photocopies of an article which he or she has found in conventional information retrieval systems. The request will be sent to one of the libraries that hold the journal or the proceedings that carries the article.

2.3 Cooperation with Academic Societies

NACSIS is supporting the activities of Japanese academic societies by compiling bibliographic records of technical reports and conference papers that cannot be obtained easily through the usual publication distribution systems. NACSIS considers that gray literature, (i.e., literature outside the usual distribution market, and difficult for people to obtain easily, such as SIG reports and conference proceedings) is of as much importance for researchers as academic periodicals. Therefore, NACSIS initiated the bibliographic database compilation of Japanese gray literature. NACSIS has also initiated the compilation of full-text records of articles in some scientific journals in cooperation with the publishing departments of several societies[3][4]. NACSIS proposes to use SGML format for full-text encoding, but the number of participating journals is limited at present.

2.4 Japanese Academic Societies

A major reason for government support of the activities of academic societies through NACSIS operations is that in Japan the scale of individual societies is small compared with those of the U.S., even though the number of societies is more than one thousand. Therefore, the financial state of the societies is weak, leading to occasional difficulties in publishing materials. Furthermore, Japanese societies are responsible for scholarly publications in the Japanese language, and those outside Japan do not subscribe easily to these publications. This makes societies financially weaker in the age of the Internet, where English prevails. While some societies publish journals in English, journals with an international reputation are few and most have limited subscriptions from overseas.

3. NACSIS-ELS Service

3.1 New Service and Digitization of Journals

NACSIS considered that the next stage for the further dissemination of scholarly information was an electronic journal service for Japanese academic societies. In this context, NACSIS-ELS was developed to accumulate scholarly information such as machine-readable journal articles, proceedings and technical reports[2]. Journals are acquired from Japanese academic societies that have given NACSIS permission to use their journals for the NACSIS-ELS service. As of April 1998, 41 societies are participating in NACSIS-ELS, and 171 journals will be digitized. Most of them will be soon available on the Internet this year. We have already digitized over 1 million pages. In 1998, more than 500 thousand pages are to be digitized. Since discussion with societies on copyright charging issues is still going on, there is no charge for the time being. NACSIS-ELS intends to institute a system of charges in October 1998.

3.2 NACSIS-ELS Functions

The service operates in the following way. Firstly, the database server stores conventional bibliographic databases. As well as these databases, the server holds document image databases of scientific journals, which include all pages from cover to cover. Pages are captured in a raster image format. Documents stored in the databases are retrieved and transmitted through high-speed Japanese academic Internet. Users can browse articles on their workstation monitors and users can print papers on a nearby high-quality laser printer, if necessary. Therefore, this system is a superset of conventional library services such as document delivery and photocopying services, and will soon supplant those conventional services. NACSIS-ELS is unique in that it provides the usual functions of the online information retrieval systems, while also enabling users to browse pages on a monitor and then print those of interest.

3.3 Service Operations

NACSIS launched the full service of NACSIS-ELS in April 1997. The definition of the copyright charging policy is one of the major concerns of academic societies. Efforts have been made to define a desirable policy. NACSIS is talking with not only academic societies but also with the organizations related to copyright issues and copyright clearance, expecting to establish a harmonized scheme on copyright charges for scholarly information in case of online dissemination. As of April 1998, negotiations have been settled on the whole.

4 Concluding Remarks

NACSIS-ELS can be categorized as an online document delivery system integrated with bibliographic databases, specially designed for scholarly publications. The full-fledged service of NACSIS-ELS, including copyright charge collection, is scheduled to start in October, will expedite the dissemination of scholarly information and facilitate easy access to Japanese scholarly publications, in particular, academic journals written in English for overseas scientists. Although we give a higher priority to the digitization of current issues for the time being, retrospective digitization is also considered. Several titles have been already converted in digital form from their first issues. The latest information concerning the NACSIS-ELS project can be accessed on the World Wide Web[5].


Paper #2

Digital Image and Text Archives for Japanese Classical
Shoichiro Hara, Hisashi Yasunaga

Dept. of Research Information

KEYWORDS: digital library and archives, internet, image and text processing, Japanese classical literature

National Institute of Japanese Literature, 1-16-10, Yutaka-Cho, Shinagawa-Ku, Tokyo 142 Japan
FAX: +81-3-3784-8875
PHONE: +81-3-3785-7131

1. Introduction

The National Institute of Japanese Literature (NIJL) is one of the inter-university research institutes of Japan founded in 1972. The purpose of its establishment is to survey the most part of printed and handwritten Japanese classical materials from the Edo period (1603-1863) and before, and to collect their original and/or microfilm reproductions in order to preserve these and also to provide public access. Over more than two decades of activity, the NIJL has acquired its place as the center of archival activity. At present, the NIJL provides only three catalogue databases. However, another catalog data, fulltext databases, and an image archive are under preparation, and they will be public within this year as a part of digital library system. In the following, chapter 2 describes the present NIJL information system and the background of the digital library project, chapter 3 describes the outline of the digital library system. Finally, new study of the "Digital Study System" for humanities is described in chapter 4.

2. Present NIJL Information System

The NIJL's information system is comprised from computers, networks, and printing devices. Using this system, NIJL provides following three catalogues as an online database service and as printed materials.

1) Catalogue of Holding Microfilms of Manuscripts and Printed Books on Japanese Classical Literature,

2) Catalogue of Holding Manuscripts and Printed Books on Japanese Classical Literature, and

3) Bibliography of Research Papers on Japanese Classical Literature.

A feature of the NIJL's information system is that all data processing from data compiling, data correction, database service, and to publishing is executed on a main frame computer system. However, during more ten years, NIJL's system has had many problems awaiting solutions from the view of software and hardware. To solve these problems, we started the digital library project for Japanese classical literature. This project downsizes the main frame computer system and reconstructs it as the so-called distributed computer system over several years. The key words of the digital library project are "standardization of data," "data independent from systems" and "multimedia oriented."

3. Digital Library System

The digital library system is constructed from catalogue databases, fulltext databases, and image archives.

3.1 Catalogue Databases

The NIJL's databases were designed more than 10 years ago based on devices at that time. As the latest computer system cannot support these devices, we are taking this opportunity to begin reconstructing whole database systems. Reviewing the old systems, we apply the new system policy of making data independent from hardware and software; specifically, we introduced SGML to describe the data. At present, we are under reconstruction of above three catalogue databases. Another catalog databases,1) Union Catalogue of Japanese Classical Materials, and 2) Catalogue of Historical Materials are also under preparation. All these catalogue databases will be public within this year as a part of digital library system. One of the main dissatisfactions expressed by catalogue database users has been that "the catalogue databases are undoubtedly useful to find the existence of materials, but accessing the materials themselves is difficult for distant users (especially for foreign users)." Fulltext databases and image archives are our solution to respond to this complaint.

3.2 Fulltext Database

Since 1987, NIJL started the project of organizing fulltext data. At present, following four text data are compiled.

1) Anthology of Japanese Classical Literature(Nihon-Koten-Bungaku-Taikei: 100 volumes, about 560 works),

2) Anthology of Story Telling(Hanashibon-Taikei: 20 volumes, about 320 works, about 20,000 stories),

3) Anthology of Story in KANA(Kana-Zoshi-Shusei: 12 volumes, about 70 works, about 1,000 stories), and

4) Anthology of Poem in Shoho Version(Shoho-Hanpon-Kashu: 21 volumes).

Among these, "Anthology of Japanese Classical Literature" and "Anthology of Poem in Shoho Version" can be accessed on the World Wide Web. At the time we began constructing fulltext databases, SGML was not popular in Japan, and unfortunately, there were no SGML applications that could process Japanese language. For these reasons, we created our own text markup rules that resembled SGML in its basic idea. We call the rules "KOKIN Rules" (KOKubungaku (means Japanese literature) Information: "KOKIN" is also a title of a famous Japanese classical poem anthology). As KOKIN rules were designed for ease of understanding and for use by researchers of Japanese classical literature, all the fulltext data in NIJL were compiled based on these rules. However, as KOKIN rules are independent from another standard, they had a few tools to parse and check KOKIN text. Recently SGML is considered as an encoding schema for transmission of text data among the systems. From these background, we decided that we should convert our KOKIN-marked text to SGML-marked text from the point of effective data circulation.

3.3 Image Archives

The NIJL has collected microfilm reproductions of classical material. At present, NIJL holds about 15,000,000 frames of image in the form of microfilm. 90% of them are the reproduction of materials out of the NIJL, and reminders are the reproduction of holding materials. The image archive is derived from the microfilms of the holding materials as a way of getting around the copyright problems and for speedy construction. The image is sampled with 1 bit gray scale and 600 DPI resolution, compressed by G4 method, and stored in a TIFF format. The image database is linked with the database of "Catalogue of Holding Manuscripts and Printed Books on Japanese Classical Literature." Users of the image archive first consult the catalogue database to search for their objective materials, then they will access its image by following the link between two databases (this link is based on the call number of the materials in both databases). We digitized about four hundred thousand (400,000) frames of microfilms in 1996 and about 150,000 frames in 1997.

4. Digital Study System

We have constructed various kinds of databases. However, during the past few years' development, we recognized that these databases alone cannot always contribute to the research activities of humanities scientists. A database is only a bank of raw material data, while on the other hand, valuable results are produced under individual research environments. Thus we feel that in order to support researchers' own methods and skills, there is a need to develop effective tools. The Digital Study System is a user side tool that is intended for humanity scientists to organize multimedia data by the researchers' own methods and skills. The Digital Study System contains "Image Annotation Program," "Version Control Mechanism," and "Text Analyzer." The Image Annotation Program is the center of the Digital Study System, that allows researchers to attach annotations (by text) to a certain position or area on an image. If a researcher attaches some keywords or codes to images, he/she can access desired images by searching the specific string among the annotations attached to images. In the same way, a researcher can collect images on the specific subject. Furthermore, linking images from different materials is possible; for example, a researcher can compare various versions of the specific sentence in the authentic text and its variants, if he/she attaches the same keywords or codes to several materials. The "Version Control Mechanism" constructs a version tree showing the history of data development. By reviewing the history, users themselves can assess the quality of the data. The "Text Analyzer" is the collection of programs for lexical analysis, vocabulary statistics and so on. At present, we are examining some programs to see whether they will be helpful in constructing the tool. These programs will have modular architecture to allow researchers doing complex text analysis by easily assembling tools.

5. Conclusion

NIJL is undertaking reconstruction of databases using SGML to cope with the multimedia age. This reconstruction project is on the right track. We have begun a software development program to support the individual research environment. Some of these databases will be public within this year as a part of digital library system.


Paper #3

An Information Retrieval System on Internet for Japanese and Chinese Full-Text Ddatabase
Kigen Hasebe, Hidehiro Ishizuka, Takeo Yamamoto

University of Library and Information Science

KEYWORDS: information retrieval system, text processing, Japanese and Chinese text

1-2, Kasuga, Tsukuba-shi, Ibaragi-ken 305-8550, Japan
E-MAIL: {kh, ishizuka, yamamoto}
FAX: +81-298-59-1093
PHONE: +81-298-59-1111

The present authors developed an information retrieval system on Internet for Japanese and Chinese full-text database. Retrieval of the text data is based on matching indexed words in stored words with words supplied by the user. Their system is based on the Multiple-Hash Screening Algorithm (MHSA)[1] for text segmentation, since Japanese or Chinese text is not segmented into words. In the written forms of English and most European languages, words are delimited by blanks or other obvious delimiters. In those languages, stored words and query words can be found simply by locating one of the delimiters in the given text string and segmenting the data string into words. However, in some other languages such as Japanese and Chinese, there is no obvious set of delimiters in their standard written forms, and the detection of a "word" in text data string is not a trivial problem. This, in turn, complicates the indexing and retrieval processes. In the present paper, a text retrieval system based on MHSA for text segmentation will be presented. The design principles were:

(1) Index permissively. Index all possible combinations of words to maximize recall. Do not try to reduce them by using overly strict grammatical rules which are susceptible to change and fluctuation in actual word usage.

(2) Shield the user from details of segmentation by using the same dictionary and similar segmentation algorithms in processing the database text and the query.

(3) Segment the query selectively. To reduce noise (unwanted match) as much as possible, do not use all the possible combinations of words obtained by segmenting the query, but only the more probable ones.

In spite of principle (2) above, give the user feedback and choice in case of doubt.MHSA is a method, based only on an open hash table (similar to a signature file) of a dictionary of words (morphemes), of finding all the possible ways of segmenting a text string into a sequence of dictionary entries. MHSA is an algorithm which does not require a large dictionary file on disk, and is therefore faster for a large amount of data. We developed two MHSA programs: the former version[1] for Japanese text coded with EUC (Extended UNIX Code), and the current one for Japanese and Chinese texts coded Unicode. We[2] have developed a method for statistically formulating feature-word-lists from words, which MHSA program generates from Japanese abstract texts, for large fields: "information processing", "agricultural chemistry" and "civil engineering". Feature-word-lists can be used for classification and retrieval of Japanese texts. Then, we study feasibility of the method to apply to classification and retrieval of texts, for example, of Japanese literature, or Japanese or Chinese newspaper. The present authors will show some applications of their system.


