KOTONOHA：The NINJAL Corpus Development Project
Since its inception in 1948, NINJAL has conducted a large-scale scientific study to determine the reality of modern Japanese.
Recently, for this purpose a long-term project to create a database of the Japanese language - the KOTONOHA project (hereafter simply "KOTONOHA") has been designed, aiming to create an extremely broad representation of Japanese from the Heian period to modern times. KOTONOHA is a generic term referring to multiple corpora which attempt to cover the entirety of the Japanese language.
The image below shows the entirety of KOTONOHA.
The horizontal axis is the temporal axis, ranging from the Heian period to modern times. The vertical axis shows the source of the words included in the corpora, with the top half representing corpora of written language, and the bottom of spoken language.
KOTONOHA covers all of the corpora shown in the figure, but because it is not possible to compile everything into a single corpus, by using a number of corpora to cover different elements, the coverage of the project can be improved step-by-step.
Of those shown above, the construction of the three corpora in the "Modern Magazine" corpus, the CSJ (Corpus of Spontaneous Japanese) and the BCCWJ (Balanced Corpus of Contemporary Written Japanese) , and the NINJAL Web Japanese Corpus on the right has been completed, and are available to the public. Currently, the Corpus of Historical Japanese (CHJ) is being worked on.
- Balanced Corpus of Contemporary Written Japanese (BCCWJ)
- Corpus of Spontaneous Japanese (CSJ in Japanese Only)
- Taiyou Corpus (in Japanese Only)
- Modern Women's Magazine Corpus (in Japanese Only)
- Meiroku Magazine Corpus (in Japanese Only)
- Kokumin no Tomo Magazine Corpus (in Japanese Only)
- Corpus of Historical Japanese (CHJ in Japanese Only)
- NINJAL Web Japanese Corpus (in Japanese Only)
Even if all of these corpora are completed, a number of empty areas still remain in the figure above. For example, there is a gap of over 50 years between the Taiyou corpus - which covers up to 1925 - and the BCCWJ. In addition, the CSJ covers mainly monologues, and so there is a lack of data concerning conversational speech and consultations.
Filling in these gaps is an important challenge. In addition, it will be necessary to periodically extend the BCCWJ in the future. In the future, the aim is to improve the corpus to an unprecedented scale and quality.