The BCCWJ is divided into the following two types of lexical items, with part-of-speech information provided:
Both short unit words and long unit words are also used in the Corpus of Spontaneous Japanese as classifications of lexical items. Originally, the short unit words were also the β lexical items in NINJAL's previous survey of modern magazines, while the long unit words were the long lexical items in a study of spoken language on television. Using these items allows the BCCWJ to maintain compatibility with the CSJ, as well as making use of NINJAL's previous experience.
Short unit words are lexical items that make up the parts that form natural language. To be classified as a short unit word, a lexical item must be among the smallest lexical items which carry meaning (referred to as Smallest Lexical Units below).
In this way, short unit words can be derived by looking at the smallest lexical items that are combined (or possibly not combined) together, and breaking them down into the individual units. The criteria for short unit words is fairly simple, and thus there should be minimal variance in their interpretation.。
"Smallest Lexical Unit" refers to the smallest linguistic unit that carries meaning in modern Japanese - including native Japanese words, Sino-Japanese words, borrowed words, symbols, proper names, and place names - and are categorized as follows. A '/' indicates a boundary between Smallest Lexical Units.
As the Smallest Lexical Units identified above are necessary to the definition of Short Unit Words, they will be further broken down as shown in the table below.
|General||Native Japanese： 豊か 大 雨 ...
Sino-Japanese： 国 語 研 究 所 ...
Borrowing： コール センター オレンジ ...
|Numbers||一 二 十 百 千 ...|
|Other||Affixes||Prefix： 相 御 各 ...
Suffix： 兼ねる がたい 的 ...
|Particles / Auxiliaries||う だ ます か から て の ...|
|Proper Names/Place Names||星野 仙一 大阪 六甲 ...|
|Symbols||Ａ Ｂ ω イ ロ ア JR ...|
The methods for identifying Short Unit Words are established in the table above. Short Unit Words will be made based on combining (or possibly not combining) the basic Smallest Lexical Items.
The following sections will summarize how Short Unit Words are generated in the "General", "Numbers" and "Other" categories.
Abbreviated borrowed words are treated in the same way as Native Japanese and Sino-Japanese words.
Instances where one of the Smallest Lexical Units of the borrowed word is abbreviated and one is not will also be combined into a single Short Unit Word.
Instances where there are three or more Smallest Lexical Units contiguously.
In cases where a Smallest Lexical Unit representing a type or name of some object comes before a Smallest Lexical Unit representing the object, the two will each be a Short Unit Word.
If 3 contiguous Smallest Lexical Items make up an organization's name.
If the point of division in a word is not clear, or if dividing it would make the meaning opaque.
Will not combined with Smallest Lexical Units other than other Numbers. Different units of 1, 10, 100, and 1000 will make up single Short Unit Words. Units of 10,000, 100 million, and 1 trillion will make up separate Short Unit Words. Parts of fractions each make up one Short Unit Word.
A single Smallest Lexical Unit corresponds to a single Short Unit Word.
Short Unit Words are each assigned the following meta-data: a representative form, a representative transcription, a part of speech, usage information, and inflected forms. The representative forms refer to the headings found in Japanese language dictionaries, while the representative transcription corresponds to the Kanji transcription.
In order to determine the parsing and meta-data of the large number of Short Unit Words, an automated system will be necessary. The BCCWJ makes use of the Unidic dictionary, developed in cooperation with Chiba University, for the parsing of Short Unit Words. Unidic's headers and appendices can be used to obtain a high degree of accuracy in parsing. The most recent version of Unidic is available at the following URL. UniDic Web page
Long Unit Words are lexical items based on phrases (e.g. noun phrases). After determining the contents of a phrase, the Long Unit Words can be determined through a process of separating the phrase into independent and auxiliary words.
Rather than breaking down compound words into their component forms, Long Unit Words treat them as whole units. Based on these Long Unit Words, it is possible to get a broader view of the characteristic words of particular sources.
In general, phrasal borders are found at or after ancilliary words.
The BCCWJ follows the stadards of Japanese education in also considering composit words as ancilliary words. After identifying the phrases, there remains the problem of the treatment of proper nouns, animal/plant names, and compound words. Regarding these, the decision was made not to cut off phrases after internal particles or auxiliary verbs. In the example below, a "|" indicates the place of phrase segmentation.
After segmenting (or not segmenting) the phrases as described above, the Long Unit Words can be identified using a set of rules. There is never a need to exceed phrasal boundaries.。
The following sections outline the rules for identifying Long Unit Words. A "|" indicates a cutoff point.
Ancilliary words (including composite words) are treated as 1 Long Unit Word.
In cases where one of the irregular verb forms with semantic value - 「する」「できる」「なさる」or 「い た す」 occur following an uninflected word or adverb, they will not be separated from the Long Unit Word.
Continuous, related words will not be separated.
In addition, in cases where multiple uninflected words appear contiguously, or when an uninflected word is followed by a form of "suru" with formal meaning - 「する」「できる」「なさる」or 「い た す」 - then the words will not be separated.
Parts of a word related to apposition following an uninflected word will not be separated.