言語資源開発センター -Center for Language Resource Development- Japanese NINJAL
 

Morphological Information

Two Types of Lexical Items

The BCCWJ is divided into the following two types of lexical items, with part-of-speech information provided:

  • Short unit words, with the objective of examining individual elements.
  • Long unit words, with the objective of examining linguistic properties.

Both short unit words and long unit words are also used in the Corpus of Spontaneous Japanese as classifications of lexical items. Originally, the short unit words were also the β lexical items in NINJAL's previous survey of modern magazines, while the long unit words were the long lexical items in a study of spoken language on television. Using these items allows the BCCWJ to maintain compatibility with the CSJ, as well as making use of NINJAL's previous experience.

Description of Short Unit Words

Short unit words are lexical items that make up the parts that form natural language. To be classified as a short unit word, a lexical item must be among the smallest lexical items which carry meaning (referred to as Smallest Lexical Units below).

In this way, short unit words can be derived by looking at the smallest lexical items that are combined (or possibly not combined) together, and breaking them down into the individual units. The criteria for short unit words is fairly simple, and thus there should be minimal variance in their interpretation.。

Classification of Smallest Lexical Units

"Smallest Lexical Unit" refers to the smallest linguistic unit that carries meaning in modern Japanese - including native Japanese words, Sino-Japanese words, borrowed words, symbols, proper names, and place names - and are categorized as follows. A '/' indicates a boundary between Smallest Lexical Units.

Native Japanese words: /豊か/な/暮らし/に/つい/て/   /大/雨/が/降っ/た/の/で/
Sino-Japanese words: /国/語/  /研/究/所/
Borrowed Words: /コール/センター/  /オレンジ/色/
Proper Names: /星野/仙一/  /ジェフ/・/ウィリアムス/  /林/威助/
Place Names: /大阪/府/豊中/市/待兼山町/  /六甲/山/  /琵琶/湖/
Symbols: /図/A/  /JR/

As the Smallest Lexical Units identified above are necessary to the definition of Short Unit Words, they will be further broken down as shown in the table below.

Category Example
General Native Japanese: 豊か 大 雨 ...
Sino-Japanese: 国 語 研 究 所 ...
Borrowing: コール センター オレンジ ...
Numbers 一 二 十 百 千 ...
Other Affixes Prefix: 相 御 各 ...
Suffix: 兼ねる がたい 的 ...
Particles / Auxiliaries う だ ます か から て の ...
Proper Names/Place Names 星野 仙一 大阪 六甲 ...
Symbols A B ω イ ロ ア JR ...

Identification of Short Unit Words

The methods for identifying Short Unit Words are established in the table above. Short Unit Words will be made based on combining (or possibly not combining) the basic Smallest Lexical Items.

The following sections will summarize how Short Unit Words are generated in the "General", "Numbers" and "Other" categories.

General

General Rules

 
  • For Native Japanese and Sino-Japanese words, 2 smallest lexical units which combine to form a word will make up 1 Short Unit Word.

    |母=親| |食べ=歩く| |言=語|資=源| |研=究|所| |本=箱|作り|
  • For borrowed words, 1 Smallest Lexical Unit will form 1 Short Unit Word

    |コール|センター| |オレンジ|色|

Regular Exceptions

 
  • The treatment of abbreviated borrowed words...

    Abbreviated borrowed words are treated in the same way as Native Japanese and Sino-Japanese words.

    |パソ=コン| |塩=ビ| |ピン=ぼけ|

    Instances where one of the Smallest Lexical Units of the borrowed word is abbreviated and one is not will also be combined into a single Short Unit Word.

    |エア=コン| |マス=コミ|
  • 1 Smallest Lexical Unit corresponds to 1 Short Unit Word...

    Instances where there are three or more Smallest Lexical Units contiguously.

    |衣|食|住| |松|竹|梅| |都|道|府|県|

    In cases where a Smallest Lexical Unit representing a type or name of some object comes before a Smallest Lexical Unit representing the object, the two will each be a Short Unit Word.

    |さくら|屋| |歌舞伎|座| |のぞみ|号|
  • Three or more contiguous Smallest Lexical Items corresponding to 1 Short Unit Word...

    If 3 contiguous Smallest Lexical Items make up an organization's name.

    |日経連| |通総研|

    If the point of division in a word is not clear, or if dividing it would make the meaning opaque.

    |大統領| |不可解| |明後日| |殺風景|
    |輸出入| |国内外| |原水爆| |市町村長|
    |大袈裟| |大丈夫| |二枚目| |十八番|

    However, if two or more Sino-Japanese Smallest Lexical Units are found contiguously, Short Unit Words will be created as follows:
    |中|小|企業| |小|中|学校| |都|道|府|県|知事|

Numbers

Will not combined with Smallest Lexical Units other than other Numbers. Different units of 1, 10, 100, and 1000 will make up single Short Unit Words. Units of 10,000, 100 million, and 1 trillion will make up separate Short Unit Words. Parts of fractions each make up one Short Unit Word.

|十|二|月|二十|三|日| |七百|五十|二|万|語| |五|分|の|二| |二三十|回| |〇|. |四|五|

Other

A single Smallest Lexical Unit corresponds to a single Short Unit Word.

Affixes: |筒|状| |扱い|兼ねる|
Particles/Auxiliaries: |豊か|な|暮らし|に|つい|て|
Proper Names: |星野|仙一| |ジェフ|・|ウィリアムス| |林|威助|
Place Names: |大阪|府|豊中|市|待兼山町| |六甲|山| |琵琶|湖|
Symbols: |図|A| |JR|

Dictionary for Parsing Short Unit Words

Short Unit Words are each assigned the following meta-data: a representative form, a representative transcription, a part of speech, usage information, and inflected forms. The representative forms refer to the headings found in Japanese language dictionaries, while the representative transcription corresponds to the Kanji transcription.

In order to determine the parsing and meta-data of the large number of Short Unit Words, an automated system will be necessary. The BCCWJ makes use of the Unidic dictionary, developed in cooperation with Chiba University, for the parsing of Short Unit Words. Unidic's headers and appendices can be used to obtain a high degree of accuracy in parsing. The most recent version of Unidic is available at the following URL. UniDic Web page

Description of Long Unit Words

Long Unit Words are lexical items based on phrases (e.g. noun phrases). After determining the contents of a phrase, the Long Unit Words can be determined through a process of separating the phrase into independent and auxiliary words.

Rather than breaking down compound words into their component forms, Long Unit Words treat them as whole units. Based on these Long Unit Words, it is possible to get a broader view of the characteristic words of particular sources.

Identification of Phrases

In general, phrasal borders are found at or after ancilliary words.

The BCCWJ follows the stadards of Japanese education in also considering composit words as ancilliary words. After identifying the phrases, there remains the problem of the treatment of proper nouns, animal/plant names, and compound words. Regarding these, the decision was made not to cut off phrases after internal particles or auxiliary verbs. In the example below, a "|" indicates the place of phrase segmentation.

|源=頼朝| |虎の=門交差点| |タツノ=オトシゴ| |ユキノ=シタ|
|案の=定| |万が=一|

Indentification of Long Unit Words

After segmenting (or not segmenting) the phrases as described above, the Long Unit Words can be identified using a set of rules. There is never a need to exceed phrasal boundaries.。

The following sections outline the rules for identifying Long Unit Words. A "|" indicates a cutoff point.

Symbols

Punctuation marks denoting pauses each correspond to 1 Long Unit Word.

|湾岸戦争後|、|英|、|仏|など|と|

Symbols which continue on from a word, or symbols representing an acronym are treated as 1 Long Unit Word.

|2,000=m2| |WHO| |PHS|

Ancilliary Words

Ancilliary words (including composite words) are treated as 1 Long Unit Word.

|公害紛争処理法|における|公害紛争処理|の|手続|は|、|原則|として|紛争当事者|から|の|申請|によって|開始さ|れる|。|

Irregular Verbs

In cases where one of the irregular verb forms with semantic value - 「する」「できる」「なさる」or 「い た す」 occur following an uninflected word or adverb, they will not be separated from the Long Unit Word.

|往復運動=し|ている| |きちんと=できる|

Continuous Words

Continuous, related words will not be separated.

|公正妥当|な|実務慣行|

In addition, in cases where multiple uninflected words appear contiguously, or when an uninflected word is followed by a form of "suru" with formal meaning - 「する」「できる」「なさる」or 「い た す」 - then the words will not be separated.

|英語=日本語-間| |芸術家=、=文化人等| |新-学年=・=学期| |在学=・=在校する|

Apposition

Parts of a word related to apposition following an uninflected word will not be separated.

|機関誌=計量国語学|が|発刊さ|れ|

Numerical Expressions

For elements including numbers, the cutoff occurs after the unit.

|平成|15年|9月|15日|午後|7時|33分|

Cutoffs also occur before numbers.

|延べ|23時間|30分|
 
 

リンク Links