言語資源開発センター -Center for Language Resource Development- Japanese NINJAL
 

TSV Data Details

TSV format data can be found under the LUW and SUW directories on Disc 2, divided into further directories by sub-corpus. Each directory contains a compressed file, which when extracted contains tabular data regarding each sub-corpus. However, for the sub-corpora with extremely large amounts of data (LB and PB), the data will be segmented into multiple files.

The TSV data contains the previously discussed morphological information in delimited tabular format, and is based on the BCCWJ online service "Chunagon". There are tables containing information on both short and long unit words, which is further divided by sub-corpus. The text data is in UTF-8 format (without BOM).

TSV contains redundant data for both SUWs and LUWs which is available individually.

Short Unit Word TSV Fields

The contents of the TSV data fields regarding Short Unit Words are listed (from left to right) in the table "Short Unit Word TSV Fields". One short unit word corresponds to a single record (line).

Short Unit Word TSV Fields

Field Name Notes
Subcorpus Name
Sample ID
Character Starting Position The offset from the head of the original sample where the SUW is found (in increments of 10).
Character Ending Position
Sequence Number The position of the word within a long unit word (increments of 10).
Surface Form Start Position Offset of the start/end positions of the surface and infinitive forms from the sample's head (increments of 10)
Surface Form End Position
Fixed Length Flag 0: Not fixed length, 1: Fixed length
Variable Length Flag 0: Not variable length, 1: Variable length
Sentence Head Label B: Is a sentence head, I: Not a sentence head
Word List ID An ID identifying the word at the level of surface or infinitive form (as the number of digits is very large the bigint format is used)
Lexeme ID An ID corresponding to UNIDIC lexemes.
LexemeShort unit word information.
Lexeme Reading
Detailed Lexeme Classification
Word Type
Part of Speech
Inflectional Pattern
Inflectional Form
Word Form
Usage
Infinitive form
Infinitive Form and Surface Form
Original Character String
Surface Pronunciation

Regarding the "Sentence head label", it will be labelled "B" at the starting point of "sentence" tags in C-XML.
 The difference between the "Character starting position" and the "Surface form starting position" fields corresponds with the aforementioned "Original character string" and "Infinitive form and Surface form" fields. The "Original character string" included in the short unit word information is the character string before the conversion of any numerical characters. If the conversion of numberical characters results in the creation of a segmented character string, then the different fields will be assigned as shown in the table "Examples of interactions based on the start point of the conversion of a string of numerical characters".

Examples of interactions based on the start point of a string of converted numerical characters

Character start point Character end point Sequencing Number Surface form start point Surface form end point Infinitive and surface form Original character string
10 50 10 10 30 二千 2011
10 50 20 30 40 2011
10 50 30 40 50 2011

Long Unit Word TSV Fields

he contents of the TSV data fields regarding Long Unit Words are listed (from left to right) in the table "Long Unit Word TSV Fields". One long unit word corresponds to a single record (line).

Long Unit Word TSV Fields

Field Name Notes
Subcorpus Name
Sample ID
Surface Form Start Point The offset from the head of the infinitive and surface forms in the original sample where the LUW is found (in increments of 10).
Surface Form End Point
Clause B: Is a clause, Empty space: Not a clause
Short/Long Differentiation Flag Indicates if the scope of a SUW and LUW match.
0: Short and long match, 1: Short and long differ.
Fixed Length Flag 0: Not fixed length, 1: Fixed length
Variable Length Flag 0: Not variable length, 1: Veriable length
Lexeme Long Unit Word information.
Lexeme Reading
Word Type
Part of Speech
Inflectional Pattern
Inflected Form
Word Form
Infinitive Form
Infinitive and Surface Form
Original Character String
Surface Pronunciation
Sequence Number The ordering of the long unit word within a sample (increments of 10).
Character Start Point Offset of the distance of the character from the head of the original character string (increments of 10).
Character End Point
Sentence Head Label B: Sentence head, I: Not a sentence head.
 
 

リンク Links