言語資源開発センター -Center for Language Resource Development- Japanese NINJAL
 

Paid Edition Contents

The Paid edition is a system of access for users who wish to be able to personally access and analyse the entire database of written language, such as information processing researchers. We currently offer access to all data, contained on four DVDs. Please note that software environments for accessing all the data are not provided.

  • Provides access to all corpus data.
  • Data cannot be redistributed.
  • Before accessing the data, in order to protect copyright you must enter a usage contract with NINJAL.

sashie1
An image of a file included in the Paid version.

The following data is compiled on the Paid edition.

disk1

Core Data (Roughly 1% of the total)  Approx. 1 million short unit words.

  • The core data comprising Short Unit Words (SUW) and Long Unit Words (LUW), integrated XML (morphological analyses included), with core data for six different registers offered. Total number of samples: 1,980.

Character-based XML (C-XML)  Approx. 100 million short unit words*.

  • XML data containing tags related to document structure. Data is divided into Fixed- and Variable-length samples. 172,675 total samples.
  • Fixed-length samples are taken from the following 5 registers: OW, PB, PN, PM, LB, while variable-length samples are taken from all 13 registers.

* White space and symbols excluded

Integrated-format XML (M-XML)  Approx. 100 million short unit words*.

  • Integrated-format XML documents (morphological information included) is provided (there is no distinction between the data for fixed- and variable-length samples). Data is provided for all 13 registers.

* White space and symbols excluded

Documentation

  • Manuals and bibiliographical information are included (5 types). Copyright data is also provided.

disk2

Tabulated Data

sashie2

Tabulated data is offered for both Short and Long Unit Words. Data from all 13 registers is provided.

 

Total Word Count

The included morphological analyses cover the short unit words and long unit words from each category of each subcorpus as listed in the table below (both TSV and M-XML). Here, the totals for each sub-corpora is presented, with white spaces and symbols excluded.

Word Counts of Small and Long Unit Words in Each Sub-Corpus

Sub-Corpus Name Sample Count SUW Count LUW Count
Publication ・Newspapers 1,133 1,061,729 773,395
Publication ・ Newspapers (Core) 340 308,504 224,140
Publication ・ Magazines 1,910 4,242,224 3,320,948
Publication ・ Magazines (Core) 86 202,268 159,883
Publication ・ Books 10,034 28,348,233 22,688,202
Publication ・ Books (Core) 83 204,050 169,730
Library ・ Books 10,551 30,377,866 25,092,641
Specialized ・ Reports 1,438 4,685,801 2970,971
Specialized ・ Reports (Core) 62 197,011 129,646
Specialized ・ Bestsellers 1,390 3,742,261 3,185,745
Specialized ・ Answers 90,507 10,162,945 8,534,840
Specialized ・ Answers (Core) 938 93,932 78,770
Specialized ・Blogs 52,209 10,101,397 8,210,312
Specialized ・Blogs (Core) 471 92,746 75,242
Specialized ・Legal 346 1,079,146 706,313
Specialized ・National Diet Minutes 159 5,102,469 4,007,842
Specialized ・PR Documents 354 3,755,161 2,308,450
Specialized ・Textbooks 412 928,448 746,170
Specialized ・ Poems 252 225,273 202,425
Total 172,675 104,911,464 83,585,665

Important Points Regarding the Text

The provided XML data corresponds closely to the original text ofthe BCCWJ, but before morphological analysis is conducted, an XML process (numTrans) is applied to ease reading. This process produces output as in the following example:

2011年に完成した
    ↓
<NumTrans originalText="2011" >二千十一</NumTrans>年に完成した

In this way, the text "2011年" is parsed as 「二千」「十」「一」「年」 (two thousand eleven) rather than as 「2」「0」「1」「1」「年」 (two zero one one). The actual meaning of the original character string will be retained through the parsing process. Also, at points where fractions occur, the process will also shift characters as necessary to maintain readability.

50/45
  ↓
<fraction>
<denominator><NumTrans originalText="45">四十五</NumTrans></denominator>
<vinculum><NumTrans originalText="/">分</NumTrans></vinculum>
<numerator><NumTrans originalText="50">五十</NumTrans></numerator>
</fraction>

Because these processes occur, the XML containing morphological analyses (M-XML) will not always correspond exactly to the text extracted and used in the character-based XML (C-XML). However, the TSV and M-XML data is retrieved from the same source as the C-XML data.

In portions of the data including morphological analyses, original character strings are marked with the tag "originalText", while character strings modified for morphological analysis are marked with the tag "orthToken".

In addition, because this number conversion process occurs automatically on extracted text, there is a chance of errors on data outside of the core data set which is checked manually.

*Reference:C-XML Details

Important Points Regarding Morphological Analysis

The morphological information regarding short and long unit words is the same in both TSV and M-XML formats, and so there is no need to parse identical SUWs and LUWs as separate categories.

The Short Unit Words were all manually checked based on analyses of UniDic. Because the analysis was not done using only a single version of UniDic, the SUWs of the BCCWJ and UniDic may not always be the same when compared. This is also true of Long Unit Words.

All of the morphological analysis includes any information that was thought to be important, regardless of the danger of redundancy. Although it would be enough to include an ID with each SUW to indicate its correspondence with data from UniDic, it was decided not to pursue this method.

In addition, the M-XML and TSV formats do not include bibliographical information, so it would be necessary to retrieve the sample IDs separately in some cases.

 
 

リンク Links