Paid Edition Contents

The Paid edition is a system of access for users who wish to be able to personally access and analyse the entire database of written language, such as information processing researchers. We currently offer access to all data, contained on four DVDs. Please note that software environments for accessing all the data are not provided.

Provides access to all corpus data.
Data cannot be redistributed.
Before accessing the data, in order to protect copyright you must enter a usage contract with NINJAL.

sashie1
An image of a file included in the Paid version.

The following data is compiled on the Paid edition.

Core Data (Roughly 1% of the total) Approx. 1 million short unit words.

The core data comprising Short Unit Words (SUW) and Long Unit Words (LUW), integrated XML (morphological analyses included), with core data for six different registers offered. Total number of samples: 1,980.

Character-based XML (C-XML) Approx. 100 million short unit words*.

XML data containing tags related to document structure. Data is divided into Fixed- and Variable-length samples. 172,675 total samples.
Fixed-length samples are taken from the following 5 registers: OW, PB, PN, PM, LB, while variable-length samples are taken from all 13 registers.

* White space and symbols excluded

Integrated-format XML (M-XML) Approx. 100 million short unit words*.

Integrated-format XML documents (morphological information included) is provided (there is no distinction between the data for fixed- and variable-length samples). Data is provided for all 13 registers.

* White space and symbols excluded

Documentation

Manuals and bibiliographical information are included (5 types). Copyright data is also provided.

Tabulated Data

sashie2

Tabulated data is offered for both Short and Long Unit Words. Data from all 13 registers is provided.

Total Word Count

The included morphological analyses cover the short unit words and long unit words from each category of each subcorpus as listed in the table below (both TSV and M-XML). Here, the totals for each sub-corpora is presented, with white spaces and symbols excluded.

Word Counts of Small and Long Unit Words in Each Sub-Corpus

Sub-Corpus Name	Sample Count	SUW Count	LUW Count
Publication ・Newspapers	1,133	1,061,729	773,395
Publication ・ Newspapers (Core)	340	308,504	224,140
Publication ・ Magazines	1,910	4,242,224	3,320,948
Publication ・ Magazines (Core)	86	202,268	159,883
Publication ・ Books	10,034	28,348,233	22,688,202
Publication ・ Books (Core)	83	204,050	169,730
Library ・ Books	10,551	30,377,866	25,092,641
Specialized ・ Reports	1,438	4,685,801	2970,971
Specialized ・ Reports (Core)	62	197,011	129,646
Specialized ・ Bestsellers	1,390	3,742,261	3,185,745
Specialized ・ Answers	90,507	10,162,945	8,534,840
Specialized ・ Answers (Core)	938	93,932	78,770
Specialized ・Blogs	52,209	10,101,397	8,210,312
Specialized ・Blogs (Core)	471	92,746	75,242
Specialized ・Legal	346	1,079,146	706,313
Specialized ・National Diet Minutes	159	5,102,469	4,007,842
Specialized ・PR Documents	354	3,755,161	2,308,450
Specialized ・Textbooks	412	928,448	746,170
Specialized ・ Poems	252	225,273	202,425
Total	172,675	104,911,464	83,585,665

Important Points Regarding the Text

The provided XML data corresponds closely to the original text ofthe BCCWJ, but before morphological analysis is conducted, an XML process (numTrans) is applied to ease reading. This process produces output as in the following example:

２０１１年に完成した
　　　　↓
<NumTrans originalText="２０１１" >二千十一</NumTrans>年に完成した

In this way, the text "２０１１年" is parsed as 「二千」「十」「一」「年」 (two thousand eleven) rather than as 「２」「０」「１」「１」「年」 (two zero one one). The actual meaning of the original character string will be retained through the parsing process. Also, at points where fractions occur, the process will also shift characters as necessary to maintain readability.

５０／４５
　　↓
<fraction>
<denominator><NumTrans originalText="４５">四十五</NumTrans></denominator>
<vinculum><NumTrans originalText="／">分</NumTrans></vinculum>
<numerator><NumTrans originalText="５０">五十</NumTrans></numerator>
</fraction>

Because these processes occur, the XML containing morphological analyses (M-XML) will not always correspond exactly to the text extracted and used in the character-based XML (C-XML). However, the TSV and M-XML data is retrieved from the same source as the C-XML data.

In portions of the data including morphological analyses, original character strings are marked with the tag "originalText", while character strings modified for morphological analysis are marked with the tag "orthToken".

In addition, because this number conversion process occurs automatically on extracted text, there is a chance of errors on data outside of the core data set which is checked manually.

*Reference：C-XML Details

Important Points Regarding Morphological Analysis

The morphological information regarding short and long unit words is the same in both TSV and M-XML formats, and so there is no need to parse identical SUWs and LUWs as separate categories.

The Short Unit Words were all manually checked based on analyses of UniDic. Because the analysis was not done using only a single version of UniDic, the SUWs of the BCCWJ and UniDic may not always be the same when compared. This is also true of Long Unit Words.

All of the morphological analysis includes any information that was thought to be important, regardless of the danger of redundancy. Although it would be enough to include an ID with each SUW to indicate its correspondence with data from UniDic, it was decided not to pursue this method.

In addition, the M-XML and TSV formats do not include bibliographical information, so it would be necessary to retrieve the sample IDs separately in some cases.