The DVD edition is a system of access for users who wish to be able to personally access and analyse the entire database of written language, such as information processing researchers. We currently offer access to all data, contained on two DVDs. Please note that software environments for accessing all the data are not provided.
An image of a file included in the DVD version.
The following data is compiled on the DVD edition.
Tabulated data is offered for both Short and Long Unit Words. Data from all 13 registers is provided.
The included morphological analyses cover the short unit words and long unit words from each category of each subcorpus as listed in the table below (both TSV and M-XML). Here, the totals for each sub-corpora is presented, with white spaces and symbols excluded.
|Sub-Corpus Name||Sample Count||SUW Count||LUW Count|
|Publication ・ Newspapers (Core)||340||308,504||224,140|
|Publication ・ Magazines||1,910||4,242,224||3,320,948|
|Publication ・ Magazines (Core)||86||202,268||159,883|
|Publication ・ Books||10,034||28,348,233||22,688,202|
|Publication ・ Books (Core)||83||204,050||169,730|
|Library ・ Books||10,551||30,377,866||25,092,641|
|Specialized ・ Reports||1,438||4,685,801||2970,971|
|Specialized ・ Reports (Core)||62||197,011||129,646|
|Specialized ・ Bestsellers||1,390||3,742,261||3,185,745|
|Specialized ・ Answers||90,507||10,162,945||8,534,840|
|Specialized ・ Answers (Core)||938||93,932||78,770|
|Specialized ・Blogs (Core)||471||92,746||75,242|
|Specialized ・National Diet Minutes||159||5,102,469||4,007,842|
|Specialized ・PR Documents||354||3,755,161||2,308,450|
|Specialized ・ Poems||252||225,273||202,425|
The provided XML data corresponds closely to the original text ofthe BCCWJ, but before morphological analysis is conducted, an XML process (numTrans) is applied to ease reading. This process produces output as in the following example:
In this way, the text "２０１１年" is parsed as 「二千」「十」「一」「年」 (two thousand eleven) rather than as 「２」「０」「１」「１」「年」 (two zero one one). The actual meaning of the original character string will be retained through the parsing process. Also, at points where fractions occur, the process will also shift characters as necessary to maintain readability.
Because these processes occur, the XML containing morphological analyses (M-XML) will not always correspond exactly to the text extracted and used in the character-based XML (C-XML). However, the TSV and M-XML data is retrieved from the same source as the C-XML data.
In portions of the data including morphological analyses, original character strings are marked with the tag "originalText", while character strings modified for morphological analysis are marked with the tag "orthToken".
In addition, because this number conversion process occurs automatically on extracted text, there is a chance of errors on data outside of the core data set which is checked manually.
The morphological information regarding short and long unit words is the same in both TSV and M-XML formats, and so there is no need to parse identical SUWs and LUWs as separate categories.
The Short Unit Words were all manually checked based on analyses of UniDic. Because the analysis was not done using only a single version of UniDic, the SUWs of the BCCWJ and UniDic may not always be the same when compared. This is also true of Long Unit Words.
All of the morphological analysis includes any information that was thought to be important, regardless of the danger of redundancy. Although it would be enough to include an ID with each SUW to indicate its correspondence with data from UniDic, it was decided not to pursue this method.
In addition, the M-XML and TSV formats do not include bibliographical information, so it would be necessary to retrieve the sample IDs separately in some cases.