言語資源開発センター -Center for Language Resource Development- Japanese NINJAL
 

Character-based XML (C-XML)

The C-XML data can be found in the C-XML directory on Disc 1, divided into subdirectories for each sub-corpus. Each directory holds a compressed file, which when extracted contains a number of XML files, each corresponding to a single sample. XML files for registers with a particularly large number of samples (the LB, PB, OC, and OY registers) will be extracted into individual subdirectories.

Document Structure Tag Sets and Their Relationship with the Sub-Corpora

The BCCWJ is made up of a number of sub-corpora. The document structure tag sets correspond to the special properties of each of the sub-corpora, and are provided as shown in the table "Relationship between tag sets and sub-corpora". Individual tag sets are defined as XML documents. In addition, there are cases where the application of otherwise identical tags may change depending on whether the source text was a paper or electronic media. Situational differences in the properties and applications of certain tags may arise because of this.

The Tag sets (or TS) are broadly divided into the following 3 types. The "Variable length sample" (partially amended) in the table refers to a tag set which has been amended in parts as compared with the regular "Variable length sample" TS.

  • Variable Length TS:
    A tag set made to describe variable length samples (where each sample is made up of a single "Article").
  • Fixed Length TS:
    A tag set made to describe fixed length samples (where each sample consists of 1000 characters).
  • Yahoo! Answers TS:
    A tag set made to describe samples from Yahoo! Answers.

Relationships Between Tag Sets and Sub-Corpora

Sub-Corpus Tag Set Media-Type of Source
Publication Sub-corpusVariable length TS, Fixed length TSPrinted
Library Sub-corpusVariable length TS, Fixed length TSPrinted
White PapersVariable length TS, Fixed length TSPrinted
TextbooksVariable length TS (partially amended)Printed
PR DocumentsVariable length TSElectronic
BestsellersVariable length TSPrinted
Yahoo! AnswersYahoo! Answers TSElectronic
Yahoo! BlogsVariable length TS (partially amended)Electronic
PoetryVariable length TS (partially amended)Printed
Legal DocumentsVariable length TSElectronic
Minutes of the National DietVariable lentgh TSElectronic

Variable Length Tag Set

The principle differences between the C-XML and M-XML tags are as follows

The variable length tag set is a tag set made to describe the variable length samples (samples which are comprised of one full "Article"). In total, there are 46 tag types.
The information provided by this tag set is broken down into the following 3 groups.

Tags Related to Sampling

"sample" and "sampling" are the tags related to samples. The "sample" tag marks the scope of a given sample, while the "sampling" tag contains information regarding various points of interest about the sample.

Tags Related to Characters and Transcription

The roles of this type of tag are: (1) To increase the convenience of data retrieval and processing, and (2) To allow the accurate description of the original texts in electronic format. An example of the former is the "correction" tag (denoting places where misprinted text was corrected).

生活基<correction type="erratum" originalText="盟">盤</correction>に
伸びを示し<correction type="omission">てGlt;/correction>いる
整備を<correction type="excess" originalText="を" />図るべく

An example of the latter are the "rubi" (denoting kana superscripts), and "missingCharacter" (denoting non-standard characters) tags.

<ruby rubyText="ご">語</ruby><ruby rubyText="い">彙</ruby>
<missingCharacter attribute="HanIdeograph" unicode="U+5AEB"
daikanwa="M06673" description="女偏に莫">〓</missingCharacter>

Tags Related to Document Structure

These types of tags are used to assign logical roles to different parts of the document, and as seen in the variable length tag set tag list below they are divided into the following categories: (a) Hierarchical structure, (b) Figures, (c) Citations, (d) Annotations, and (e) Other.

The following is an explanation of tags related to hierarchical structure. These tags present information related to the high-level structure of the article, such as "cluster", "sentence", and "paragraph". The following example shows these elements are they appear in an extracted portion of an XML file. Indentation in the example further indicates subordinate structural hierarchy. For example, in the figure below you can see that there are "titleBlock", "cluster", and "paragraph" elements subortinate to the "article" element.

  article
   titleBlock 第2節 内外均衡の背景
   paragraph  53年度中にみられた...
   cluster
    titleBlock 1.財政金融政策の効果
    paragraph  石油危機後,...
    cluster
     titleBlock (公共投資の拡大)


Tag Name Contents
Data SamplingsampleMarks the scope of one sample.
samplingGives meta-data regarding sampling.
Hierarchical Structure
(Document Structure)
articleA single element of coherent text by one author.
blockEndMarks semantic boundaries.
clusterDenotes the scope covered by a title tag.
titleBlockDenotes the title and directly related elements.
titleThe title of a definite portion of the sample.
orphanedTitleThe title of an indefinite portion of the sample.
listItemized or listed elements.
paragraphA paragraph element.
sentenceA sentence element.
Figures
(Document Structure)
figureBlockDenotes the figure and any accompanying elements.
figureThe figure element itself.
captionAny caption describing a figure.
tableA table element.
Citations
(Document Structure)
quotationA quoted element from outside the text of the current article (e.g. from lists, photos, illustrations, etc).
citationA cited element from another document.
sourceInformation regarding a cited document (document title, author name, document data, etc.)
speechDenotes transcriptions of speech or internal monologues.
speakerMarks character strings explicitly indicating the speaker.
quoteDenotes quotations, utterances, internal monologues, and descriptions directly referenced from a different article.
Annotations
(Document Structure))
noteBodyIndicates a note and its scope.
noteBodyInlineIndicates an inline note.
Other
(Document Structure)
abstractIndicates an abstract of an article or cluster element.
authorsDataIndicates information about the author of the article.
Other
(Document Structure)
contentsA table of contents.
profileProfiles of the author or characters.
rejectedBlockIndicates the existence of a block deleted from the sample.
verseIndicates a poem, tanka, haiku, or song lyrics.
verseLineMarks a single line in a verse.
Characters and AnnotationrubyIndicates kana superscript showing kanji readings.
correctionMarks corrections of the original text.
missingCharacterCharacters from outside the JIS X 0213:2004 character set.
enclosedCharacterMarks character enclosed by shapes which serve as bibliographical labels, such as ©.
cursiveIndicates characters written in cursive.
imageSymbols from outside the JIS X 0213:2004 character set.
superScriptMarks a superscript element.
subScriptMarks a subscript element.
fractionMarks the proper fraction portion of a compount number.
deleteMarks text with a strike-through line.
brA line break.
infoInfo concerning characters in the original text, such as erasure lines.
rejectedSpamIndicates the existence of deleted inline characters.
substitutionIndicates a character has been substituted for.

Fixed Length Tag Set

The fixed length tag set was made to describe fixed length samples (those samples consisting of exactly 1,000 characters). The specifications of the tag set are largely the same as of the variable length TS, but with the following differences.

Fixed Length Samples are Limited by Character Count

Fixed length block elements may not match up to the definitions of variable length elements. For example, an "article" element may contain a number of articles, chapters, or sections, in a fixed length sample is is possible that even just the first "titleBlock" element will not cover the entirety of the following text.

The "isWholeArticle" attribute of an "article" element is taken as implied (optional).

The Following Elements are not Used

The "cluster" element.

Yahoo! Answers Sub-Corpus Tag Set

 Samples from the "Yahoo! Answers" sub-corpus are arranged logically in pairs of questions and answers. However, because it was not possible to fully describe this structure using the existing variable and fixed length tag sets, these structures were treated as independent document types. There are 9 types of tags.

Tag Name Contents
sampleIndicates the scope of a question/answer pair.
OCQuestionMarks the original question.
OCAnswerMarks the original answer.
brA line break.
webLineMarks automatic logical operators RE: web data.
sentenceMarks a unified sentence.
rejectedBlockMarks an erased element.
ncrMarks the erasure of mathematical operands or the substitution of an "=".
InfoInformation deemed auxiliary.

The "Other" Tag Set

As shown in the "Relationships between tag sets and sub-corpora" table above, certain sub-corpora make use of an amended variable length tag set. This section will explain the differences between amended versions of the variable length tag set.

Yahoo! Blog

"ASCIIArt" has been added as a possible argument of the "type" attribute of the "rejectedBlock" tag. This indicates so-called ASCII art which was omitted during sample creation.

Poetry

A single "sample" element may contain multiple subordinate "article" elements. This is due to the sampling methods used for the "Poetry" sub-corpus, wherein multiple works (e.g. "article" elements) are included in each sample. It should be noted that in the variable length tag set a "sample" element will only have a single "article" sub-element.

Textbooks

22 tag types have been omitted from and 8 types added to the variable length tag set for use with the "Textbook" sub-corpus.

Omitted Tags

 abstract, authorsData, blockEnd, contents, cursive, delite, info, insert, list, listItem,
 orphanedTitle, paragraph, profile, quotation, quote, source, speaker, speech, table,
 titleBlock, verse, verseLine

Added Tags

 book, copyright, supplement, skippedBlock, surrogatePair, subRuby, root, skippedSpan

Tags Added or Modified For Use With the Textbook Sub-Corpus


Tag Name Contents
Elements describing hierarchical linguistic structure. book [Added] Indicates a full text book.
cluster [Modified] Indicates the scope of the text covered by a chapter title, or section title as shown in the TOC.
Elements describing specific language structures. copyright [Modified] Indicates elements where special copyright treatment is necessary beyond what is found in "citation" elements.
supplement [Modified] Indicates an element that is in a different format from the main text (the main scholastic content), or that is supplemental to the main body of the textbook.
skippedBlock [Added] A text element that was not covered in the lexical survey when creating the textbook corpus word list.
Elements relating to characters and annotation. surrogatePair [Added] Denotes a surrogate pair character in the JIS X 0213:2004 format as indicated by a "=".
subRuby [Added] Indicates kana transcriptions of kanji readings below the text in cases of horizontal writing, and to the left of the text in cases of vertical writing.
root [Added] Indicates areas where there is a danger of misinterpretation of the scope of the text indicated by a √ mark.
skippedSpan [Added] Indicates a character string that was omitted from the lexical survey when creating the textbook corpus word list.

* Reference: Tanaka et al. (2011). "II Textbook corpus character input and tag usage."

 
 

リンク Links