encoding_proposal.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_odds.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_odds.rng" type="application/xml"
	schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Encoding Guidelines for the ELTeC</title>
            <author>Cost Action CA16204 – WG1</author>
         </titleStmt>
         <publicationStmt>
            <p>Unpublished draft for discussion</p>
         </publicationStmt>
         <sourceDesc>
            <p>A born digital document drafted in TEI format by LB</p>
         </sourceDesc>
      </fileDesc>
      <revisionDesc>
         <change when="2018-01-17">Expanded metadata section a bit; added comments from CO and
            BN</change>
         <change when="2017-12-17">First (partial) discussion draft</change>
      </revisionDesc>
   </teiHeader>
   <text>
      <body>
         <p>This is the first draft of a reference document defining the encoding scheme to be used
            for the European Literary Text Collection (ELTeC) which will be a major deliverable of
            COST Action 16204, <title>Distant Reading</title>. As a first draft it necessarily
            contains much uncertainty, either because policies have to be determined, or because
            issues are currently under-specified. Particular topics on which policy remains to be
            defined are signalled below with the label <q>Open Question</q>. Input from other
            participants in the Action about these topics would be particularly useful. </p>
         <div>
            <head>Principles</head>
            <p>The MoU for the project points out that <q>Distant Reading methods cover a wide range
                  of computational methods for literary text analysis, such as authorship
                  attribution, topic modelling, character network analysis, or stylistic
                  analysis.</q> The focus of the ELTeC encoding scheme is thus not to represent
               texts in all their original complexity of structure or appearance, but rather to
               facilitate a richer and better-informed distant reading than a transcription of its
               lexical content alone would permit. For example, it seems useful to distinguish
               headings and annotations from the rest of the text, and to be able to locate
               stretches of text within gross structural features such as chapters and paragraphs.
               It is probably also useful to distinguish passages belonging to different narrative
               levels (for example, direct speech versus narrative or quotation versus narrative)
               and to identify reference points such as page breaks. It is less useful to record
               exact nuances of rendition or spelling in a particular version of a text. Our goal is
               not to duplicate the work of scholarly editors or to produce (yet another) digital
               edition of a specific source document. </p>
            <!-- CO: OK. With respect to the selection criteria, we need to consider that we will encode the first edition of each texts. 
               The first edition will reveal spelling variances and variances concerning different text levels, e.g, direct speech
               (like any other modern edition will do in some way or the other)-->
            <!-- LB : the idea of encoding specifically the first edition wasn't what I thought the
               brief was when I started work on this document! Licensing issues-->
            <p>In selecting features for inclusion in the markup scheme, we have been guided, but
               not limited, by existing practice as far as possible. Particular collections we have
               examined are listed in section <ptr target="#sources"/> but the main goal has been to
               identify a small core set of essential textual features which can be readily
               (preferably automatically) identified in existing digital transcriptions, or easily
               and consistently provided by new transcriptions. </p>
            <!-- BN : I definitely agree:-->
            <p>This document lists all the textual features which are to be distinguished in an
               ELTeC conformant transcription. Whenever a given feature exists in a text, it will be
               marked up as indicated here. No other features will be captured by the markup: if
               some textual feature not provided for here is identified by a marked up source text,
               that markup will be removed. The goal is to ensure that the ELTeC texts can be
               processed by very simple minded (but XML-aware) systems primarily concerned with
               lexis and to make life easier for the developers of such systems. </p>
            <!-- CO: In addition, our scheme should make sure that we have no variance in the actual encoding.
               Especially a fixed sequence of attributes and a fixed hierarchical structure of elements is needed for parsing issues. 
               For example, allowing diverging xpath may reveal problems for automatically processing ELTeC.  -->
            <!-- LB : I am assuming we will produce a tightly constrained XML schema to ensure that
               the hierarchy of XML elements is simple and predictable. We can't however do anything
               to enforce the order in which attributes appear on an XML tag, nor should we. And I
               don't know what you mean by "allowing diverging xpath": we can't control the
               structure of the documents we are encoding, just represent them! My
               assumption is that ELTeC documents will be well formed and valid XML documents: no
               more, but no less. What parsing issues are you imagining? -->
            <!-- CO: For this period, it may not be a frequent problem, but we also need to think of special character encoding:
               a full and extensive transcription guideline is needed. 
               We should further rely on UTF8 encoding. -->
            <!-- LB: Not sure what you mean here. Can you give an example? I am assuming we will use
              Unicode and UTF8. If we need anything beyond that we could use TEI <g> I suppose -->
            <p>We make no attempt to propose markup for linguistic annotations here. The assumption
               is that this will be produced by different annotation systems in different ways,
               though with an association between such annotations and the basic lexical structures
               represented by the core ELTeC markup. </p>
            <!-- CO: These linguistic or any other annotations may be assigned in different formats. 
               Thus, the tei representation serves as a starting point for some other automatic annotations and analysis. I think, -->
         </div>
         <div>
            <head>Basic Transcription Guidelines</head>
            <!-- CO: General thoughts: I think, we should have in mind when the encoding can be done automatically 
               and when we need man power for encoding. -->
            <!-- LB : in what way ? this sounds like a workflow issue -->
            <p>The basic unit of the ELTeC corpus is the text of a single novel, represented by a
               TEI <gi>text</gi> element. We propose no mechanism (other than metadata) to encode
               units larger than a single novel, such as multipart novel series like Proust's
                  <title>A la recherche du temps perdu</title> or Balzac's <title>Les
                  Rougon-Macquart</title>. </p>
            <p><label>Open Question</label> Should we include liminal matter (titlepages, prefaces,
               appendixes...) in our transcriptions? The following policies seem possible: <list>
                  <item>No : these typically belong to a particular edition or version of the text,
                     and should therefore systematically be excluded</item>
                  <item>Yes : these often form a significant part of the reader's experience (cf.
                     the foreword to most editions of <title>David Copperfield</title>). Mark them
                     up using <gi>front</gi> and <gi>back</gi> as appropriate.</item>
                  <!-- CO: As we would like to propose the use of first editions, we represent the
                     original version. Additionally, I know studies where these non-continuous texts 
                     are analyzed to investigate and question traditional classifications of text genres.  
                  -->
                  <!-- LB : what do you mean by "non-continuous"? -->
                  <!-- BN: From my point of view, liminal matter should not be included in the
               transcription. I think this information is related to the book as physical object,
               but the project is focused on the novel, on the content of the book. At least at the
               moment I think this information is not useful for a distant reading. -->
                  <item>Sort of : do not transcribe them, but indicate that they have been
                     suppressed by using the <gi>gap</gi> element. </item>
               </list></p>
            <p>Within the body of a text, major structural divisions (parts, sections, chapters
               etc.) will be captured using the generic <gi>div</gi> element, with attributes
                  <att>type</att>, <att>xml:lang</att>, <att>xml:id</att> and <att>n</att> used as
               further detailed below.</p>
            <p>The names used for hierarchic structural divisions of a novel above the chapter are
               arbitrary, culture-specific, and often inconsistent : in some novels things called
                  <q>part</q> contain things called <q>book</q> and in others the reverse. We
               propose to follow TEI in using a single element (<gi>div</gi>) for every hierarchical
               structural division, down to the level of <q>chapter</q>.</p>
            <p><label>Open Question</label> Is it useful to retain the name used for each level in
               the original source (the type of div) ? <list>
                  <!-- CO: Maybe I just missed this information. We can keep the structuring texts/words of the text as headings.
                     We can mark them with the appropriate element. In this way, we can keep the information without having to define 
                     a fixed list of subdivisions in a novel (you alreay pointed out that this is not possible). Having marked the existing 
                     headings (and subheadings) they can be included or ignored automatically when processing the document.  -->
                  <!-- LB : I think it would be problematic to just use <head> without using <div>
                     if that's what you are suggesting. I am only asking whether or not we
                     include e.g. @type="chapter" on the top level div -->
                  <!-- BN: I'm not sure. Maybe it could be a good solution the third one:-->
                  <item> Yes: it is easy to keep and may help referencing : use the <att>type</att>
                     attribute to hold the name used for each level of div in the work in
                     question</item>
                  <item>No : this name adds no useful information beyond the level indicated by the
                     XML structure </item>
                  <item>No : it would be more useful to provide an explicit and normalised
                     indication of the hierarchic level for the benefit of non-XML-aware processors
                     (e.g. <code>level1</code>, <code>level2</code> etc.)</item>
               </list></p>
            <p>The (human) language in which a text is expressed is indicated explicitly by the
                  <att>xml:lang</att> attribute which supplies the ISO 2 letter code for the
               language concerned. This attribute will always be supplied on the <gi>text</gi>
               element to specify a default, and may also appear on other elements, for example
                  <gi>foreign</gi>, to indicate passages where the language changes. The various
               different languages used in a given text will be itemized in its metadata (see
                  <gi>langUsage</gi> element in the header). </p>
            <p><label>Open question</label> Should passages exhibiting regional or dialectal
               variation be specially signalled? <list>
                  <item>No : this is too fine grained and controversial a distinction to be made
                     with reliable consistency </item>
                  <item>Yes : treat this in the same way as any other kind of code switching and
                     define a set of appropriate language codes for the project</item>
                  <!-- CO: From a linguistic point of view, I would like to say yes but detecting dialectal variation 
                     is something which cannot be done automatically. I think, without a more explicit guidelines concerning 
                     the detection of foreign material we might end up with confusing analysis. The
                     definition of the element foreign is broad. For example,
                     neoclassic words may or may not count as foreign. Especially in a corpus containing roman and germanic languages.  -->
                  <!-- BN: This annotation could be complex. I think it will be better to annotate
dialectal variations during linguistic annotation.-->
                  <!-- LB : maybe we should just use foreign for passages which are signalled
                     specially in some way in the text. e.g. in italics -->
                  <item>Maybe : just use the <gi>distinct</gi> element to indicate the kind of
                     variation concerned</item>
                  <!-- CO: This element may then be applied to other distinct phenomena as well. I
               think this is not the best way. -->
                  <!-- LB: Many TEI elements have multiple uses, so this is not an argument against using
                  <distinct> in my view-->
               </list></p>
            <p>A single reference scheme will be defined for the whole corpus, with the following
               components: <list>
                  <item>text identifier : every text will have an identifier consisting of its two
                     letter language code and a three digit serial number, for example
                        <code>FR042</code></item>
                  <item>chapter identifier: each chapter or equivalent will have an identifier
                     concatenating the text identifier and a three digit serial number, for example
                        <code>FR042012</code> is the twelfth chapter of the 42nd French novel. </item>
                  <item>If sub-chapter segmentation (see below) is implemented, then the segments
                     will append a further four digit serial number.</item>
                  <!--  CO: In addition, each other part of the book may also get an identifier too (table of contents etc.).  
               -->
                  <!-- LB: assuming that they are included, yes, they will. Do we really want to
                     include tables of contents? -->
               </list>The identifier will be supplied as the value of an <att>xml:id</att> attribute
               on each <gi>text</gi>, <gi>div</gi> or <gi>s</gi> element as appropriate. Adding this
               identifier is an easily automated task which can be built into the workflow for
               accession to the ELTeC.</p>
            <p>Note that these identifiers will not necessarily correspond with the numbering used
               in a particular source text. In a work where the first twelve chapters are considered
               to form part one, and the next twelve constitute part two, the first chapter of the
               second part will have an identifier ending <code>013</code>, even though it may be
               numbered <code>1</code> in a source text. </p>
            <!-- CO: good idea. -->
            <p><label>Open question</label> is it important to preserve the original numbering,
               particularly for deeply structured texts? <list>
                  <item>Yes : the original numbering is widely used to reference the text: it should
                     be supplied as using the <att>n</att> attribute on the <gi>div</gi>.</item>
                  <!-- CO: we may use here the "head" element as the numbering of a chapter may be analysed as a head, maybe next to other heads. See comment above-->
                  <!-- BN: I'm not sure. Sometimes scholars refer to specific passages by original numbering ("The chapter two of Don Quixote bla bla bla"). In this case
this information is necessary. -->
                  <!-- LB: head is not the same as @n : I agree with Borja -->
                  <item>No : the original numbering and referencing scheme are of no use in our
                     intended applications, introduce unnecessary complexity, and may be a source of
                     confusion. </item>
               </list></p>
            <p>The chapters of a novel mostly consist of prose, arranged in paragraphs, for which we
               will use the TEI <gi>p</gi> element. It is not unusual to find other structures
               however, specifically verse, or passages of dialogue presented as if in a play, with
               speaker labels and even stage directions. Less frequently, novels may contain
               material presented in list or tabular formats. Graphics with their own associated
               heading or other text are also frequent. </p>
            <p><label>Open Question</label> how should material other than running prose and
               dialogue be encoded? <list type="ordered">
                  <item>Use the appropriate TEI elements for verse or drama (<gi>lg</gi>,
                     <gi>l</gi>, <gi>sp</gi>, <gi>stage</gi>)</item>
                  <!-- CO: These elements are far more specific concerning text encoding. For example: Why do we need information about lines in the corpus (in the prose text, we haven't)?  -->
                  <!-- LB: because lines of poetry are structural units, but lines of prose are
                     typographic artefacts -->
                  <item>Use the appropriate TEI elements for lists and tables (<gi>list</gi>,
                        <gi>label</gi>, <gi>item</gi>, <gi>table</gi>, <gi>cell</gi>,
                     <gi>row</gi>)</item>
                  <!-- CO: I think this might help to distinguish between different types of text material -->
                  <!-- LB : yes, but which ones? all these? -->
                  <item>Use the appropriate TEI elements for embedded graphics (<gi>figure</gi>,
                        <gi>graphic</gi>, <gi>head</gi>)</item>
                  <!-- CO: This might be applied, if we good texts which are figure heads. I wonder whether we need the information that there 'was' figure when there is no trace of it and no head-->
                  <!-- LB:  gap prevents you from making stupid mistakes e.g. when looking for
                     collocates -->
                  <item>Suppress all non-prose material, replacing it by <gi>gap</gi></item>
               </list></p>
            <p>Novels are also full of direct speech, represented using various different
               conventions, but almost always distinguished from the narrative voice. The first
               person narrative is also common, but may be regarded as a special case.
               <!-- CO: Narrative voice might be good to have for the metadata! -->How exactly
               different narrative strands are articulated in a novel, and the extent to which they
               may be characterised by their lexis has been a preoccupation of many <q>distant
                  reading</q> style analyses. It might therefore be helpful to distinguish material
               purporting to be direct speech from material purporting to be narrative in our basic
               encoding, though to do so consistently and accurately may occasionally be
               problematic.</p>
            <!-- CO: Do you know how direct speech was indentified in this analysis? -->
            <!-- LB: probably by hand, but that's not the issue -->
            <p><label>Open Question</label> Should passages presented as direct speech in a novel be
               distinguished from passages presented as narrative? <list>
                  <item>Yes : use <gi>q</gi> and avoid nesting problems by always nesting it within
                        <gi>p</gi></item>
                  <item>Yes : use a <gi>milestone</gi> to mark the beginning and end of each passage
                     of direct speech</item>
                  <item>Sort of : provide an attribute on <gi>p</gi> to indicate whether or not the
                     paragraph contains direct speech</item>
                  <!-- CO:  I don't know whether the kind of suggestion annotation help to process the corpus. 
                     Either the whole paragraph needs to be excluded from analysis or you include the paragraph 
                     and you know that the paragraph contain  some speech text. I think, this is not a big benefit.    -->
                  <item>No : rely on (or normalise) typographic conventions such as quote marks or
                     dashes to distinguish direct speech only. </item>
                  <!-- CO: At the moment, I would prefer this solution. -->
                  <!-- LB: it's certainly the easiest! tho normalising punctuation marks may be
                     problematic-->
               </list>
            </p>
            <p>Printed texts typically deploy a number of conventions which can cause problems for
               linguistic analyses of even the most basic kind. Changes of font or style
               (italicization or use of superscript, for example) can have particular lexical
               significance which should be taken into account. End-of-line hyphenation can make it
               harder to identify the exact form of a token. Non-standard (i.e. non-modern)
               spellings can mislead parsers. Our proposed encoding aims above all for consistency
               and transparency in what is reliably achievable, leaving more difficult and
               problematic issues to be addressed by linguistic annotations. </p>
            <p>We do not preserve the lineation of running prose in our source texts, since this is
               always purely an artefact of the source edition. For the same reason we will
               reassemble words broken across a line break, silently removing any hyphen present.
               (This will make it impossible to use our texts for hyphenation studies. So be it.) </p>
            <!-- CO: This contradicts the idea of encoding the first edition in a philological way. -->
            <p><label>Open Question</label> : Should page breaks in the source text be preserved ? <list>
                  <item>Yes : this is useful information (e.g. to determine words-per-page, or to
                     anchor links to an image of the source text) which is usually available at
                     no-cost in existing digital texts</item>
                  <!-- CO: This might help. -->
                  <!-- BN: From my point of view this information is not necessary. As I said
before, it is related to the book as physical object, not to the novel itself.-->
                  <item>No : the proposed uses don't justify the cost of providing the information
                     if it is missing. And pagination is inherently copy-specific.</item>
               </list></p>
            <p>Font and style variations in the source text usually signal something. Italics may
               signal emphasis, quotation, foreign language terms etc. Superscripts almost always
               signal abbreviation. The visual salience of these variations is of considerably less
               interest to distant readers than the intended function they signal. However, it is
               not always easy to determine that function reliably and consistently by algorithm.
               Some simple cases could however be addressed. A possibly strategy is outlined below.
               It assumes the existence of a digital version of the text in which visual features
               are explicit, whether by means of TEI-style markup or styling information such as
               that provided by Word. <list>
                  <item>if possible, replace indications of highlighting by an appropriate TEI
                     element, chosen from the following list : <gi>foreign</gi>, <gi>title</gi>,
                        <gi>emph</gi></item>
                  <!-- CO: Why would title and foreign be good elements for the task?  Emph refers to linguistic or rhetorical effect -->
                  <!-- LB: because titles and foreign passages are often represented by
                     highlighting -->
                  <item>otherwise, replace all indications of highlighting by the TEI <gi>hi</gi>
                     element</item>
                  <!-- CO: this might be a good way of encoding. hi can get a @rend for determing bold, underlined etc.?! -->
                  <!-- LB: indeed it can, but why would we want to use that? -->
                  <item>indications of superscript characters (such as French
                        <soCalled>14&#x1d49;</soCalled>) should be removed. Instead, the TEI element
                        <gi>abbr</gi> should be used to indicate the presence of an abbreviated
                     word: <code>&lt;abbr>14e&lt;/abbr></code></item>
               </list></p>
            <p><label>Open Question</label>: Is it feasible or useful to recode highlighted spans of
               text in this way? <!-- BN: I don't know. I'm not sure if it is worth annotating this information. It
seems that it is not really useful for a distant reading.-->
               <!-- LB : I think I am coming down on Borja's side here: it's a LOT of work to do it
                  properly, and most distant readers won't want to use it -->
               <list>
                  <item>Yes : in many cases this can be an automatic process and the results justify
                     investing the effort </item>
                  <item>No : there are likely to be too many borderline or debatable cases to do
                     this automatically so this would have to be done as part of a major proof
                     reading exercise</item>
               </list></p>
            <p>Whichever solution is adopted, it should be applied uniformly across the ELTeC. A
               collection in which some texts make distinctions ignored by others is
               unsatisfactory.</p>
         </div>
         <div>
            <head>TEI Elements used</head>
            <p>This section will provide a checklist of TEI elements used in the body of each ELTeC
               text, with descriptions and examples of their intended applications. </p>
         </div>
         <div>
            <head>Metadata in the TEI Header</head>
            <!-- BN: About the metadata:
Licences (creative common) will be included in metadata, isn't it?
I think it could be useful to link author names with WikiData, VIAF, ISNI
or similar linked open data resources. The ID of each author in these
resources could be included in the metadata.
WikiData:
https://www.wikidata.org/wiki/Wikidata:Main_Page
VIAF (Virtual International Authority File):
https://www.oclc.org/en/viaf.html
http://viaf.org/
ISNI (International Standard Name Identifier (ISO 27729))
http://www.isni.org/
http://isni.oclc.org/-->
            <!-- LB : agreed. Since wikidata includes the others, should we use that for
               preference? -->
            <p>This section describes the metadata associated with each text (title, authorship,
               date etc.) and with the collection as a whole. The intention is to provide this in a
               standardised way to facilitate subsetting of the collection, using (for example)
               coded values for the descriptive selection criteria associated with the text. As far
               as possible, our text should represent the first complete printed edition of each
               novel selected. </p>
            <p>The TEI Header provides a very large number of possibilities for encoding such
               metadata. We will provide a checklist of the TEI Header elements which are always to
               be provided for each text, possibly in the form of a template. As in the body of the
               text, the intention is to provide a guaranteed minimal level of information,
               consistent across all parts of the ELTeC. </p>
            <!-- CO: maybe you can include the metadata in the discussion paper of the selection criteria? -->
            <p>Note that metadata may be supplied at (at least) two levels: the level of the ELTeC
               as a whole, and that of individual texts within it. Information which applies
               uniformly to all parts of the collection should be supplied in the ELTeC header;
               information specific to a particular document in the text header. </p>
         </div>
         <div>
            <head>Text-level metadata</head>
            <p>Here is an example template for an individual text header
               <egXML xmlns="http://www.tei-c.org/ns/Examples">
                  <teiHeader type="novelHeader">
                     <fileDesc>
                        <titleStmt>
                           <title><!-- standard title of work -->
                           </title>
                           <author>
                              <!-- information about the author -->
                           </author>
                        </titleStmt>
                        <extent>
                           <!-- size of the text, in pages and words -->
                        </extent>
                        <publicationStmt>
                           <!-- boilerplate statement about status as part of ELTeC -->
                        </publicationStmt>
                        <sourceDesc>
                           <bibl>
                              <!-- bibliographic description of the printed source -->
                           </bibl>
                        </sourceDesc>
                     </fileDesc>
                     <profileDesc>
                        <!-- additional descriptive information -->
                     </profileDesc>
                     <revisionDesc>
                        <!-- revision information -->
                     </revisionDesc>
                  </teiHeader>
               </egXML>
            </p>
            <p>Within the <gi>teiHeader</gi>, a <gi>fileDesc</gi>, a <gi>profileDesc</gi>, and a
                  <gi>revisionDesc</gi> are all required. The <gi>encodingDesc</gi> may be supplied
               in (hopefully unlikely) event that some aspect of this document's encoding is
               anomalous. </p>
            <div>
               <head>Components of the file description</head>
               <p>The <gi>fileDesc</gi> contains the following mandatory elements: <specList>
                     <specDesc key="titleStmt"/>
                     <specDesc key="extent"/>
                     <specDesc key="publicationStmt"/>
                     <specDesc key="sourceDesc"/>
                  </specList>
               </p>
               <p> Taking these in turn, the <gi>titleStmt</gi> contains the title, author, and
                  encoder of the document. For novels with multiple authors, titles, or encoders the
                  element concerned is simply repeated. The <gi>title </gi>should be taken from an
                  authoritative bibliographic source, and should include a phrase such as
                     <soCalled>ELTeC edition</soCalled>. The <gi>author</gi> may contain one or more
                  of the following descriptive elements: <specList>
                     <specDesc key="persName"/>
                     <specDesc key="forename"/>
                     <specDesc key="surname"/>
                     <specDesc key="birth"/>
                     <specDesc key="death"/>
                     <specDesc key="affiliation" atts="type"/>
                     <specDesc key="sex" atts="value"/>
                     <specDesc key="idno" atts="type"/>
                  </specList>
               </p>
               <p>In addition to one or more <gi>author</gi> elements, a <gi>titleStmt</gi> should
                  contain at least one <gi>respStmt</gi> element indicating the person responsible
                  for the ELTeC encoded version, using the following elements <specList>
                     <specDesc key="resp"/>
                     <specDesc key="respStmt"/>
                     <specDesc key="name"/>
                  </specList></p>
               <p>Here is an example :
                  <egXML xmlns="http://www.tei-c.org/ns/Examples">
                     <titleStmt>
                        <title>Howards End : ELTeC edition</title>
                        <author>
                           <persName>
                              <forename>Edward</forename>
                              <forename>Morgan</forename>
                              <surname>Forster</surname>
                           </persName>
                           <persName>E.M. Forster</persName>
                           <birth when="1879"/>
                           <death when="1970"/>
                           <sex value="M"/>
                           <idno type="viaf">https://viaf.org/viaf/31996364</idno>
                           <idno type="wiki">https://www.wikidata.org/wiki/Q189119</idno>
                        </author>
                        <respStmt>
                           <resp>ELTeC encoding</resp>
                           <name>Lou Burnard</name>
                        </respStmt>
                     </titleStmt>
                  </egXML>
               </p>
               <p> The <gi>extent</gi> provides information about the size of the document, given by
                  means of the following elements<specList>
                     <specDesc key="extent"/>
                     <specDesc key="measure" atts="unit quantity"/>
                  </specList> Exactly which measurements will be most useful and easily incorporated
                  is yet to be determined: probably a count of words and pages will suffice. </p>
               <egXML xmlns="http://www.tei-c.org/ns/Examples"><extent>
                     <measure unit="words" quantity="20010"/>
                     <measure unit="pages" quantity="245"/>
                  </extent>
               </egXML>
               <p>The <gi>publicationStmt</gi> is required for TEI conformance: in individual text
                  headers it will contain some standard boiler plate text referring to the fuller
                  statement which will be furnished by the collection-level header. <!--<specList>
                  <specDesc key="idno"/>
                  <specDesc key="pubPlace"/>
                  <specDesc key="publisher"/>
                  <specDesc key="date" atts="when"/>
                  <specDesc key="biblScope"/>
               </specList>-->
                  <egXML xmlns="http://www.tei-c.org/ns/Examples">
                     <publicationStmt>
                        <p>Incorporated into the ELTeC <date>2018-02-12</date></p>
                     </publicationStmt>
                  </egXML>
               </p>
               <p>The <gi>sourceDesc</gi> element is also required for TEI conformance. It will
                  contain a bibliographic description of the source text against which the digital
                  text has been validated, typically the first published edition of the work
                  concerned. Where the ELTeC version derives from a pre-existing digital version of
                  this work, a reference to that source will also be provided. The following
                  elements are used to record this information: <specList>
                     <specDesc key="bibl"/>
                     <specDesc key="title"/>
                     <specDesc key="author"/>
                     <specDesc key="publisher"/>
                     <specDesc key="pubPlace"/>
                     <specDesc key="ref"/>
                  </specList>
               </p>
               <egXML xmlns="http://www.tei-c.org/ns/Examples">
                  <sourceDesc>
                     <bibl>
                        <author>E.M. Forster</author>
                        <title>Howards End</title>
                        <pubPlace>London</pubPlace>
                        <publisher>Edward Arnold</publisher>
                        <date>1910</date>
                        <idno type="wiki">https://www.wikidata.org/wiki/Q1146642</idno>
                     </bibl>
                     <bibl>
                        <title>The Project Gutenberg Etext of Howards End, by E. M. Forster</title>
                        <ref target="http://www.gutenberg.org/files/2891/2891-h/2891-h.htm">HTML
                           version downloaded on <date>2017-12-26</date></ref>
                     </bibl>
                     <note type="editions" source="worldcat"> Worldcat lists 484 print editions in
                        English</note>
                  </sourceDesc>
               </egXML>
            </div>
            <div>
               <head>Components of the profile description</head>
               <p>The <gi>profileDesc</gi> of an ELTeC text has the following mandatory components: <specList>
                     <specDesc key="langUsage"/>
                     <specDesc key="textClass"/>
                  </specList></p>
               <p>The <gi>langUsage</gi> element contains one or more <gi>language</gi> elements,
                  one for each language, dialect, sublanguage etc. explicitly identified in the body
                  of the text, indicating roughly how much of the text uses this language. For
                  example, a text which is almost entirely in British English, but also contains
                  some parts in US English would have an entry like this: </p>
               <egXML xmlns="http://www.tei-c.org/ns/Examples">
                  <langUsage>
                     <language ident="en-GB" usage="90">British English</language>
                     <language ident="en-US" usage="10">North American English</language>
                  </langUsage></egXML>
               <p>The TEI <gi>textClass</gi> element can contain one or more of the following
                  elements: <specList>
                     <specDesc key="catRef"/>
                     <specDesc key="classCode"/>
                     <specDesc key="keywords" atts="source"/>
                     <specDesc key="term"/>
                  </specList> These three methods for classifying texts can be used in parallel. It
                  is an <label>open question</label> which we should use for the ELTeC collection:
                  the schema proposed here permits any combination. </p>
               <p>The <gi>keywords</gi> option allows us to supply one or more <gi>term</gi>
                  elements to categorise a text in some way. If the values are taken from a known
                  closed list or authority file, that file should be specified using the
                     <att>source</att> attribute. </p>
               <egXML xmlns="http://www.tei-c.org/ns/Examples">
                  <textClass>
                     <keywords source="http://wikidata.org">
                        <term>social class</term>
                        <term>social convention</term>
                        <term>modernity</term>
                        <term>family drama</term>
                     </keywords>
                  </textClass>
               </egXML>
               <p><label>Open Question</label> : should we invent our own taxonomy, use a
                  pre-existing one, make no attempt to constrain or predefine terms used here?</p>
               <p>The <gi>classCode</gi> option allows us to use classification codes used or
                  defined by existing authorities, such as library catalogue schemes, while the
                     <gi>catRef</gi> option allows us to specify such codes using our own
                  classification scheme. </p>
               <egXML xmlns="http://www.tei-c.org/ns/Examples">
                  <catRef target="#author_m #reprint_3"/>
                  <classCode source="UDC">8231.111</classCode>
               </egXML>
               <p>Since our selection and descriptive criteria are likely to be specific to the
                  project, we will probably have to define them in the corpus header using the
                  following elements: <specList>
                     <specDesc key="taxonomy"/>
                     <specDesc key="category"/>
                     <specDesc key="catDesc"/>
                  </specList></p>
               <egXML xmlns="http://www.tei-c.org/ns/Examples">
                  <taxonomy>
                     <category xml:id="author_m"><catDesc>male authorship</catDesc></category>
                     <category xml:id="author_f"><catDesc>female authorship</catDesc></category>
                     <category xml:id="author_u"><catDesc>author gender unknown</catDesc></category>
                     <category xml:id="reprint_0"><catDesc>no reprints found</catDesc></category>
                     <category xml:id="reprint_1"><catDesc>1 to 50 editions</catDesc></category>
                     <category xml:id="reprint_2"><catDesc>50 to 100 editions</catDesc></category>
                     <category xml:id="reprint_3"><catDesc>Over 100 reprints</catDesc></category>
                  </taxonomy>
               </egXML>
               <!-- LB: some examples are needed here. Check recommendations of TEI in Libraries -->
            </div>
            <div>
               <head>Components of the Revision Description</head>
               <p>The <gi>revisionDesc</gi> element is used to document significant points in the
                  version history of the document. At least one entry should be provided for an
                  ELTeC document, specifying when it was first added to the collection. The
                  following elements can be used: <specList>
                     <specDesc key="revisionDesc"/>
                     <specDesc key="change" atts="when who"/>
                  </specList>
                  <egXML xmlns="http://www.tei-c.org/ns/Examples">
                     <revisionDesc>
                        <change when="2018-02-21" who="ELTeC:LB">Added new linguistic
                           classifications</change>
                        <change when="2018-01-29" who="ELTeC:LB">Added to the ELTeC</change>
                     </revisionDesc></egXML>
               </p>
            </div>
            <div>
               <head>Encoding description</head>
               <p>The TEI allows for the specification of encoding practice, by which is meant
                  documentation of the specific editorial policies followed during transcription
                  (treatment of printed hyphens, lexical normalisation, sampling procedures,
                  features included, ignored, or normalised, etc.). Such specification may be
                  supplied at the individual document level, or once for all across the whole of a
                  corpus. It is even possible to specify that different parts of a document follow
                  different policies, provided that all the available policies are defined
                  somewhere. </p>
               <p><label>Open Question</label> : We propose as far as possible not to allow for any
                  variation in encoding policies applied within the ELTeC. We will still need to
                  determine our encoding policies, of course, and to document them appropriately in
                  the ELTeC corpus header, but there should be no need for separate specifications
                  at the document level. </p>
            </div>
         </div>
         <div>
            <head>Linguistic and semantic annotation</head>
            <p> Later stages of the project will need to use additional markup facilities to
               represent more sophisticated annotations, which may be motivated linguistically (for
               example, to provide a normalised form, part of speech, etc.) or semantically (for
               example to distinguish proper names, names of people, places, events, etc.). These
               will form an additional layer, not discussed here: the principle should however be
               that the base text we provide is always available in a uniform encoding.</p>
         </div>
         <div>
            <head>Formal specification</head>
            <p>The ELTeC encoding scheme defined by this document is a TEI-conformant customization,
               from which user documentation, and formal RELAXNG or DTD specifications can be
               generated automatically. This section contains the schema specification itself.</p>
            <schemaSpec ident="ELTeC" start="TEI">
               <moduleRef key="tei"/>
               <moduleRef key="core"
                  include="author bibl date head item l label lg list measure
                  name p pb publisher pubPlace quote ref resp respStmt sp stage hi term title"/>
               <moduleRef key="textstructure" include="TEI text div body front back"/>
               <moduleRef key="header"
                  include="catDesc category catRef change classCode  encodingDesc extent fileDesc idno  keywords langUsage language  
                  profileDesc publicationStmt revisionDesc sourceDesc taxonomy teiHeader textClass titleStmt"/>
               <moduleRef key="namesdates"
                  include="affiliation forename  persName  surname"/>
               <!-- Class modifications to reduce attribute clutter -->
               <!-- first remove classes which provide attributes we dont want -->
               <classSpec ident="att.declaring" type="atts" mode="delete"/>
               <classSpec ident="att.declarable" type="atts" mode="delete"/>
               <classSpec ident="att.personal" type="atts" mode="delete"/>
               <classSpec ident="att.ranging" type="atts" mode="delete"/>
               <classSpec ident="att.written" type="atts" mode="delete"/>
               <classSpec ident="att.breaking" type="atts" mode="delete"/>
               <classSpec ident="att.datable.iso" type="atts" mode="delete"/>
               <classSpec ident="att.datable.custom" type="atts" mode="delete"/>
               <classSpec ident="att.divLike" type="atts" mode="delete"/>
               <classSpec ident="att.editLike" type="atts" mode="delete"/>
               <classSpec type="atts" ident="att.global.responsibility" mode="delete"/>
               <classSpec ident="att.naming" type="atts" mode="delete"/>
               <!-- next modify classes which provide some attributes we want and some we don't -->
               <classSpec type="atts" ident="att.global.rendition" mode="change">
                  <attList>
                     <attDef ident="rendition" mode="delete"/>
                     <attDef ident="style" mode="delete"/>
                  </attList>
               </classSpec>
               <classSpec type="atts" ident="att.typed" mode="change">
                  <attList>
                     <attDef ident="subtype" mode="delete"/>
                  </attList>
               </classSpec>
               <classSpec type="atts" ident="att.dimensions" mode="change">
                  <attList>
                     <attDef ident="precision" mode="delete"/>
                     <attDef ident="scope" mode="delete"/>
                  </attList>
               </classSpec>
               <classSpec type="atts" ident="att.datable" mode="change">
                  <attList>
                     <attDef ident="calendar" mode="delete"/>
                     <attDef ident="period" mode="delete"/>
                  </attList>
               </classSpec>
               <classSpec type="atts" ident="att.pointing" mode="change">
                  <attList>
                     <attDef ident="targetLang" mode="delete"/>
                     <attDef ident="evaluate" mode="delete"/>
                  </attList>
               </classSpec>
               
               <!-- now tweaks for individual elements -->
               
               <!-- make xml:id and xml:lang obligatory for text -->
               <elementSpec ident="text" mode="change">
                  <attList>
                     <attDef ident="xml:id" mode="change" usage="req"/>
                     <attDef ident="xml:lang" mode="change" usage="req"/>
                  </attList>
               </elementSpec>
               
             <!-- define two new attributes for author -->
            <elementSpec ident="author" mode="change">
               <attList>
                  <attDef ident="dates" mode="add">
                     <desc xml:lang="en">supplies the years of birth and death of an author</desc>
                     <datatype minOccurs="2" maxOccurs="2">
                        <dataRef name="gYear"/>
                     </datatype>
                     <!-- any pair of 4 digit values will do here; we could
                        add schematron to check that it is in the
                        right range and that the second is after the first -->
                     
                  </attDef><attDef ident="sex" mode="add">
                     <desc xml:lang="en">Specifies the sexual identification usually associated with the author</desc>
                     <valList type="closed">
                        <valItem ident="m"><desc>male</desc></valItem>
                        <valItem ident="f"><desc>female</desc></valItem>
                        <valItem ident="u"><desc>unknown or inapplicable</desc></valItem>                        
                     </valList>
                  </attDef> 
               </attList>
            </elementSpec>
            </schemaSpec>
         </div>
      </body>
      <back xml:id="sources">
         <p>Sources consulted</p>
         <listBibl>
            <bibl>An introduction to TEI Simple Print <idno type="URI"
                  >http://www.tei-c.org/release/doc/tei-p5-exemplars/html/tei_simplePrint.doc.html</idno></bibl>
            <bibl>Burnard (Lou) <date>2005</date>
               <title level="a">Metadata for corpus work</title> in <title>Developing Linguistic
                  Corpora: A guide to good practice</title> ed. Martin Wynne. Oxford: Oxbow Books,
               pp 30-46. <!--<ref target="2005-metadata.xml">XML source</ref>--></bibl>
            <bibl> Odebrecht, Carolin. (2017). Metadata for Historical Corpora. Realization of the
               Metamodel for Corpus Metadata with the help of TEI Customization [Data set]. Zenodo.
               http://doi.org/10.5281/zenodo.267999</bibl>
            <bibl>
               <idno type="URI">github.com/cligs/textbox</idno>
            </bibl>
         </listBibl>
      </back>
   </text>
</TEI>