forked from distantreading/WG1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathencoding_proposal.xml
751 lines (747 loc) · 51 KB
/
encoding_proposal.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_odds.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_odds.rng" type="application/xml"
schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Encoding Guidelines for the ELTeC</title>
<author>Cost Action CA16204 – WG1</author>
</titleStmt>
<publicationStmt>
<p>Unpublished draft for discussion</p>
</publicationStmt>
<sourceDesc>
<p>A born digital document drafted in TEI format by LB</p>
</sourceDesc>
</fileDesc>
<revisionDesc>
<change when="2018-01-17">Expanded metadata section a bit; added comments from CO and
BN</change>
<change when="2017-12-17">First (partial) discussion draft</change>
</revisionDesc>
</teiHeader>
<text>
<body>
<p>This is the first draft of a reference document defining the encoding scheme to be used
for the European Literary Text Collection (ELTeC) which will be a major deliverable of
COST Action 16204, <title>Distant Reading</title>. As a first draft it necessarily
contains much uncertainty, either because policies have to be determined, or because
issues are currently under-specified. Particular topics on which policy remains to be
defined are signalled below with the label <q>Open Question</q>. Input from other
participants in the Action about these topics would be particularly useful. </p>
<div>
<head>Principles</head>
<p>The MoU for the project points out that <q>Distant Reading methods cover a wide range
of computational methods for literary text analysis, such as authorship
attribution, topic modelling, character network analysis, or stylistic
analysis.</q> The focus of the ELTeC encoding scheme is thus not to represent
texts in all their original complexity of structure or appearance, but rather to
facilitate a richer and better-informed distant reading than a transcription of its
lexical content alone would permit. For example, it seems useful to distinguish
headings and annotations from the rest of the text, and to be able to locate
stretches of text within gross structural features such as chapters and paragraphs.
It is probably also useful to distinguish passages belonging to different narrative
levels (for example, direct speech versus narrative or quotation versus narrative)
and to identify reference points such as page breaks. It is less useful to record
exact nuances of rendition or spelling in a particular version of a text. Our goal is
not to duplicate the work of scholarly editors or to produce (yet another) digital
edition of a specific source document. </p>
<!-- CO: OK. With respect to the selection criteria, we need to consider that we will encode the first edition of each texts.
The first edition will reveal spelling variances and variances concerning different text levels, e.g, direct speech
(like any other modern edition will do in some way or the other)-->
<!-- LB : the idea of encoding specifically the first edition wasn't what I thought the
brief was when I started work on this document! Licensing issues-->
<p>In selecting features for inclusion in the markup scheme, we have been guided, but
not limited, by existing practice as far as possible. Particular collections we have
examined are listed in section <ptr target="#sources"/> but the main goal has been to
identify a small core set of essential textual features which can be readily
(preferably automatically) identified in existing digital transcriptions, or easily
and consistently provided by new transcriptions. </p>
<!-- BN : I definitely agree:-->
<p>This document lists all the textual features which are to be distinguished in an
ELTeC conformant transcription. Whenever a given feature exists in a text, it will be
marked up as indicated here. No other features will be captured by the markup: if
some textual feature not provided for here is identified by a marked up source text,
that markup will be removed. The goal is to ensure that the ELTeC texts can be
processed by very simple minded (but XML-aware) systems primarily concerned with
lexis and to make life easier for the developers of such systems. </p>
<!-- CO: In addition, our scheme should make sure that we have no variance in the actual encoding.
Especially a fixed sequence of attributes and a fixed hierarchical structure of elements is needed for parsing issues.
For example, allowing diverging xpath may reveal problems for automatically processing ELTeC. -->
<!-- LB : I am assuming we will produce a tightly constrained XML schema to ensure that
the hierarchy of XML elements is simple and predictable. We can't however do anything
to enforce the order in which attributes appear on an XML tag, nor should we. And I
don't know what you mean by "allowing diverging xpath": we can't control the
structure of the documents we are encoding, just represent them! My
assumption is that ELTeC documents will be well formed and valid XML documents: no
more, but no less. What parsing issues are you imagining? -->
<!-- CO: For this period, it may not be a frequent problem, but we also need to think of special character encoding:
a full and extensive transcription guideline is needed.
We should further rely on UTF8 encoding. -->
<!-- LB: Not sure what you mean here. Can you give an example? I am assuming we will use
Unicode and UTF8. If we need anything beyond that we could use TEI <g> I suppose -->
<p>We make no attempt to propose markup for linguistic annotations here. The assumption
is that this will be produced by different annotation systems in different ways,
though with an association between such annotations and the basic lexical structures
represented by the core ELTeC markup. </p>
<!-- CO: These linguistic or any other annotations may be assigned in different formats.
Thus, the tei representation serves as a starting point for some other automatic annotations and analysis. I think, -->
</div>
<div>
<head>Basic Transcription Guidelines</head>
<!-- CO: General thoughts: I think, we should have in mind when the encoding can be done automatically
and when we need man power for encoding. -->
<!-- LB : in what way ? this sounds like a workflow issue -->
<p>The basic unit of the ELTeC corpus is the text of a single novel, represented by a
TEI <gi>text</gi> element. We propose no mechanism (other than metadata) to encode
units larger than a single novel, such as multipart novel series like Proust's
<title>A la recherche du temps perdu</title> or Balzac's <title>Les
Rougon-Macquart</title>. </p>
<p><label>Open Question</label> Should we include liminal matter (titlepages, prefaces,
appendixes...) in our transcriptions? The following policies seem possible: <list>
<item>No : these typically belong to a particular edition or version of the text,
and should therefore systematically be excluded</item>
<item>Yes : these often form a significant part of the reader's experience (cf.
the foreword to most editions of <title>David Copperfield</title>). Mark them
up using <gi>front</gi> and <gi>back</gi> as appropriate.</item>
<!-- CO: As we would like to propose the use of first editions, we represent the
original version. Additionally, I know studies where these non-continuous texts
are analyzed to investigate and question traditional classifications of text genres.
-->
<!-- LB : what do you mean by "non-continuous"? -->
<!-- BN: From my point of view, liminal matter should not be included in the
transcription. I think this information is related to the book as physical object,
but the project is focused on the novel, on the content of the book. At least at the
moment I think this information is not useful for a distant reading. -->
<item>Sort of : do not transcribe them, but indicate that they have been
suppressed by using the <gi>gap</gi> element. </item>
</list></p>
<p>Within the body of a text, major structural divisions (parts, sections, chapters
etc.) will be captured using the generic <gi>div</gi> element, with attributes
<att>type</att>, <att>xml:lang</att>, <att>xml:id</att> and <att>n</att> used as
further detailed below.</p>
<p>The names used for hierarchic structural divisions of a novel above the chapter are
arbitrary, culture-specific, and often inconsistent : in some novels things called
<q>part</q> contain things called <q>book</q> and in others the reverse. We
propose to follow TEI in using a single element (<gi>div</gi>) for every hierarchical
structural division, down to the level of <q>chapter</q>.</p>
<p><label>Open Question</label> Is it useful to retain the name used for each level in
the original source (the type of div) ? <list>
<!-- CO: Maybe I just missed this information. We can keep the structuring texts/words of the text as headings.
We can mark them with the appropriate element. In this way, we can keep the information without having to define
a fixed list of subdivisions in a novel (you alreay pointed out that this is not possible). Having marked the existing
headings (and subheadings) they can be included or ignored automatically when processing the document. -->
<!-- LB : I think it would be problematic to just use <head> without using <div>
if that's what you are suggesting. I am only asking whether or not we
include e.g. @type="chapter" on the top level div -->
<!-- BN: I'm not sure. Maybe it could be a good solution the third one:-->
<item> Yes: it is easy to keep and may help referencing : use the <att>type</att>
attribute to hold the name used for each level of div in the work in
question</item>
<item>No : this name adds no useful information beyond the level indicated by the
XML structure </item>
<item>No : it would be more useful to provide an explicit and normalised
indication of the hierarchic level for the benefit of non-XML-aware processors
(e.g. <code>level1</code>, <code>level2</code> etc.)</item>
</list></p>
<p>The (human) language in which a text is expressed is indicated explicitly by the
<att>xml:lang</att> attribute which supplies the ISO 2 letter code for the
language concerned. This attribute will always be supplied on the <gi>text</gi>
element to specify a default, and may also appear on other elements, for example
<gi>foreign</gi>, to indicate passages where the language changes. The various
different languages used in a given text will be itemized in its metadata (see
<gi>langUsage</gi> element in the header). </p>
<p><label>Open question</label> Should passages exhibiting regional or dialectal
variation be specially signalled? <list>
<item>No : this is too fine grained and controversial a distinction to be made
with reliable consistency </item>
<item>Yes : treat this in the same way as any other kind of code switching and
define a set of appropriate language codes for the project</item>
<!-- CO: From a linguistic point of view, I would like to say yes but detecting dialectal variation
is something which cannot be done automatically. I think, without a more explicit guidelines concerning
the detection of foreign material we might end up with confusing analysis. The
definition of the element foreign is broad. For example,
neoclassic words may or may not count as foreign. Especially in a corpus containing roman and germanic languages. -->
<!-- BN: This annotation could be complex. I think it will be better to annotate
dialectal variations during linguistic annotation.-->
<!-- LB : maybe we should just use foreign for passages which are signalled
specially in some way in the text. e.g. in italics -->
<item>Maybe : just use the <gi>distinct</gi> element to indicate the kind of
variation concerned</item>
<!-- CO: This element may then be applied to other distinct phenomena as well. I
think this is not the best way. -->
<!-- LB: Many TEI elements have multiple uses, so this is not an argument against using
<distinct> in my view-->
</list></p>
<p>A single reference scheme will be defined for the whole corpus, with the following
components: <list>
<item>text identifier : every text will have an identifier consisting of its two
letter language code and a three digit serial number, for example
<code>FR042</code></item>
<item>chapter identifier: each chapter or equivalent will have an identifier
concatenating the text identifier and a three digit serial number, for example
<code>FR042012</code> is the twelfth chapter of the 42nd French novel. </item>
<item>If sub-chapter segmentation (see below) is implemented, then the segments
will append a further four digit serial number.</item>
<!-- CO: In addition, each other part of the book may also get an identifier too (table of contents etc.).
-->
<!-- LB: assuming that they are included, yes, they will. Do we really want to
include tables of contents? -->
</list>The identifier will be supplied as the value of an <att>xml:id</att> attribute
on each <gi>text</gi>, <gi>div</gi> or <gi>s</gi> element as appropriate. Adding this
identifier is an easily automated task which can be built into the workflow for
accession to the ELTeC.</p>
<p>Note that these identifiers will not necessarily correspond with the numbering used
in a particular source text. In a work where the first twelve chapters are considered
to form part one, and the next twelve constitute part two, the first chapter of the
second part will have an identifier ending <code>013</code>, even though it may be
numbered <code>1</code> in a source text. </p>
<!-- CO: good idea. -->
<p><label>Open question</label> is it important to preserve the original numbering,
particularly for deeply structured texts? <list>
<item>Yes : the original numbering is widely used to reference the text: it should
be supplied as using the <att>n</att> attribute on the <gi>div</gi>.</item>
<!-- CO: we may use here the "head" element as the numbering of a chapter may be analysed as a head, maybe next to other heads. See comment above-->
<!-- BN: I'm not sure. Sometimes scholars refer to specific passages by original numbering ("The chapter two of Don Quixote bla bla bla"). In this case
this information is necessary. -->
<!-- LB: head is not the same as @n : I agree with Borja -->
<item>No : the original numbering and referencing scheme are of no use in our
intended applications, introduce unnecessary complexity, and may be a source of
confusion. </item>
</list></p>
<p>The chapters of a novel mostly consist of prose, arranged in paragraphs, for which we
will use the TEI <gi>p</gi> element. It is not unusual to find other structures
however, specifically verse, or passages of dialogue presented as if in a play, with
speaker labels and even stage directions. Less frequently, novels may contain
material presented in list or tabular formats. Graphics with their own associated
heading or other text are also frequent. </p>
<p><label>Open Question</label> how should material other than running prose and
dialogue be encoded? <list type="ordered">
<item>Use the appropriate TEI elements for verse or drama (<gi>lg</gi>,
<gi>l</gi>, <gi>sp</gi>, <gi>stage</gi>)</item>
<!-- CO: These elements are far more specific concerning text encoding. For example: Why do we need information about lines in the corpus (in the prose text, we haven't)? -->
<!-- LB: because lines of poetry are structural units, but lines of prose are
typographic artefacts -->
<item>Use the appropriate TEI elements for lists and tables (<gi>list</gi>,
<gi>label</gi>, <gi>item</gi>, <gi>table</gi>, <gi>cell</gi>,
<gi>row</gi>)</item>
<!-- CO: I think this might help to distinguish between different types of text material -->
<!-- LB : yes, but which ones? all these? -->
<item>Use the appropriate TEI elements for embedded graphics (<gi>figure</gi>,
<gi>graphic</gi>, <gi>head</gi>)</item>
<!-- CO: This might be applied, if we good texts which are figure heads. I wonder whether we need the information that there 'was' figure when there is no trace of it and no head-->
<!-- LB: gap prevents you from making stupid mistakes e.g. when looking for
collocates -->
<item>Suppress all non-prose material, replacing it by <gi>gap</gi></item>
</list></p>
<p>Novels are also full of direct speech, represented using various different
conventions, but almost always distinguished from the narrative voice. The first
person narrative is also common, but may be regarded as a special case.
<!-- CO: Narrative voice might be good to have for the metadata! -->How exactly
different narrative strands are articulated in a novel, and the extent to which they
may be characterised by their lexis has been a preoccupation of many <q>distant
reading</q> style analyses. It might therefore be helpful to distinguish material
purporting to be direct speech from material purporting to be narrative in our basic
encoding, though to do so consistently and accurately may occasionally be
problematic.</p>
<!-- CO: Do you know how direct speech was indentified in this analysis? -->
<!-- LB: probably by hand, but that's not the issue -->
<p><label>Open Question</label> Should passages presented as direct speech in a novel be
distinguished from passages presented as narrative? <list>
<item>Yes : use <gi>q</gi> and avoid nesting problems by always nesting it within
<gi>p</gi></item>
<item>Yes : use a <gi>milestone</gi> to mark the beginning and end of each passage
of direct speech</item>
<item>Sort of : provide an attribute on <gi>p</gi> to indicate whether or not the
paragraph contains direct speech</item>
<!-- CO: I don't know whether the kind of suggestion annotation help to process the corpus.
Either the whole paragraph needs to be excluded from analysis or you include the paragraph
and you know that the paragraph contain some speech text. I think, this is not a big benefit. -->
<item>No : rely on (or normalise) typographic conventions such as quote marks or
dashes to distinguish direct speech only. </item>
<!-- CO: At the moment, I would prefer this solution. -->
<!-- LB: it's certainly the easiest! tho normalising punctuation marks may be
problematic-->
</list>
</p>
<p>Printed texts typically deploy a number of conventions which can cause problems for
linguistic analyses of even the most basic kind. Changes of font or style
(italicization or use of superscript, for example) can have particular lexical
significance which should be taken into account. End-of-line hyphenation can make it
harder to identify the exact form of a token. Non-standard (i.e. non-modern)
spellings can mislead parsers. Our proposed encoding aims above all for consistency
and transparency in what is reliably achievable, leaving more difficult and
problematic issues to be addressed by linguistic annotations. </p>
<p>We do not preserve the lineation of running prose in our source texts, since this is
always purely an artefact of the source edition. For the same reason we will
reassemble words broken across a line break, silently removing any hyphen present.
(This will make it impossible to use our texts for hyphenation studies. So be it.) </p>
<!-- CO: This contradicts the idea of encoding the first edition in a philological way. -->
<p><label>Open Question</label> : Should page breaks in the source text be preserved ? <list>
<item>Yes : this is useful information (e.g. to determine words-per-page, or to
anchor links to an image of the source text) which is usually available at
no-cost in existing digital texts</item>
<!-- CO: This might help. -->
<!-- BN: From my point of view this information is not necessary. As I said
before, it is related to the book as physical object, not to the novel itself.-->
<item>No : the proposed uses don't justify the cost of providing the information
if it is missing. And pagination is inherently copy-specific.</item>
</list></p>
<p>Font and style variations in the source text usually signal something. Italics may
signal emphasis, quotation, foreign language terms etc. Superscripts almost always
signal abbreviation. The visual salience of these variations is of considerably less
interest to distant readers than the intended function they signal. However, it is
not always easy to determine that function reliably and consistently by algorithm.
Some simple cases could however be addressed. A possibly strategy is outlined below.
It assumes the existence of a digital version of the text in which visual features
are explicit, whether by means of TEI-style markup or styling information such as
that provided by Word. <list>
<item>if possible, replace indications of highlighting by an appropriate TEI
element, chosen from the following list : <gi>foreign</gi>, <gi>title</gi>,
<gi>emph</gi></item>
<!-- CO: Why would title and foreign be good elements for the task? Emph refers to linguistic or rhetorical effect -->
<!-- LB: because titles and foreign passages are often represented by
highlighting -->
<item>otherwise, replace all indications of highlighting by the TEI <gi>hi</gi>
element</item>
<!-- CO: this might be a good way of encoding. hi can get a @rend for determing bold, underlined etc.?! -->
<!-- LB: indeed it can, but why would we want to use that? -->
<item>indications of superscript characters (such as French
<soCalled>14ᵉ</soCalled>) should be removed. Instead, the TEI element
<gi>abbr</gi> should be used to indicate the presence of an abbreviated
word: <code><abbr>14e</abbr></code></item>
</list></p>
<p><label>Open Question</label>: Is it feasible or useful to recode highlighted spans of
text in this way? <!-- BN: I don't know. I'm not sure if it is worth annotating this information. It
seems that it is not really useful for a distant reading.-->
<!-- LB : I think I am coming down on Borja's side here: it's a LOT of work to do it
properly, and most distant readers won't want to use it -->
<list>
<item>Yes : in many cases this can be an automatic process and the results justify
investing the effort </item>
<item>No : there are likely to be too many borderline or debatable cases to do
this automatically so this would have to be done as part of a major proof
reading exercise</item>
</list></p>
<p>Whichever solution is adopted, it should be applied uniformly across the ELTeC. A
collection in which some texts make distinctions ignored by others is
unsatisfactory.</p>
</div>
<div>
<head>TEI Elements used</head>
<p>This section will provide a checklist of TEI elements used in the body of each ELTeC
text, with descriptions and examples of their intended applications. </p>
</div>
<div>
<head>Metadata in the TEI Header</head>
<!-- BN: About the metadata:
Licences (creative common) will be included in metadata, isn't it?
I think it could be useful to link author names with WikiData, VIAF, ISNI
or similar linked open data resources. The ID of each author in these
resources could be included in the metadata.
WikiData:
https://www.wikidata.org/wiki/Wikidata:Main_Page
VIAF (Virtual International Authority File):
https://www.oclc.org/en/viaf.html
http://viaf.org/
ISNI (International Standard Name Identifier (ISO 27729))
http://www.isni.org/
http://isni.oclc.org/-->
<!-- LB : agreed. Since wikidata includes the others, should we use that for
preference? -->
<p>This section describes the metadata associated with each text (title, authorship,
date etc.) and with the collection as a whole. The intention is to provide this in a
standardised way to facilitate subsetting of the collection, using (for example)
coded values for the descriptive selection criteria associated with the text. As far
as possible, our text should represent the first complete printed edition of each
novel selected. </p>
<p>The TEI Header provides a very large number of possibilities for encoding such
metadata. We will provide a checklist of the TEI Header elements which are always to
be provided for each text, possibly in the form of a template. As in the body of the
text, the intention is to provide a guaranteed minimal level of information,
consistent across all parts of the ELTeC. </p>
<!-- CO: maybe you can include the metadata in the discussion paper of the selection criteria? -->
<p>Note that metadata may be supplied at (at least) two levels: the level of the ELTeC
as a whole, and that of individual texts within it. Information which applies
uniformly to all parts of the collection should be supplied in the ELTeC header;
information specific to a particular document in the text header. </p>
</div>
<div>
<head>Text-level metadata</head>
<p>Here is an example template for an individual text header
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<teiHeader type="novelHeader">
<fileDesc>
<titleStmt>
<title><!-- standard title of work -->
</title>
<author>
<!-- information about the author -->
</author>
</titleStmt>
<extent>
<!-- size of the text, in pages and words -->
</extent>
<publicationStmt>
<!-- boilerplate statement about status as part of ELTeC -->
</publicationStmt>
<sourceDesc>
<bibl>
<!-- bibliographic description of the printed source -->
</bibl>
</sourceDesc>
</fileDesc>
<profileDesc>
<!-- additional descriptive information -->
</profileDesc>
<revisionDesc>
<!-- revision information -->
</revisionDesc>
</teiHeader>
</egXML>
</p>
<p>Within the <gi>teiHeader</gi>, a <gi>fileDesc</gi>, a <gi>profileDesc</gi>, and a
<gi>revisionDesc</gi> are all required. The <gi>encodingDesc</gi> may be supplied
in (hopefully unlikely) event that some aspect of this document's encoding is
anomalous. </p>
<div>
<head>Components of the file description</head>
<p>The <gi>fileDesc</gi> contains the following mandatory elements: <specList>
<specDesc key="titleStmt"/>
<specDesc key="extent"/>
<specDesc key="publicationStmt"/>
<specDesc key="sourceDesc"/>
</specList>
</p>
<p> Taking these in turn, the <gi>titleStmt</gi> contains the title, author, and
encoder of the document. For novels with multiple authors, titles, or encoders the
element concerned is simply repeated. The <gi>title </gi>should be taken from an
authoritative bibliographic source, and should include a phrase such as
<soCalled>ELTeC edition</soCalled>. The <gi>author</gi> may contain one or more
of the following descriptive elements: <specList>
<specDesc key="persName"/>
<specDesc key="forename"/>
<specDesc key="surname"/>
<specDesc key="birth"/>
<specDesc key="death"/>
<specDesc key="affiliation" atts="type"/>
<specDesc key="sex" atts="value"/>
<specDesc key="idno" atts="type"/>
</specList>
</p>
<p>In addition to one or more <gi>author</gi> elements, a <gi>titleStmt</gi> should
contain at least one <gi>respStmt</gi> element indicating the person responsible
for the ELTeC encoded version, using the following elements <specList>
<specDesc key="resp"/>
<specDesc key="respStmt"/>
<specDesc key="name"/>
</specList></p>
<p>Here is an example :
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<titleStmt>
<title>Howards End : ELTeC edition</title>
<author>
<persName>
<forename>Edward</forename>
<forename>Morgan</forename>
<surname>Forster</surname>
</persName>
<persName>E.M. Forster</persName>
<birth when="1879"/>
<death when="1970"/>
<sex value="M"/>
<idno type="viaf">https://viaf.org/viaf/31996364</idno>
<idno type="wiki">https://www.wikidata.org/wiki/Q189119</idno>
</author>
<respStmt>
<resp>ELTeC encoding</resp>
<name>Lou Burnard</name>
</respStmt>
</titleStmt>
</egXML>
</p>
<p> The <gi>extent</gi> provides information about the size of the document, given by
means of the following elements<specList>
<specDesc key="extent"/>
<specDesc key="measure" atts="unit quantity"/>
</specList> Exactly which measurements will be most useful and easily incorporated
is yet to be determined: probably a count of words and pages will suffice. </p>
<egXML xmlns="http://www.tei-c.org/ns/Examples"><extent>
<measure unit="words" quantity="20010"/>
<measure unit="pages" quantity="245"/>
</extent>
</egXML>
<p>The <gi>publicationStmt</gi> is required for TEI conformance: in individual text
headers it will contain some standard boiler plate text referring to the fuller
statement which will be furnished by the collection-level header. <!--<specList>
<specDesc key="idno"/>
<specDesc key="pubPlace"/>
<specDesc key="publisher"/>
<specDesc key="date" atts="when"/>
<specDesc key="biblScope"/>
</specList>-->
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<publicationStmt>
<p>Incorporated into the ELTeC <date>2018-02-12</date></p>
</publicationStmt>
</egXML>
</p>
<p>The <gi>sourceDesc</gi> element is also required for TEI conformance. It will
contain a bibliographic description of the source text against which the digital
text has been validated, typically the first published edition of the work
concerned. Where the ELTeC version derives from a pre-existing digital version of
this work, a reference to that source will also be provided. The following
elements are used to record this information: <specList>
<specDesc key="bibl"/>
<specDesc key="title"/>
<specDesc key="author"/>
<specDesc key="publisher"/>
<specDesc key="pubPlace"/>
<specDesc key="ref"/>
</specList>
</p>
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<sourceDesc>
<bibl>
<author>E.M. Forster</author>
<title>Howards End</title>
<pubPlace>London</pubPlace>
<publisher>Edward Arnold</publisher>
<date>1910</date>
<idno type="wiki">https://www.wikidata.org/wiki/Q1146642</idno>
</bibl>
<bibl>
<title>The Project Gutenberg Etext of Howards End, by E. M. Forster</title>
<ref target="http://www.gutenberg.org/files/2891/2891-h/2891-h.htm">HTML
version downloaded on <date>2017-12-26</date></ref>
</bibl>
<note type="editions" source="worldcat"> Worldcat lists 484 print editions in
English</note>
</sourceDesc>
</egXML>
</div>
<div>
<head>Components of the profile description</head>
<p>The <gi>profileDesc</gi> of an ELTeC text has the following mandatory components: <specList>
<specDesc key="langUsage"/>
<specDesc key="textClass"/>
</specList></p>
<p>The <gi>langUsage</gi> element contains one or more <gi>language</gi> elements,
one for each language, dialect, sublanguage etc. explicitly identified in the body
of the text, indicating roughly how much of the text uses this language. For
example, a text which is almost entirely in British English, but also contains
some parts in US English would have an entry like this: </p>
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<langUsage>
<language ident="en-GB" usage="90">British English</language>
<language ident="en-US" usage="10">North American English</language>
</langUsage></egXML>
<p>The TEI <gi>textClass</gi> element can contain one or more of the following
elements: <specList>
<specDesc key="catRef"/>
<specDesc key="classCode"/>
<specDesc key="keywords" atts="source"/>
<specDesc key="term"/>
</specList> These three methods for classifying texts can be used in parallel. It
is an <label>open question</label> which we should use for the ELTeC collection:
the schema proposed here permits any combination. </p>
<p>The <gi>keywords</gi> option allows us to supply one or more <gi>term</gi>
elements to categorise a text in some way. If the values are taken from a known
closed list or authority file, that file should be specified using the
<att>source</att> attribute. </p>
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<textClass>
<keywords source="http://wikidata.org">
<term>social class</term>
<term>social convention</term>
<term>modernity</term>
<term>family drama</term>
</keywords>
</textClass>
</egXML>
<p><label>Open Question</label> : should we invent our own taxonomy, use a
pre-existing one, make no attempt to constrain or predefine terms used here?</p>
<p>The <gi>classCode</gi> option allows us to use classification codes used or
defined by existing authorities, such as library catalogue schemes, while the
<gi>catRef</gi> option allows us to specify such codes using our own
classification scheme. </p>
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<catRef target="#author_m #reprint_3"/>
<classCode source="UDC">8231.111</classCode>
</egXML>
<p>Since our selection and descriptive criteria are likely to be specific to the
project, we will probably have to define them in the corpus header using the
following elements: <specList>
<specDesc key="taxonomy"/>
<specDesc key="category"/>
<specDesc key="catDesc"/>
</specList></p>
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<taxonomy>
<category xml:id="author_m"><catDesc>male authorship</catDesc></category>
<category xml:id="author_f"><catDesc>female authorship</catDesc></category>
<category xml:id="author_u"><catDesc>author gender unknown</catDesc></category>
<category xml:id="reprint_0"><catDesc>no reprints found</catDesc></category>
<category xml:id="reprint_1"><catDesc>1 to 50 editions</catDesc></category>
<category xml:id="reprint_2"><catDesc>50 to 100 editions</catDesc></category>
<category xml:id="reprint_3"><catDesc>Over 100 reprints</catDesc></category>
</taxonomy>
</egXML>
<!-- LB: some examples are needed here. Check recommendations of TEI in Libraries -->
</div>
<div>
<head>Components of the Revision Description</head>
<p>The <gi>revisionDesc</gi> element is used to document significant points in the
version history of the document. At least one entry should be provided for an
ELTeC document, specifying when it was first added to the collection. The
following elements can be used: <specList>
<specDesc key="revisionDesc"/>
<specDesc key="change" atts="when who"/>
</specList>
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<revisionDesc>
<change when="2018-02-21" who="ELTeC:LB">Added new linguistic
classifications</change>
<change when="2018-01-29" who="ELTeC:LB">Added to the ELTeC</change>
</revisionDesc></egXML>
</p>
</div>
<div>
<head>Encoding description</head>
<p>The TEI allows for the specification of encoding practice, by which is meant
documentation of the specific editorial policies followed during transcription
(treatment of printed hyphens, lexical normalisation, sampling procedures,
features included, ignored, or normalised, etc.). Such specification may be
supplied at the individual document level, or once for all across the whole of a
corpus. It is even possible to specify that different parts of a document follow
different policies, provided that all the available policies are defined
somewhere. </p>
<p><label>Open Question</label> : We propose as far as possible not to allow for any
variation in encoding policies applied within the ELTeC. We will still need to
determine our encoding policies, of course, and to document them appropriately in
the ELTeC corpus header, but there should be no need for separate specifications
at the document level. </p>
</div>
</div>
<div>
<head>Linguistic and semantic annotation</head>
<p> Later stages of the project will need to use additional markup facilities to
represent more sophisticated annotations, which may be motivated linguistically (for
example, to provide a normalised form, part of speech, etc.) or semantically (for
example to distinguish proper names, names of people, places, events, etc.). These
will form an additional layer, not discussed here: the principle should however be
that the base text we provide is always available in a uniform encoding.</p>
</div>
<div>
<head>Formal specification</head>
<p>The ELTeC encoding scheme defined by this document is a TEI-conformant customization,
from which user documentation, and formal RELAXNG or DTD specifications can be
generated automatically. This section contains the schema specification itself.</p>
<schemaSpec ident="ELTeC" start="TEI">
<moduleRef key="tei"/>
<moduleRef key="core"
include="author bibl date head item l label lg list measure
name p pb publisher pubPlace quote ref resp respStmt sp stage hi term title"/>
<moduleRef key="textstructure" include="TEI text div body front back"/>
<moduleRef key="header"
include="catDesc category catRef change classCode encodingDesc extent fileDesc idno keywords langUsage language
profileDesc publicationStmt revisionDesc sourceDesc taxonomy teiHeader textClass titleStmt"/>
<moduleRef key="namesdates"
include="affiliation forename persName surname"/>
<!-- Class modifications to reduce attribute clutter -->
<!-- first remove classes which provide attributes we dont want -->
<classSpec ident="att.declaring" type="atts" mode="delete"/>
<classSpec ident="att.declarable" type="atts" mode="delete"/>
<classSpec ident="att.personal" type="atts" mode="delete"/>
<classSpec ident="att.ranging" type="atts" mode="delete"/>
<classSpec ident="att.written" type="atts" mode="delete"/>
<classSpec ident="att.breaking" type="atts" mode="delete"/>
<classSpec ident="att.datable.iso" type="atts" mode="delete"/>
<classSpec ident="att.datable.custom" type="atts" mode="delete"/>
<classSpec ident="att.divLike" type="atts" mode="delete"/>
<classSpec ident="att.editLike" type="atts" mode="delete"/>
<classSpec type="atts" ident="att.global.responsibility" mode="delete"/>
<classSpec ident="att.naming" type="atts" mode="delete"/>
<!-- next modify classes which provide some attributes we want and some we don't -->
<classSpec type="atts" ident="att.global.rendition" mode="change">
<attList>
<attDef ident="rendition" mode="delete"/>
<attDef ident="style" mode="delete"/>
</attList>
</classSpec>
<classSpec type="atts" ident="att.typed" mode="change">
<attList>
<attDef ident="subtype" mode="delete"/>
</attList>
</classSpec>
<classSpec type="atts" ident="att.dimensions" mode="change">
<attList>
<attDef ident="precision" mode="delete"/>
<attDef ident="scope" mode="delete"/>
</attList>
</classSpec>
<classSpec type="atts" ident="att.datable" mode="change">
<attList>
<attDef ident="calendar" mode="delete"/>
<attDef ident="period" mode="delete"/>
</attList>
</classSpec>
<classSpec type="atts" ident="att.pointing" mode="change">
<attList>
<attDef ident="targetLang" mode="delete"/>
<attDef ident="evaluate" mode="delete"/>
</attList>
</classSpec>
<!-- now tweaks for individual elements -->
<!-- make xml:id and xml:lang obligatory for text -->
<elementSpec ident="text" mode="change">
<attList>
<attDef ident="xml:id" mode="change" usage="req"/>
<attDef ident="xml:lang" mode="change" usage="req"/>
</attList>
</elementSpec>
<!-- define two new attributes for author -->
<elementSpec ident="author" mode="change">
<attList>
<attDef ident="dates" mode="add">
<desc xml:lang="en">supplies the years of birth and death of an author</desc>
<datatype minOccurs="2" maxOccurs="2">
<dataRef name="gYear"/>
</datatype>
<!-- any pair of 4 digit values will do here; we could
add schematron to check that it is in the
right range and that the second is after the first -->
</attDef><attDef ident="sex" mode="add">
<desc xml:lang="en">Specifies the sexual identification usually associated with the author</desc>
<valList type="closed">
<valItem ident="m"><desc>male</desc></valItem>
<valItem ident="f"><desc>female</desc></valItem>
<valItem ident="u"><desc>unknown or inapplicable</desc></valItem>
</valList>
</attDef>
</attList>
</elementSpec>
</schemaSpec>
</div>
</body>
<back xml:id="sources">
<p>Sources consulted</p>
<listBibl>
<bibl>An introduction to TEI Simple Print <idno type="URI"
>http://www.tei-c.org/release/doc/tei-p5-exemplars/html/tei_simplePrint.doc.html</idno></bibl>
<bibl>Burnard (Lou) <date>2005</date>
<title level="a">Metadata for corpus work</title> in <title>Developing Linguistic
Corpora: A guide to good practice</title> ed. Martin Wynne. Oxford: Oxbow Books,
pp 30-46. <!--<ref target="2005-metadata.xml">XML source</ref>--></bibl>
<bibl> Odebrecht, Carolin. (2017). Metadata for Historical Corpora. Realization of the
Metamodel for Corpus Metadata with the help of TEI Customization [Data set]. Zenodo.
http://doi.org/10.5281/zenodo.267999</bibl>
<bibl>
<idno type="URI">github.com/cligs/textbox</idno>
</bibl>
</listBibl>
</back>
</text>
</TEI>