Technical Checklist for TEI/SGML documents
(version 1 / 2 Nov 2002 / Tobias Rischer)
Purpose and assumptions
The purpose of these checks is to compare and classify sample TEI/SGML
documents and their properties for the purpose of XML migration.
Assumed is a given sample of TEI/SGML, consisting of one or several
files, in a separate subdirectory. Some of the suggestions for probing
the sample that are given below are only helpful if the file is a
valid, parseable SGML document.
The person checking the sample need not be well acquainted with it,
and should not need to spend too much time per sample in order to
gather meaningful information.
When answering the items of the checklist, it would be good to state
shortly how the answer was achieved. So, "No" is not as good as "No
(grepped for SUBDOC and didn't find any)."
In the system used for checking, the following could be useful:
* P3 TEILite and full TEI DTD somewhere;
* the SP utilities: nsgmls, spam, sx;
* a validating XML parser, for example rxp;
* some general UNIX tools or equivalents: grep, diff, perl, wc
Checklist
1. First visual check
101. What is the title and source repository of the sample? What files
are there?
102. Does the sample come with: no DTD / copy of standard DTD (which?)
/ Pizza DTD?
103. Was the sample (by the look of it) generated by a program, or
written/edited by a person?
104. Is the sample document all in one file, or distributed over
several SGML files?
2. SGML check:
201. Is the sample valid (parseable) SGML? (with its own DTD / with
some standard TEI DTD?)
How to find out:
parse with "nsgmls -s" and retry with your locally stored DTDs.
202. Is the sample already XML? Is it valid?
How to find out:
the first hint is the extension (".xml" or ".sgm(l)") of course.
Then, look for "/>" in the code (the key feature of the XML syntax
for empty elements). You can try to parse with rxp or other
validating XML parsers.
203. Does the sample use SUBDOCs? For anything else than WSDs?
How to find out:
grep for the word "SUBDOC" - most likely, it will be used to refer
to a Writing System Declaration, as in:
204. Are all elements fully tagged without minimization techniques?
How to find out:
Search for > and compare the number of "
" with the number of "
]' | wc -l" (other likely candidates could be "li"
or "item" - these suggestions are just heuristics, of course). Run
"spam -p -momittag" on the sample, redirect the result into a new
file, and check for differences with diff.
205. Are all attribute values quoted?
How to find out:
Use a regular expression to find any equal sign not followed by a
single or double quote, then quickly look for suspicious lines.
The following Perl script should do it:
while (<>) {
if (/\=\s*(\S)/) {
if ($1 ne "'" && $1 ne '"') {
print;
}
}
}
Even better: use spam -p -mattvalue on the sample, redirect the
output into a new file, and compare it against the sample with
diff.
206. Are there any omitted attribute names (as in )?
How to find out:
It will come up with the spam-technique proposed for the previous
check item, or by running spam -p -mattname and comparing results.
If you can't parse, you can try a complicated regular expression
(a word within a tag, after the tag name, and not followed by an
equal sign -- but don't forget that tags can spread over several
lines).
It would be useful to have a list of TEI tags where this can
happen (those with attributes of NMTOKENS type, I would think).
207. Does the text use SDATA entity references for well-known
(Unicode) characters? Are there any self-defined / non-ISO /
non-Unicode SDATA entities?
How to find out:
You can use the following little perl program to get a statistic
of entity references used in the sample:
%entities = ();
while (<>) {
while (/(\&[^\;]+\;)/) {
++$entities{$1};
s/\&[^\;]+\;//;
}
}
foreach $ent (sort keys %entities) {
print "$ent \t$entities{$ent}\n";
}
What remains to be done is checking the names. By starting at the
other end, you can check the document prolog or extension files
for entity definitions (assuming the extension files are part of
the sample).
208. Are there comments? In formats not legal in XML?
How to find out:
XML comments must be of the form "" with no "--" in
between. Empty declarations "" are forbidden. Pragmatically,
you can look for "--" and "" in the sample.
209. Are there Processing Instructions? Do they start with a name?
How to find out:
Processing instructions start with "" and in XML, they must be
followed by a name; the name "xml" is reserved (and forbidden in
all other forms than lowercase). You can simply grep for "" and
check what you get.
210. Does the sample use really obscure SGML features? (CONCUR, ...)
How to find out:
If you are an SGML specialist, you could have a quick look at the
SGML declaration and/or the beginning of the document. But the
item is here mostly for completeness' sake. If in doubt, just
guess "no".
211. What kind of warnings and errors do you get from sx? Something
not yet probed by the previous checks?
How to find out:
Try to run sx on the sample and look at the errors and warnings.
This, of course, only works with parseable samples.
3. TEI check:
301. On which TEI DTD is the sample based? (P2, P3, P4, TEILite,
unknown)
How to find out:
Check the DOCTYPE declaration at the beginning. If the sample
comes with its own DTD file, have a short look at that one. Have a
quick look at the TEI Header. You can use the perl code for the
camelCase check to get a list of tags and check for non-TEI ones.
302. Does the sample (consistently) use the TEI camelCase spelling?
How to find out:
Systematic problems (all uppercase or all lowercase) can be
spotted with one look at the start of the document, thanks to the
spelling of "teiHeader". For a deeper check, the following perl
code should give you a sorted list of tags as they occur in the
input (in their spelling). Below that is a list of all camelCased
tag names (no guarantee, it's a copy/paste from Sebastian Rahtz's
XSL). If I had a list of all TEI tags, this perl code could be
enhanced to an automatic checker for new and mis-cased tags.
%tags = ();
while (<>) {
while (/\<\/?\s*([^ \t\>\/]+)/) {
++$tags{$1};
s/\/;
}
}
foreach $tag (sort keys %tags) {
print " $tags{$tag} \t $tag \n";
}
Tags that are not all-lowercase: TEI.2, addName, addSpan,
addrLine, altGrp, attDef, attList, attName, attlDecl, baseWsd,
biblFull, biblScope, biblStruct, castGroup, castItem, castList,
catDesc, catRef, classCode, classDecl, classDoc, codedCharSet,
dataDesc, dateRange, dateStruct, delSpan, divGen, docAuthor,
docDate, docEdition, docImprint, docTitle, eLeaf, eTree,
editionStmt, editorialDecl, elemDecl, encodingDesc, entDoc,
entName, entitySet, entryFree, extFigure, fAlt, fDecl, fDescr,
fLib, figDesc, fileDesc, firstLang, foreName, forestGrp, fvLib,
genName, geogName, gramGrp, handList, handShift, headItem,
headLabel, iNode, interpGrp, joinGrp, lacunaEnd, lacunaStart,
langKnown, langUsage, linkGrp, listBibl, metDecl, nameLink,
notesStmt, oRef, oVar, offSet, orgDivn, orgName, orgTitle,
orgType, otherForm, pRef, pVar, particDesc, particLinks, persName,
personGrp, placeName, postBox, postCode, profileDesc, projectDesc,
pubPlace, publicationStmt, rdgGrp, recordingStmt, refsDecl,
respStmt, revisionDesc, roleDesc, roleName, samplingDecl,
scriptStmt, seriesStmt, settingDesc, soCalled, socecStatus,
sourceDesc, spanGrp, stdVals, tagDoc, tagUsage, tagsDecl,
teiCorpus.2, teiFsd2, teiHeader, termEntry, textClass, textDesc,
timeRange, timeStruct, titlePage, titlePart, titleStmt, vAlt,
vDefault, vRange, valDesc, valList, variantEncoding, witDetail,
witEnd, witList, witStart.
303. Is there a substantial DTD subset? (that is the part between [
and ] at the beginning of the document, within the DOCTYPE
element.) Does it contain more than ENTITY declarations with TEI
DTD parameters and invocations of character sets?
304. Does the sample DTD rename TEI tags?
How to find out:
Check DTD extension files (if available) or the DTD itself. A
checker for unknown TEI tags would be nice to spot them
automatically.
305. Are there real DTD modifications? With recommended technique or
by editing DTD files?
How to find out:
Modifications should be done in DTD extension files and documented
somewhere. Even if "they" forgot to pack extension files, they
should be referred in the DTD subset at the beginning of the
document. Hand-edited DTD files (if available) could contain
comments or other indications. One could try a diff of an alleged
TEILite DTD against an official one (but I think there are more
than one official TEI Lite DTD and diff's might find a lot of
noise anyway). If the sample is parseable, one could try parsing
against an official DTD and wait for errors.
last modified on: 2002-11-02 19:55 GMT
(c) 2002 Tobias Rischer, [1]http://rischer.com/
[2]-> IMPRESSUM
References
1. http://rischer.com/
2. file://localhost/impressum.html