(version 1 / 2 Nov 2002 / Tobias Rischer)
The purpose of these checks is to compare and classify sample TEI/SGML documents and their properties for the purpose of XML migration.
Assumed is a given sample of TEI/SGML, consisting of one or several files, in a separate subdirectory. Some of the suggestions for probing the sample that are given below are only helpful if the file is a valid, parseable SGML document.
The person checking the sample need not be well acquainted with it, and should not need to spend too much time per sample in order to gather meaningful information.
When answering the items of the checklist, it would be good to state shortly how the answer was achieved. So, "No" is not as good as "No (grepped for SUBDOC and didn't find any)."
In the system used for checking, the following could be useful:
nsgmls -s
" and retry with your locally stored
DTDs..xml
" or
".sgm(l)
") of course. Then, look for "/>
" in
the code (the key feature of the XML syntax for empty elements). You can
try to parse with rxp or other validating XML parsers.<!ENTITY wsd.english SYSTEM "teien.wsd" SUBDOC>
</>
and compare the number of
"<p
" followed by whitespace or ">" with the number of
"</p
" using something like "egrep '<p[ \t>]' |
wc -l
" (other likely candidates could be "li
" or
"item
" - these suggestions are just heuristics, of course).
Run "spam -p -momittag
" on the sample, redirect the result
into a new file, and check for differences with diff.while (<>) { if (/\=\s*(\S)/) { if ($1 ne "'" && $1 ne '"') { print; } } }Even better: use
spam -p -mattvalue
on the sample, redirect
the output into a new file, and compare it against the sample with
diff.<title m>
)?
spam -p -mattname
and comparing
results. If you can't parse, you can try a complicated regular
expression (a word within a tag, after the tag name, and not followed by
an equal sign -- but don't forget that tags can spread over several
lines).%entities = (); while (<>) { while (/(\&[^\;]+\;)/) { ++$entities{$1}; s/\&[^\;]+\;//; } } foreach $ent (sort keys %entities) { print "$ent \t$entities{$ent}\n"; }What remains to be done is checking the names. By starting at the other end, you can check the document prolog or extension files for entity definitions (assuming the extension files are part of the sample).
<!-- ... -->
" with
no "--
" in between. Empty declarations
"<!>
" are forbidden. Pragmatically, you can look for
"--
" and "<!>
" in the sample.<?
" and in XML, they
must be followed by a name; the name "xml" is reserved (and forbidden in
all other forms than lowercase). You can simply grep for
"<?
" and check what you get.teiHeader
". For a deeper check, the following perl code
should give you a sorted list of tags as they occur in the input (in
their spelling). Below that is a list of all camelCased tag names (no
guarantee, it's a copy/paste from Sebastian Rahtz's XSL). If I had a
list of all TEI tags, this perl code could be enhanced to an
automatic checker for new and mis-cased tags.
%tags = (); while (<>) { while (/\<\/?\s*([^ \t\>\/]+)/) { ++$tags{$1}; s/\<//; } } foreach $tag (sort keys %tags) { print " $tags{$tag} \t $tag \n"; }Tags that are not all-lowercase: TEI.2, addName, addSpan, addrLine, altGrp, attDef, attList, attName, attlDecl, baseWsd, biblFull, biblScope, biblStruct, castGroup, castItem, castList, catDesc, catRef, classCode, classDecl, classDoc, codedCharSet, dataDesc, dateRange, dateStruct, delSpan, divGen, docAuthor, docDate, docEdition, docImprint, docTitle, eLeaf, eTree, editionStmt, editorialDecl, elemDecl, encodingDesc, entDoc, entName, entitySet, entryFree, extFigure, fAlt, fDecl, fDescr, fLib, figDesc, fileDesc, firstLang, foreName, forestGrp, fvLib, genName, geogName, gramGrp, handList, handShift, headItem, headLabel, iNode, interpGrp, joinGrp, lacunaEnd, lacunaStart, langKnown, langUsage, linkGrp, listBibl, metDecl, nameLink, notesStmt, oRef, oVar, offSet, orgDivn, orgName, orgTitle, orgType, otherForm, pRef, pVar, particDesc, particLinks, persName, personGrp, placeName, postBox, postCode, profileDesc, projectDesc, pubPlace, publicationStmt, rdgGrp, recordingStmt, refsDecl, respStmt, revisionDesc, roleDesc, roleName, samplingDecl, scriptStmt, seriesStmt, settingDesc, soCalled, socecStatus, sourceDesc, spanGrp, stdVals, tagDoc, tagUsage, tagsDecl, teiCorpus.2, teiFsd2, teiHeader, termEntry, textClass, textDesc, timeRange, timeStruct, titlePage, titlePart, titleStmt, vAlt, vDefault, vRange, valDesc, valList, variantEncoding, witDetail, witEnd, witList, witStart.
[
and ]
at the beginning of the document,
within the DOCTYPE element.) Does it contain more than ENTITY
declarations with TEI DTD parameters and invocations of character
sets?