Technical Checklist for TEI/SGML documents

(version 1 / 2 Nov 2002 / Tobias Rischer)

Purpose and assumptions

The purpose of these checks is to compare and classify sample TEI/SGML documents and their properties for the purpose of XML migration.

Assumed is a given sample of TEI/SGML, consisting of one or several files, in a separate subdirectory. Some of the suggestions for probing the sample that are given below are only helpful if the file is a valid, parseable SGML document.

The person checking the sample need not be well acquainted with it, and should not need to spend too much time per sample in order to gather meaningful information.

When answering the items of the checklist, it would be good to state shortly how the answer was achieved. So, "No" is not as good as "No (grepped for SUBDOC and didn't find any)."

In the system used for checking, the following could be useful:

Checklist

1. First visual check

  1. What is the title and source repository of the sample? What files are there?
  2. Does the sample come with: no DTD / copy of standard DTD (which?) / Pizza DTD?
  3. Was the sample (by the look of it) generated by a program, or written/edited by a person?
  4. Is the sample document all in one file, or distributed over several SGML files?

2. SGML check:

  1. Is the sample valid (parseable) SGML? (with its own DTD / with some standard TEI DTD?)
    How to find out:
    parse with "nsgmls -s" and retry with your locally stored DTDs.
  2. Is the sample already XML? Is it valid?
    How to find out:
    the first hint is the extension (".xml" or ".sgm(l)") of course. Then, look for "/>" in the code (the key feature of the XML syntax for empty elements). You can try to parse with rxp or other validating XML parsers.
  3. Does the sample use SUBDOCs? For anything else than WSDs?
    How to find out:
    grep for the word "SUBDOC" - most likely, it will be used to refer to a Writing System Declaration, as in:
    <!ENTITY wsd.english SYSTEM "teien.wsd" SUBDOC>
  4. Are all elements fully tagged without minimization techniques?
    How to find out:
    Search for </> and compare the number of "<p" followed by whitespace or ">" with the number of "</p" using something like "egrep '<p[ \t>]' | wc -l" (other likely candidates could be "li" or "item" - these suggestions are just heuristics, of course). Run "spam -p -momittag" on the sample, redirect the result into a new file, and check for differences with diff.
  5. Are all attribute values quoted?
    How to find out:
    Use a regular expression to find any equal sign not followed by a single or double quote, then quickly look for suspicious lines. The following Perl script should do it:
    while (<>) {
       if (/\=\s*(\S)/) {
           if ($1 ne "'" && $1 ne '"') {
    	   print;
           }
        }
    }
    
    Even better: use spam -p -mattvalue on the sample, redirect the output into a new file, and compare it against the sample with diff.
  6. Are there any omitted attribute names (as in <title m>)?
    How to find out:
    It will come up with the spam-technique proposed for the previous check item, or by running spam -p -mattname and comparing results. If you can't parse, you can try a complicated regular expression (a word within a tag, after the tag name, and not followed by an equal sign -- but don't forget that tags can spread over several lines).
    It would be useful to have a list of TEI tags where this can happen (those with attributes of NMTOKENS type, I would think).
  7. Does the text use SDATA entity references for well-known (Unicode) characters? Are there any self-defined / non-ISO / non-Unicode SDATA entities?
    How to find out:
    You can use the following little perl program to get a statistic of entity references used in the sample:
    %entities = ();
    
    while (<>) {
        while (/(\&[^\;]+\;)/) {
    	++$entities{$1};
    	s/\&[^\;]+\;//;
        }
    }
    
    foreach $ent (sort keys %entities) {
        print "$ent \t$entities{$ent}\n";
    }
    
    What remains to be done is checking the names. By starting at the other end, you can check the document prolog or extension files for entity definitions (assuming the extension files are part of the sample).
  8. Are there comments? In formats not legal in XML?
    How to find out:
    XML comments must be of the form "<!-- ... -->" with no "--" in between. Empty declarations "<!>" are forbidden. Pragmatically, you can look for "--" and "<!>" in the sample.
  9. Are there Processing Instructions? Do they start with a name?
    How to find out:
    Processing instructions start with "<?" and in XML, they must be followed by a name; the name "xml" is reserved (and forbidden in all other forms than lowercase). You can simply grep for "<?" and check what you get.
  10. Does the sample use really obscure SGML features? (CONCUR, ...)
    How to find out:
    If you are an SGML specialist, you could have a quick look at the SGML declaration and/or the beginning of the document. But the item is here mostly for completeness' sake. If in doubt, just guess "no".
  11. What kind of warnings and errors do you get from sx? Something not yet probed by the previous checks?
    How to find out:
    Try to run sx on the sample and look at the errors and warnings. This, of course, only works with parseable samples.

3. TEI check:

  1. On which TEI DTD is the sample based? (P2, P3, P4, TEILite, unknown)
    How to find out:
    Check the DOCTYPE declaration at the beginning. If the sample comes with its own DTD file, have a short look at that one. Have a quick look at the TEI Header. You can use the perl code for the camelCase check to get a list of tags and check for non-TEI ones.
  2. Does the sample (consistently) use the TEI camelCase spelling?
    How to find out:
    Systematic problems (all uppercase or all lowercase) can be spotted with one look at the start of the document, thanks to the spelling of "teiHeader". For a deeper check, the following perl code should give you a sorted list of tags as they occur in the input (in their spelling). Below that is a list of all camelCased tag names (no guarantee, it's a copy/paste from Sebastian Rahtz's XSL). If I had a list of all TEI tags, this perl code could be enhanced to an automatic checker for new and mis-cased tags.
    %tags = ();
    
    while (<>) {
        while (/\<\/?\s*([^ \t\>\/]+)/) {
            ++$tags{$1};
            s/\<//;
        }
    }
    foreach $tag (sort keys %tags) {
        print " $tags{$tag} \t $tag \n";
    }
    
    
    Tags that are not all-lowercase: TEI.2, addName, addSpan, addrLine, altGrp, attDef, attList, attName, attlDecl, baseWsd, biblFull, biblScope, biblStruct, castGroup, castItem, castList, catDesc, catRef, classCode, classDecl, classDoc, codedCharSet, dataDesc, dateRange, dateStruct, delSpan, divGen, docAuthor, docDate, docEdition, docImprint, docTitle, eLeaf, eTree, editionStmt, editorialDecl, elemDecl, encodingDesc, entDoc, entName, entitySet, entryFree, extFigure, fAlt, fDecl, fDescr, fLib, figDesc, fileDesc, firstLang, foreName, forestGrp, fvLib, genName, geogName, gramGrp, handList, handShift, headItem, headLabel, iNode, interpGrp, joinGrp, lacunaEnd, lacunaStart, langKnown, langUsage, linkGrp, listBibl, metDecl, nameLink, notesStmt, oRef, oVar, offSet, orgDivn, orgName, orgTitle, orgType, otherForm, pRef, pVar, particDesc, particLinks, persName, personGrp, placeName, postBox, postCode, profileDesc, projectDesc, pubPlace, publicationStmt, rdgGrp, recordingStmt, refsDecl, respStmt, revisionDesc, roleDesc, roleName, samplingDecl, scriptStmt, seriesStmt, settingDesc, soCalled, socecStatus, sourceDesc, spanGrp, stdVals, tagDoc, tagUsage, tagsDecl, teiCorpus.2, teiFsd2, teiHeader, termEntry, textClass, textDesc, timeRange, timeStruct, titlePage, titlePart, titleStmt, vAlt, vDefault, vRange, valDesc, valList, variantEncoding, witDetail, witEnd, witList, witStart.
  3. Is there a substantial DTD subset? (that is the part between [ and ] at the beginning of the document, within the DOCTYPE element.) Does it contain more than ENTITY declarations with TEI DTD parameters and invocations of character sets?
  4. Does the sample DTD rename TEI tags?
    How to find out:
    Check DTD extension files (if available) or the DTD itself. A checker for unknown TEI tags would be nice to spot them automatically.
  5. Are there real DTD modifications? With recommended technique or by editing DTD files?
    How to find out:
    Modifications should be done in DTD extension files and documented somewhere. Even if "they" forgot to pack extension files, they should be referred in the DTD subset at the beginning of the document. Hand-edited DTD files (if available) could contain comments or other indications. One could try a diff of an alleged TEILite DTD against an official one (but I think there are more than one official TEI Lite DTD and diff's might find a lot of noise anyway). If the sample is parseable, one could try parsing against an official DTD and wait for errors.