Metadata in document, extraction
I was recently working on a project which required several documents to be chosen, and displayed. The easiest way to do this was to have the documents formatted as html "snippets" and include them, as required, with the PHP echo file_get_contents(fn); function.
This worked just fine, but I wanted to allow the user to choose which document to display. Without some sort of database with the author, description and status of the document, all I could do was offer the document's filenames for the choices.
The answer I came up with was to format the documents like this:
<article>
<metadata author="JMP" description="Metadata in html document" status="complete"/>
<h3>Metadata in html document</h3>
<h2>The document</h2>
<p>... ... ... ...</p>
</article>
The documents, themselves, would be the "database"; iterating through the document filenames, I could call a function to extract the metadata, and make a much more informative document choice list
Extract the metadata
To extract the metadata, I wrote this function:
<?php
/*
getMetadata($theDocument)
$theDocument string - document filename/path
returns
associative array "attributename"=>"attributevalue"
*/
function getMetadata($theDocument)
{
$dom = new DOMDocument();
$dom->load($theDocument);
// this will require "strict" xhtml type of tags
// in the document i.e. <br/> not <br>
$p = $dom->getElementsByTagName('metadata')->item(0);
// or the slightly more "permissive"
//$dom->loadHTMLFile($theDocument,LIBXML_NOWARNING);
// The LIBXML_NOWARNING option will keep the parser
// from issuing warnings, which will occur since
// <metadata> isn't a "valid" html tag
$ret = array();
$p = $dom->getElementsByTagName('metadata')->item(0);
foreach ($p->attributes as $attr) {
$ret[$attr->nodeName] = $attr->nodeValue;
}
return $ret;
}
This allows me to do something like this:
$files = array_diff(scandir('/path/to/docs'), array('..', '.'));
printf('<select>');
foreach($files as $file) {
$metadata = getMetadata('/path/to/docs/'.$file);
printf('<option value="%s">%s - %s (%s)</option>',
$file,$metadata['author'],$metadata['description'],$metadata['status']);
}
printf('</select>')
which makes each document's select display AUTHOR - Description (staus), a much better UI IMHO.
There is, also, something rather pleasing about the documents being completely self-contained. There is no separate database maintenance, since the documents are the database. The author(s) can simply create, update, delete the documents as required.