Wikipedia:Historical archive/GNE project files/GNE Architecture
My (MikeWarren) current favourite name is GNE: GNE's Not an Encyclopedia! (Dan Geiser's suggestion.)
In any case, my ideas for how this should all work:
OVERVIEW
Since the goal of GNE is to keep submissions almost completely open
(barring completely obvious spam), some method of classification is
needed. Since almost everyone has suggested a different way of
classifying articles, it makes the most sense to keep the
classification information separate and allow for multiple
classifiers.
THE BACK END ARTICLE REPOSITORY
Hence, only information absolutely essential to the article should be
kept in the actual article repository. I think keeping this in XML has
some advantages: it is readily human readable; some simple semantic
hints can be included by the author if she chooses (here I mean things
like <date>Jurassic</date> or <name>Mike Warren</name>); changing the
DTD/Schema can be quite easy in many cases (unlike changing the schema
of a database). Unique IDs will need to be assigned to each article,
so that the classifiers can reference them. Anything from the really
simple (sequentially-assigned 128-bit integers) to the complicated
(MD5 or similar hashes of the content) can be employed for this
purpose.
Using this method to store the articles, it seems to make sense to
just use an existing Web server like Apache to serve these articles,
which can then be accessed easily:
http://www.gne.org/article/unique-id-12345.xml
If a single directory becomes insufficient (as seems likely at some
point), then the first bits of the unique IDs can be used to make
sub-directories, and the URL re-writing ability of Apache (and
presumably other Web servers) can be used to change the above URLs
into the actual URL. This has the advantage that existing free
software is employed to implement the back-end and mirroring the data
is extremely easy (just tar it, or rsync, or FTP). It will also be
necessary to include an index of all articles which exist on the
server. This need be nothing more than a complete list of all the
unique IDs of the articles, enabling easy access by the classifiers
(see below).
Versioning of the articles is also desired. This can be
simply:
http://www.gne.org/articles/unique-id.version.xml
Whichever method of assigning the IDs is used, many have expressed the
need/desire to digitally sign the articles. Since the article itself
needs to be signed, it makes little sense to include the signature in
the actual XML of the article (since then it would need to be removed
to check the signature, and raises questions like, ``exactly which
bytes do I remove to take out the signature). The above scheme for
serving the articles fits in well here: signatures can just go in
well-known signature files:
http://www.gne.org/article/unique-id-12345.xml.asc
for the above example. If such a file doesn't exist, the article is
not signed.
THE CLASSIFIERS
No actual user or client program should be accessing the repository
directly, in all likelihood. A number of classification databases will
exist, which will index all the articles in the repository according
to their own criteria. One of the simplest ones (which the GNE project
itself will likely supple) will be a simple author and title
index. Every night (say) this classifier will request the article
index from one of the back-end repositories. Looking through the list
of unique IDs, it will note any IDs which are not already in its own
database. For these, it will request the XML file from the repository
and parse it, extracting whatever information is relevant to this
classification (in this example, just the author(s) and title). If
this classifier is more complicated (perhaps its the Nupedia one),
then these new articles will be sent to mailing lists for comments
about which category it should go in (or if it should be excluded from
the classification, by putting it in an ``ignore category).
The classifiers can do absolutely anything imaginable, from providing
a kids-only view of the information, to a keyword search capability to
more complex things. As classifications become irrelevant or unused,
they can simply be deleted; nothing needs to change in the article
repository. If someone is dissatisfied with the current
classification schemes, they can take some base software provided by
GNE and modify it to suit their needs; classifiers based on voting of
users, voting of ``experts or many other schemes can be created.
These classifier systems should probably store much of their
information in a database like MySQL since they will be accessing it a
lot.
CONVERTERS/CLIENTS
No matter what classifier is used, the user needs to actually see the
articles. This needs conversion from the XML format into some other
suitable format, like HTML or DVI. The GNE project will produce such
converters, which will be employed by the classifiers when showing an
article to an actual user. It might also be useful (in the future) to
make clients which interact (using some known protocol) with the
classifiers and then allow the user to choose from whatever formats
the client software knows to convert to. I wouldn't suggest doing such
a thing until well after the classifier software is stable, if ever.
DIAGRAM
So, here's what will happen:
+---------------------+ | Classifier Projects | +---------------------+ . . . +---------+ +------+ +---------------------+ | Backend | | user | <--- Web browser ---> | Dewey Decimal Sys. | <--- fetches article ---> +---------+ +------+ +---------------------+ |12345.xml| | Library of Congress | |33213.xml| +---------------------+ |77662.xml| | Children-only + | ... | +---------------------+ +---------+ . . . Web browser/GNE client Database-driven lookup Directory(s) of mechanism XML files, with index & optional signature
SOFTWARE BY GNE
The software which GNE would need to write includes:
- A PARSER which will take almost-plain-text and convert it into XML in the DTD/Schema decided upon for the back-end articles. This will include author, revision and content. It will not include a digital signature or any classification information whatsoever. The content can contain optional semantic hint tags intended for use by the classifiers.
- BASIC CLASSIFIER software, which will ease the task of writing some classifier. As a minimum, this should include options to send articles to mailing lists and some basic hierarchical classification mechanism (since many classifiers will likely be hierarchical).
- CONVERTERS to take XML content in our DTD/Schema and produce LaTeX, HTML, DVI or other formats, for use by the classifiers when serving content to users. This means no load on repository servers, and flexibility for the classifiers, which might like to change the HTML to conform to their design style.
CONCLUSION
A simple, efficient and almost completely inclusive back-end article
repository is indexed by a number of classifier projects, with which
the user interacts to get at content they're interested in. These
classifiers are limited only by the imagination, and don't affect what
is stored in the article repository; no classifier group can
``censor the GNE project, since another classifier group could start
up to correct the wrong (in their opinion) classifications. No
articles are rejected from the back-end repository, which is easy to
mirror, easy to maintain and presents no major load on the server
beyond its duties as a Web page server. Classifier databases can run
on any system, anywhere in the world and use any mirror of the
repository.