Wikipedia:Historical archive/GNE project files/GNE Architecture

My (MikeWarren) current favourite name is GNE: GNE's Not an Encyclopedia! (Dan Geiser's suggestion.)

In any case, my ideas for how this should all work:

OVERVIEW

Since the goal of GNE is to keep submissions almost completely open

(barring completely obvious spam), some method of classification is

needed. Since almost everyone has suggested a different way of

classifying articles, it makes the most sense to keep the

classification information separate and allow for multiple

classifiers.

THE BACK END ARTICLE REPOSITORY

Hence, only information absolutely essential to the article should be

kept in the actual article repository. I think keeping this in XML has

some advantages: it is readily human readable; some simple semantic

hints can be included by the author if she chooses (here I mean things

like <date>Jurassic</date> or <name>Mike Warren</name>); changing the

DTD/Schema can be quite easy in many cases (unlike changing the schema

of a database). Unique IDs will need to be assigned to each article,

so that the classifiers can reference them. Anything from the really

simple (sequentially-assigned 128-bit integers) to the complicated

(MD5 or similar hashes of the content) can be employed for this

purpose.

Using this method to store the articles, it seems to make sense to

just use an existing Web server like Apache to serve these articles,

which can then be accessed easily:

   http://www.gne.org/article/unique-id-12345.xml

If a single directory becomes insufficient (as seems likely at some

point), then the first bits of the unique IDs can be used to make

sub-directories, and the URL re-writing ability of Apache (and

presumably other Web servers) can be used to change the above URLs

into the actual URL. This has the advantage that existing free

software is employed to implement the back-end and mirroring the data

is extremely easy (just tar it, or rsync, or FTP). It will also be

necessary to include an index of all articles which exist on the

server. This need be nothing more than a complete list of all the

unique IDs of the articles, enabling easy access by the classifiers

(see below).

Versioning of the articles is also desired. This can be

simply:

 http://www.gne.org/articles/unique-id.version.xml

Whichever method of assigning the IDs is used, many have expressed the

need/desire to digitally sign the articles. Since the article itself

needs to be signed, it makes little sense to include the signature in

the actual XML of the article (since then it would need to be removed

to check the signature, and raises questions like, ``exactly which

bytes do I remove to take out the signature). The above scheme for

serving the articles fits in well here: signatures can just go in

well-known signature files:

   http://www.gne.org/article/unique-id-12345.xml.asc

for the above example. If such a file doesn't exist, the article is

not signed.

THE CLASSIFIERS

No actual user or client program should be accessing the repository

directly, in all likelihood. A number of classification databases will

exist, which will index all the articles in the repository according

to their own criteria. One of the simplest ones (which the GNE project

itself will likely supple) will be a simple author and title

index. Every night (say) this classifier will request the article

index from one of the back-end repositories. Looking through the list

of unique IDs, it will note any IDs which are not already in its own

database. For these, it will request the XML file from the repository

and parse it, extracting whatever information is relevant to this

classification (in this example, just the author(s) and title). If

this classifier is more complicated (perhaps its the Nupedia one),

then these new articles will be sent to mailing lists for comments

about which category it should go in (or if it should be excluded from

the classification, by putting it in an ``ignore category).

The classifiers can do absolutely anything imaginable, from providing

a kids-only view of the information, to a keyword search capability to

more complex things. As classifications become irrelevant or unused,

they can simply be deleted; nothing needs to change in the article

repository. If someone is dissatisfied with the current

classification schemes, they can take some base software provided by

GNE and modify it to suit their needs; classifiers based on voting of

users, voting of ``experts or many other schemes can be created.

These classifier systems should probably store much of their

information in a database like MySQL since they will be accessing it a

lot.

CONVERTERS/CLIENTS

No matter what classifier is used, the user needs to actually see the

articles. This needs conversion from the XML format into some other

suitable format, like HTML or DVI. The GNE project will produce such

converters, which will be employed by the classifiers when showing an

article to an actual user. It might also be useful (in the future) to

make clients which interact (using some known protocol) with the

classifiers and then allow the user to choose from whatever formats

the client software knows to convert to. I wouldn't suggest doing such

a thing until well after the classifier software is stable, if ever.

DIAGRAM

So, here's what will happen:




                                 +---------------------+

                                 | Classifier Projects |

                                 +---------------------+

                                          . . .                                    +---------+

  +------+                       +---------------------+                           | Backend |

  | user | <--- Web browser ---> | Dewey Decimal Sys.  | <--- fetches article ---> +---------+

  +------+                       +---------------------+                           |12345.xml|

                                 | Library of Congress |                           |33213.xml|

                                 +---------------------+                           |77662.xml|

                                 | Children-only       +                           |  ...    |

                                 +---------------------+                           +---------+

                                          . . .                                               

                                                                                              

  Web browser/GNE client          Database-driven lookup                      Directory(s) of 

                                       mechanism                              XML files, with

                                                                              index & optional

                                                                              signature

SOFTWARE BY GNE

The software which GNE would need to write includes:

A PARSER which will take almost-plain-text and convert it into XML in the DTD/Schema decided upon for the back-end articles. This will include author, revision and content. It will not include a digital signature or any classification information whatsoever. The content can contain optional semantic hint tags intended for use by the classifiers.

BASIC CLASSIFIER software, which will ease the task of writing some classifier. As a minimum, this should include options to send articles to mailing lists and some basic hierarchical classification mechanism (since many classifiers will likely be hierarchical).

CONVERTERS to take XML content in our DTD/Schema and produce LaTeX, HTML, DVI or other formats, for use by the classifiers when serving content to users. This means no load on repository servers, and flexibility for the classifiers, which might like to change the HTML to conform to their design style.

CONCLUSION

A simple, efficient and almost completely inclusive back-end article

repository is indexed by a number of classifier projects, with which

the user interacts to get at content they're interested in. These

classifiers are limited only by the imagination, and don't affect what

is stored in the article repository; no classifier group can

``censor the GNE project, since another classifier group could start

up to correct the wrong (in their opinion) classifications. No

articles are rejected from the back-end repository, which is easy to

mirror, easy to maintain and presents no major load on the server

beyond its duties as a Web page server. Classifier databases can run

on any system, anywhere in the world and use any mirror of the

repository.