Jump to content

Wikipedia:Parallelize Architecture

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Joe Walp (talk | contribs) at 04:32, 2 October 2003 (removed redundant subtitle). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Consider the Google architecture.  There, indices and documents are replicated across several servers, and the replication process is, itself, parallelized.  It would require a lot of coding, but it's scalable.

That is also compatible with creating a user-hostable application for distributing the load across the machines of high-bandwidth power users and volunteers.  In that scenario, a central index knows several remote locations where the data for the latest version of a wiki node is stored.  Call each remote location a data peer.  When a request is made via the Web interface, the server consults its index and sends the request immediately to multiple data peers (for redundency), each of which attempts to format a response page as quickly as possible.  The page from the first-responding peer is then piped through the central server to the waiting client. 

At first glance, it looks like that distributed variant would dramatically increase bandwidth requirements at the central server locale.  Bandwidth would be consumed by replication of data nodes onto new data peers, replication of updated nodes to existing data peers, and piping a formatted response from a data peer through the central server.  But that neglects the possibility of the remote data peers also serving media components (like images and sounds) featured on the requested page.  If that was implemented, then there would be a bandwidth breakeven point as mature pages become more image- and sound-rich.  And the breakeven point would be reached more quickly if propagated node updates were diffs rather than full copies.  A minor drawback to serving media items from (relatively less available and less reliable) data peers is that links by clients directly to individual media items would break eventually.

Two options for implementing any variant are an open source project and a set of masters theses.  In either case, you're likely to end up with a lot of noodle code unless you come up with some good architecture and coding standards.  Universities seem like a better bet, since they'll have clusters of machines available for testing.  For the distributed variant, you'd probably want to implement the data peer on the J2EE platform (Java 2 Enterprise Edition) in order to accomodate a variety of client OSs.  With the NIO library (as of J2EE 1.4), a "Pure Java" implementation would have good performance, but the data peer would undoubtedly have a large memory footprint.