Jump to content

Wikipedia:PHP script tech talk

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Magnus Manske (talk | contribs) at 16:06, 7 February 2002 (removed splitting specialPages.php; already did it;)). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

This is the place to discuss bug fixes and planned feature on a more "technical" level. Please, do not just add bugs and feature requests here at random.

Serious bugs

Things that should be repaired ASAP.

#REDIRECTs that end in an eternal edit conflict

I'm probably stating the obvious here, but this appears to be caused by the page trying to redirect to a different page than has been specified. If this page doesn't exist (as is usually the case) then the page goes into edit mode when you click Save, which gives an edit conflict. If the page does exist, then no edit conflict occurs, but the redirect does not go to the expected place (as for Zundark/Old_Talk, which is redirected to user:Zundark but actually ends up at user:Zundark/Old_Talk). --Zundark, 2002 Feb 3
(2002/02/03 20:43 PST) Fix is in CVS for the problem when the redirected page does exist. (s/$this->$subPageTitle/$this->subPageTitle/) Doesn't seem to have fixed the doesn't-exist problem, I'll look at it some more. --Brion VIBBER

Change password does not work

For more details, see Wikipedia:Bug Reports. Until this is fixed any users who change their password cannot log in (like me). --User:Chuck Smith

Strange. It works fine on my local copy. Tried to log in with your old password? --Magnus Manske

Volunteers wanted

These tasks need volunteers to hack'em!

Mask minor edits on Recent Changes

Might be fixed by a patch from Brion VIBBER and myself --Magnus Manske

Fix the Recent Changes "(# changes)" counter

Might be fixed by a patch from Brion VIBBER and myself --Magnus Manske

  • I just looked in cvs (thats 2002/2/4 00:02 Amsterdam time) and it seems you still add a variable $addoriginal to the count. But I think that is silly because you should never count the current page if you are counting the changes. So just remove $addoriginal and the problem is solved. -- Jan Hidders (PS. Wouldn't it be nice if the sign-shortcut ~ ~ ~ would always be replaced with name and time? :-))
Hey, that's a good idea! Especially for the bug-report pages... Brion VIBBER (2002/2/4 15:18 PST)
Yes, unfortunately it's the only thing that made sense in my remark. :-/ What I should have said was the following. The variable $addoriginal should be 0 if the page did not already exist the previous day and current page is a minor edit and the user does not want to see minor edits. -- Jan Hidders (2001/2/5 8:45 GMT+1)

Fixing some parser bugs

Especially the <pre> tags.
I've replaced removeHTMLtags() with behavior more like the old usemod version; ie instead of forbidding a few tags, it allows only a small number. Thus, no <span>, <object> etc. However it still needs to be able to strip out unknown elements/parameters; I can still write naughty things like this. (2002/02/04 20:48 PST) --Brion VIBBER
Also, I commented out the line in subParseContents that makes &amp; followed by text that could be an entity into the entity. I suspect it was put in to fix pages that were getting over-escaped during editing, but that bug seems to be gone now and it just makes it hard to write the name of an entity. Ie, "&amp;" should *not* appear as just an ampersand, but an ampersand followed by "amp;". --BV

Brainstorming

Ideas for solutions needed here.

  • Caching of pages for reading only (Jimbo's idea).
    • Could be tricky. Would have to adapt to viewing preferences and newly created pages (red/blue links).
It may be possible to cache a 'common' almost-final version, which can then have a regexp run over it to set the link color and paragraph justification, and then inserted into the header/footer; this would at least save parsing the wiki page every time. Still need to deal with new pages though... Simplest way might be to run the "which pages link to this" check on a newly created page and expire the cached versions of anything that does. --Brion VIBBER
A lot of the regexp work could be skipped by using a quick preprocessor (ideally one that slaps in the header/footer without even looking at the text) and/or CSS. --Uriyan
Yup. CSS won't handle the difference between red links and [classic links]? for new pages, though. I recommend we change or eliminate one or the other. --Brion VIBBER

I take that back, CSS should do fine there. How does this sound:

   This is a <span class="newlinkedge">[</span><a href="foo" class="newlink">new
   link</a><span class="newlinkedge">]<a href="foo">?</a></span>.

where we define either:

   a.newlink { color: red; }
   .newlinkedge { display: none; }

or

   a.newlink { color: black; text-decoration: none; }
   .newlinkedge { }

in the style sheet? The text portion will still be clickable in the old-style case, though that could probably be "fixed" if desired. --Brion VIBBER

(2002/02/03 15:05 PST) I've changed the CVS version to use style sheets for the link colors, paragraph justification, and text/background color. (Try it at my test server, if you can find your way around the partially Esperanto-localized interface.) Keeps down the number of things that need to be changed if somebody wants to change the styles further, and should make the HTML-ized page guts cacheable.

Caching proposal

  1. Create a new field in the cur table named cur_cache (MEDIUMTEXT), empty by default
  2. When a pages is saved after edit, the cache is cleared
  3. When a page is viewed,
    1. and the cache has been used X times, it is cleared (enforced up-to-date)
    2. and the cache entry contains text, the cache is adapted to current user settings and displayed
    3. and the cache is empty, the text is rendered, displayed and stored in the cache field
  4. When a new page is created, the cache of all pages that link to the new page is cleared
  5. Pages with Lua error in Module:NUMBEROF at line 18: Parameter 1 is missing. See template documentation. are not cached

--Magnus Manske

Why not simply update the cache field every time the page is edited? You have to parse the page then anyway because it is presented after the edit. -- Jan Hidders PS. I know I'm getting annoying but can I say again that we first should measure which pages are eating up the server CPU time? Otherwise the implementation of caching might be a waste of time and effort that unnecessarily complicates the code.
On viewing an uncached page, the contents is rendered anyway. The result can be slightly altered and stored as the cache. Generating the cache upon saving means it will have to be rendered especially for that purpose, thus wasting resources. Also, when the cache is flushed, the page won't be cached again until after the next edit.
That said, we should of course check the special pages and improve their speed. I already (kinda) cached the Most Wanted. The other candidate is Orphans, but I don't know how to cache that; anyway, with de-orphanising progressing, the orphans list will get shorter, and the popularity of that page might drop.
Also, the Main Page has to run a database request each time it is viewed, to keep the an up-to-date article count.
Editing this page, connected with 10MBit to the Internet, wikipedia performs quite well; I know this will change once the US goes online ;) --Magnus Manske
I must be misunderstanding someting. After submitting I am always redirected to the page I just edited, right? So are you now suggesting that you would allow that I would then see an older cached version without the changes I just made? Isn't that a bit confusing for the writer? -- Jan Hidders
I suspect the sane behavior would be to clear the cache field when a page is saved (and the cache fields of any pages that link to it, if it's a new page). Then, when the page is loaded up again (for the edited page, that would be immediately), the empty cache is noticed and the page is rerendered and stored.
Yes, that's what I meant. Sorry for being unclear. I added a line to the proposal. --Magnus Manske
But why then do you need the rule for "enforce up-to-date"? -- Jan Hidders
Just a safety mechanism to ensure every page is updated once in a while. Might not be necessary, but it won't do much harm if set to a high value. --Magnus Manske
As far as the mainpage... how many pages actually use those Template:Blahblah things? Should we not cache pages that contain them? -- Brion VIBBER
No other pages have these, AFAIK. We could count the number of "{{" occurrences before and after variable replacement, and if it is unchanged, the page can be cached, otherwise not. --Magnus Manske

  • Database operations & efficiency
    • I notice there are a lot of mysql_connect()/mysql_close() pairs in the code; depending on the page loaded, the database can be opened and closed from 5 to 12 times. This seems excessive to me... mysql_connect() doesn't open a new connection if one is left open, and the connection is automatically closed when the script finishes. Surely the overhead of opening and closing multiple connection is worse than the overhead of having one connection open for the whole fraction of a second that it takes to run the several database operations needed by each page? Taking them out doesn't seem to affect performance significantly on my test machine, but I have a small database and I'm the only one using it. (Indeed, it may even make sense to use persistent connections.) --Brion VIBBER (2002/02/06 02:37 PST)
      • I can't access the CVS from here. Why don't you outcomment the mysql_close() lines, and see what happens when Jimbo actually updates the running version (in a month or so;). If that works out, we could try the persistent connections, if not, we just remove the # again. --Magnus Manske
        • Okay, done. Guess we'll see what happens... mwoooh haa haa haa... Also, turns out I missed a lot of them in my initial count; revise that to "the database can be opened and closed from 10 to 21 times" per page. Brion VIBBER

  • Browser-specific page layout
I notice though, that the tables in the page layout are setting their border properties based on whether the user agent is Internet Explorer. This explains why the tables have thin black borders in Internet Explorer and no borders in Mozilla... Magnus, is there any reason for this? I'd prefer to replace the table with some CSS markup in any case. --Brion VIBBER
The reason is that I like the thin black lines in IE, but other browsers don't support that, they draw all lines black, which looks ugly (try it!). If you know how to change it, go right ahead :) --Magnus Manske
Done, checked in. Looks ever so slightly different in IE and Mozilla, but approximately the same as the previous behavior in IE. Also looks okay in Konqueror 2.2.1, ugly but visible in Opera 6 (some beta version I have), but still doesn't show in Netscape 4.x. Brion VIBBER 2002/02/04 15:21 PST
  • Optimize slow code parts (where? why?)
    • Do we actually know what the slow parts are? My gut feeling is that the Recent Changes page is the slowest, but it would be nice if we could do some actual measurements on a server that is serving only one client but has a large database. (Does somebody have a big SQL dump?) Anyway, presuming that it is, I looked at the code and I think it can be made much much more optimal by combining the two SQL queries into one. Right now it computes a JOIN and a GROUP BY in PHP which can be done far more efficiently by the database. However, it should then be possible to do a GROUP BY on the day, which is now hidden in the time stamp. So we would split this column into a day column and a time column. Do I have your permission to attempt this? (But first I would like to know if it is worth it, i.e., if the Recent Changes page plays a major part in the slowing down. I thought Magnus said something about a memory leak in Apache, so perhaps we should try to find that first.) -- Jan Hidders
      • My (old) suggestion for Recent Changes optimization was to make it a separate table containing the last 5000+ changes. (The table would store only the RC-related data, not the actual page contents. It could be simply added to by the edit function, and trimmed in daily/weekly maintenance.) This table would eliminate the need for each RC page to search the *entire* DB looking for the most-recently-updated pages. --Clifford Adams
        • It doesn't, it uses the indexes to do that. That's the whole point of using a database; they are usually very clever at these things. :-) I would like to add the remark that letting the database do the joins for you does often lead to a performance improvent of orders of magnitude. If I'm right this could be a major boost. -- Jan Hidders (2001/2/5 8:48 GMT+1)
  • Eliminate the access count function ("This page has been accessed 6 times"). I have a feeling that the huge number of writes to the database may be killing performance. A much less way to do this would be to process the Apache log files daily or hourly with a separate script. --Clifford Adams

Resolved issues

These could probably be deleted or moved to a separate "what were we thinking? / what did we do when this broke before?" page.

Pages with "wiki.phtml" subpages

  • Anone has an idea why this is? I couldn't reproduce it with my local copy. --Magnus Manske
I haven't seen any of these lately. I *think* it was fixed by using an absolute path for $THESCRIPT. --Brion VIBBER

130.94.122.xxx bug

  • This is serious, because of its potential for masking vandalism.
Possible fix submitted; I suspect that there's some kind of proxying going on at the server end. But I could be totally wrong. --Brion VIBBER
We'll see, it is in the mail now... --Magnus Manske
Fixed as of Feb 4 02.