Wikipedia:PHP script tech talk
This is the place to discuss bug fixes and planned feature on a more "technical" level. Please, do not just add bugs and feature requests here at random.
Serious bugs
Things that should be repaired ASAP.
#REDIRECTs that end in an eternal edit conflict
- Suggestions? --Magnus Manske
- I'm probably stating the obvious here, but this appears to be caused by the page trying to redirect to a different page than has been specified. If this page doesn't exist (as is usually the case) then the page goes into edit mode when you click Save, which gives an edit conflict. If the page does exist, then no edit conflict occurs, but the redirect does not go to the expected place (as for Zundark/Old_Talk, which is redirected to user:Zundark but actually ends up at user:Zundark/Old_Talk). --Zundark, 2002 Feb 3
- (2002/02/03 20:43 PST) Fix is in CVS for the problem when the redirected page does exist. (s/$this->$subPageTitle/$this->subPageTitle/) Doesn't seem to have fixed the doesn't-exist problem, I'll look at it some more. --Brion VIBBER
Change password does not work
For more details, see Wikipedia:Bug Reports. Until this is fixed any users who change their password cannot log in (like me). --User:Chuck Smith
- Strange. It works fine on my local copy. Tried to log in with your old password? --Magnus Manske
Volunteers wanted
These tasks need volunteers to hack'em!
Mask minor edits on Recent Changes
Might be fixed by a patch from Brion VIBBER and myself --Magnus Manske
Fix the Recent Changes "(# changes)" counter
Might be fixed by a patch from Brion VIBBER and myself --Magnus Manske
- I just looked in cvs (thats 2002/2/4 00:02 Amsterdam time) and it seems you still add a variable $addoriginal to the count. But I think that is silly because you should never count the current page if you are counting the changes. So just remove $addoriginal and the problem is solved. -- Jan Hidders (PS. Wouldn't it be nice if the sign-shortcut ~ ~ ~ would always be replaced with name and time? :-))
- Hey, that's a good idea! Especially for the bug-report pages... Brion VIBBER (2002/2/4 15:18 PST)
- Yes, unfortunately it's the only thing that made sense in my remark. :-/ What I should have said was the following. The variable $addoriginal should be 0 if the page did not already exist the previous day and current page is a minor edit and the user does not want to see minor edits. -- Jan Hidders (2001/2/5 8:45 GMT+1)
- Hey, that's a good idea! Especially for the bug-report pages... Brion VIBBER (2002/2/4 15:18 PST)
Fixing some parser bugs
- Especially the <pre> tags.
- I've replaced removeHTMLtags() with behavior more like the old usemod version; ie instead of forbidding a few tags, it allows only a small number. Thus, no <span>, <object> etc. However it still needs to be able to strip out unknown elements/parameters; I can still write naughty things like this. (2002/02/04 20:48 PST) --Brion VIBBER
- Also, I commented out the line in subParseContents that makes & followed by text that could be an entity into the entity. I suspect it was put in to fix pages that were getting over-escaped during editing, but that bug seems to be gone now and it just makes it hard to write the name of an entity. Ie, "&" should *not* appear as just an ampersand, but an ampersand followed by "amp;". --BV
Brainstorming
Ideas for solutions needed here.
Speeding up the PHP script
- Caching of pages for reading only (Jimbo's idea).
- Could be tricky. Would have to adapt to viewing preferences and newly created pages (red/blue links).
- It may be possible to cache a 'common' almost-final version, which can then have a regexp run over it to set the link color and paragraph justification, and then inserted into the header/footer; this would at least save parsing the wiki page every time. Still need to deal with new pages though... Simplest way might be to run the "which pages link to this" check on a newly created page and expire the cached versions of anything that does. --Brion VIBBER
- A lot of the regexp work could be skipped by using a quick preprocessor (ideally one that slaps in the header/footer without even looking at the text) and/or CSS. --Uriyan
- Yup. CSS won't handle the difference between red links and [classic links]? for new pages, though. I recommend we change or eliminate one or the other. --Brion VIBBER
- A lot of the regexp work could be skipped by using a quick preprocessor (ideally one that slaps in the header/footer without even looking at the text) and/or CSS. --Uriyan
I take that back, CSS should do fine there. How does this sound:
This is a <span class="newlinkedge">[</span><a href="foo" class="newlink">new link</a><span class="newlinkedge">]<a href="foo">?</a></span>.
where we define either:
a.newlink { color: red; } .newlinkedge { display: none; }
or
a.newlink { color: black; text-decoration: none; } .newlinkedge { }
in the style sheet? The text portion will still be clickable in the old-style case, though that could probably be "fixed" if desired. --Brion VIBBER
- (2002/02/03 15:05 PST) I've changed the CVS version to use style sheets for the link colors, paragraph justification, and text/background color. (Try it at my test server, if you can find your way around the partially Esperanto-localized interface.) Keeps down the number of things that need to be changed if somebody wants to change the styles further, and should make the HTML-ized page guts cacheable.
Caching proposal
- Create a new field in the cur table named cur_cache (MEDIUMTEXT), empty by default
- When a pages is saved after edit, the cache is cleared
- When a page is viewed,
- and the cache has been used X times, it is cleared (enforced up-to-date)
- and the cache entry contains text, the cache is adapted to current user settings and displayed
- and the cache is empty, the text is rendered, displayed and stored in the cache field
- When a new page is created, the cache of all pages that link to the new page is cleared
- Pages with Lua error in Module:NUMBEROF at line 18: Parameter 1 is missing. See template documentation. are not cached
- Why not simply update the cache field every time the page is edited? You have to parse the page then anyway because it is presented after the edit. -- Jan Hidders PS. I know I'm getting annoying but can I say again that we first should measure which pages are eating up the server CPU time? Otherwise the implementation of caching might be a waste of time and effort that unnecessarily complicates the code.
- On viewing an uncached page, the contents is rendered anyway. The result can be slightly altered and stored as the cache. Generating the cache upon saving means it will have to be rendered especially for that purpose, thus wasting resources. Also, when the cache is flushed, the page won't be cached again until after the next edit.
- That said, we should of course check the special pages and improve their speed. I already (kinda) cached the Most Wanted. The other candidate is Orphans, but I don't know how to cache that; anyway, with de-orphanising progressing, the orphans list will get shorter, and the popularity of that page might drop.
- Also, the Main Page has to run a database request each time it is viewed, to keep the an up-to-date article count.
- Editing this page, connected with 10MBit to the Internet, wikipedia performs quite well; I know this will change once the US goes online ;) --Magnus Manske
- I must be misunderstanding someting. After submitting I am always redirected to the page I just edited, right? So are you now suggesting that you would allow that I would then see an older cached version without the changes I just made? Isn't that a bit confusing for the writer? -- Jan Hidders
- I suspect the sane behavior would be to clear the cache field when a page is saved (and the cache fields of any pages that link to it, if it's a new page). Then, when the page is loaded up again (for the edited page, that would be immediately), the empty cache is noticed and the page is rerendered and stored.
- Yes, that's what I meant. Sorry for being unclear. I added a line to the proposal. --Magnus Manske
- But why then do you need the rule for "enforce up-to-date"? -- Jan Hidders
- As far as the mainpage... how many pages actually use those Template:Blahblah things? Should we not cache pages that contain them? -- Brion VIBBER
- No other pages have these, AFAIK. We could count the number of "{{" occurrences before and after variable replacement, and if it is unchanged, the page can be cached, otherwise not. --Magnus Manske
- Browser-specific page layout
- I notice though, that the tables in the page layout are setting their border properties based on whether the user agent is Internet Explorer. This explains why the tables have thin black borders in Internet Explorer and no borders in Mozilla... Magnus, is there any reason for this? I'd prefer to replace the table with some CSS markup in any case. --Brion VIBBER
- The reason is that I like the thin black lines in IE, but other browsers don't support that, they draw all lines black, which looks ugly (try it!). If you know how to change it, go right ahead :) --Magnus Manske
- Done, checked in. Looks ever so slightly different in IE and Mozilla, but approximately the same as the previous behavior in IE. Also looks okay in Konqueror 2.2.1, ugly but visible in Opera 6 (some beta version I have), but still doesn't show in Netscape 4.x. Brion VIBBER 2002/02/04 15:21 PST
- The reason is that I like the thin black lines in IE, but other browsers don't support that, they draw all lines black, which looks ugly (try it!). If you know how to change it, go right ahead :) --Magnus Manske
- I notice though, that the tables in the page layout are setting their border properties based on whether the user agent is Internet Explorer. This explains why the tables have thin black borders in Internet Explorer and no borders in Mozilla... Magnus, is there any reason for this? I'd prefer to replace the table with some CSS markup in any case. --Brion VIBBER
- Pre-compiling of the PHP code.
- Perhaps the Alternative PHP Cache could help? (I haven't tried it.) --Clifford Adams
- Optimize slow code parts (where? why?)
- Do we actually know what the slow parts are? My gut feeling is that the Recent Changes page is the slowest, but it would be nice if we could do some actual measurements on a server that is serving only one client but has a large database. (Does somebody have a big SQL dump?) Anyway, presuming that it is, I looked at the code and I think it can be made much much more optimal by combining the two SQL queries into one. Right now it computes a JOIN and a GROUP BY in PHP which can be done far more efficiently by the database. However, it should then be possible to do a GROUP BY on the day, which is now hidden in the time stamp. So we would split this column into a day column and a time column. Do I have your permission to attempt this? (But first I would like to know if it is worth it, i.e., if the Recent Changes page plays a major part in the slowing down. I thought Magnus said something about a memory leak in Apache, so perhaps we should try to find that first.) -- Jan Hidders
- My (old) suggestion for Recent Changes optimization was to make it a separate table containing the last 5000+ changes. (The table would store only the RC-related data, not the actual page contents. It could be simply added to by the edit function, and trimmed in daily/weekly maintenance.) This table would eliminate the need for each RC page to search the *entire* DB looking for the most-recently-updated pages. --Clifford Adams
- It doesn't, it uses the indexes to do that. That's the whole point of using a database; they are usually very clever at these things. :-) I would like to add the remark that letting the database do the joins for you does often lead to a performance improvent of orders of magnitude. If I'm right this could be a major boost. -- Jan Hidders (2001/2/5 8:48 GMT+1)
- My (old) suggestion for Recent Changes optimization was to make it a separate table containing the last 5000+ changes. (The table would store only the RC-related data, not the actual page contents. It could be simply added to by the edit function, and trimmed in daily/weekly maintenance.) This table would eliminate the need for each RC page to search the *entire* DB looking for the most-recently-updated pages. --Clifford Adams
- Do we actually know what the slow parts are? My gut feeling is that the Recent Changes page is the slowest, but it would be nice if we could do some actual measurements on a server that is serving only one client but has a large database. (Does somebody have a big SQL dump?) Anyway, presuming that it is, I looked at the code and I think it can be made much much more optimal by combining the two SQL queries into one. Right now it computes a JOIN and a GROUP BY in PHP which can be done far more efficiently by the database. However, it should then be possible to do a GROUP BY on the day, which is now hidden in the time stamp. So we would split this column into a day column and a time column. Do I have your permission to attempt this? (But first I would like to know if it is worth it, i.e., if the Recent Changes page plays a major part in the slowing down. I thought Magnus said something about a memory leak in Apache, so perhaps we should try to find that first.) -- Jan Hidders
- Eliminate the access count function ("This page has been accessed 6 times"). I have a feeling that the huge number of writes to the database may be killing performance. A much less way to do this would be to process the Apache log files daily or hourly with a separate script. --Clifford Adams
Resolved issues
These could probably be deleted or moved to a separate "what were we thinking? / what did we do when this broke before?" page.
Pages with "wiki.phtml" subpages
- Anone has an idea why this is? I couldn't reproduce it with my local copy. --Magnus Manske
- I haven't seen any of these lately. I *think* it was fixed by using an absolute path for $THESCRIPT. --Brion VIBBER
130.94.122.xxx bug
- This is serious, because of its potential for masking vandalism.
- Possible fix submitted; I suspect that there's some kind of proxying going on at the server end. But I could be totally wrong. --Brion VIBBER
- We'll see, it is in the mail now... --Magnus Manske
- Fixed as of Feb 4 02.
- We'll see, it is in the mail now... --Magnus Manske
- Possible fix submitted; I suspect that there's some kind of proxying going on at the server end. But I could be totally wrong. --Brion VIBBER