Jump to content

Wikipedia talk:Category intersection/Archive 1

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Sam (talk | contribs) at 02:28, 11 August 2006 (new archive). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)
Archive 1Archive 2Archive 3

Old discussion

--Copied from user talk:SamuelWantman#Subcategories--
Hi Rick, Haven't talked to you for a while. I'd like your opinion about something I've been thinking about quite a bit recently. As you probably know, I've been pushing for wider acceptance of duplication of articles in parents and their children for quite a while. Here's my present thinking about this, and what I'd like to propose to the developers:

  • Categories are a tool for browsing.
  • Categories are sometimes useful as an index of a subject, but often are not available as an index because they have been broken into subcategories and depopulated.
  • Many of these subcategories are in essence intersections of larger categories. For example, Category:American film directors can be though of as the intersection of Category:Film directors and Category:American people.
  • It would be useful to have categories fully populated at the "level of notability", by which I mean that directors are much more likely to be notable as "film directors" than as "American film directors".
  • There are many category intersections that do not exist that some people might find useful. Adding more and more intersections clutters up the category listings for articles.

To address all of these things I propose the following:

  • Categories be fully populated at the level of notability.
  • The software be modified so that category intersections get created on the fly.

Here's how it would work:

  • All the categories that are intersections would be deleted and their members moved to the larger categories at the level of notability. Some of these categories would be rather large (like Category:American people.
  • New wiki-markup would be added to the software to create dynamically created subcategories. Here's how it might look:

[[Subcategory:American people:Film directors]]

This markup would be added to the page Category:American film directors. The markup would initiate a database comparison of the categories listed to find the articles and subcategories listed in both categories. The page would be displayed as a "Sub-category" instead of as a "Category" which would indicate that it was dynamically created. There might be automatically generated text that would say something like, "This sub-category contains all the articles in Category:American people that are also in Category:Film directors. Additional text for the page could be created as normal, and the subcategory could be categorized as normal.

Articles could be placed in the category directly. For example List of American film directors could still be put in the category. There should be some visual indicator of the articles that are in the category directly and those that were from the intersection of the parents to help alert editors of miscategorized articles.

Articles would only list Categories on the bottom and not list all the Subcategories that they may be found in (unless they have been put in these categories directly by mistake). Perhaps, each category listed might have a check box, by clicking on some of the check boxes and then clicking on a link to "display subcategory" the user could go directly from the article to the dynamically created subcategory.

Does this sound like a good idea to you? Comments? Suggestions? Thanks. -- Samuel Wantman 10:26, 22 July 2006 (UTC)

Hi Samuel - yes, it's been a while. I hope things with you are going well. I haven't spent much time on categories lately, except for adding some comments about the naming conventions just recently. So, on the fly intersections? First, the general notion has been around for quite some time. Looking through the wikitech-l mailing list archives, someone even wrote the code implementing a version of category intersection ("category:Film directory/American people" would be the intersection of these two categories). The ensuing discussion pointed out that "/" was not a great choice, and brought up concerns that without including subcategories in the results that this would be of only limited use. Looking at requests currently open in bugzilla, there's bugzilla:5244 and bugzilla:2285. user:Steve block and I had a discussion a while ago about using flickr style tags (which I think exists nowhere except in the VPT archives) which I think is at least similar to what you're thinking. Let me turn this around - what do you think about the flickr sort of idea (and, if you've never visited flickr, give it try)? -- Rick Block (talk) 14:35, 22 July 2006 (UTC)
So, what do you think? -- Rick Block (talk) 04:12, 27 July 2006 (UTC)
I have been thinking about this quite a bit. I think the current system is a mess and needs changing. Have you seen the latest about category duplication at Wikipedia talk:Categorization. After months of work on this, I feel like I'm starting all over. The reality seems to be that it doesn't matter what decisions are reached through discussion, what matters is what common practice is and who is the most insistent.
So if there is going to be a change, I think it should try and hold onto the aspects of the current system that are good, and enhance it. I have looked at the flickr system, and I find it to be very disorganized. I suspect that unless handled well it could be a real mess here. But there are some good things about it, and I have been thinking about how to incorporate it into what we have. I don't want to abandon what we have, and I don't want people adding oodles of meaningless tags to thousands of articles. I'd like to keep the multiple taxonomies that we currently have, and encourage additional taxonomies. So If you take the ideas I mentioned above, I think it could work like this: Sort of a cross between what we have and flickr and apparently already doable and most of the code written according to the links you posted above.
As I mentioned, there would be no subcategories posted with articles only primary categories so instead of this (using Laurence Fishbourne as an example):
Categories: 1961 births | African-American actors | American film actors | American television actors | Best Actor Academy Award nominees | Living people | M*A*S*H actors | Miami Vice actors | A Nightmare on Elm Street actors | People from Augusta, Georgia
You would have this:
Categories: 1961 births [ ] | Living people [ ] | American people [•] | People from Georgia (United States) [ ] | People from Augusta, Georgia [ ] | People of African descent [•] | Film people [ ] | Television people [ ] | Actors [•] | Best Actor Academy Award nominees [ ] | M*A*S*H [ ] | Miami Vice [ ] | A Nightmare on Elm Street [ ]
Show Sub-category matching all checked boxes
These categories are sort of like Flickr. None of them are intersections of other traits. Each listing would have a check box next to each category. You could check off what ever category you'd want and then click below to get the sub-category. In this case it is Category:African-American actors. It is interesting to me that most of these categories already exist. This adds a small amount of category "clutter". There are a few more categories than originally. But with this set up, ALL of the primary categories listed here would be fully populated and so would all the possible intersections of these categories. You would be able to see the intersection categories even if nobody had created the page for it, such as Category:African-American film actors from August, Georgia who appeared on M*A*S*H. This is just like when somebody puts an article in a category without creating the page. The sub-category would be created dynamically by finding the intersection. If the page hadn't been created yet, it would list all the articles and also have links to the primary categories used for the intersection. Editors could continue to create pages for these intersections and structure them however they want, just as done now. There would probably need to be a new way to indicate how to code an intersection, as I mentioned above.
I also think some of the process for categorization could be automated. For example, if someone just created the Laurence Fishbourne article, and put him in Category:African-American actors, perhaps the system could look at the page for the category and see that it is an intersection sub-category that has three parents. The software could make the changes to the article so that it gets categorized in all three parents.
In reading through the links you posted above, I notice that this proposal might not have the problems that were discussed. Perhaps it might actually be easy to implement. -- Samuel Wantman 07:28, 27 July 2006 (UTC)
At first reading (and I'll read it again) this sounds almost exactly like the flickr setup, but using "category" as the name rather than "tag". I agree it would be good to keep the name as category. As it stands, categories are internally a page tied to a database search with a combination of user entered text (the "prologue" bit you enter when you edit a category) and dynamically generated content based on the database search (the list of articles or subcategories). Allowing "intersection categories" be able to be explicitly created seems like a reasonable idea as well (it's the same setup as an existing category, but with a more complicated search than a simple "all articles in this one category"). As you suggest, if these are only intersections the software could allow adding articles to such categories (by actually adding the article to all the categories that are intersected).
A couple of the Fishbourne example categories bother me a bit, but I'm not quite sure what to do about it. Specifically, the "born in" hierarchy (if he was born in Augusta, George, he was obviously born in Georgia and obviously born in the U.S so Category:People from Georgia (U.S. state) and Americans seem to be implied), similarly Actors and Best Actor Academy Award nominees seem to overlap. These are strict subsets rather than intersections which means the "parent" category could theorectically be done as a union, but I'm not sure if most people would immediately understand the difference.
I think it might fundamentally be a quibble, but I'd prefer the intersection UI to be on the category page, rather than the article page. Perhaps the categories are all listed (on the article page) and you can click on any of them individually or click on the "categories" header (which takes you to the intersection of all of them). Then the "category page" shows the current list of "intersection" categories, each clickable to show all the articles in that category and with something to click (trailing "[-]"?) to remove the category from the current "intersection set"). The available intersection categories would be in a separate spot in the display (like underneath), again each individually clickable but also with something to click (trailing "[+]"?) to add them to the current intersection set. In any event, however the exact UI details get worked out I think the operations of refining or expanding the current intersection set would need to be available. -- Rick Block (talk) 14:07, 27 July 2006 (UTC)
I too am a little bothered by the Fisbourne example and don't exactly know what to do with it. One criteria in designing this is that it should remain easy to get to the categories that you can now get to. Since there are many nationality sub-categories currently (Category:American actors), and virtually none by state or city, it seems useful to make them easy to create. The Oscar one does not bother me, because it is already acceptable as a place to duplicate listing people. I also think that all people categories should be populated using an ALL or NONE rule. By this I mean, if you are going to have a few people in Category:People from Georgia (U.S. state) (which I just fixed in the examples above). It should contain EVERYONE in Georgia or NOBODY from Georgia because they are all in subcategories. The reason for this is so categories can be used as subject indexes. So having the multiple categorizations shows that there is community acceptance of having this duplication.
I have reservations about taking the intersection off of the article page. If I'm understanding you, you'd fist go to the category which is the intersection of everything and then remove categories from the intersection. I suspect for most articles there will only be one article listed in the intersection category. I doubt there are any other Oscar nominated African-American actors from Augusta Georgia that appeared on those TV series. So what you are in essence suggesting is that you go to another page to make the intersection selection. If we can come up with a good interface for doing it on the article page, I think that would be better than doing it on a separate page.
Here's another idea I've been kicking around. What if there there is some built in categories for all articles. The set I'm thinking of is PEOPLE, PLACE, THING, TOPIC, LIST, EVENT. Every article would have to be classified as one of these things. Perhaps there is a name-space for each of these things, and the first thing you have to do when you create a page is decide which name-space it belongs in. For example, Suspension bridge would be a topic, Golden Gate Bridge a thing, and San Francisco a place, Battle of Gettysburg an event, World War Two a topic. etc... Along with this, when you create or edit a category there would be a checkbox that would say what namespaces would be allowed in the category. There would be a checkbox for CATEGORIES, IMAGES, TEMPLATES, WIKIPEDIA PAGES, TALK PAGES, PORTALS, PEOPLE, PLACES, THINGS, TOPICS, LISTS, EVENTS. So if Category:Entertainers does not have PEOPLE checked, you would not be able to put a PEOPLE article in the category. Perhaps, the Entertainers would show up in grey to indicate that it was not put in the category. If you clicked on the grey link you'd get a message that explained that you could not put PEOPLE articles in Category:Entertainers and to look in the subcategories of Category:Entertainers for categories where PEOPLE belong. There could also be separate sections for each of these namespaces for the category listings.
It is a clear consensus to not put people into Category:Entertainers, yet I think it would be useful to be able to se a complete index of what is in Category:Entertainers. So I've been wondering about having the ability to turn any Category into an INDEX. Perhaps there is a link at the top of each category that say "View as an Index". When you clicked on the link, you'd see the category presented as an outline. All the subcategories and articles would be combined into a single alphabetical list. The subcategories would be formated differently from the articles. There'd also be another option that said "Show contents of all subcategories" Clicking on this would add the contents of the subcategories to the category or list. If both options are selected the subcategory contents would be indented and listed directly under the subcategory heading. Indexes would only go a set number of levels deep. Perhaps the depth of the index could be a user preference. -- Samuel Wantman 22:29, 27 July 2006 (UTC)
In rereading what I just wrote, I noticed that my new idea could change things a little. If there are separate namespaces as described, and if each is shown in a different section on category pages, then you could redo Fishbournes categorization like this:
Categories: 1961 [ ] | Living people [ ] | United States [•] | Georgia (United States) [ ] | Augusta, Georgia [ ] | African descent [•] | Film [ ] | Television [ ] | Acting [•] | Best Actor Academy Award nominees [ ] | M*A*S*H [ ] | Miami Vice [ ] | A Nightmare on Elm Street [ ]
[GO]
This scheme would combine many categories together. For example, American people would be part of United States. To make this work, perhaps each section of a category could have a show/hide button. By default, perhaps any section with more than 50 entries starts out hidden. If not, then the categories would probably be too huge. -- Samuel Wantman 22:46, 27 July 2006 (UTC)
From the point of view of a general sofware package, I'm not sure I like the people/place/thing classification. Showing a category as an index is interesting, but I suspect it only works for subset hierarchies. It might be possible to have both explicit "intersection" categories (e.g. American actors) and "subset" categories (people born in Augusta, GA), although this might get pretty complicated pretty fast. I think since there is an example (flickr) that shows a way to deal with intersections, it might be worth keeping these notions separate and address only one (at first).
OK. So where would you like to go with this? We could enter it as a bugzilla request, or write something up as a proposal in wikipedia space to solicit more input, or post it to the Wikitech-l mailing list. Do you have a strong preference between these, or any other ideas for what to do next? -- Rick Block (talk) 17:55, 29 July 2006 (UTC)
I think we should come up with as good a proposal as we can and then invite people to come and discuss it, especially the developers. I have not been involved with the mailing lists or the irc channels, so I have no opionion about them. I notice that virtually all the old-timers who used to hang out at Wikipedia:Categorization and WP:CFD are no longer around. Things seem broken. There have been two discussions just today at Wikipedia talk:Categorization about this problem. One involves breaking up categories into English, Scottish, Welsh, etc... vs. just using British. Another is about Category:Board games.
Would you mind copying what you think makes the most sense from what we have written and starting a proposal? That way I could understand better where you are, and see if we are close to being in the same place. -- Samuel Wantman 08:51, 30 July 2006 (UTC)
Sure. I'll draft something up today. -- Rick Block (talk) 14:54, 30 July 2006 (UTC)
I'm working on it, but not done yet. I'll let you know when I have something that I think is reasonable (might be a few days even). It's harder than I thought to come up with something that's easy to use (and playing around with Flickr I can't figure out how to make it do intersections - I could have sworn this at least used to be possible). -- Rick Block (talk) 04:08, 31 July 2006 (UTC)

Start of proposal

The current state of categories in Wikipedia is somewhat chaotic due at least in part to the lack of a category intersection feature. Many categories are in essence intersections of larger categories. For example, Category:American film directors can be thought of as the intersection of Category:Film directors and Category:American people. Use of these "subset" categories makes it difficult to find all members of a "higher level" category; either articles have to be added to both the "subset" and "higher level" categories or the members of the "subcategories" (and, recursively, their subcategories) have to be enumerated. Precisely defining the circumstances in which articles should be added to both "lower level" and "higher level" categories, and even whether this is ever appropriate, remains a source of continuing discussion among editors (see, for example, Wikipedia talk:Categorization/Archive 11).

Category intersection has been a desired feature for quite some time. Looking through the wikitech-l mailing list archives, someone even wrote the code implementing a version of category intersection. With this change, "category:Film directors/American people" would be the intersection of these two categories. The ensuing discussion pointed out that "/" was not a great choice, and brought up concerns that without including subcategories in the results that this would be of only limited use. Looking at requests currently open in bugzilla, there's bugzilla:5244 and bugzilla:2285. The CatScan tool on the toolserver machine is used as a current workaround, although a feature implemented directly in the MediaWiki software itself would have applicability to all users of the software.

How should such a feature work?

First, any existing "intersection categories" would be decomposed into primary categories. So instead of this (using Laurence Fishbourne as an example):

Categories: 1961 births | African-American actors | American film actors | American television actors | Best Actor Academy Award nominees | Living people | M*A*S*H actors | Miami Vice actors | A Nightmare on Elm Street actors | People from Augusta, Georgia

we would have this:

Categories: 1961 births | People of African descent | American people | Film actors | Television actors | Best Actor Academy Award nominees | Living people | M*A*S*H actors | Miami Vice actors | A Nightmare on Elm Street actors | People from Augusta, Georgia

Clicking on any of the categories would act very much like a category does today (more on this below). However, note the categories link. Today, this link goes to Special:Categories (which is a relatively useless list of all categories that exist). With this proposal, this link would go to Categories, which would be interpreted as the dynamically created "intersection category" of all categories the article Laurence Fishburne is in. From any category listing, the total number of articles in the category would be displayed (truncated to some reasonable number, like 999) and, instead of "subcategories", an interface would be provided to reduce the number of matching articles (by adding a category to the current intersection set) or to increase the number of matching articles (by removing a category from the current intersection set).

Staying with Laurence Fishburne, clicking on the "categories" link (not a specific category) would show the articles that are in all the same categories Laurence Fishburne is in (likely, just the one article). The list of categories comprising the intersection would be displayed, perhaps near the top of the category listing. Clicking any of these would remove the category from the current intersection set, and recompute the intersection (resulting in more articles being displayed). To add a category to the current intersection set (reducing the number of articles being displayed), a list of "subset categories" would be displayed plus an input box for entering an arbitrary category. The "subset categories" would be manually added as meta-data to each category. The list displayed would be the union of the subset categories added to all categories in the current intersection set.

What would the user interface for all this look like?

<need to work out more details>

New discussion

My big question to you is what is wrong with putting the interface on the article page. Working from my previous example it could look something like this:

Categories: People | 1961 births | Living | United States | Georgia (United States) | Augusta, Georgia | African descent | Film | Television | Acting | Best Actor Academy Award nominees | M*A*S*H | Miami Vice | A Nightmare on Elm Street
[Create subcategory from the selected categories]

The check boxes should look better (I don't know how to code them), and "|" is probably not the best way to separate them. This is almost a full flickr implementation. I'm trying to think of a way to implement my primary categories (People, Places, Things, etc...) I'm thinking that these would ALWAYS be the first categories, and that the software would require all articles to be placed in a primary category. Also, for the interface, the People category would always be checked and could not be unchecked. It might be possible to make it a pull down list that would let you select the other primary categories, but this seems to be a complication that is not required. -- Samuel Wantman 09:16, 1 August 2006 (UTC)

In reading through the discussions about the patch that was created to do intersections I found this explanation about why the patch would not be useful:

"I don't see how this can be more than marginally useful unless it also searches all subcategories to infinite depth (with recursion checks?!).

This assumes the current system of putting articles into the lowest level of subcategories and removing the articles from the parents. If this were no longer the case, then this would not be a problem. Since we are discussing a system where each category is fully populated, the code will work just fine without having to search through any subcategories. -- Samuel Wantman 09:37, 1 August 2006 (UTC)

I've changed the checkboxes to Unicode check box characters (you can't check and uncheck, but the look is probably closer). The concern I have is that putting them on the article page makes the software change bigger. I think whatever we do, we're affecting the code that generates a category listing. Adding the selection mechanism to an article page affects the basic page presentation code as well. Not that this can't be done (and, I agree that it might be nice to be able to directly select the categories you're interested in), but the magnitude of this change is a little daunting. I'm OK with writing it up this way (selection from article pages).
I don't think adding a mandatory primary category is very feasible. There are over a million articles in en.wikipedia, none of which currently have a primary category. I think we have to propose a change that doesn't require touching all articles at the point the change becomes "live". We're talking about recategorizing probably every single article, but this doesn't have to be done immediately. Even if we did have primary categories, I don't understand why "people" could not be unchecked. Wouldn't you want to be able to show all articles related to, say, Augusta from the Fishbourne article?
I agree with your comment about the prior implementation limitation. I think we're talking about completely flattening many of the existing ccategory hierarchies, although I don't think we have a solution yet for the augusta/georgia/u.s. sort of issue (other than add to all). "Television actor" vs. "film actor" is another one of these, although I see in your example above you split these into "television", "film", and "actor" (acting). Without some sort of additional semantics, doing intersections with these are likely to have non-obvious results (for example, if you're looking for folks in the movie MASH "film x actor x MASH" will likely include anyone associated with the TV show as well as long as they were also in at least one movie). So long as "category" is just one dimensional (a single value), there isn't any way to fix this (we could have categories be a type/value pair, but this would be a MUCH bigger change than we've been talking about). -- Rick Block (talk) 19:18, 1 August 2006 (UTC)

I see what you are getting out. I hadn't really thought about this much, but this is one of the problems of the Flickr system. If you put people in an actor category and you also put people in a film category, the intersection of these is not just film actors. The intersection is film actors plus actors who worked on a film but never acted in one (like a stage actor who is also a film director). My first take on this is that a pure Flickr system will not work well for what we are trying to do.

So one way around this is to leave the categorization system pretty much the way it is, but remove those categories which clearly are the intersections of other categories.

So this would change the category structure to this:

Categories: | 1961 births | Living People | American people | People from Georgia (United States) | People from Augusta, Georgia | People of African descent | Film actors | Television actors | Best Actor Academy Award nominees | M*A*S*H actors | Miami Vice actors | A Nightmare on Elm Street actors
[Create subcategory from the selected categories]

The general rule would be: If a category can be completely and totally determined by finding the intersection of a single set of a small number of other categories it should not be populated. If there are articles that relate to the topic they can get linked manually to an intersection category by adding a "See also" comment. For example there might be a comment to see List of American film directors in Category:American film directors which would be populated with the intersection category of Category:Film directors and Category:American people.

I'd reword the general rule a bit: If a category can be completely and totally expressed as the intersection of other categories, it should be defined only as this intersection.

This isn't as much of an overhaul as I was hoping for, but perhaps that is a good thing.

To make this system work the software would need the following upgrades:

  1. An interface to allow a user to easily choose categories to intersect.
    Both on the fly and "statically" (e.g. a "precreated" intersection category, thay can have intro text). Category intersections need a URL and wikilink syntax as well. While we're at it, I'd like to see the search interface extended to include the ability to find articles in specific categories as well.
  2. Mark-up code to add the display of intersections to categories. (I'm wondering about using double colons to delineate between categories.) I'm assuming that once defined as an intersection, no articles will remain in the category. It would still be possible for the category to have subcategories. The procedure to make subcategories does not need to change.
    Hmm. Seems like there are three topics here. One is how you precreate an intersection category (so it can include a text intro and "see also" links). This includes the issues of how do you get to the create this page interface from an intersection display, what the syntax is for specifying the intersection set, and whether an intersection category can have a name (other than the intersection syntax). Another is how to display the intersection set when you're viewing an intersection category, either a precreated intersection or one done on the fly. I think this is much like the checkbox interface we're presuming is on the articles (right?). The third topic is how intersection categories relate to subcategories. I think intersection categories should probably be treated like subcategories of every category in the "intersection set". Beyond that, I suppose they should be able to be explicitly added as a subcategory to any other category. However, I don't think you can make a category a subcategory of an intersection category except by adding the category to each of the categories in the intersection set.
  3. A database to match category pages to their intersected categories. This is needed so that when someone checks off three categories for an intersection, the page to display can be found. This way every intersection category can have the same names they now have, following normal category naming conventions.
    I've been thinking more about how this is done on the fly, but if we're going to have precreated intersection categories as well then there needs to be a way to find the precreated one. I think this gets a little tricky since an "intersection" is not order dependendent, i.e. [American people, Directors] is the same as [Directors, American people]. When we get 3 or 4 or 5 categories in an intersection the number of combinations that are simply different by order grows pretty fast (it's n!). I think to support this the software would have to store the intersection set in some canonical order (alphabetically sorted, perhaps) and then when looking for a precreated intersection put the desired intersection set in the same canonical order before searching. Per below, the pages for these intersection categories probably have to be in a new namespace as well, since if they're in the category namespace there wouldn't be any way to prevent someone from explicitly adding articles or categories to intersection categories.
  4. If someone tries to put an article in a category defined as an intersection, perhaps the software automatically puts the article in the parent categories. This would also have the added benefit of recategorizing the entire database of articles as they get edited. Without this feature, this will be a very difficult system to maintain.
    If intersection categories are in a different namespace, adding articles or categories to them can simply be disallowed. I think this is a better idea, but does create a transition issue. I guess if the names are distinct enough (like include "::" as you suggest), they could effectively be treated like they're in a different namespace.

Some sort of protection scheme will be needed to keep people from wrecking havoc on the system by turning existing categories into intersection categories that contain categorization errors or vandalism. For example, someone could go into Category:Living people and change the code so that it becomes the intersection of Category:LGBT Wikipedians and Category:Gay actors. This would vandalize 1000s of articles at once. Intersection code should probably only be added by an admins. Perhaps all recategorizations as intersections need to be agreed to by the community when this proposal gets implemented. Once underway, I would think that anybody could create the page to go with an intersection as long as they could not change the intersection. -- Samuel Wantman 08:11, 2 August 2006 (UTC)

If intersection categories are in a different namespace, you can't turn an existing category into an intersection category except by creating the new intersection category and removing the old category. You could change the intersection set for an existing intersection category (e.g. change "American Directors" to the intersection of Americans and Murderers). Maybe this is a reason to have the intersection set only be in the name and not as editable metadata. -- Rick Block (talk) 14:21, 2 August 2006 (UTC)

Putting this all together

I had to start a new section. This was just getting too long.

I've been thinking about a set of criteria for what we are trying to do. Would you agree to the following?

  1. Many topic level categories that now only hold subcategories should be fully populated.
  2. Many subcateogies can be created automatically by finding the intersection of their parents.
  3. The current categorization structure should not be affected by this proposal. The only perceived differences might be:
    • The ability to create category intersections on the fly. Many of these subcategories do not currently exist. All users should be able to create categories using these intersections if possible.
    • Articles will show only primary (topic level) categories on their pages.
  4. An interface will be needed for users to create intersection categories.
    • Preferably, this will be possible from any article page.
    • If possible, articles miscategorized into categories that are intersections should be automatically fixed.
  5. The system needs to be protected from vandalism.

So how about this:

  • The mark-up for creating a category intersection will just be the automatic transclusion of a page from a new namespace (sub-category? intersection? I'll use "Subcat" for the examples). Pages in this new namespace will just be lists of the categories to be used for an intersection. They will have the same name as the category page that uses them. There won't be any markup for transcluding. If a subcat page exists with the same name as a category page it will be automatically transcluded. For example Category:African American actors would have the corresponding page Subcat:African American actors which would have the following editable text:
Actors
American people
People of African descent
  • This list would only appear when the subcat page is being edited, and should always be in alphabetical order. The software can alphabetize any lists not entered in order. When not being edited the page will look like a list of links to all the articles that are the intersection of all the categories listed, so when it is transcluded it will be the contents of a category. It might also have a header with links to the categories that were used for the intersection.
  • This page can be created several ways:
    1. Administrators can create or edit the page manually.
    2. Anyone can create a category intersection by selecting categories listed under an article by checking off the desired categories and then clicking on a link to view the intersection set. The user would then see the subcat page, but would not be able to edit it. It would look like a blank category page, without a title, just displaying the categories being intersected and links to articles that are the results of the intersection. If there was already a category page created for this intersection it will be displayed, so to the user it will appear that they have just moved to that category. If this is a new subcat page the user will be able to save it and create a new category using it by selecting an option that says something like, "Wikipedia does not currently have a category like this. If you would like to create a new subcategory that is the intersection of these categories: {list of selected categories}, enter a name for the category here _______ and select 'create'". Etc... with insturctions and links to the relevant policies. If the user enters the name of a category page that already exists they would be informed and asked to enter a different name, or abort. Once a valid name is entered the category page would be created, and so would the subcat page with the intersections. The user would be able to edit and categorize this page just like any other category. The only difference is that the subcat page gets transcluded as well.
    3. Using the procedure above, it would be possible to create a subcat page by adding the desired categories to a sandbox page, previewing it, checking off all the categories and saving it as a category.
  • It would be possible with this system to have the software or a bot automatically move miscategorized articles to their parents. I would suggest that this be done by adding a tag or flag to the subcat page that does not get displayed. Only admins would be able to set this flag. The flag would be needed because it would be possible to create intersections of categories that were not meant to be populated with articles. It would not be possible to vandalize existing categories, because once created the categories used for the intersection would only be editable by admins. Admins would also be able to rename and delete subcat pages if necessary.
  • The subcat pages will simply be a database of lists of category names along with te name of the page that contains the lists. When a user creates a category intersction on the fly, the categories selected will be matched with the lists. Since the lists are in alphabetical order, the selected categories can be compared in the same order to quickly find a match. If there is a match, the category that uses the subcat page will be displayed.

Does this address all your concerns? -- Samuel Wantman 08:37, 5 August 2006 (UTC)

I'm not sure if this is "stop the presses" or not, but I just ran into m:DynamicPageList (a MediaWiki extension that is not currently installed here). Hmmm. I think I need some time to think about this. -- Rick Block (talk) 18:53, 5 August 2006 (UTC)
I would not be surprised if everything we need to implement this already exists. --Samuel Wantman 20:59, 5 August 2006 (UTC)

I generally agree with the criteria, although I think I might tweak the wording a little bit. For example, your first two are related and could be combined into:

  1. Many existing categories are logically the intersection of attributes for which "primary" categories exist, for example Category:American actors is logically the intersecton of Category:Actors and Category:American people. Although these "primary" categories are today generally subdivided into subcategories, if they were directly (fully) populated the "intersection categories" could be automatically generated.

I like explicitly listing the criteria for the solution. One more, perhaps implied by your #3, is that the software change to implement the new solution must fundamentally be an "add-on" not requiring wholesale changes to existing articles or categories. Other additional ones might be:

  • articles should not be permitted to be directly added to intersection categories
  • both "on the fly" and "static" intersection categories must have a URL syntax, and both should have a wikilink syntax

My understanding of the substance of your proposal is that an admin creates a "static" intersection category by editing the intersection list maintained in a parallel, protected, namespace ("subcat"), while non-admins could create new intersection categories but not edit the intersection list, right? So, for example, to create Category:African American actors as an intersection an admin would edit Subcat:African American actors and include in it the intersection list. If a non-admin user was currently viewing a "non-existent" (equivalent to red-linked) intersection category, he/she could "save" this intersection as a new pseudo-category by giving it a name.

Our understandings are exactly the same.

Following this through a bit, the previously created intersection categories would have a URL and wikilink syntax exactly like existing categories, so when the software generates a category listing it has to check if the parallel name exists and, if so, then treat the category as an intersection rather than as a "regular" category. Part of treating a category as an intersection might be to disallow adding articles to the category. For statically created intersections, I think this could clearly work, although the ability to turn an existing category into an intersection category seems problematic (what happens to the articles that are already in the existing category when this is done?).

The ability to turn an existing category into an intersection category is part of the elegance of the system. First of all, only an admin would be able to do this because an on-the-fly subcats cannot be created with a name that already exists. So the process is that there has to be agreement to change the nature of the category, perhaps occuring at WP:CFD, and then an admin creates the subcat page and sets the flag to move the articles. The software will move all the articles to the categories listed in the subcat page as the converted category is depopulated. So part of the process of changing a category to a subcat would be checking to see that all the articles in the category will actually belong in the parents, and creating a "See also:" section for articles that should remain associated with the category. In the examples we've been talking about it might be an eponymous article or list (List of African-Amreican actors). If someone miscategorizes an article into a category that has a subcat list, the article would be in the category until the software or bot moved it. Perhaps the category would be listed in grey to show that it will not be there for long. This way of implementing subcats will make it possible to undertake the massive repopulation of categories that will be needed.
There should be a new section in category listings for the subcats. I'm thinking the new section would be after the display of subcategories and before the display of articles. It might say "Articles that are in category:xxx, category:yyy and category:zzz". I can see some possible uses for not setting the flag to depopulate the category. If there are a fair number of articles about the topic they could remain in the article section. But, I suspect this won't happen much and perhaps this type of category should be discouraged. The more likely use I see for this is a way for Admins to preview what the category will look like without implementing any changes that would be hard to reverse.

For "on the fly" intersections, I don't quite see how this works. I think the list of categories being intersected has to be provided in the URL, which means an "on the fly" intersection would have to have a different URL syntax than a statically created one. I think I like using the new namespace for this rather than than using it to parallel the existing category namespace, so perhaps something like "Intersection:" rather than "Category:" could be used for intersection categories. Then, intersection:American people::Actors could mean the intersection of these two categories as a wikilink, leading to the URL http://en.wikipedia.org/wiki/Intersection:American_people::Actors (which might or might not lead to a "previously created" intersection). The ordering issue, where intersection:American people::Actors should be the same as intersection:Actors::American people could be addressed using a completely hidden intersection list like you suggest. In fact, the actual internal name could be in canonical (sorted) order and all other permutations effectively treated as redirects (more like synonyms) to this name (before doing the lookup, the software would parse the URL and then sort the category list, and then do the lookup). Doing it this way would keep intersections completely separate from existing categories (which I think would be a good thing).

Using "intersection" as a separate namespace resolves the vandalism issue as well. The categories being intersected are embedded in the name, and can't be changed (by anyone). This would mean it wouldn't require any special permissions to create or edit intersections. -- Rick Block (talk) 17:28, 7 August 2006 (UTC)

I think we are on the same page. -- Samuel Wantman 20:13, 7 August 2006 (UTC)
I'm not sure if you picked up on this, but I'm suggesting using the "Intersection:" namespace all the time, even for statically created intersection categories. I think there enough issues with recasting an existing category as an intersection that we should avoid this. Perhaps an existing category could be turned into a REDIRECT to an intersection, but manipulating all the articles when this is done seems like a pretty big deal. What I've suggested makes intersections truly an add-on feature, related to, but without any direct impact on the existing category feature. One issue we haven't talked about is sort order. If an article (or category) has two different sort keys for the categories that constitute an intersection, what happens? This has to be algorithmically specified, and should not be too complicated. I don't know exactly how this is done with regular categories, but I suspect the article's sort key is stored in some database record associated with the category (in addition to the source category reference, which is in the article). -- Rick Block (talk) 21:01, 7 August 2006 (UTC)
I'm not sure if YOU picked up on this but I was proposing all along in this section was that the subcat: or intersection: (or whatever it is called) namespace would be used all the time, even for statically created intersection categories. There doesn't need to be a redirect. We can use the existing category pages and simply transclude the subcat page into a new section. The first thing I proposed at the top of this section was "The mark-up for creating a category intersection will just be the automatic transclusion of a page from a new namespace (sub-category? intersection? I'll use "Subcat" for the examples)." and then later said; "Anyone can create a category intersection by selecting categories listed under an article by checking off the desired categories and then clicking on a link to view the intersection set. The user would then see the subcat page, but would not be able to edit it." But perhaps I'm not understanding what you are getting at. I am saying that all the intersections happen in the new namespace, and if there is a category page with the same name the subcat page gets automatically transcluded. I think we are in agreement. This would truly be an add on feature of the existing category structure.
As for the sort keys, it is possible to add some parameters for the categories listed in a subcat page so that the software or bot can decide how to sort the articles when doing an intersection. This sort of thing already exists in WP:AWB. I suspect the easiest way to implement this is to simply select which category's sort key will be used. It might be as easy as adding an empty pipe to the category you want to use (e.g. [[Category:American people|]]). Going in the other direction, when the bot or software moves articles to the parent categories the piping can just be copied to all of the parents. Since most of the intersected categories deal with people, most of them will all be piped the same way, so I'm guessing this won't be a big problem. -- Samuel Wantman 01:23, 8 August 2006 (UTC)
If names in the new namespace are "tied" to names in the existing category namespace there has to be some other mechanism to specify (and manipulate, etc.) the list of categories that are intersected. If the name itself specifies the categories to be intersected, this avoids the issue of permission to change the "intersection list" and I think would lead to a simpler implementation. Creating an indirection between the name and the intersected categories works well for existing categories that could be done via intersection (and, likely, for most statically created intersections), but doesn't provide a solution for "on the fly" intersections. Assuming there has to be some solution to "on the fly", I'd reuse the same mechanism for static intersections if there's a way to make it work. On the flip side, this means there wouldn't be a convenient way to do something like the sort key trick you suggest. -- Rick Block (talk) 03:21, 8 August 2006 (UTC)
If the name of the subcat page specifies the categories used to find the intersection set how do you specify which category uses the subcat page? I think we have to set it up that everything that needs to be specified (which also need to be protected) happens in the subcat pages. Our criteria is that average users should be able to do as much as possible. If we have to create a redirect to a subcat page, it has the potential to be abused. That was why I proposed transcluding from a protected namespace. I don't think we want to display the category as being "Category:Actors::American people::People of African descent" This is not going to mean anything to most people. We want to call it "Category:African-American actors" and allow people to edit the page. So I don't see how we can avoid some way of pointing from the category page to the list of categories. If that pointer is not in a protected namespace, we could be creating the potential for widespread vandalism.
The only alternative I can think of at the moment is to switch the name of the subcat page with the list of categories. Perhaps this is what you are suggesting? So instead of the page containing a list of categories, it would contain a link to the category page. Since the name is not identical to the category page, you'd have to create a database to know which subcat page gets displayed. If a category used to create the subcat gets renamed, the page would have to be moved. The subcat page contents could still specify which category to use for piping and still have the switch for depopulating and moving articles. So in essence this is pretty much the same -- we'd just be switching the subcat page name (and URL) with the list of categories. One disadvantage I see is that an admin would need to know the categories used for the intersection to be able to edit the page, or there would have to be a "subcat" tab that only appeared for admins when looking at categories. But the biggest problem is that it becomes much more complicated to prevent the situation where a category ends up with two different subcat pages. It seems more natural to have a one to one mapping. I'm not seeing an advantage to doing it this way.
I guess I don't understand the problem you are trying to solve. It seems you are concerned about what the URL for the on-the-fly intersection set would be. It could get some temporary name, perhaps generated from the category intersected as you suggest. When the user decides to save it as a category, it would get renamed to the category name (except in the subcat namespace) and the user would then be able to edit the category page. Why is this a problem? -- Samuel Wantman 04:47, 8 August 2006 (UTC)
This is a problem because the fundamental way a wiki works is that there's a URL corresponding to the page the user wants to see. However you pick the categories comprising an (on the fly) intersection, the result is a URL. The URL must include the categories since it's the only thing the web server sees. There basically can't be a "hidden" file (of any sort) that stores the list of categories to be intersected for the "on the fly" case. It could have an invented name, but the list of categories has to be included in the URL. Hence, just invent a namespace (I'm suggesting "intersection:") for this purpose where the name, in this namespace, is a list of categories separated by some specific separation string ("::"). Once we do this, I think we're done. I think this means the name doesn't match any name in "category" namespace. It seems you're trying to ensure there's a correspondence between names in "category" namespace and "subcat" namespace. I think this is not a fruitful approach. This clearly can't be the case for an "on the fly" intersection. So, why not just let the names in the new namespace simply be different? -- Rick Block (talk) 05:16, 8 August 2006 (UTC)
OK. I understand what you are getting at. So I'll write it up as the intersection pages will contain the link to the category page. Functionally it is the same. -- Samuel Wantman 05:49, 8 August 2006 (UTC)