Quality & Importance Metrics

Quality content is the Wikimedia project top concern.
GRE are assesed by machines
Systems that have done quality assement^[1]

Information Metrics

1. Entropy/Information (compresed size)
2. Normed Entropy (compresed size/word count)
3. Unique 3-grams (unique triplets)

Linking Structure

WikiLinks
1. Orphan, near orphan
2. Catagories
3. Catagories Relevance (Kullback-Leibler divergence | cross entopy across the catagory)
4. Incoming Link Count | Out Going Link Count (Kullback-Leibler divergence| cross entopy | PLSI - across the catagory)
5. Incoming Link Quality | Out Going Link Quality (Kullback-Leibler divergence| cross entopy | PLSI - across the catagory)
Link Rot

Stability - Add to Recency

Logic indicatres that High Quality articles would become complete and require no more editing. Research indicates that editing can degrade an article's quality.

Edit spikes
Edit time series model (i.e. how long till next edit is expected based on full history|recent 7 edits)

Importance Metrics

Cache Traffic Rank.

Verifiability

Citation Metrics.
1. Citation Count.
2. Citation Density.
3. Citation Relevance (Covrage)
4. Citation Quality.
5. BLP Citation Density.

Media

Images.
1. Image Count.
2. Image Caption.
3. image - fair use
4. image - commons
video
audio
Quoate
Poem
Source
Template Count.
1. Problem Tags (Inline|Section|Article).
2. Non Problem Tempates.
Table Count and Mass.

Words per editor (mean, standard deviation) for (Article;Talk;Both,Only Talk,Only Article).
Edits per editors (mean, standard deviation) for (Article;Talk;Both,Only Talk,Only Article).
Permission per editor (none, ... ,admin)
Edits per editors Good and Featured articles (mean, standard deviation).
Editors Contribs per Namespace (mean, standard deviation).

Calls for Action (Huggle/Twinkle/common content templates)

Tot Template Count
Tot Inline Template Count
Red Link Count

Conflict (Researchers believe that Conflict is a marker of Quality)

Number of non vandalism edits reverted,Deletion, Rollbacks (based on via use cosine between diff vector and current version)
Edit Wars
3Revert Violation
Afd nomination
Move event
Semi/Full Protection events

Coordination and Communication

Talk page Discussion
History Comment length
Inline Template count
Page/Section Template count

Style

The style guide is full of both quality features and small task sources. Many are context sensative. Is there research on style guide complience? I think style complience is seperates article C and below and B and above.

Title compliance
- Title is Sentence Case; ;
- Title starts with A, An, The
- Title is starts with Noun/Noun Phrase
- Title is ends with Punctuation
- Title formatting

Hat Notes - must be at start
- {{About|text}} {{redirect}} {{Hatnote|text}} {{Hatnote|text [[link]]}} {{see also?}} {{for?}} {{Other uses}} {{Other uses of}} {{Two other uses}} {{Three other uses}} {{Details}} {{Further?}} {{other people?}} {{other places?}} {{other ships?}} {{other hurricanes}} {{otherArticle}} {{Main}} {{Main list}} {{article preceding}} {{article succeeding}} {{ArticlePair}} {{Cat main}}

Has Lead
Section Titles
Section Templates

{{Main|Article name}} {{DetailsArticle name}} {{FurtherArticle name}} {{RelatedArticle name}} {{alsoArticle name}} {{Distinguish?}} {{Redirect-distinguish?}}

Appedecies in this order:
- Works or Publications
- See also
- Notes and/or References
- Further reading
- External links
- Navigation templates (footer navboxes)[6]
- Geographical coordinates (if not in Infobox) or
- Persondata template
- Defaultsort
- Categories
- Stub templates
- Interlanguage links

deprecated items
- horizontal rule ----

Categories included may be disputed (have infoboxes

etc.....

NLP

Lexical

word count
- word count - top 0- 99 most frequent
- word count - top 100- 999 most frequent
- word count - top 1,000-9,999 most frequent
- word count - core
- word count - 3 times or less in Wikipedia
- word count - words unique to article
sentence length (mean,sd)
collocations
compound words
lead to section seme match
typos
style errors
english varients
- us
- uk
- other
- non english
IPA info

Sentiment^[2]

Sentiment is the most interesting are of research in this area.
It requires sophisticated NLP analysis.
It allows to compare similar articles with the level of sntiments they express.
- It can be used to discriminate Promotional or attack articles from regular ones.

classify sentences into facts and opinions.
opinion senteces have
- object
- opinion (implicit or explicit)(comparative or direct)
- emotion
opinion quintuples $(o_{j},f_{jk},oo|{ijklmhi},t|l)$ in d

Opinion sentences -
The Primary Emotions^[3]
- love
- joy, surprise, anger, sadness and fear

- cool,ok,sucks,lousy
- too (adj|short|long|expensive)
- is (good|no good|bad|awful)

Semantics

redirects to article

Others

watch listed by users
markup sophistication score
- bold,
- italics,
- templates,
- math,
- ref,
- head1..head5,
- <nowiki><noinclude><include-only>,
- time line,
- links: internal, external,inter-wiki,catagories,images,pipetrick
- isbn
- image,
- bullets,lists,indents,
- tables, etc...
microformats

Readability

Domain experts are important for suplying the most scarce resource - dependeable new content with professional refrenceing. They should not be penalised for other issues such as style etc. Readability adds a new dimension to quality - rather than agrageate information metrics it informs us, how well written the text is. Writing more readable text is more difficult than writing non readable text.

SMOG (Simple Measure Of Gobbledygook) looks like the recommended metric ^[4]^[5] - requires 30 sentences.
Gunning fog index
Flesch-Kincaid readability tests:
Flesch Reading Ease
Flesch-Kincaid Grade Level
Accelerated Reader ATOS
Automated Readability Index (ARI) - character dependent, fast to calculate, langauage independent.
Coleman-Liau Index ^[6]- character dependent, fast to calculate, langauage independent.
Dale-Chall Readability Formula
Flesch-Kincaid readability tests:
Fry Readability Formula^[7]
Gunning-Fog Index
Lexile Framework for Reading
Linsear Write
LIX
Raygor Estimate Graph
Spache Readability Formula

Action

It would be great if we could per edit asess if readability has gone up or down. Since more readable articles will provide greater value for readers. Also repeated decrease in readability should trigger a request for a copy editing expert to improve the text.

A CSCW system that aims to assist editors to improve readability should be able to highlight the problematic sentences/section using color or other visuals.

Integration

PHP Readability kit ^[8]

Issues

These metrics are english only.
These do not look at syntax or semantic complexity.
Most are these are language dependent.
To cook a new formula incorprating the above text for calibrating against different school-university grades would need to be located. Or we could check against other formulas....

Other Facets

Refrences

^ Macdonald, N. ; Frase, L. Gingrich, P. ; Keenan, S. (1982). "The Writer's Workbench: Computer Aids for Text Analysis". Communications, IEEE Transactions on. 30 (1): 105–110. doi:10.1109/TCOM.1982.1095380.{{cite journal}}: CS1 maint: multiple names: authors list (link)
^ Liu, Bing. "Sentiment Analysis and Subjectivity" (PDF). Retrieved 7 June 2012.
^ W. Parrott. Emotions in Social Psychology, Psychology Press, Philadelphia, 2001.
^ Hedman, Amy S. (January 2008). "Using the SMOG formula to revise a health-related document". American Journal of Health Education. 39 (1): 61–64. doi:10.1080/19325037.2008.10599016. Retrieved 2009-01-19.{{cite journal}}: CS1 maint: date and year (link)
^ Ley, P.; Florio, T. (1996). "The use of readability formulas in health care". Psychology, Health & Medicine. 1 (1): 7–28. doi:10.1080/13548509608400003. Retrieved 2010-12-14. {{cite journal}}: Unknown parameter |month= ignored (help)CS1 maint: date and year (link)
^ Coleman, M.; and Liau, T. L. (1975). "A computer readability formula designed for machine scoring'". Journal of Applied Psychology. 60 (2): 283–284. doi:10.1037/h0076540.{{cite journal}}: CS1 maint: multiple names: authors list (link)
^ Gunning, T. G. (2003). Building Literacy in the Content Areas. Boston: Allyn & Baco.
^ Blog Entry on this library
^ Stvilia, Besiki (07). "An activity theoretic model for information quality change". First Monday. 13 (4). {{cite journal}}: Check date values in: |date= and |year= / |date= mismatch (help); Unknown parameter |coauthors= ignored (|author= suggested) (help); Unknown parameter |month= ignored (help)
^ Rassbach, Laura (07), "Using Natural Language Processing to determine the quality of Wikipedia article", Proceeding of Wikimania 2007 {{citation}}: Check date values in: |date= and |year= / |date= mismatch (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Pierpaolo Dondio, Stephen Barrett: Computational Trust in Web Content Quality: A Comparative Evalutation on the Wikipedia Project. Informatica (Slovenia) 31(2): 151-160 (2007)
^ Dalip, Daniel Hasan; Gonçalves, Marcos André; Cristo, Marco; Calado, Pável (2011). "Automatic Assessment of Document Quality in Web Collaborative Digital Libraries". Journal of Data and Information Quality. 2 (3): 1–30. doi:10.1145/2063504.2063507. ISSN 1936-1955.

[1] Macdonald, N. ; Frase, L. Gingrich, P. ; Keenan, S. (1982). "The Writer's Workbench: Computer Aids for Text Analysis". Communications, IEEE Transactions on. 30 (1): 105–110. doi:10.1109/TCOM.1982.1095380.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[Bing2012-2] Liu, Bing. "Sentiment Analysis and Subjectivity" (PDF). Retrieved 7 June 2012.

[3] W. Parrott. Emotions in Social Psychology, Psychology Press, Philadelphia, 2001.

[Hedman2008-4] Hedman, Amy S. (January 2008). "Using the SMOG formula to revise a health-related document". American Journal of Health Education. 39 (1): 61–64. doi:10.1080/19325037.2008.10599016. Retrieved 2009-01-19.{{cite journal}}: CS1 maint: date and year (link)

[ley1996-5] Ley, P.; Florio, T. (1996). "The use of readability formulas in health care". Psychology, Health & Medicine. 1 (1): 7–28. doi:10.1080/13548509608400003. Retrieved 2010-12-14. {{cite journal}}: Unknown parameter |month= ignored (help)CS1 maint: date and year (link)

[Coleman75-6] Coleman, M.; and Liau, T. L. (1975). "A computer readability formula designed for machine scoring'". Journal of Applied Psychology. 60 (2): 283–284. doi:10.1037/h0076540.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[Gunning2003-7] Gunning, T. G. (2003). Building Literacy in the Content Areas. Boston: Allyn & Baco.

[8] Blog Entry on this library

[Besiki08-9] Stvilia, Besiki (07). "An activity theoretic model for information quality change". First Monday. 13 (4). {{cite journal}}: Check date values in: |date= and |year= / |date= mismatch (help); Unknown parameter |coauthors= ignored (|author= suggested) (help); Unknown parameter |month= ignored (help)

[Rassbach07-10] Rassbach, Laura (07), "Using Natural Language Processing to determine the quality of Wikipedia article", Proceeding of Wikimania 2007 {{citation}}: Check date values in: |date= and |year= / |date= mismatch (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)

[Dondio07-11] Pierpaolo Dondio, Stephen Barrett: Computational Trust in Web Content Quality: A Comparative Evalutation on the Wikipedia Project. Informatica (Slovenia) 31(2): 151-160 (2007)

[DalipGonçalves2011-12] Dalip, Daniel Hasan; Gonçalves, Marcos André; Cristo, Marco; Calado, Pável (2011). "Automatic Assessment of Document Quality in Web Collaborative Digital Libraries". Journal of Data and Information Quality. 2 (3): 1–30. doi:10.1145/2063504.2063507. ISSN 1936-1955.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]