User:TronBot/QA
Quality & Importance Metrics
[edit]- Quality content is the Wikimedia project top concern.
- GRE are assesed by machines
- Systems that have done quality assement[1]
Information Metrics
[edit]- Entropy/Information (compresed size)
- Normed Entropy (compresed size/word count)
- Unique 3-grams (unique triplets)
Linking Structure
[edit]- WikiLinks
- Orphan, near orphan
- Catagories
- Catagories Relevance (Kullback-Leibler divergence | cross entopy across the catagory)
- Incoming Link Count | Out Going Link Count (Kullback-Leibler divergence| cross entopy | PLSI - across the catagory)
- Incoming Link Quality | Out Going Link Quality (Kullback-Leibler divergence| cross entopy | PLSI - across the catagory)
- Link Rot
Stability - Add to Recency
[edit]Logic indicatres that High Quality articles would become complete and require no more editing. Research indicates that editing can degrade an article's quality.
- Edit spikes
- Edit time series model (i.e. how long till next edit is expected based on full history|recent 7 edits)
Importance Metrics
[edit]- Cache Traffic Rank.
Verifiability
[edit]- Citation Metrics.
- Citation Count.
- Citation Density.
- Citation Relevance (Covrage)
- Citation Quality.
- BLP Citation Density.
Media
[edit]- Images.
- Image Count.
- Image Caption.
- image - fair use
- image - commons
- video
- audio
- Quoate
- Poem
- Source
- Template Count.
- Problem Tags (Inline|Section|Article).
- Non Problem Tempates.
- Table Count and Mass.
Social
[edit]- Words per editor (mean, standard deviation) for (Article;Talk;Both,Only Talk,Only Article).
- Edits per editors (mean, standard deviation) for (Article;Talk;Both,Only Talk,Only Article).
- Permission per editor (none, ... ,admin)
- Edits per editors Good and Featured articles (mean, standard deviation).
- Editors Contribs per Namespace (mean, standard deviation).
Calls for Action (Huggle/Twinkle/common content templates)
[edit]- Tot Template Count
- Tot Inline Template Count
- Red Link Count
Conflict (Researchers believe that Conflict is a marker of Quality)
[edit]- Number of non vandalism edits reverted,Deletion, Rollbacks (based on via use cosine between diff vector and current version)
- Edit Wars
- 3Revert Violation
- Afd nomination
- Move event
- Semi/Full Protection events
Coordination and Communication
[edit]- Talk page Discussion
- History Comment length
- Inline Template count
- Page/Section Template count
Style
[edit]The style guide is full of both quality features and small task sources. Many are context sensative. Is there research on style guide complience? I think style complience is seperates article C and below and B and above.
- Title compliance
- Title is Sentence Case; ;
- Title starts with A, An, The
- Title is starts with Noun/Noun Phrase
- Title is ends with Punctuation
- Title formatting
- Hat Notes - must be at start
- {{About|text}} {{redirect}} {{Hatnote|text}} {{Hatnote|text [[link]]}} {{see also?}} {{for?}} {{Other uses}} {{Other uses of}} {{Two other uses}} {{Three other uses}} {{Details}} {{Further?}} {{other people?}} {{other places?}} {{other ships?}} {{other hurricanes}} {{otherArticle}} {{Main}} {{Main list}} {{article preceding}} {{article succeeding}} {{ArticlePair}} {{Cat main}}
- Has Lead
- Section Titles
- Section Templates
{{Main|Article name}} {{DetailsArticle name}} {{FurtherArticle name}} {{RelatedArticle name}} {{alsoArticle name}} {{Distinguish?}} {{Redirect-distinguish?}}
- Appedecies in this order:
- Works or Publications
- See also
- Notes and/or References
- Further reading
- External links
- Navigation templates (footer navboxes)[6]
- Geographical coordinates (if not in Infobox) or
- Persondata template
- Defaultsort
- Categories
- Stub templates
- Interlanguage links
- deprecated items
- horizontal rule ----
Categories included may be disputed (have infoboxes
etc.....
NLP
[edit]Lexical
[edit]- word count
- word count - top 0- 99 most frequent
- word count - top 100- 999 most frequent
- word count - top 1,000-9,999 most frequent
- word count - core
- word count - 3 times or less in Wikipedia
- word count - words unique to article
- sentence length (mean,sd)
- collocations
- compound words
- lead to section seme match
- typos
- style errors
- english varients
- us
- uk
- other
- non english
- IPA info
- Sentiment is the most interesting are of research in this area.
- It requires sophisticated NLP analysis.
- It allows to compare similar articles with the level of sntiments they express.
- It can be used to discriminate Promotional or attack articles from regular ones.
- classify sentences into facts and opinions.
- opinion senteces have
- object
- opinion (implicit or explicit)(comparative or direct)
- emotion
- opinion quintuples in d
- Opinion sentences -
- The Primary Emotions[3]
- love
- joy, surprise, anger, sadness and fear
- cool,ok,sucks,lousy
- too (adj|short|long|expensive)
- is (good|no good|bad|awful)
Semantics
[edit]- redirects to article
Others
[edit]- watch listed by users
- markup sophistication score
- bold,
- italics,
- templates,
- math,
- ref,
- head1..head5,
- <nowiki><noinclude><include-only>,
- time line,
- links: internal, external,inter-wiki,catagories,images,pipetrick
- isbn
- image,
- bullets,lists,indents,
- tables, etc...
- microformats
Readability
[edit]Domain experts are important for suplying the most scarce resource - dependeable new content with professional refrenceing. They should not be penalised for other issues such as style etc. Readability adds a new dimension to quality - rather than agrageate information metrics it informs us, how well written the text is. Writing more readable text is more difficult than writing non readable text.
- SMOG (Simple Measure Of Gobbledygook) looks like the recommended metric [4][5] - requires 30 sentences.
- Gunning fog index
- Flesch-Kincaid readability tests:
- Flesch Reading Ease
- Flesch-Kincaid Grade Level
- Accelerated Reader ATOS
- Automated Readability Index (ARI) - character dependent, fast to calculate, langauage independent.
- Coleman-Liau Index [6]- character dependent, fast to calculate, langauage independent.
- Dale-Chall Readability Formula
- Flesch-Kincaid readability tests:
- Fry Readability Formula[7]
- Gunning-Fog Index
- Lexile Framework for Reading
- Linsear Write
- LIX
- Raygor Estimate Graph
- Spache Readability Formula
Action
[edit]It would be great if we could per edit asess if readability has gone up or down. Since more readable articles will provide greater value for readers. Also repeated decrease in readability should trigger a request for a copy editing expert to improve the text.
A CSCW system that aims to assist editors to improve readability should be able to highlight the problematic sentences/section using color or other visuals.
Integration
[edit]
Issues
[edit]- These metrics are english only.
- These do not look at syntax or semantic complexity.
- Most are these are language dependent.
- To cook a new formula incorprating the above text for calibrating against different school-university grades would need to be located. Or we could check against other formulas....
Other Facets
[edit]See also
[edit]Refrences
[edit]- ^ Macdonald, N. ; Frase, L. Gingrich, P. ; Keenan, S. (1982). "The Writer's Workbench: Computer Aids for Text Analysis". Communications, IEEE Transactions on. 30 (1): 105–110. doi:10.1109/TCOM.1982.1095380.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - ^ Liu, Bing. "Sentiment Analysis and Subjectivity" (PDF). Retrieved 7 June 2012.
- ^ W. Parrott. Emotions in Social Psychology, Psychology Press, Philadelphia, 2001.
- ^ Hedman, Amy S. (January 2008). "Using the SMOG formula to revise a health-related document". American Journal of Health Education. 39 (1): 61–64. doi:10.1080/19325037.2008.10599016. Retrieved 2009-01-19.
{{cite journal}}
: CS1 maint: date and year (link) - ^ Ley, P.; Florio, T. (1996). "The use of readability formulas in health care". Psychology, Health & Medicine. 1 (1): 7–28. doi:10.1080/13548509608400003. Retrieved 2010-12-14.
{{cite journal}}
: Unknown parameter|month=
ignored (help)CS1 maint: date and year (link) - ^ Coleman, M.; and Liau, T. L. (1975). "A computer readability formula designed for machine scoring'". Journal of Applied Psychology. 60 (2): 283–284. doi:10.1037/h0076540.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - ^ Gunning, T. G. (2003). Building Literacy in the Content Areas. Boston: Allyn & Baco.
- ^ Blog Entry on this library
- ^ Stvilia, Besiki (07). "An activity theoretic model for information quality change". First Monday. 13 (4).
{{cite journal}}
: Check date values in:|date=
and|year=
/|date=
mismatch (help); Unknown parameter|coauthors=
ignored (|author=
suggested) (help); Unknown parameter|month=
ignored (help) - ^ Rassbach, Laura (07), "Using Natural Language Processing to determine the quality of Wikipedia article", Proceeding of Wikimania 2007
{{citation}}
: Check date values in:|date=
and|year=
/|date=
mismatch (help); Unknown parameter|coauthors=
ignored (|author=
suggested) (help) - ^ Pierpaolo Dondio, Stephen Barrett: Computational Trust in Web Content Quality: A Comparative Evalutation on the Wikipedia Project. Informatica (Slovenia) 31(2): 151-160 (2007)
- ^ Dalip, Daniel Hasan; Gonçalves, Marcos André; Cristo, Marco; Calado, Pável (2011). "Automatic Assessment of Document Quality in Web Collaborative Digital Libraries". Journal of Data and Information Quality. 2 (3): 1–30. doi:10.1145/2063504.2063507. ISSN 1936-1955.