User:ConvertibleFlyingSaucer/Evaluate an Article
![]() | Evaluate an article
Complete your article evaluation below. Here are the key aspects to consider: Lead sectionA good lead section defines the topic and provides a concise overview. A reader who just wants to identify the topic can read the first sentence. A reader who wants a very brief overview of the most important things about it can read the first paragraph. A reader who wants a quick overview can read the whole lead section.
ContentA good Wikipedia article should cover all the important aspects of a topic, without putting too much weight on one part while neglecting another.
Tone and BalanceWikipedia articles should be written from a neutral point of view; if there are substantial differences of interpretation or controversies among published, reliable sources, those views should be described as fairly as possible.
Sources and ReferencesA Wikipedia article should be based on the best sources available for the topic at hand. When possible, this means academic and peer-reviewed publications or scholarly books.
Organization and writing qualityThe writing should be clear and professional, the content should be organized sensibly into sections.
Images and Media
Talk page discussionThe article's talk page — and any discussions among other Wikipedia editors that have been taking place there — can be a useful window into the state of an article, and might help you focus on important aspects that you didn't think of.
Overall impressions
Examples of good feedbackA good article evaluation can take a number of forms. The most essential things are to clearly identify the biggest shortcomings, and provide specific guidance on how the article can be improved. |
Which article are you evaluating?
[edit]Why you have chosen this article to evaluate?
[edit]I was looking for a topic relevant to my master's program. This article was a good fit since we have just seen it in an introductory Machine Learning class, although at a surface level. Leakage is a problem that may be present in every single Machine Learning model, and it is not too complex; it doesn't require deep expertise in the field nor extensive knowledge of many concepts, which a topic like Transformers might. However, it's relevance cannot be understated, since, as the article states, it has led to many papers suffering reproducibility issues.
My impression after reading was that the article was quite brief. At first, this made me think that perhaps this was all that needed to be said about Leakage, but then reading other articles, even those on topics that might seem "simple", I saw they could be quite lengthy. Therefore, I thought that perhaps for this topic it has been difficult to gather information from the type of sources Wikipedia recommends, and thus this was still a work-in-progress.
Evaluate the article
[edit]After evaluating the recommended points and answering the questions provided as a guide, here is my evaluation of the most relevant changes to be made:
- On the "Leakage modes" section, the first paragraph mentions sub-classifying leakage causes, but it's not clear what the main classification criteria is, and if the latter does exist, the article would benefit from making it explicit and mentioning these main classifications.
- Also, an outlining of the ways different stages of a Machine Learning process might suffer from leakage would be a useful expansion too.
- The example provided on "Future Leakage" could be visualized in a table that has the features, the undesired features that cause the leakage, and the ground truth, so that we might have an example for users that resembles a real case.
- On Training Example Leakage, we only have row-wise leakage. I am not certain if this is the only form of Training Example Leakage, but if not, we could mention other types and some examples.
- "Premature featurization" does not have a clear definition nor examples; it is defined in this article by naming the term again. This should be re-written to clearly depict what this row-wise leakage type is.
- The other types of row-wise leakage also require a re-writing; as it is now, it seems to be an extensive side-comment, not meant to explain these, but to remind someone that is already familiar with them. It would be better to re-write it to include the definition of the type and to explain clearly with examples provided.
- The paragraph at the end of "Leakage modes" perhaps should be in a separate section that addresses how Data Leakage has contributed to the Reproducibility Crisis. However, the Reproducibility Crisis is a problem that should not be addressed in its entirety here, as it is an issue that involves other concerns such as problems with reliable p-values given sample sizes.
- The "Detection" section could use images as well, detailing an example of performance with leakage vs performance without it.
- In general, the article could add links to concepts mentioned that are relevant in Machine Learning and have Wikipedia articles, such as "data pipeline", "pre-processing", "feature engineering", and "time series".
- The second paragraph on the "Detection" section could benefit from examples of what counter-intuitive features and unexpected patterns look like.
- Some of these sources are a bit dated and could use more current articles to help strengthen and validate the concepts they define, such as the article from 2011 used to give the definition in the Lead section, or the article from 2008 that is used for the "Feature Leakage" section.
- The sources from IBM's website and a Twitter comment should be changed to the articles that these sources use instead, of course, consulting these sources to see if they provide that same information and then re-writing the original paragraphs.