Jump to content

Local case-control sampling

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by KSFT (talk | contribs) at 20:24, 12 June 2015 (fixed grammar/formatting, added a category). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In machine learning, local case-control sampling is an algorithm used to reduce the complexity of training a logistic regression classifier. The algorithm reduces the training complexity by selecting a small subsample of the original dataset for training. It assumes the availability of a (unreliable) pilot estimation of the parameters. It then performs a single pass over the entire dataset using the pilot estimation to identify the most "surprising" samples. In practice, the pilot may come from prior knowledge or training using a subsample of the dataset. The algorithm is most effective when the underlying dataset is imbalanced. It exploits the structures of conditional imbalanced datasets more efficiently than alternative methods, such as case control sampling and weighted case control sampling.


[1]

  1. ^ Fithian, William; Hastie, Trevor (2014). "Local case-control sampling: Efficient subsampling in imbalanced data sets". The Annals of Statistics. 42 (5): 1693-1724. {{cite journal}}: External link in |ref= (help)