Synthetic minority oversampling technique

In statistics, synthetic minority oversampling technique (SMOTE) is a method for oversampling samples when dealing with imbalanced classification categories within a dataset. Compared with the method of undersampling, which also is used for imbalanced datasets, SMOTE will oversample the minority category.^[1]^[2]

Algorithm

The SMOTE algorithm can be abstracted with the following pseudocode:^[2]

if N < 100; then
    Randomize the T minority class samples
    T = (N/100) ∗ T
    N = 100
endif

N = (int)(N/100)
k = Number of nearest neighbors
numattrs = Number of attributes
Sample[ ][ ]: array for original minority class samples
newindex: keeps a count of number of synthetic samples generated, initialized to 0
Synthetic[ ][ ]: array for synthetic samples

for i <- 1 to T
    Compute k nearest neighbors for i, and save the indices in the nnarray
    Populate(N , i, nnarray)
endfor

Populate(N, i, nnarray):
    while N != 0
        Choose a random number between 1 and k, call it nn
        for attr <- 1 to numattrs
            Compute: dif = Sample[nnarray[nn]][attr] − Sample[i][attr]
            Compute: gap = random number between 0 and 1
            Synthetic[newindex][attr] = Sample[i][attr] + gap ∗ dif
        endfor
        newindex++
        N = N − 1
    endwhile
    return

where

N is the amount of SMOTE, where the amount of SMOTE is assumed to be a multiple of one hundred
T is the number of minority class samples
k is the number of nearest neighbors
Populate() is the generating function for new synthetic minority samples

If N is less than 100%, the minority class samples will be randomized, as only a random subset of them will have SMOTE applied to them.

Variations

Two variations to the SMOTE algorithm were proposed in the initial SMOTE paper:^[2]

SMOTE-NC: applies to datasets with a mix of nominal and continuous data
SMOTE-N: accounts for nominal features, with the nearest neighbors algorithm being computed using the modified version of Value Difference Metric (VDM), which looks at the overlap of feature values over all feature vectors

Other variations include:^[3]

ADASYN: use a weighted distribution for different minority class examples according to their level of difficulty in learning^[4]^[5]
Borderline-SMOTE: only the minority examples near the borderline are over-sampled^[4]^[6]
SMOTE-Tomek: applying Tomek links to the oversampled training set as a data cleaning step to remove samples overlapping the category boundaries^[7]
SMOTE-ENN: uses the Edited Nearest Neighbor Rule, which removes any example whose class label differs from the class of at least two of its three nearest neighbors^[7]

Limitations

SMOTE does come with some limitations and challenges:^[8]

Overfitting during the training process
Favorable outcomes in the machine learning process, but questionable translation to practical uses
Synthetically created sample may belong to a different class
Synthetic data may not match the original distribution of the minority class

Implementations

imbalanced-learn (Python)
smote_variants: implementation of 86 SMOTE variations (Python)
ImbalancedLearningRegression (Python)

References

^ Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P. (2011-06-09), SMOTE: Synthetic Minority Over-sampling Technique, arXiv, doi:10.48550/arXiv.1106.1813, arXiv:1106.1813, retrieved 2025-07-16
^ ^a ^b ^c Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P. (2002-06-01). "SMOTE: Synthetic Minority Over-sampling Technique". Journal of Artificial Intelligence Research. 16: 321–357. doi:10.1613/jair.953. ISSN 1076-9757.
^ "Over-sampling methods — Version 0.13.0". imbalanced-learn.org. Retrieved 2025-07-16.
^ ^a ^b Elreedy, Dina; Atiya, Amir F. (2019-12-01). "A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance". Information Sciences. 505: 32–64. doi:10.1016/j.ins.2019.07.070. ISSN 0020-0255.
^ He, Haibo; Bai, Yang; Garcia, Edwardo A.; Li, Shutao (2008-06-01). "ADASYN: Adaptive synthetic sampling approach for imbalanced learning". 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE: 1322–1328. doi:10.1109/ijcnn.2008.4633969.
^ Han, Hui; Wang, Wen-Yuan; Mao, Bing-Huan (2005-08-23). "Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning". Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I. ICIC'05. Part I. Berlin, Heidelberg: Springer-Verlag: 878–887. doi:10.1007/11538059_91. ISBN 978-3-540-28226-6.
^ ^a ^b Batista, Gustavo E. A. P. A.; Prati, Ronaldo C.; Monard, Maria Carolina (2004-06-01). "A study of the behavior of several methods for balancing machine learning training data". SIGKDD Explor. Newsl. 6 (1): 20–29. doi:10.1145/1007730.1007735. ISSN 1931-0145.
^ Alkhawaldeh, Ibraheem M.; Albalkhi, Ibrahem; Naswhan, Abdulqadir Jeprel (2023-12-20). "Challenges and limitations of synthetic minority oversampling techniques in machine learning". World Journal of Methodology. 13 (5): 373–378. doi:10.5662/wjm.v13.i5.373. PMC 10789107.{{cite journal}}: CS1 maint: unflagged free DOI (link)

This statistics-related article is a stub. You can help Wikipedia by expanding it.

[1] Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P. (2011-06-09), SMOTE: Synthetic Minority Over-sampling Technique, arXiv, doi:10.48550/arXiv.1106.1813, arXiv:1106.1813, retrieved 2025-07-16

[:0-2] Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P. (2002-06-01). "SMOTE: Synthetic Minority Over-sampling Technique". Journal of Artificial Intelligence Research. 16: 321–357. doi:10.1613/jair.953. ISSN 1076-9757.

[3] "Over-sampling methods — Version 0.13.0". imbalanced-learn.org. Retrieved 2025-07-16.

[:1-4] Elreedy, Dina; Atiya, Amir F. (2019-12-01). "A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance". Information Sciences. 505: 32–64. doi:10.1016/j.ins.2019.07.070. ISSN 0020-0255.

[5] He, Haibo; Bai, Yang; Garcia, Edwardo A.; Li, Shutao (2008-06-01). "ADASYN: Adaptive synthetic sampling approach for imbalanced learning". 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE: 1322–1328. doi:10.1109/ijcnn.2008.4633969.

[6] Han, Hui; Wang, Wen-Yuan; Mao, Bing-Huan (2005-08-23). "Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning". Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I. ICIC'05. Part I. Berlin, Heidelberg: Springer-Verlag: 878–887. doi:10.1007/11538059_91. ISBN 978-3-540-28226-6.

[:2-7] Batista, Gustavo E. A. P. A.; Prati, Ronaldo C.; Monard, Maria Carolina (2004-06-01). "A study of the behavior of several methods for balancing machine learning training data". SIGKDD Explor. Newsl. 6 (1): 20–29. doi:10.1145/1007730.1007735. ISSN 1931-0145.

[8] Alkhawaldeh, Ibraheem M.; Albalkhi, Ibrahem; Naswhan, Abdulqadir Jeprel (2023-12-20). "Challenges and limitations of synthetic minority oversampling techniques in machine learning". World Journal of Methodology. 13 (5): 373–378. doi:10.5662/wjm.v13.i5.373. PMC 10789107.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Algorithm

Variations

Limitations

Implementations

See also

References