Simulation study on the performance of robust outlier labelling methods

Loading...
Thumbnail Image
Date
2023-10
Journal Title
Journal ISSN
Volume Title
Publisher
Kampala International University, College of Economics and management
Abstract
The identification and labeling of outliers play a crucial role in data analysis and modeling tasks. Robust outlier labeling methods aim to accurately identify observations that deviate significantly from the majority of the data points while being resilient to noise, measurement errors, and data corruption. In this simulation study, we evaluate the performance of various robust outlier labeling methods using synthetic datasets. To conduct the study, we defined the simulation setup by specifying the characteristics of the datasets, including the number of variables, sample size, distributional assumptions, and proportion of outliers. Synthetic datasets were generated based on these specifications, incorporating both normal observations and outliers with known characteristics. A set of robust outlier labeling methods was selected for evaluation. These methods were designed to effectively handle outliers and provide reliable labels. Implementation of the selected methods was carried out using a programming language, ensuring proper application to the generated datasets. Performance metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC) were defined to assess the effectiveness of the outlier labeling methods. Each method was applied to the synthetic datasets, and the results were recorded. The performance metrics were calculated based on the known labels of the synthetic outliers. The collected results were analyzed and compared to identify the strengths and limitations of each robust outlier labeling method. The performance metrics were used to assess accuracy, robustness, and computational efficiency. To ensure the reliability of the findings, the simulation study was repeated with different simulation setups and datasets, validating the consistency of the results across multiple iterations. Based on the findings, conclusions were drawn regarding the performance of the evaluated robust outlier labeling methods. The most effective methods for the specific characteristics of the datasets used in the study were identified. These findings provide valuable insights for researchers, practitioners, and data analysts in choosing appropriate outlier labeling methods for their data analysis and modeling tasks. In summary, this simulation study contributes to the understanding of the performance of robust outlier labeling methods and provides a systematic evaluation framework for comparing and selecting suitable methods in the presence of outliers.
Description
A research thesis submitted to the school of mathematics in partial fulfillment of the requirements for the award of the Master of Science in statistics of Kampala International University
Keywords
Simulation study, Performance, Robust outlier, labelling methods
Citation