Algoritmo para la selección de instancias en problemas de clasificación basado en arreglos de cobertura

Solarte Martínez, Jhonattan

Principal
→
Facultad de Ingeniería Electrónica y Telecomunicaciones
→
Maestría en Computación
→
Ver ítem

dc.contributor.author	Solarte Martínez, Jhonattan
dc.date.accessioned	2022-06-16T15:51:23Z
dc.date.available	2022-06-16T15:51:23Z
dc.date.issued	2019
dc.identifier.uri	http://repositorio.unicauca.edu.co:8080/xmlui/handle/123456789/4163
dc.description.abstract	La extracción de conocimiento en bases de datos soporta la toma de decisiones y la mejora de diversas tareas del quehacer humano. Un Modelo de Aprendizaje Supervisado (MAS) se obtiene de un conjunto de datos, que representan situaciones específicas de la vida cotidiana, los cuales han sido recopilados y registrados (datos de entrenamiento) con dos componentes principales, una variable de interés denominada variable de salida, objetivo o de respuesta y otras variables denominadas variables de entrada o explicativas. El proceso de obtención de un MAS para un problema del mundo real tiene una fase de “Preprocesamiento de los datos”, donde se aplica la selección de instancias (reducción de datos), si el dataset lo amerita. La selección de instancias busca obtener datos de calidad, que al ser utilizados en la etapa de modelamiento permita definir modelos de calidad similar o superior con menos cantidad datos, en menor tiempo y esfuerzo computacional. Los Covering Array (CA) o algoritmos de cobertura son objetos matemáticos que ha sido usados en el diseño de experimentos con aplicaciones en biología, análisis de fallas en ingeniería y recientemente en las pruebas de calidad de software y hardware. Sin embargo, a la fecha no han sido utilizados para la selección de instancias en datasets. Por lo anterior, en esta investigación se propone un algoritmo basado en Covering Arrays binarios, que permite seleccionar instancias o filas de datasets que se utilizan en procesos de clasificación (aprendizaje supervisado), propuesta denominada ISCA. La evaluación de la propuesta se realizó con 26 datasets del repositorio de aprendizaje de máquina de la Universidad de California en Irvine, usando dos medidas de comparación: la calidad de la clasificación (menor porcentaje de error en la clasificación) y el porcentaje de reducción de instancias. Los datasets reducidos con ISCA se usaron con dos clasificadores (1NN y C4.5) y la calidad de los modelos se comparó frente al uso del dataset completo (original), como resultado se observó que ISCA version 1 permite reducir el dataset en promedio a un 39,5% del tamaño original y mejora la calidad de la clasificación en promedio un 2,5 %. ISCA versión 2 se comparó con cuatro algoritmos del estado del arte de selección (reducción) de instancias, CNN, ELH, ENN y RMHC, permitiendo obtener una reducción de instancias mucho mayor, de 11,5% en promedio, mientras que para los otros fue de 38,3%, 25,7%, 77,2% y 9,8% respectivamente. En cuanto al promedio del % de error, ISCA obtiene un valor de 19,460% mientras que el mejor resultado de los otros algoritmos comparados lo obtiene CNN con 12,622. Los resultados son prometedores y alientan el desarrollo de nuevos trabajos de investigación en el área.	spa
dc.description.abstract	Knowledge Discovery in Databases (KDD) supports decision-making and the improvement of various tasks of human activity. A Supervised Learning Model (SLM) is obtained from a dataset which represents specific daily life situations. The data has been collected and recorded (training data) with two main components: a variable of interest called the output, objective or response variable and other variables called the input or explanatory variables. The process of obtaining an SLM for a real-life problem has a “Data preprocessing” phase, in which instance selection is applied if the dataset needs it. Instance selection seeks to obtain quality data, which, when used in the modeling stage, allows the definition of models of similar or higher quality with less data, using less time and computational effort. Covering Arrays (CA) are mathematical objects that have been used in the design of experiments with applications in biology, analysis of engineering failures, and recently in the quality testing of software and hardware. However, they have not been used to date for instance selection in datasets. Therefore, this research proposes an algorithm based on binary Covering Arrays, which allows the selection of instances or rows of datasets that are used in classification processes (supervised learning). This is a proposal called ISCA. The evaluation of the proposal was carried out with 26 datasets from the machine learning repository of the University of California at Irvine, using two comparison measures: the quality of the classification (lower error percentage in the classification) and the percentage of instance reduction. The datasets reduced with ISCA were used with two classifiers (1NN and C4.5) and the quality of the models was compared against the use of the complete (original) dataset. As a result, it was observed that on average ISCA version 1 allows the reduction of the dataset to 39.5% of the original size and improves the quality of the classification by 2.5%. ISCA version 2 was compared with four state-of-the-art instance selection (reduction) algorithms: CNN, ELH, ENN and RMHC, allowing for a much greater instance reduction, 11.5% on average, while with the others it was 38.3%, 25.7%, 77.2%, and 9.8%, respectively. Regarding the average % error, ISCA obtains a value of 19.460% while the best result of the other algorithms is obtained by CNN, which is 12.622%. The results are promising and encourage the implementation of new research projects in the area.	eng
dc.language.iso	spa
dc.publisher	Universidad del Cauca	spa
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject	Algoritmos de clasificación	spa
dc.subject	Arreglos de cobertura	spa
dc.subject	Selección de instancias	spa
dc.subject	Classification algorithms	eng
dc.subject	Covering arrays	eng
dc.subject	Instance selection	eng
dc.subject	RMHC (Random Mutation Hill Climbing)	eng
dc.subject	ELH (Encoding Length Heuristic)	eng
dc.subject	CNN (Condensed Nearest Neighbor Rule)	eng
dc.subject	ENN (Edited Nearest Neighbour)	eng
dc.title	Algoritmo para la selección de instancias en problemas de clasificación basado en arreglos de cobertura	spa
dc.type	Tesis maestría	spa
dc.rights.creativecommons	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.type.driver	info:eu-repo/semantics/masterThesis
dc.type.coar	http://purl.org/coar/resource_type/c_bdcc
dc.publisher.faculty	Facultad de Ingeniería Electrónica y Telecomunicaciones	spa
dc.publisher.program	Maestría en Computación	spa
dc.rights.accessrights	info:eu-repo/semantics/openAccess
dc.type.version	info:eu-repo/semantics/acceptedVersion
dc.identifier.instname
dc.identifier.reponame
oaire.accessrights	http://purl.org/coar/access_right/c_abf2
dc.identifier.repourl
oaire.version	http://purl.org/coar/version/c_ab4af688f83e57aa