Data mining applied to the
segmentation of students and the acquisition of new profiles by means of
clustering techniques.
Minería de datos aplicada a la segmentación de estudiantes y captación
de nuevos perfiles mediante técnicas de clustering
Carlos
Rafael Guffantte Salazar *
Santiago
Israel Logroño Naranjo *
SUMMARY
This project proposes the
application of data mining tools, in particular clustering techniques, to
identify the most representative profiles that have passed through the
CONDUESPOCH driving school located in the city of Riobamba.
The research is justified by
the need to improve the recruitment of students before each new academic
period, and through the processing of historical information available to the
institution it is possible to segment these individuals into different groups.
The development is based on the application of K-Means and DBSCAN models, each
with its strengths and weaknesses depending on the nature and distribution of
the data available; the DBSCAN model is the most appropriate for the case
study, reaching validation metrics of 0.78 for the silhouette coefficient and
0.27 for the Davies-Bouldin index.
The study adopts a
quantitative and exploratory approach, with an applied methodology. The main
characteristics of the most significant groups will be analyzed in order to
propose recruitment strategies that can be effective for the institution.
The program seeks to improve
the integration of data mining in institutional management, simplifying student
recruitment processes so that the school maintains a high profile in the region
with its peers; in addition to ensuring efficient and informed decision making
based on the knowledge acquired.
Keywords: Data Mining, Clustering, Profile Segmentation,
Capture Strategies, Data Science.
Resumen
This project proposes the
application of data mining tools, in particular clustering techniques, to identify
the most representative profiles that have passed through the CONDUESPOCH
driving school located in the city of Riobamba.
The research is justified by
the need to improve the recruitment of students before each new academic
period, and through the processing of historical information available to the
institution it is possible to segment these individuals into different groups.
The development is based on the application of K-Means and DBSCAN models, each
with its strengths and weaknesses depending on the nature and distribution of
the data available; the DBSCAN model is the most appropriate for the case
study, reaching validation metrics of 0.78 for the silhouette coefficient and
0.27 for the Davies-Bouldin index.
The study adopts a
quantitative and exploratory approach, with an applied methodology. The main
characteristics of the most significant groups will be analyzed in order to
propose recruitment strategies that can be effective for the institution.
The program seeks to improve
the integration of data mining in institutional management, simplifying student
recruitment processes so that the school maintains a high profile in the region
with its peers; in addition to ensuring efficient and informed decision making
based on the knowledge acquired.
Key words: Data Mining, Clustering, Profile Segmentation,
Capture Strategies, Data Science.
INTRODUCTION
In the current educational landscape, institutions
face the constant challenge of better understanding the characteristics and
needs of their students. This understanding is not only key to design more
effective educational programs, but also to attract new profiles that fit their
objectives
A very useful tool that helps to overcome this type of
challenge is related to the application of data mining techniques, specifically
clustering techniques, which allow grouping individuals belonging to the
institution into segments with similar characteristics, revealing patterns that
under ordinary circumstances might go unnoticed. For this purpose, there are
algorithms such as K-means and DBSCAN, which facilitate the management of large
volumes of data, in addition to providing practical results for decision making
The impact of this study focuses on turning
CONDUESPOCH's historical data into a strategic advantage. By identifying clear
student segments, the institution will be able to personalize its educational
offerings, optimize resources and focus its recruitment efforts on the most
representative and successful profiles. The main objective of this work is to
use clustering techniques to segment students according to their demographic
and academic characteristics, thus contributing to improve both institutional
planning and the attraction of new profiles. To achieve this goal, three
specific objectives are proposed: to identify student profiles from key data,
to evaluate the performance of the K-means and DBSCAN algorithms in this
context, and to design recruitment strategies aimed at the most prominent
segments.
The purpose of this research, beyond the benefits that
the CONDUESPOCH school will obtain, is that it can become a model that other
educational institutions can replicate; offering an innovative alternative to
face the challenges of the educational sector based on an approach based on
data analysis and the subsequent segmentation and understanding of the students
who belong to them. In addition, it is intended to answer the following
question: How can clustering techniques improve the segmentation of students in
CONDUESPOCH and optimize its recruitment process?
Theoretical
Framework
The use of data analytics or data analysis in
education has become a transcendental procedure to aspire to improve both
teaching and the recruitment of new students in a given institution, not to
mention the support it represents for strategic decision making. However,
applying these techniques is usually not simple; on the contrary, it is
certainly common to face obstacles related to the lack of information derived
from inconsistencies or lack of registration in the institutions, in addition
to the fact that some of them may not even have systems that allow them to
manage the data efficiently, affecting the quality of the analyses. In
addition, there is often a lack of technical training to use these tools
properly. There is also the issue of the use of personal data, which requires
careful consideration of ethical and privacy aspects
Despite the complications associated with these
challenges, the potential of data mining to change the way educational
institutions are managed is enormous, establishing itself as a highly relevant
resource for improving the student experience and making better use of the
resources available to the institution. To mention an example, Dutt et al
In the case of new student recruitment, Estrada-Danell
et al.
MATERIALS AND METHODS
The present research was developed in an exploratory
and applied manner, since it is intended to analyze patterns in the data by
means of grouping techniques, without a predefined structure and with a
quantitative approach.
Research
design and type of research
This study uses a quantitative approach to analyze the
data, which will be numerical in nature, of students who have historically
attended the CONDUESPOCH driving school, in order to detect the possible
existence of significant patterns among them. This approach is imperative due
to the need to resort to statistical techniques that facilitate obtaining
concrete and verifiable results. This starts with the collection of historical
data available in the institutional systems, followed by a detailed analysis to
identify the key relationships between variables
The design is applied, as it seeks to use existing
data to recruit new students, identifying relevant profiles. In addition, it is
descriptive, focusing on categorizing the main characteristics of current
students to extrapolate trends in the recruitment of new profiles. This
analysis is performed at a single moment in time and without affecting the
weight of the information represented by any variable is simply limited to the
observation and analysis of the available data, classifying the study as
non-experimental and cross-sectional
Population
and sample
The population is composed of approximately 1000
student records of the CONDUESPOCH driving school, in which information
academic and demographic information is available for each one of them. Taking
into consideration that the totality of the available data will be analyzed, it
will not be necessary to apply a sampling under any criteria, ensuring a
general representativeness of the results.
Research
methods and techniques
The following methods and techniques were used for
data collection and analysis:
Collection of information
Data were extracted from multiple tables stored in the
institution's central database. Variables considered include:
·
Demographics: Age,
gender and place of origin (province, canton, parish, etc.).
·
Academics:
Qualifications, Status.
Python software was used to develop the models, which
has specialized libraries for data analysis and management such as
Scikit-learn, Pandas and Matplotlib, among others, facilitating both the
pre-processing of the data and the visualization of the results obtained.
Data analysis procedures
The data
collected from the different tables stored in the institutional system were grouped
and integrated into a single table using Python. The procedure began with the
preprocessing of the data, where missing and duplicate values were eliminated
as part of the cleaning process, avoiding working with irregular information.
Subsequently, the most relevant variables were identified using correlation and
exploratory analysis techniques. Irregular distributions were identified in
most of the data, so it was necessary to enrich the characteristics, obtaining
new variables from the original ones, such as performance from grades; in
addition, a column for age was generated from the dates of birth and beginning
of the courses. The numerical variables were normalized using MinMaxScaler to
ensure a uniform scale, which improves the performance of the algorithms; while
for the categorical variables, one-hot coding was performed to ensure
compatibility with the corresponding clustering techniques
Once the data were processed, two clustering
algorithms were implemented:
·
K-Means: The
optimal number of clusters was determined using the elbow method.
·
DBSCAN: An
adjustment was made to the epsilon and min_samples parameters in such a way
that significant clusters could be identified, in addition to detecting the
existence of possible outliers. .
The quality of the clusters obtained for each case was
evaluated using metrics such as the silhouette coefficient to estimate the
cohesion and separation between groups, and the Davies-Bouldin index to assess
their level of compactness and dispersion (
RESULTS
This section presents the findings obtained from the
clustering of the register of students who have studied at CONDUESPOCH during
the last five periods, in order to identify the most popular profiles and adapt
strategies to improve future recruitment at the institution.
Evaluation Metrics
As shown inError!
Reference source not found. , the DBSCAN model is the
most appropriate to work with because it has better metrics than its similar
K-Means model. This can be explained due to the quite particular distribution
of the data, complicating the homogeneous grouping performed by K-Means and
favoring the density analysis of DBSCAN
Table 1: Evaluation
metrics for the applied models
MODEL |
# CLUSTERS |
COEF. SILUETA |
D-B INDEX |
DBSCAN |
8 |
0,78 |
0,27 |
KMEANS |
7 |
0,64 |
0,7 |
Distribution of Clusters by Performance
InError!
Reference source not found. , the results of the
clustering are presented. The most significant clusters were cluster 0, with
more than 350 individuals, clusters 2 and 3 with more than 150 individuals and
cluster 5 with 75 individuals; on the other hand, cluster 4 had only 34
individuals grouped together but with a high recorded performance. The large
and well-performing clusters were used as a starting point for analyses leading
to the development of recruitment strategies
Figure
1: Performance by cluster
Distribution by location
According toError!
Reference source not found. , most of the students
are from the city of Riobamba, as indicated by the three largest clusters,
while clusters 4 and 5, which were also of interest for the academic part, are
distributed among other cantons of the country. This information allows for a
more in-depth analysis of the regions and possible educational establishments
in which the institution can be promoted.
Figure
2: Customer segmentation: Location by Cluster
Within the groups focused on the analysis in Riobamba,
there is a clear tendency of individuals who live in the urban parishes of the
city, especially in Lizarzaburu; in addition, they are generally registered as
members of public educational institutions. In the groups of interest outside
of Riobamba, the majority of people indicate Guano as their city of residence
or study at ESPOCH. It is worth noting the low presence of people from parishes
close to the institution such as Licán or Yaruquíes, as well as from
neighboring cantons such as Chambo on the other hand.
Age Distribution
As illustrated Error!
Reference source not found. , clusters 6 and 7 are
the most differentiated in terms of student age with respect to the others,
with mean values of 36 and 46 years, respectively. The other clusters have a
similar behavior, with a density concentrated between 20 and 25 years of age.
Figure
3: Ages by Cluster
Age distribution for each group formed
This suggests that the vast majority of students who
have historically been part of the institution did so tentatively while they
were finishing their higher education studies or shortly after completing high
school, and few have chosen to enter the institution after the age of 35.
Additional Observations
InError!
Reference source not found. , the distribution of
people according to the section in which they took their courses is presented.
There are no clear preferences between groups in this section beyond the
apparent preference for the evening and weekend option over the afternoon, even
so, this section has a fairly decent volume of students that keeps it as a good
option to continue, especially for people based in Riobamba.
Figure
4: Sections per Cluster
In
addition, the gender distribution for the clusters found is summarized inError! Reference source not
found. . Of the representative clusters, only in the
outstanding ones is there an even distribution between male and female, while
for the rest there is a clear majority tendency for the former.
Figure
5: Gender
by Cluster
Proposed Strategies
Once the relevant information was known, and taking as
a starting point what is mentioned by Kotler et al.
Error!
Reference source not found. summarizes the
recruitment strategies proposed based on the clustering results obtained.
Table 2:
Summary of Strategies based on cluster
distributions
Segmentation |
Description |
Remarks |
Proposals |
Location |
Riobamba |
High concentration of people in Lizarzaburu parish, followed by the
other urban parishes. |
Focus the campaign in the urban area of the city, especially in
Lizarzaburu. Suggest promoting the offer in Licán or Yaruquíes, which are
areas adjacent to the school. |
High
tendency of people in fiscal institutions such as: Carlos Cisneros, Riobamba,
Maldonado, Juan de Velasco. |
Conduct lectures in these institutions to maintain or improve the
influx of people. |
||
Within the higher education institutions, there is a majority of
people belonging to ESPOCH, with almost no presence of people from other
universities. |
Offer exclusive benefits to students who belong to ESPOCH. Promote the
offer in other higher education institutions in the city. |
||
Others |
The vast majority is concentrated in Guano, with no influence from the
other surrounding cantons. |
Consider this trend for future campaigns to make a presence in Guano.
Evaluate the possibility of visiting nearby cantons such as Chambo to
publicize the offer and try to expand the volume of people from these areas. |
|
Age |
20-27 years |
Great majority of people under 27 years old, high concentration around
22 years old. |
Have an active social media approach to reach this age group
effectively. |
DISCUSSION
The results reflect that data mining techniques, in
the particular case of clustering algorithms, can serve as a tool of notable
relevance in the educational field to identify segments of students with
different characteristics, as pointed out by Baker & Inventado
With regard to the selected models, we agree with
Lazaro-Camasca & Nuñez-Medrano
Particularly, the proposed strategies are aligned with
the stipulations of Parra Armendariz et al.
In the case of CONDUESPOCH, applying clustering
techniques to segment students represents a unique opportunity to improve
internal planning and develop more effective recruitment strategies. This
approach would not only strengthen the institution's competitiveness, but can
also serve as a replicable model for other educational organizations interested
in leveraging their data to make informed decisions
This study has allowed us to learn more about the main
characteristics of the students who have been part of the CONDUESPOCH driving
school, which was useful in order to plan recruitment strategies with targeted
offers based on the sector with which the groups were identified.
The distribution and nature of the school data had a
notable influence at the time of interpreting the results of both models
applied, clearly evidencing in the evaluation metrics that the DBSCAN model
formed a more solid grouping than the K-Means model, making it the ideal model
to propose recruitment proposals with a high level of confidence.
The recruitment strategies were designed taking into
account the most successful academic profiles within the total population, and
for the most part, in accordance with the most numerous groups identified; it
is important to adapt the way in which people will be reached based on factors
such as their age and distance from the institution.
The study has achieved its main objective, which was
to apply clustering techniques to segment the school's students into groups,
based on their demographic and academic characteristics, allowing the
development of proposals aimed at benefiting the recruitment of students in an
informed and strategically planned manner.
REFERENCES
Baker, R. S., & Inventado, P. S.
(2014). Educational Data Mining and Learning Analytics. In J. A. Larusson
& B. White (Eds.), Learning Analytics: From Research to Practice (pp.
61-75). Springer New York. https://doi.org/10.1007/978-1-4614-3305-7_4
Bhuyan, R., & Borah, S. (2023). A
Survey of Some Density Based Clustering Techniques. National Conference on
Advancements in Information, Computer & Communication, 1.
https://doi.org/10.13140/2.1.4554.6887
Castro R., L. F., Espitia P., E., &
Montilla, A. F. (2018). Applying CRISP-DM
in a KDD Process for the Analysis of Student Attrition. In J. E. Serrano C.
& J. C. Martínez-Santos (Eds.), Advances in Computing (pp.
386-401). Springer International Publishing.
Dutt, A., Aghabozrgi, S., Ismail, M. A.,
& Mahroeian, H. (2015). Clustering Algorithms Applied in Educational Data
Mining. International Journal of Information and Electronics Engineering
(IJIEE), 5. https://doi.org/10.7763/IJIEE.2015.V5.513.
https://doi.org/10.7763/IJIEE.2015.V5.513
Estrada-Danell, R. I., Zamarripa-Franco,
R. A., Zúñiga-Garay, P. G., & Martínez-Trejo, I. (2016). Contributions from data mining to the enrollment
capture process in private higher education institutions. Educare
Electronic Journal, 20(3), 1-21.
https://www.redalyc.org/articulo.oa?id=194146862011.
https://www.redalyc.org/articulo.oa?id=194146862011
Han, J., Kamber, M., & Pei, J. (2021).
Data Mining. Concepts and Techniques,
3rd Edition (The Morgan Kaufmann Series in Data Management Systems).
Khan, A., & Ghosh, S. K. (2021).
Student performance analysis and prediction in classroom learning: A review
of educational data mining studies. Education and Information Technologies,
26(1), 205-240. https://doi.org/10.1007/s10639-020-10230-3.
https://doi.org/10.1007/s10639-020-10230-3
Kotler, P., Keller, K. L., Edition, D.,
Maria, T., Mues, A., Monica, Z., Gay, M., De La, M., Eloisa, L., Rivera, A.,
Hernandez, M., Enrique, E., & Bianchi, C. (2012). Marketing management ADAPTATION AND TECHNICAL REVIEW
(G. Domínguez, Ed.; 14th ed.). Pearson.
Lazaro-Camasca, E. N., & Nuñez-Medrano,
Y. (2023). Segmentation of
University Students using Clustering and considering a Virtual Cycle. Proceedings
of the LACCEI International Multi-Conference for Engineering, Education and
Technology, 2023-July. https://doi.org/10.18687/laccei2023.1.1.1355
Liao, S. H., Chu, P. H., & Hsiao, P.
Y. (2012). Data mining techniques and applications - A decade review from
2000 to 2011. In Expert Systems with Applications (Vol. 39, Issue 12,
pp. 11303-11311). https://doi.org/10.1016/j.eswa.2012.02.063
Luan, J. (2002). Data Mining and Its
Applications in Higher Education. New Directions for Institutional
Research, 2002, 17-36. https://doi.org/10.1002/ir.35
Parra Armendariz, C., Ulloa Viteri, S.,
& Medina, P. (2022). Systematic
literature review on educational marketing. Religation. Journal of Social
Sciences and Humanities, 7(33).
https://doi.org/10.46652/rgn.v7i33.943.
https://doi.org/10.46652/rgn.v7i33.943.
Perreault, K. (2011). Research Design:
Qualitative, Quantitative, and Mixed Methods Approaches. Manual Therapy,
16(1), 103. https://doi.org/https://doi.org/10.1016/j.math.2010.09.003
Romero, C., & Ventura, S. (2020).
Educational data mining and learning analytics: An updated survey. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(3).
https://doi.org/10.1002/widm.1355. https://doi.org/10.1002/widm.1355.
Shrestha, S., & Pokharel, M. (2020).
Data Mining Applications Used in Education Sector. Journal of Education
and Research, 10, 27-51. https://doi.org/10.3126/jer.v10i2.32721.
https://doi.org/10.3126/jer.v10i2.32721
Xu, R., & Wunsch, D. (2018). Cluster
Analysis. In Clustering (pp. 1-13). John Wiley & Sons.
https://doi.org/https://doi.org/10.1002/9780470382776.ch1