Machine Learning Models Based on Geospatial Data for
Customer Churn Prediction in ISP
Modelos de Aprendizaje Automático Basados en Datos
Geoespaciales para la Predicción de Fuga de Clientes en Proveedores de
Servicios de Internet (ISP)
Gabriela
Solano Aguilar*
Néstor Estrada Brito *
ABSTRACT
The purpose of this study is
to address the problem of customer churn in the telecommunications sector,
specifically in an Internet Service Provider (ISP) operating in Ecuador,
through a machine learning model and the application of geospatial analysis. To
carry it out, it was necessary to apply class balancing techniques, since the
dataset presented a smaller number of customers in abandonment, in contrast to
those who remain in active service. In addition, customer segmentation was
carried out using K-Means. In terms of churn prediction, several classification
models were evaluated and the one with the best performance was selected. Based
on their predictions, geospatial analysis was applied to examine the
territorial distribution of customers, identify patterns and enable the
development of more effective retention strategies. To evaluate each model,
ROC-AUC and recall metrics were applied, where Random Forest in combination
with Random Downsampling presented the best
performance. While, through segmentation, four groups of customers with similar
characteristics were identified. In conclusion, this study demonstrates that
integrating machine learning and geospatial analysis is an effective
combination for predicting customer churn in the telecommunications sector. The
combination of Random Forest with data balancing techniques, together with
customer segmentation using K-Means, resulted in a robust and accurate model.
Keywords: Machine learning, geospatial
analysis, customer churn, Random Forest, K-Means
RESUMEN
El
propósito de este estudio es abordar el problema de la pérdida de clientes en
el sector de las telecomunicaciones, específicamente en un Proveedor de
Servicios de Internet (ISP) que opera en Ecuador, mediante un modelo de
aprendizaje automático y la aplicación de análisis geoespacial. Para llevarlo a
cabo, fue necesario aplicar técnicas de balanceo de clases, ya que el conjunto
de datos presentaba una menor cantidad de clientes que abandonaban el servicio,
en contraste con aquellos que permanecían activos. Además, se realizó una
segmentación de clientes utilizando K-Means. En
cuanto a la predicción de abandono, se evaluaron varios modelos de
clasificación y se seleccionó el de mejor rendimiento. Con base en sus
predicciones, se aplicó análisis geoespacial para examinar la distribución
territorial de los clientes, identificar patrones y permitir el desarrollo de
estrategias de retención más efectivas. Para evaluar cada modelo se utilizaron
las métricas ROC-AUC y recall, siendo Random Forest en combinación con Random Downsampling el que
presentó el mejor desempeño. A través de la segmentación, se identificaron
cuatro grupos de clientes con características similares. En conclusión, este
estudio demuestra que integrar el aprendizaje automático con el análisis
geoespacial es una combinación eficaz para predecir la pérdida de clientes en
el sector de las telecomunicaciones. La combinación de Random
Forest con técnicas de balanceo de datos, junto con
la segmentación de clientes utilizando K-Means, dio
como resultado un modelo robusto y preciso.
Palabras clave: Aprendizaje automático, análisis geoespacial, pérdida de clientes, Random Forest, K-Means
INTRODUCTION
According to
In Ecuador, the increase in Internet use has also been
notable. According to the Ministry of Telecommunications report published in 2024,
the percentage of rural parishes and cantonal capitals with access to fixed
internet via fiber optics grew from 75.82 % in 2022 to 80.80 % in the first
quarter of 2024
Customer churn or abandonment understood as the
unilateral severance of the contractual relationship with a provider
Traditionally, retention strategies do not normally
embrace user behavior analysis, nor geographical
factors that could motivate the decision to desert the service. Faced with
these difficulties, this study presents a novel solution founded on two
inherent vectors: machine learning and geospatial analysis, whose
complementarity allows the creation of an exhaustive solution to prevent
customer defection in ISPs. Both machine learning and geospatial analysis are
accompanied by probability theory and statistics, which provide the
mathematical basis for model development that enables
strong and accurate prediction.
Machine learning focuses on identifying patterns that
can be difficult and complex for humans to find through large amounts of data.
Then, through unsupervised learning, such as the clustering technique,
customers can be segmented into common characteristics, which provides useful
information about their behavior
Geospatial analysis, on the other hand, provides a
strategic facet to the study, as it explores how geographic factors affect the
client's decision, which in turn allows identifying patterns related to
abandonment and locating key areas that need special attention
To address this challenge, the study proposes as
research objectives, first, the preprocessing of the data provided by the ISP,
to improve the quality of these data. Secondly, the application of clustering
techniques allows for segmenting customers into groups with similar
characteristics. Thirdly, the evaluation of different classification models to
select the most appropriate one according to their performance, taking into
account different metrics and also using cross-validation. Finally,
the interpretation of the results of the selected classification model, through
geospatial analysis, to identify spatial patterns that are related to client
attrition.
Several studies have addressed methods to predict
customer churn in the telecommunications sector, for example;
The present study focuses on an ISP located in
Ecuador, which has activities in the provinces of Chimborazo and Tungurahua.
The ISP has a lack of systems for identifying and predicting clients in
abandonment. As for the data provided by this company, they correspond to
residential customers with fiber optic plans, from them, this study addresses
the problem of churn with a technological and institutional approach, in
addition to applying machine learning and geospatial analysis to understand the
factors related to churn and predict when customers are in danger of falling
into it.
This study is oriented to answer the research
question: How can a predictive model that integrates machine learning and
geospatial analysis predict customer churn in an ISP operating in Ecuador?
MATERIALS AND METHODS
Customer
Churn detection
There are two ways in which the problem of customer
defection can be addressed: reactive or proactive. The first is when the
company acts after the customer has expressed a desire to cancel the service.
On the other hand, in the proactive approach, the company seeks to anticipate
abandonment through predictive techniques
Within the proactive paradigm, customer defection in
the telecommunications sector has been extensively addressed with supervised
classification models. Decision trees are among the models presented because
they allow for the interpretation of interaction between various
characteristics.
One such prominent study was performed by
Another of the most widely used models in the
classification of customer churn is XGBoost, since it
can adequately deal with unbalanced data sets and also optimizes computational
performance.
XGBoost was applied in the study presented by
The study by
In turn,
Other studies have integrated models with esemble or hybrid schemes, thereby combining the strengths
of different algorithms. For example,
In
Another study that proposes an ensemble approach is
that of
In the study by
Customer
segmentation and churn detection
There is also research that has explored the use of
clustering techniques to segment customers based on their characteristics and
behavioral patterns. Segmentation allows a better understanding of
user-profiles and their interaction with the service, which can be useful both
for retention strategies and for improving predictive models of attrition.
K-Means is one of the clustering algorithms used in
recent studies related to customer defection prediction. An example is the work
of
K-Medoids are also present
as an alternative to K-Means since it is presented as a more robust option to
outliers. In fact,
On the other hand,
Alternatively, existen enfoques más
avanzados en los cuales se aplica aprendizaje profundo para realizar la
segmentación de clientes. El estudio realizado por
Geospatial
analysis
Geospatial analysis is a tool that allows an analysis
to be performed in such a way that location-related characteristics relate to
and determine certain phenomena. It adds an extra dimension when dealing with
certain behaviors or patterns in different contexts. In this framework, the
prediction of customer defection has a high potential, since aspects such as
the geographical environment, local infrastructure, the presence of
competitors, or the socioeconomic characteristics of an area, can influence the
abandonment of services.
Even though, up to the date of this study, there have
not been numerous investigations that incorporate dropout prediction models and
geospatial analysis, the work of
In other sectors, such as e-commerce, the authors
In sectors outside the telecommunications sector,
geospatial analysis has also been used to identify spatial patterns. An example
of this is the research
Through this literature review, several papers have
addressed customer defection prediction in telecommunications using supervised
classification models, some of which have chosen to integrate segmentation
techniques. However, this study proposes a global approach, where in addition
to predicting churn and segmenting customers, geospatial analysis is considered
to identify spatial patterns associated with churn. This approach has not been
addressed in previous studies, and is of great help, as it provides valuable
information for the development of retention strategies and decision-making.
This study formulates a design science methodology
based on the principles of design science, which consists of three interconnected
phases: the relevance cycle, the design cycle, and the rigor cycle, as
illustrated in Error!
Reference source not found.. These are core to
solving the problem under study and ensuring findings are scientifically valid
and practically meaningful.
Figure 1: Design Science Methodology
Relevance
cycle
A central problem in ISPs is client attrition and its
financial and operational impact, which is the starting point for this study.
To address the problem, we worked through a systematic literature review using
the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (Prisma) methodology, which provides a rigorous and
transparent framework for the selection of studies, to have a scientifically
backed knowledge base.
The implementation of the Prisma
process in this study results in an exhaustive search process, developed in
recognized academic databases such as Scopus and SpringerLink,
from which publications are collected for the period 2021-2024, prioritizing
those studies that are related to the telecommunications sector, where machine
learning is applied in customer segmentation and attrition prediction, as well
as geospatial analysis approaches.
Articles that were retracted or unrelated to the study
were also excluded. Relevant algorithms, methodologies, and applications, as
well as metrics, forms of validation, and analytical tools, were identified.
Research
Design
The present research follows a quantitative,
correlational-predictive approach and non-experimental design, where the aim is
to predict customer defection using historical and geo-referenced data.
Population
and sample
Population: ISP residential customers operating in the
provinces of Chimborazo and Tungurahua, Ecuador, with fiber optic plans.
Sample: Sub-set of active and defective clients
selected through non-probability convenience sampling based on historical
records provided by the ISP.
Inclusion criteria: Client records provided by the
ISP, including both loyal and defective clients, were selected through
non-probability sampling.
Exclusion criteria: Client records that are incomplete
or contain erroneous information received by the ISP.
Collection instruments
All data used in this study
were provided by the ISP through its internal database
Design cycle
The Cross Industry Standard
Process for Data Mining (CRISP-DM) methodology has been used for the
development of this study. This standard is widely used in the structuring of
projects related to data analysis, given that it presents an iterative and
systematic process, which allows the objectives of this study to be addressed
in an organized manner. The main phases of CRISP-DM adapted to this environment
are the following:
Segmentation: K-Means is
applied because it has been the most mentioned in the literature review
Classification: Models
based on decision trees and ensemble techniques, which are highly effective in
the literature reviewed, are employed. Decision Trees
The Random Forest model
Gradient Boosted Trees
models
On the other hand, XGBoost
In order to ensure that the
best performance metrics were obtained, stratified cross-validation (StratifiedKFold) was applied, which ensures a balanced
number of classes in each fold
Model
evaluation: The elbow method is
employed to identify the optimal number of groups for K-Means by analyzing the
variation in inertia (sum of distances to the centroid) as the number of groups
or clusters increases
In the case of
classification models, having a dataset with unbalanced classes, the AUC-ROC
metric has been chosen as it provides an aggregate measure of model performance
at all possible decision thresholds; this metric, derived from the ROC curve,
assesses the model's ability to distinguish between defecting and
non-defaulting customers by measuring the ratio between the rate of true
positives and false positives. Its usefulness is particularly prominent in
unbalanced datasets, where the number of positive cases (churned customers) is
significantly lower than the number of negative cases
However, since the AUC-ROC
metric does not allow us to observe in detail how well the model can identify
at-risk clients, we incorporate an additional metric, Recall (sensitivity),
which measures the proportion of clients labeled as dropouts who were correctly
identified, thus prioritizing the reduction of false negatives.
Deployment:
An interactive map is proposed to visualize
the results of the final model. This involves the geospatial variables of the
customers and shows both the customer segments and the geospatial patterns
associated with defection.
Validation and Reproducibility
Cross-validation: In the
case of classification algorithms, a 5-partition stratified cross-validation
scheme (StratifiedKFold) is applied, ensuring that
the class distribution is maintained in each partition equally, ensuring a
strong model evaluation, particularly in an imbalanced class dataset.
Comparison of results:
K-Means was used for segmentation, and two tests were performed one without
geospatial data such as latitude and longitude, and one with them. This was to
analyze which groups presented better metrics and thus better segmentation.
As for the classification
models, the performance is compared with the results obtained by the different
algorithms tested in this study to select those that offer the best
performance.
Reproducibility: Since the
project is getting data and information from an ISP, where there is sensitive
information, the code of the models developed will not be open-sourced.
However, the development process is described in this report, and reproduction
can be done in similar datasets or other business environments.
Design
Limitations
Limited data: Lack of
demographic characteristics can restrict analysis of other potential factors
that could be relevant.
Geographic bias: The
results are determined by characteristics present in Ecuador, specifically in provinces
Chimborazo and Tungurahua, and therefore if the research is to be generalized
across varying contexts, then some modification will have to be done
accordingly.
Technological dependence:
When deploying the models, a scalable technological infrastructure may be
required at the ISP.
Rigour
cycle
The scientific rigor of the
study is based on a Prisma review, which defines a
robust theoretical framework, validation, and reproducibility methods
documenting all phases of the process that will allow replicability along the
same lines, an assessment of limitations and potential for generalization while
facilitating continuous improvement.
RESULTS
A cleaning and validation process was applied to the
data provided by the ISP to ensure the quality of the data. On the one hand,
duplicate records were eliminated, while null records were corrected by
requesting information from the ISP. Likewise, an outlier analysis was carried
out, which allows for the purging of inconsistent records present in the data
capture.
Description of the dataset
As a result of the first data processing step, a
cleaned and formatted dataset was obtained, comprising customer information
regarding residential fiber optic plans, faulty and active.
Error!
Reference source not found. has the dataset structure
as 9653 instances with 11 attributes, 8 of them numeric and 3 nominal. Class
imbalance can be noticed as dropouts are only 7.32% of customers. This
distribution of imbalance should be considered, since it can affect the
performance of the churn prediction model. Another solution to this issue is at
the data preparation stage, where class balancing techniques have been used,
whose impact will be discussed later.
Table 1: Summary
of the dataset
# Instances |
# Characteristics |
# C. Num. |
# C. Cat. |
|
|
9653 |
11 |
8 |
3 |
0.073 |
|
On the other hand, Error!
Reference source not found. presents each of the
variables included in the dataset. These characteristics contain information
about the contracted service, as well as the history of customer interactions
with the ISP.
Table 2: Description
of the Variables contained in the Dataset
Variable |
Description |
Product_name |
Type of internet plan contracted by the
customer. |
Product_speed |
Data transmission capacity measured in
megabits per second. |
Product_price |
Monthly cost in dollars of the service
contracted by the customer. |
Geo_latitud |
Geographic position of the customer with
respect to the horizontal axis (latitude). |
Geo_longitud |
Geographic position of the customer with
respect to the vertical axis (longitude). |
Nodo_name |
Primary network equipment to which the
customer is connected. |
Total_invoices_loyalty |
Number of months that the customer keeps
their service active. |
Compliance_index |
Relación entre pagos realizados a tiempo y
la cantidad total de pagos. |
Tickets_resolved |
Ratio of payments made on time to the total
amount of payments. |
Average_hours_tickets_resolved |
Average time in hours to resolve critical
tickets. |
Churn |
Indicates whether the customer has
abandoned (1) or not (0). |
Exploratory data analysis
In any study it is vital to carry out an exploratory
analysis, given that in this way it is possible to get to know the data set
better in terms of its structure, to find certain patterns, or to identify
anomalies. It is at this point that an interesting factor was discovered when
applying correlation analysis between numerical variables.
As shown in Error!
Reference source not found.,
the correlation analysis yielded a very high value (0.99) among the variables: Product_speed and Product_price.
This shows high dependence among them, and hence it was agreed to exclude Product
speed from analysis, so that collinearity problems could be avoided. On the
other hand, no high correlations between the predictor variables and the target
variable (Churn) existed, indicating that the utilization of machine learning
models is inevitable to detect non-linear relationships.
Customer segmentation
In business management, customer segmentation is key to identifying behavioral patterns and thereby
optimizing retention strategies. Understanding how customers interact with the
service and deciphering the reasons that influence their retention, or churn
improves business decision-making in the telecommunications sector. Designing
marketing campaigns that are better targeted through customer segmentation
allows them to be effective, as well as personalizing offers and anticipating
potential churn, which contributes to customer satisfaction and thus to
business sustainability.
In this study, the variables below were
selected for segmentation: customer dwell time, payment fulfillment rate,
number of critical support tickets and average time in hours for the resolution
of those tickets, and geographic location.
Figure 2: Correlation
heat map between quantitative variables
These variables were chosen for their relevance in
characterizing customer behavior, in contrast to other variables that only
describe the characteristics of the contracted service.
Two tests were carried out to obtain information on
whether geographic coordinates influence customer segmentation. In the first
one, these variables were incorporated and in the second one they were omitted,
and K-Means was used, where k was given values from 2 to 10. The results showed
that the test carried out without the geographical characteristics obtained a
higher Silhouette Score, which means that there is greater separation between
groups and that the elements within each of them are more similar.
It is also important to note that the elbow method was
applied, without including the geospatial variables, to obtain the optimal
number of clusters. Error!
Reference source not found. shows the result of this
analysis, where it can be observed that the inertia decreases progressively as
the number of clusters increases, but a reduction can also be noted up to 𝐾=4,
indicating that this number represents a good balance between explained
variability and model complexity. The Silhouette Score confirmed that K=4
obtained the highest value (0.3406), for this ratio 4 clusters were selected.
To better understand the behavior of customers in each
cluster, an analysis was carried out, according to the average of each
characteristic used to perform the segmentation. Error!
Reference source not found. presents these values, of
which clusters 0 and 2 are made up of customers with the longest time with the
ISP (23 months), with a high rate of on-time payment.
On the other hand, clusters 1 and 3 group together
customers with shorter tenure and lower payment compliance. In particular,
cluster 3 stands out for having the lowest tenure and, in turn, a low number of
support tickets (0.194 on average).
Figure 3: Elbow method for determining the optimal
number of clusters
Table
3: Average number of
relevant characteristics in each customer cluster
Cluster |
Total_invoices_loyalty |
Total_invoices_loyalty |
Tickets_ resolved |
Average_hours_ tickets_resolved |
0 |
23.203 |
0.881 |
0.681 |
7.590 |
1 |
15.034 |
0.449 |
0.464 |
5.912 |
2 |
23.090 |
0.845 |
2.943 |
66.646 |
3 |
7.589 |
0.928 |
0.194 |
2.860 |
In addition, an analysis of attrition distribution for
each cluster was conducted, which indicates that Cluster 1 has higher
attrition. This suggests that clients with lower payment compliance and shorter
dwell time have a higher propensity to drop out of the service.
Customer churn prediction
For the development of the predictive models, the
previously explained dataset was used, without the variable Product_speed,
due to its high correlation with Product_price. Then,
different classification models were evaluated: Decision Tree, Random Forest,
Gradient Boosted Trees, and XGBoost.
Performance measures of all the models are presented
in Error!
Reference source not found.. Initially, the models
were evaluated without using any balancing or baseline sampling strategy on the
data, which revealed a majority class imbalance. Random Forest in this instance
gave the best AUC-ROC (0.794), but it also yielded a very low recall (0.051),
reflecting low capacity to identify churned clients. Similarly, GBT and XGBoost had AUC-ROC of 0.778 and 0.776 respectively, with
even lesser recall.
Table 4: Evaluation of models.
Model |
Sampling technique |
AUC-ROC |
Recall / Sensibility |
Decision Tree |
Baseline |
0.714 |
0.074 |
Random Forest |
0.794 |
0.051 |
|
GBT |
0.778 |
0.029 |
|
XGBoost |
0.776 |
0.044 |
|
|
|
|
|
Decision Tree |
Sklearn's class_weight ‘balanced’ |
0.720 |
0.875 |
Random Forest |
0.784 |
0.287 |
|
GBT |
0.779 |
0.757 |
|
XGBoost |
0.778 |
0.743 |
|
|
|
|
|
Decision Tree |
SMOTE |
0.671 |
0.360 |
Random Forest |
0.768 |
0.360 |
|
GBT |
0.761 |
0.162 |
|
XGBoost |
0.754 |
0.213 |
|
|
|
|
|
Decision Tree |
ADASYN |
0.674 |
0.426 |
Random Forest |
0.759 |
0.324 |
|
GBT |
0.741 |
0.118 |
|
XGBoost |
0.740 |
0.162 |
|
|
|
|
|
Decision Tree |
Random downsampling |
0.720 |
0.610 |
Random Forest |
0.777 |
0.809 |
|
GBT |
0.764 |
0.757 |
|
XGBoost |
0.763 |
0.772 |
|
|
|
|
|
Decision Tree |
NCR |
0.755 |
0.117 |
Random Forest |
0.781 |
0.081 |
|
GBT |
0.779 |
0.066 |
|
XGBoost |
0.776 |
0.088 |
|
|
|
|
|
Decision Tree |
Tomek Links |
0.745 |
0.147 |
Random Forest |
0.783 |
0.015 |
|
GBT |
0.781 |
0.029 |
|
XGBoost |
0.781 |
0.044 |
|
|
|
|
|
Decision Tree |
SMOTE + Random downsampling |
0.663 |
0.397 |
Random Forest |
0.773 |
0.294 |
|
GBT |
0.757 |
0.176 |
|
XGBoost |
0.748 |
0.206 |
|
|
|
|
|
Decision Tree |
SMOTE + NCR |
0.666 |
0.397 |
Random Forest |
0.763 |
0.478 |
|
GBT |
0.744 |
0.265 |
|
XGBoost |
0.735 |
0.324 |
|
|
|
|
|
Decision Tree |
SMOTE + Tomek Links |
0.647 |
0.375 |
Random Forest |
0.769 |
0.331 |
|
GBT |
0.756 |
0.154 |
|
XGBoost |
0.754 |
0.221 |
|
|
|
|
|
Decision Tree |
ADASYN + Random downsampling |
0.632 |
0.250 |
Random Forest |
0.760 |
0.250 |
|
GBT |
0.747 |
0.162 |
|
XGBoost |
0.747 |
0.206 |
|
|
|
|
|
Decision Tree |
ADASYN + NCR |
0.675 |
0.449 |
Random Forest |
0.757 |
0.485 |
|
GBT |
0.744 |
0.228 |
|
XGBoost |
0.746 |
0.331 |
|
|
|
|
|
Decision Tree |
ADASYN + Tomek Links |
0.680 |
0.404 |
Random Forest |
0.760 |
0.324 |
|
GBT |
0.751 |
0.162 |
|
XGBoost |
0.746 |
0.154 |
The first class-balancing technique was sklearn's class_weight =
‘balanced’, where Decision Tree achieved the highest recall (0.875), with
AUC-ROC at 0.720, showing high sensitivity, although a higher risk of false
positives. GBT and XGBoost also improved their recall
to 0.757 and 0.743 respectively, and maintained an AUC-ROC close to 0.778.
The impact of oversampling and undersampling
techniques on the performance of the classification models can also be seen, as
significant differences in terms of AUC-ROC and recall were obtained.
For example, moderate improvements in the recall were
obtained when applying techniques such as SMOTE and Adasyn,
however, the AUC-ROC was negatively affected, this
could be seen in the GBT and XGBoost models. If
we consider the performance of Random
Forest using SMOTE, we have AUC-ROC as 0.768 with recall
0.360, whereas this model with Adasyn, in AUC-ROC reduced to
0.759, and in recall 0.324.
Among the tested algorithms, Random Downsampling and Random Forest performed best with the
highest recall (0.809) and AUC-ROC of 0.777, which was the most suitable model
for customer dropout risk prediction. XGBoost and GBT
also enhanced recall (0.772 and 0.757, respectively), though losing a bit of
AUC-ROC.
The more advanced undersampling
strategies, such as NCR and Tomek Links, showed high
AUC-ROC in Random Forest (0.781 and 0.783, respectively), but extremely low
recall, indicating a lower ability to detect at-risk customers.
Interestingly, the combination of oversampling and undersampling techniques failed to improve the evaluation
metrics. Recall obtained lower values, which shows that these combinations were
not effective in addressing the problem with the data set under study. The best
result in this section was achieved by Random Forest with SMOTE + NCR, with an
AUC-ROC of 0.763 and a recall of 0.478, but it did not exceed the results
previously observed.
Taking into account all the results of the models
applied together with the oversampling and undersampling
techniques, it was found that the model with the best combination of AUC-ROC
and recall evaluation was Random Forest with Random Downsampling
(AUC-ROC = 0.777, Recall = 0.809). The measured values indicate a higher
capacity to identify customers prone to drop out, making it the best option in
this study for dropout prediction.
The best-fit model obtained may be further validated
by the Receiver Operating Characteristic (ROC) curve, depicted in Error!
Reference source not found.Error! Reference source not
found.. The
continuous line plots
the relationship between the True Positive Rate (TPR) and the
False Positive Rate (FPR) at different decision thresholds.
The area under the curve is 0.777, indicating high
moderate discriminatory ability. In other words, this model can differentiate
between defecting and loyal customers with higher performance than chance. And
indeed change is represented by the dashed diagonal line, where a model would
have no predictive ability, with an AUC of 0.5
Meanwhile, Error!
Reference source not found. shows the five most
relevant characteristics in the prediction of customer defection, according to
the Random Forest model with Random Downsampling.
These variables stand out for having the highest values of importance within
the model.
Figure
5: ROC curve for Random Forest model with Random Downsampling
Figure
5: Most influential
characteristics in the best-performing attrition model
Active service months are the highest among all the
features, which suggests that billing history and customer loyalty
are most likely to induce attrition. Likewise, longitude and latitude suggest
that geographic location influences customer tenure. Payments made on time over
total payments and average hours to close critical tickets also play an
important role.
Geospatial analysis of client attrition
To understand the spatial distribution of customer
churn, geospatial analysis tools were used in relationships generated by the
Random Forest with a Random Downsampling model.
Through this analysis, it was possible to identify spatial patterns, as well as
areas of higher risk of churn, which allows the results to be interpreted in a
geographical context and also provides strategic information to the ISP to
optimize the service.
In fact, the geospatial analysis of customer churn
predictions is illustrated in Error!
Reference source not found.,
which shows three interactive maps, in the areas where the ISP operates. These
allow the identification of spatial patterns related to defection.
The first map (left) shows the geographical spread of
predictions for each of the customers using their positioning data. Three
colors are shown in this map, associated with the attrition probability: low in
green, medium in orange, and high in red. From this visualization, it can be
seen that there is a higher concentration of clients likely to drop out in
urban areas, and a lower concentration in rural areas.
The second map (center) provides an attrition analysis
by parishes the ISP is present in. For this purpose, individual forecasts have
been integrated, and each parish's probability of customers who are likely to
churn has been estimated. In this map, darker shades of color represent higher
chances of churn. Moreover, the use of this data on territorial patterns leads
to the generation of strategies and decisions.
Figure
6: Visualizing
customer churn prediction on maps from the model
The third is a heat map (right), and it provides
information on the distribution of clients with high drop-out propensity based
on predictions from the best model. Regions of red and yellow hues tend to drop
out. The visualization helps to locate hotspots that require extra effort to
earn customer loyalty.
DISCUSSION
The research question posed in this study aims to
determine how a predictive model that integrates machine learning and
geospatial analytics can predict customer churn at an ISP operating in Ecuador.
This discussion analyses the results achieved in the execution of this study,
from data processing, customer segmentation, and predictive modeling to
geospatial analysis.
Through the results obtained, it can be demonstrated
that churn prediction is a complex problem, which can be addressed and benefit
from multiple data sources and methodologies. Undoubtedly, one of the most
important phases in this study was the pre-processing of the data, as through
this it was possible to ensure the quality and also the consistency of the
dataset. To structure the robust dataset for further analysis and training, the
elimination of duplicate records, the handling of missing values, and the
identification of outliers were applied. In addition, one of the most prominent
challenges related to the dataset was class imbalance, as only 7.32% of the
clients had dropped out. This aspect is noteworthy as it significantly
influenced the performance of the classification models, necessitating the
application of class balancing techniques to improve detection rates, in line
with the findings of
The segmentation allowed the identification of four
clusters, which have similar characteristics and were analyzed to understand
customer behavior. These clusters emerged from customer data once they were
part of the company, i.e. dwell time, payment compliance, and interaction with
the service. After analysis, cluster 1, which is composed of
customers with irregular payments and less time with the company, showed the
highest dropout. In contrast, clusters 0 and 2, which are composed of
customers with more time in the service and timely payments, showed lower
attrition. These results relate to studies by
The evaluations of each of the classification models,
presented in Error!
Reference source not found.,
highlight Random Forest with Random Downsampling as
the one that achieved the best balance between AUC-ROC (0.777) and recall
(0.809) metrics, which maintains a reasonable false positive rate. Also, the
effectiveness of Random Downsampling in addressing
class imbalance was highlighted, as it allowed the model to learn from the
minority class, without affecting the overall model performance. Studies by
The ability of the Random Forest model to measure the
feature importance in churn prediction was of paramount importance, as it
stated customer subscription length to be the most important factor in
predicting churn, a factor that corroborates the hypothesis that customers with
longer subscription times on the service are less likely to churn. Likewise,
geospatial characteristics, such as longitude and latitude, were highly
relevant, pointing towards there being geographical determinants of whether to
exit the service or not.. This aligns with
Additional information regarding the pattern of
dropout predictions became available with the use of geospatial analysis.
Interactive maps allowed for easy visualization of how clients behave in
accordance with model predictions. For example, heat map visualization revealed
urban areas with higher attrition rates, which would imply that there may be
service quality issues or aggressive commercial pressures. Conversely,
parish-level data identified areas with a greater likelihood of clients to fall
away, pointing towards the necessity of localized retention efforts. These
results are in line with
To ISPs, geospatial information is relevant since the
combination of geospatial information with segmentation modeling and churn
forecasting allows them to develop focused retention programs, which could be
more effective than normal ones. For example, using focused loyalty programs
where attrition could occur, improving infrastructure and quality of services,
or investigating the potential of adjusting the price of internet plans. In
addition, with the recognition of geographic trends, there is the potential to
predict churn patterns and a more effective resource allocation can be
attained.
The outcomes of this study to predict churn in ISP
highlight the importance of combining different methodologies such as machine
learning and geospatial analysis. Customer segmentation is critical to
understand customer behavior, while class balancing as well as applying
classification models improves discrimination between loyal customers and
defectors. And even though geospatial characteristics may not be the best
predictor, they are used in decision-making. This result corresponds with the
study of
For future research, it would be important to address
ensemble and hybrid techniques, such as those used by
This study demonstrates the effectiveness of combining
data pre-processing, segmentation, classification modeling, and geospatial
analysis in customer churn prediction in ISP. The findings of this study add to
the cumulative literature of predictive churn analysis in the telecommunication
sector and also give valuable insights to the company to devise strategies to
target customers who are at higher risk.
This research examined the prediction of customer
churn in an internet service provider in central Ecuador, using a model based
on machine learning techniques and geospatial analysis. It was possible to
determine significant patterns of customer churn and select the optimal model
for its prediction based on pre-processing of data, customer segmentation, and
evaluation of different classification models.
Based on the results, short-time customers and low
compliance customers in paying on time were more likely to leave the service.
On the other hand, the application of data balancing techniques improved the
performance of the model by allowing it to discriminate between customers
likely to drop out of the service and those that are not. The best model in
this study was a Random Forest with Random Downsampling,
which performed higher ROC-AUC and recall scores.
On the other hand, geospatial analysis allowed for the
identification of spatial patterns related to attrition. Such information is key for ISPs to design effective retention strategies.
Overall, the combination of all these approaches proved to be useful in
anticipating customer attrition.
Lastly, the nexus between machine learning and
geospatial analysis represents a crucial tool for elevating customer management
in the telecom sector, and there is a suggestion that further research on
advanced modeling paradigms and data enrichment is conducted to improve the
accuracy of the projections and hence the effectiveness of retention
strategies.
REFERENCES
Adhikary, D. Das, & Gupta, D. (2021). Applying
over 100 classifiers for churn prediction in telecom companies. Multimedia
Tools and Applications, 80(28–29), 35123–35144.
https://doi.org/10.1007/S11042-020-09658-Z/METRICS
Agasti, B. R., & Satpathy,
S. (2024). Predicting customer churn in telecommunication
sector using Naïve Bayes algorithm. Indonesian Journal of Electrical
Engineering and Computer Science, 35(3), 1610–1617.
https://doi.org/10.11591/ijeecs.v35.i3.pp1610-1617
Aguilar Cazar Miguel. (2024). Informe de
Avance del Indicador 8.1.2.: Porcentaje de parroquias rurales y cabeceras
cantonales con presencia del servicio de internet fijo a través de enlaces de
fibra óptica.
https://www.telecomunicaciones.gob.ec/wp-content/uploads/downloads/2024/07/Informe-de-Avance-del-Indicador_Primer-Trimestre_Indicador-8.1.2.pdf
Alteer, S. A., & Alariyibi,
A. (2024). Customer Churn
Prediction Using Machine Learning: A Case Study of Libyan Internet Service
Provider Company. 2024 IEEE 4th International Maghreb Meeting of the
Conference on Sciences and Techniques of Automatic Control and Computer
Engineering, MI-STA 2024 - Proceeding, 605–612.
https://doi.org/10.1109/MI-STA61267.2024.10599671
Bachan, L., & Gaber, T. (2021). Predicting Customer Churn in the Internet Service
Provider Industry of Developing Nations: A Single, Explanatory Case Study of
Trinidad and Tobago. Advances in Intelligent Systems and Computing, 1339, 835–844.
https://doi.org/10.1007/978-3-030-69717-4_77
Bilal, S. F., Almazroi,
A. A., Bashir, S., Khan, F. H., & Almazroi, A.
A. (2022). An ensemble based approach using a combination of clustering and
classification algorithms to enhance customer churn prediction in telecom
industry. PeerJ Computer Science, 8, e854.
https://doi.org/10.7717/PEERJ-CS.854/SUPP-2
Cenggoro, T. W., Wirastari,
R. A., Rudianto, E., Mohadi,
M. I., Ratj, D., & Pardamean,
B. (2021). Deep Learning as a
Vector Embedding Model for Customer Churn. Procedia Computer Science, 179, 624–631.
https://doi.org/10.1016/J.PROCS.2021.01.048
Chong, A. Y. W., Khaw,
K. W., Yeong, W. C., & Chuah,
W. X. (2023). Customer Churn Prediction of Telecom Company
Using Machine Learning Algorithms. Journal of Soft Computing and Data
Mining, 4(2), 1–22. https://doi.org/10.30880/jscdm.2023.04.02.001
Colot, C., Baecke,
P., & Linden, I. (2021). Leveraging fine-grained mobile data for churn detection through
Essence Random Forest. Journal of Big Data, 8(1), 1–26. https://doi.org/10.1186/S40537-021-00451-9/FIGURES/7
Fatima, G., Khan,
S., Aadil, F., Kim, D. H., Atteia,
G., & Alabdulhafith, M. (2024). An autonomous mixed data
oversampling method for AIOT-based churn recognition and personalized
recommendations using behavioral segmentation. PeerJ
Computer Science, 10, 1–32. https://doi.org/10.7717/PEERJ-CS.1756/SUPP-1
Geiler, L., Affeldt, S., & Nadif, M. (2022). An effective strategy
for churn prediction and customer profiling. Data
& Knowledge Engineering, 142, 102100. https://doi.org/10.1016/J.DATAK.2022.102100
Khoh, W. H., Pang, Y. H., Ooi, S. Y., Wang, L. Y. K., & Poh,
Q. W. (2023). Predictive Churn
Modeling for Sustainable Business in the Telecommunication Industry:
Optimized Weighted Ensemble Machine Learning. Sustainability
2023, Vol. 15, Page 8631, 15(11), 8631.
https://doi.org/10.3390/SU15118631
Krishna, R., Jayanthi, D., Shylu Sam, D. S.,
Kavitha, K., Maurya, N.
K., & Benil, T. (2024). Application of machine learning
techniques for churn prediction in the telecom business. Results in Engineering, 24, 103165.
https://doi.org/10.1016/J.RINENG.2024.103165
Lalwani, P., Mishra, M. K., Chadha, J. S.,
& Sethi, P. (2022). Customer churn prediction
system: a machine learning approach. Computing, 104(2), 271–294.
https://doi.org/10.1007/S00607-021-00908-Y/METRICS
Liu, R., Ali, S.,
Bilal, S. F., Sakhawat, Z., Imran, A., Almuhaimeed, A., Alzahrani, A.,
& Sun, G. (2022). An Intelligent Hybrid Scheme for Customer Churn Prediction
Integrating Clustering and Classification Algorithms. Applied Sciences 2022, Vol. 12, Page 9355, 12(18), 9355.
https://doi.org/10.3390/APP12189355
Matuszelański, K., & Kopczewska,
K. (2022). Customer Churn in
Retail E-Commerce Business: Spatial and Machine Learning Approach. Journal of
Theoretical and Applied Electronic Commerce Research 2022, Vol. 17, Pages
165-198, 17(1), 165–198. https://doi.org/10.3390/JTAER17010009
Number of internet users worldwide 2024
| Statista. (n.d.).
Retrieved January 22, 2025, from
https://www.statista.com/statistics/273018/number-of-internet-users-worldwide/
Ortakci, Y., & Seker,
H. (2024). Optimising customer retention: An AI-driven personalised pricing approach. Computers
and Industrial Engineering, 188, 109920.
https://doi.org/10.1016/j.cie.2024.109920
Ouf, S., Mahmoud, K. T., &
Abdel-Fattah, M. A. (2024). A proposed hybrid framework to improve the accuracy of customer
churn prediction in telecom industry. Journal of Big
Data, 11(1), 70. https://doi.org/10.1186/s40537-024-00922-9
Pejić Bach, M., Pivar,
J., & Jaković, B. (2021). Churn Management in Telecommunications: Hybrid
Approach Using Cluster Analysis and Decision Trees. Journal
of Risk and Financial Management 2021, Vol. 14, Page 544, 14(11), 544.
https://doi.org/10.3390/JRFM14110544
Peng, X., Niu, Y. yan, Meng, B., Tao, Y., & Huang, Z. (2024). Big geo-data unveils influencing factors on
customer flow dynamics within urban commercial districts. International
Journal of Applied Earth Observation and Geoinformation,
134, 104231. https://doi.org/10.1016/J.JAG.2024.104231
Poudel, S. S., Pokharel, S.,
& Timilsina, M. (2024). Explaining
customer churn prediction in telecom industry using tabular machine learning
models. Machine Learning with Applications, 17,
100567. https://doi.org/10.1016/J.MLWA.2024.100567
Saha, L., Tripathy,
H. K., Gaber, T., El-Gohary, H., & El-kenawy, E. S. M. (2023). Deep Churn Prediction Method for
Telecommunication Industry. Sustainability 2023, Vol. 15, Page 4543, 15(5),
4543. https://doi.org/10.3390/SU15054543
Shrestha, S. M.,
& Shakya, A. (2022). A Customer Churn Prediction Model using XGBoost for the Telecommunication Industry in Nepal. Procedia
Computer Science, 215, 652–661. https://doi.org/10.1016/J.PROCS.2022.12.067
Soleiman-garmabaki, O., & Rezvani,
M. H. (2024). Ensemble classification using balanced data to predict customer
churn: a case study on the telecom industry. Multimedia Tools and
Applications, 83(15), 44799–44831. https://doi.org/10.1007/s11042-023-17267-9
Usman-Hamza, F. E., Balogun,
A. O., Amosa, R. T., Capretz,
L. F., Mojeed, H. A., Salihu,
S. A., Akintola, A. G., & Mabayoje,
M. A. (2024). Sampling-based
novel heterogeneous multi-layer stacking ensemble method
for telecom customer churn prediction. Scientific African, 24, e02223.
https://doi.org/10.1016/J.SCIAF.2024.E02223
Usman-Hamza, F. E., Balogun, A. O., Nasiru,
S. K., Capretz, L. F., Mojeed, H. A., Salihu, S. A., Akintola, A. G.,
Mabayoje, M. A., & Awotunde, J. B. (2024). Empirical analysis of tree-based classification models for customer
churn prediction. Scientific African, 23, e02054.
https://doi.org/10.1016/J.SCIAF.2023.E02054
Wagh, S. K., Andhale, A. A., Wagh, K. S., Pansare, J. R., Ambadekar,
S. P., & Gawande, S. H. (2024). Customer churn prediction in telecom sector using machine
learning techniques. Results in Control and Optimization, 14,
100342. https://doi.org/10.1016/J.RICO.2023.100342
Wahul, R. M., Kale, A. P., & Kota, P. N.
(2023). An Ensemble Learning Approach to Enhance Customer Churn
Prediction in Telecom Industry. International Journal of Intelligent
Systems and Applications in Engineering, 11(9s), 258–266.
Xu, T., Ma, Y., Ao, C., Qu, M., & Meng, X. H. (2023). A NOVEL TELECOM CUSTOMER CHURN ANALYSIS SYSTEM BASED
ON RFM MODEL AND FEATURE IMPORTANCE RANKING. Interdisciplinary Journal of Information,
Knowledge, and Management, 18, 719–737. https://doi.org/10.28945/5192
Xu, T., Ma, Y.,
& Kim, K. (2021). Telecom Churn
Prediction System Based on Ensemble Learning Using Feature Grouping. Applied Sciences 2021, Vol. 11, Page 4742, 11(11), 4742.
https://doi.org/10.3390/APP11114742
Zhao, Y., Shao, Z., Zhao, W., Han, J.,
Zheng, Q., & Jing, R. (2023). Combining unsupervised and supervised
classification for customer value discovery in the telecom industry: a deep
learning approach. Computing, 105(7), 1395–1417.
https://doi.org/10.1007/s00607-023-01150-4
*Magíster en
Ingeniería de Sistemas, Docente de la
Escuela Superior Politécnica de Chimborazo - ESPOCH, Riobamba, Ecuador
gabriela.solano@espoch.edu.ec, https://orcid.org/0009-0007-7565-4702
*Magíster en Ingeniería
Electrónica en Telecomunicaciones y Redes, Docente en la Escuela Superior
Politécnica de Chimborazo- ESPOCH, Riobamba, Ecuador.
nestor.estrada@espoch.edu.ec, https://orcid.org/0000-0002-4100-7351