Privacy preserving optimization of communication networks

Table of Contents

The CLUSTER framework

Consider a communication network with M users and N spatially located BSs that can alternate between active and sleep modes. As users exchange data they generate demand on each BS, typically driven by geographical proximity. Therefore, the BS load patterns are driven by the interplay between the geographical layout of the BSs and the temporal dynamics of the user connectivity. To capture this we seek to track the connection preferences (CP) of all users. At each point in time, user j’s CP is given by the N-dimensional vector l^j(t), whose ith term, ${l}_{i}^{\;j}(t)$, captures the probability of user j to connect to BS i; hence ${\sum }_{i=1}^{N}{l}_{i}^{\;j}(t)=1$. Collecting all preferences of all users via the matrix L(t) = [l¹(t), …, l^M(t)], we can express the demand at all BSs as

$$\tilde{{{{\bf{x}}}}}(t)={{{\bf{L}}}}(t){{{\bf{w}}}}(t),$$

(1)

where ${{{\bf{w}}}}(t)={({w}_{1}(t),\ldots,{w}_{M}(t))}^{\top }$ is the time dependent weight assigned to each user, capturing their varying activity levels.

Equation (1) describes the demand experienced by all BSs as a weighted sum of all CPs over all users. The weights w_j(t) ∈ [0, 1] account for the fact that users generate diverse traffic volumes, and hence the contribution of j’s CP to $\tilde{{{{\bf{x}}}}}(t)$ is proportional to the fraction of the total traffic generated by the j-user at time t; we normalize the total traffic volume at all times to follow $\mathop{\sum }_{j=1}^{M}{w}_{j}(t)=1$.

Equation (1) represents the incoming demand into all BSs, as determined by the user connectivity patterns. In reality, however, a BS may not always be available to accept this demand. This is because BSs may alternate between active and sleep modes, allowing to save on energy at times of low traffic. If at time t BS i is inactive, its demand will be redirected to one of the other stations, in accordance with all user CPs L(t). Hence, to obtain the actual BS loads we now consider the stations’ duty cycles ${{{\bf{u}}}}(t)={({u}_{1}(t),\ldots,{u}_{N}(t))}^{\top }$, in which the ith term describes the fraction of time that BS i is active during the interval (t, t + Δt). In simple terms, u_i(t) = 0 if i is at sleep mode for the entire interval; u_i(t) = 1 if, during this period, it is fully active; and 0≤u_i(t)≤1 if the BS is active for some of the time. In most practical applications u(t) is updated hourly, and hence Δt = 1 hour, and u_i(t) captures the fraction of this one-hour interval in which BS i is active.

The duty-cycles can be controlled by the network managers, and setting them determines the actual loads on all BSs, which we denote by ${{{\bf{x}}}}(t)={({x}_{i}(t),\ldots,{x}_{N}(t))}^{\top }$. These loads, therefore, arise from the interplay between the user CPs, as expressed within (1), and the BS duty cycles, which determine the available BSs at any given time point. For an individual user j, the load they introduce into BS i is proportional to the amount of traffic they generate (w_j(t)), their connection preference to the ith BS (${l}_{i}^{\;j}(t)$), and the probability that BS i is available to receive their incoming traffic (u_i(t)). Summing over all M users collectively, this provides us with ${x}_{i}(t)\propto \mathop{\sum }_{j=1}^{M}{u}_{i}(t){l}_{i}^{\;j}(t){w}_{j}(t)$, capturing the total contribution of all users to BS i’s load at time t. In matrix notation, this product sum can be written as (Supplementary Section 3.4)

$${{{\bf{x}}}}(t)={{\mathrm{diag}}}\,\left({{{\bf{u}}}}(t)\right){{{\bf{L}}}}(t){{{\bf{D}}}}(t){{{\bf{w}}}}(t)$$

(2)

where D(t) is a normalization factor, following

$${{{\bf{D}}}}(t)={{\mathrm{diag}}}\,{\left({{{{\bf{1}}}}}_{N}^{\top }{{\mathrm{diag}}}\,\left({{{\bf{u}}}}(t)\right){{{\bf{L}}}}(t)\right)}^{-1}.$$

(3)

In (2) and (3) we use diag(v) to describe a diagonal matrix whose ith term is the element v_i, hence, e.g., diag(u(t)) collects all duty cycles into an N × N diagonal matrix. The unity vector 1_N = (1, …, 1)^⊤ is an N dimensional vector of ones. Together, the diagonal matrix D(t) in (3) is used to normalize x(t) such that the total load at any time t sums to $\mathop{\sum }_{i=1}^{N}{x}_{i}(t)=1$. In case u_i(t) = 1, i.e. all BSs are constantly active, Eq. (2) recovers the demands of (1), up to the normalization factor D(t).

Equation (2) represents our first key observation. It establishes a direct relationship between the duty cycle settings u(t) and the subsequent load distribution x(t) across all BSs. This mapping provides essential guidance for BS controllers on how to design sleep/activation cycles to achieve a desired load distribution. It also enables to foresee the network’s dynamic response to changes in BS availability, predicting how the loads x(t) will adapt to a station’s shutdowns or sudden failure. Finally, the equation can help forecast the network’s ability to handle growing demand or shifts in user behavior, as, indeed, all these factors are inherently governed by the interplay of u(t), L(t) and w(t).

Protected data

The challenge is that the path from u(t) to x(t), as expressed in Eq. (2), is incomplete, unless we construct the two missing links: first, L(t), which encapsulates the time dependent CPs of all users, and second, w(t), which describes their usage dynamics. These two components are intrinsically tied to user behavior and thus involve privacy-sensitive information. For instance, user CPs are primarily influenced by geographical proximity to BSs, typically favoring nearby stations. Consequently, access to L(t) would enable network operators to infer user mobility patterns — a potential breach of individual privacy. Similarly, w(t), reflecting each user’s connectivity over time, reveals usage habits and cellular activity, again touching on protected information, subject to strict regulatory and ethical constraints.

To bypass these restrictions, we seek to use aggregated data, from which individual user information cannot be identified. Of course, aggregation risks the loss of important details that may impact x(t). Therefore, we wish to implement Eq. (2) based on the appropriate level of aggregation: on the one hand obscuring all PII, while on the other hand maintaining the relevant features by which to predict x(t) in a meaningful fashion.

From personal data to user clusters

To achieve the desired balance between aggregation and individual data, we seek to partition all M users into K ≪ M groups, or clusters, each cluster consisting of an order of ~ M/K individuals. Within each of these clusters we collect users that have been found to exhibit correlated CPs. Therefore, individuals within cluster α (α = 1, …, K) have similar l^j(t), and can be approximated by the cluster average ${{{{\bf{l}}}}}^{\alpha }(t)={\langle {{{{\bf{l}}}}}^{\;j}(t)\rangle }_{j\in \alpha }$. Now, instead of tracking individual users, we can model the BS loads by tracking the collective CPs characterizing each cluster.

This partition helps rewrite Eq. (2) in terms of the K collective CPs and weights. The clustered preference matrix L^c(t) will now have just K columns, i.e. L^c(t) = [l¹(t), …, l^K(t)], each column representing a single group-aggregated CP. Similarly, w(t) will also be clustered into a K-dimensional vector w^c(t), capturing the collective traffic generation within each cluster. Using these clustered L^c(t) and w^c(t) in (2) we obtain the desired link between u(t) and x(t). Only this time it is predicted by the behavior of the user clusters, rather than that of each individual user. It thus bypasses the PII restrictions.

The challenge is, therefore, to construct these reduced L^c(t), w^c(t) in an optimal fashion. The natural approach is to examine all user CPs, and use statistical reduction methods, such as principal component analysis or k-means, to consolidate them into separated clusters. This, however, still requires us to collect data on all individual CPs, leaving us precisely where we started. To overcome this we developed a Bayesian non-parametric (BNP) model ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ that allows us to (approximately) reconstruct L^c(t) and w^c(t) directly from the observed BS activation and loads u(t) and x(t), thus circumventing the need to track the individual usage patterns. In explicit terms, we only collect allowable data, which is, indeed, available at each BS, but then we use this data to infer the clustered CPs and weights. This results in L^c(t) and w^c(t) – aggregate enough to avoid PII, and yet, as we show below, sufficiently detailed to offer the desired predictive power.

The CLUSTER method—extracting the hidden clusters

The challenge of ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ is to translate the observed BS loads x(t) together with the known duty-cycles u(t), into the most likely set of hidden user CPs. In Supplementary Section 4 we show how to do this in an iterative fashion. We begin with an initial rough guess, the ansatz, and then use an iterative algorithm to refine this ansatz, until we reach a CP distribution that is satisfactorily consistent with the observed loads. To reduce the potentially immense search space we induce two prior assumptions. The first, represents the fundamental premise of CLUSTER – stating that the M user CPs can be aggregated into a limited number K of sets that follow statistically similar patterns. The second assumption is that user CPs change gradually over time, and hence exhibit temporal correlations across adjacent time points. After several rounds, ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ retrieves the most likely number of clusters K, their respective weights w^c(t) and their aggregated CPs L^c(t).

In simple terms, ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ takes as input a limited amount of observations of u(t) and x(t), data that contains no PII, and provides as output a well-approximated L^c(t) and w^c(t), thus allowing us to construct the N × K clustered Eq. (2). All without accessing any user protected data. In Supplementary Section 4.1 we also consider an alternative plain formulation, ${{{{\mathcal{M}}}}}_{{{{\rm{PLN}}}}}$, in which the number of clusters K is pre-set to some initially assumed value.

While the technical details of ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ are explained in Supplementary Section 4.2, here, to gain insight into the conceptual idea of the model, we offer an illustrative description in Fig. 1. We consider, at each specific time point t, the individual users and their CPs as beads in a Galton board. The K groups correspond to K initial release points on the board (funnels), forming clusters of different weights w^c(t). These weights are captured here by the number of beads within each funnel. Our observation of x(t) corresponds to the distribution of the beads at the N different bins located at the bottom of the board – each bin representing a specific BS. This final distribution of bin population contains information on the initial release points. However, this information is obscured by the stochastic nature of the bead-paths as they traverse the Galton board (dashed green arrow). Such stochasticity, in our system, originates in the fact that the users within each cluster exhibit some level of variability and randomness, and hence, using our analogy, they do not all begin at the exact same release point (scattered greed/orange beads), and, furthermore, they may follow slightly different paths along their trajectories from these release points to the bins at bottom of the board. Our model ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ uses the observed distribution x(t) across the bins (BSs) to retrieve the most likely number of release points (K), their placement (L^c(t)) and their initial bead population (w^c(t)).

Taken together ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ (and ${{{{\mathcal{M}}}}}_{{{{\rm{PLN}}}}}$) remove a crucial barrier along the path to effective communication network management. They do this by systematically linking the BS loads x(t) and the operator controlled duty cycles u(t) without the need to track all individual user connection patterns. Network managers can now set the optimal duty-cycle per each BS, attuned to their specifically desired load balance, using the clustered Eq. (2). They can also detect critical BSs, whose removal may significantly affect x(t), or alternatively design strategic BS deployment to reduce loads and address potential coverage gaps. The crucial point is that now, in lieu of the PII-protected M-dimensional L(t) and w(t), they can rely on the reduced K-dimensional L^c(t) and w^c(t) to achieve such predictions. However, before we advance to these applications, let us first examine the CLUSTER toolbox to establish its performance against both real and simulated data.

Testing CLUSTER

Numerical testing

(Supplementary Section 5.1). We first examine ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ under the controlled environment of a numerically simulated population, capturing the dynamics of M = 10³ users with distributed weights w_j(t), roaming around in a virtual town over the course of 365 days (Fig. 2a,b). Users are partitioned into four different groups, depending on their domestic-occupational routine. For example, group 1 lives in Neighborhood 1 and commutes Downtown daily for work (Fig. 2a,b, yellow); group 2 of the same neighborhood, in contrast, remains homeward bound (blue). These different daily cycles capture realistic routines observed in actual demographic data. We next, spread N = 13 BSs at random locations across town, and set, for each BS i, its randomly generated duty-cycle u_i(t) (Fig. 2c). As users move across town they connect preferably to the nearest active BS (Fig. 2d). This determines each user’s CP, as illustrated in Fig. 2e.

**Fig. 2: Testing CLUSTER in a simulated town.**

Next, we extract the BS loads x(t), as they emerge from the combination of user location and BS availability, and use ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ to uncover the most likely CP-clusters. Note that in our simulation design, just like in the real-world, the clusters are not explicitly introduced into the data. Rather they arise endogenously from the fact that different users have distinct routines, driven by their place of habitat and commuting patterns. Such population dynamics lead to the emergence of natural clusters, that are roughly coordinated with our distinct population groups. We, therefore, test CLUSTER on its ability to retrieve these grouping patterns, without any access to individual user movements – relying on the observed u(t) and x(t) alone. To be clear, while the simulation does model individual usage and movement patterns, our CLUSTER analysis was deliberately rendered blind to that data, hence tracking only the accessible BS parameters u(t) and x(t).

In Fig. 2f we show the time-averaged clustered weights w^c, as predicted by ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$. The algorithm detected K = 9 distinct clusters that together account for 99% of the user weights; the remaining small clusters that amount to ~ 1% are ignored. Looking at their collective CPs L^c(t) we find that, indeed, each of these clusters has a unique connection pattern across the 13 BSs (Fig. 2g). This is expressed through the dissimilarity in the BS (color) distribution across the nine inferred CP vectors, indicating that ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ successfully detected statistically distinct population groups.

We can now examine the extracted cluster CPs and link them to our simulated population. For example, clusters 2, 3 and 9 distribute their CPs across the three BSs located in neighborhood N2. Therefore, they likely represent the inhabitants of this neighborhood, some of whom live closer to BSs purple and light-green (clusters 2, 9), and some in the light and dark-green part of the neighborhood (cluster 3). Indeed, the weights of these three clusters combined sum up to the total weight of neighborhood 2 (Fig. 2h, middle bars). A similar analysis helps retrieve also the remaining population groups in neighborhoods N1 and N3.

Together, these results demonstrate our algorithm’s ability to successfully recover the hidden patterns in the observed BS load data. The algorithm not only detected the natural partition of the population into neighborhoods, but also exposed the distinct sub-groups within each neighborhood, based on their geographic locales and commuting patterns. Most crucially, it achieved this without relying on any individual user data, but only by tapping into the collective observable load patterns.

Empirical testing

(Supplementary Section 5.3). Next, we conduct a similar analysis using empirical data from two real-world networks, a 5G network in Liuzhou and a 4G network in Cixi, both towns located in China (Fig. 3). We begin with the Liuzhou dataset, which captures the behavior of M ≈ 7000 users over a span of about 60 days, at Δt = 3 h resolution (i.e. 8 samples/day). The network comprises N = 302 BSs, divided across five geographic regions (Fig. 3a,b): Countryside (30 BSs, red), Outskirts (46 BSs, green), Downtown (102 BSs, blue), Residential (83 BSs, yellow), and Industrial (41 BSs, violet). For each region, we collected time-series data on the BS duty cycles u(t) and their corresponding loads x(t). For instance, the Countryside region includes 30 BSs monitored over 64 days, thus producing two sets of 30-dimensional time-varying vectors—one for activation (u(t)) and one for load (x(t)). Both vectors span 512 time points (64 days × 8 samples/day). Sample traces of u_i(t) and x_i(t) for selected BSs—one from each region—are shown in Fig. 3c,d, illustrating typical daily patterns. We also tracked the average population in each region throughout the day (Fig. 3e), offering demographic snapshots that reflect temporal shifts in user distribution.

**Fig. 3: Testing CLUSTER against real-world data.**

In principle, had we been able to track all individual user information, we could construct the complete connection preference matrix L(t) and the individual user weights w(t). To illustrate this consider the local population in Countryside at t = 9: 00 AM, which, on average, amounts to 630 individuals (Fig. 3e). This sub-population collectively connects to Countryside’s 30 BSs. Therefore, tracking each of these users would provide us with a 30 × 630 matrix L(t) and a 630 dimensional vector w(t). These two observables, had we been able to access them, would together link the duty cycles u(t) with the measured loads x(t) via Eq. (2). Lacking direct access to these data, we instead use ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ to retrieve the clustered L^c(t) directly from the observed u(t), x(t). The 630 users in Countryside at 9: 00 AM, we find, break into 3 distinctive clusters, whose collective weights are shown in Fig. 3f (red); the remaining 18 clusters, at that time of day, for the four other regions are also shown. Both the number of clusters K and their specific composition may vary across time, adapting to the evolving dynamics of the user population. Supplementary Table S6 details the number of clusters identified in each region throughout the day.

Taking the ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ inferred L^c(t) and w^c(t), together with the empirically accessible u(t), we can now use Eq. (2) to predict the loads on all BSs through time. In Fig. 3g do just that. We use ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ to construct ${p}_{{{{\mathcal{M}}}}}({{{\bf{x}}}}(t)| {{{\bf{u}}}}(t))$, the probability distribution to observe loads x(t) given the empirically measured u(t). We then extract our predicted loads ${{{{\bf{x}}}}}_{{{{\mathcal{M}}}}}(t)$ from the inferred distribution as

$${{{{\bf{x}}}}}_{{{{\mathcal{M}}}}}(t) \sim {p}_{{{{\mathcal{M}}}}}\left({{{\bf{x}}}}(t)| {{{\bf{u}}}}(t)\right).$$

(4)

The loads in (4) represent the ultimate outcome of our framework: taking only u(t) as input, we predict the anticipated loads ${{{{\bf{x}}}}}_{{{{\mathcal{M}}}}}(t)$ as output, absent any protected user information. Such prediction is only enabled thanks to our ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ extracted clusters. As Fig. 3g shows, the predicted loads ${{{{\bf{x}}}}}_{{{{\mathcal{M}}}}}(t)$ (blue solid line) recover, quite accurately, the experimentally observed loads x_Exp(t) (green solid line). Some variability around the empirical loads is also observed (circles, shaded area), an inevitable outcome of the stochastic nature of our analytical construction. We quantify these discrepancies using the mean absolute error (MAE) of ${{{{\bf{x}}}}}_{{{{\mathcal{M}}}}}(t)$, which we find, in this dataset, to be MAE = 3.599 × 10⁻³, capturing a mean discrepancy of 6.69% (see Supplementary Section 6.3 for a rigorous precision analysis). The observed results clearly indicate that CLUSTER can successfully capture empirical user dynamics directly from the collective, low-resolution, non-restricted data accumulated at each BS.

We now turn to conduct a similar analysis on the 4G Cixi network data. This dataset confronts our algorithm with several challenges that were not present in Liuzhou. First, the database is an order of magnitude smaller, comprising a range of M = 300–800 users, depending on the examined time of day. It therefore exhibits smaller, and hence, noisier, statistics. Second in this dataset we do not have access to the load of each individual antenna. Instead, the data from all antennae on the same tower are aggregated, leaving us with only N = 8 BSs, each representing a tower with potentially several antennae. Finally, the modus operandi of the Cixi network control is to leave all BSs almost permanently active, i.e. u_i(t) = 1 for all t. This excludes all information rooted in the activation variability of the BSs, that ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ can potentially use to enhance its predictive power. Still, despite these limitations, we find in Fig. 3i that our predicted ${{{{\bf{x}}}}}_{{{{\mathcal{M}}}}}(t)$ (blue solid line) continues to successfully approximate the actual, empirically observed, load patterns (green solid line). This time, quite naturally, with a higher degree of uncertainty (circles, shaded area, MAE = 0.014).

CLUSTER in comparison to existing methods

To evaluate CLUSTER’s performance in context, we consider several relevant alternatives, all designed to link x(t) to the operator controlled u(t), while bypassing the hidden individualized user data. This includes traditional linear regression methods,⁴² alongside more advanced deep learning approaches, based on neural networks and their more sophisticated variants^43,44. Using the Liuzhou Countryside data as our benchmark, we find that CLUSTER achieves a smaller MAE as compared to all other examined methods (Fig. 4a, black). This trend is quite consistent throughout all the daily time-intervals (Fig. 4b), barring a few instances in which the Gaussian Process Regression (GPR) method⁴⁵ (red) features a minor advantage over CLUSTER.

**Fig. 4: CLUSTER alongside alternative optimization methods.**

Interestingly, CLUSTER and its leading contender, the GPR method, address this problem via fundamentally different modeling paradigms. GPR is a general nonparametric method that takes u(t) as input and directly predicts x(t) through a Gaussian process governed by a kernel function. It makes no explicit assumptions about the intermediate variable L(t), and therefore, despite its generality, it offers no interpretable insights into the hidden user CPs. In contrast, CLUSTER imposes a structural assumption on L(t), positing that it can break down into groups of users sharing common behavioral patterns. This design, which is perhaps less general, is specifically tailored to the anticipated characteristics of the user statistics—particularly the tendency of social networks to form homogeneous clusters. Therefore, while GPR benefits from flexible model adaptation, CLUSTER derives its strength from its unique structural alignment with the specific nature of the problem.

To examine this, we simulated 100 distinct realizations of user-BS interaction with varying levels of randomness R, and clustering C (see Methods Section 5 and Supplementary Section 5.4). To understand these two parameters, we refer, once again, to the Galton board analogy (Fig. 4c). In this illustration the clusters originate in the discrete funnels spread out along the top of the board. Therefore, in a highly clustered system, i.e. where C → 1, the users (beads) condense within a small number of funnels, thus following similar trajectories as they advance towards the bottom of the board. In the opposite limit of C → 0, they disperse evenly across all funnels. Once the beads are distributed across the funnels, the randomness in their dynamics is controlled by the number of pegs. The system is fully deterministic, i.e. R → 0 when there are no pegs, and fully random (R → 1) in the limit where the number of pegs approaches infinity.

In the C, R space, CLUSTER is most well-suited for intermediate levels of clustering, i.e. 0 ≤ C ≤ 1. In this range, since C > 0, users can, indeed, be partitioned into meaningful clusters. As C approaches unity, however, we arrive at a system where all users fall within just 1 or 2 clusters, and hence the clustering becomes non-informative. A similar reasoning renders GPR most appropriate for intermediate levels of R. To observe this we measured ΔMAE = MAE_GPR − MAE_CLR, the difference in MAE between the two methodologies. A positive ΔMAE indicates that CLUSTER outperforms GPR, whereas a negative value implies superior performance by GPR.

In Fig. 4d we present ΔMAE for our 100 simulated scenarios, plotted in the C, R parameter space. As expected, CLUSTER achieves peak performance at intermediate values of C, specifically within the range C ∈ (0.2, 0.8), indicated by the dark blue area. Conversely, GPR shows a slight performance advantage around R ≈ 0.6 (red regions), once again, in line with our expectations. The crucial point is, that apart from these localized advantages for GPR, CLUSTER consistently outperforms across the entire remainder of the C, R plane. Hence, it emerges as the preferred method under a broad range of conditions. A similar performance advantage holds in a systematic comparison of CLUSTER with all other evaluated methods (Supplementary Section 5.4).

Taken together, the results presented in Figs. 2–4 clearly demonstrate CLUSTER’s empirical relevance. We can now use its proven predictive power to gain insights towards optimal network planning.

CLUSTER-based network planning insights

Consider a user j in an area covered by many BSs. This user will likely experience high quality and robust service, benefiting from sufficient connection redundancy. In j’s CP this will be expressed through a non-localized dispersed l^j(t), whose entries will be spread across a variety of nearby BSs. In contrast, if j is at an underserved area, with only a small number of nearby BSs, their CP will be localized, with almost all connection probabilities concentrated on j’s few available BSs. The latter points to a service deficiency, and thus suggests that in these areas, where CPs tend to be localized, it is best to increase coverage and add BSs (or increase activation of existing BSs). Hence, once again, access to user CPs proves crucial for network management.

We, therefore, use CLUSTER to identify such patterns in the collective CP ensemble L^c(t). To achieve this we project all CPs on to a lower dimensional space ${{{\mathcal{S}}}}$, using a combination of two well-established algorithms: the uniform manifold approximation and projection (UMAP) method and the classic k-means clustering algorithm (Supplementary Section 6.2). In the resulting ${{{\mathcal{S}}}}$-projection, dispersed CPs will likely have much overlap, as they spread their weight over a large number of BSs. Hence, even if users belong to different ${{{\mathcal{M}}}}$-clusters under ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ (or ${{{{\mathcal{M}}}}}_{{{{\rm{PLN}}}}}$) they may still be grouped together within the same ${{{\mathcal{S}}}}$-cluster. This results in intermixed clusters, i.e. ${{{\mathcal{S}}}}$-clusters that exhibit a combination of many different CP types, originating in distinct ${{{\mathcal{M}}}}$-clusters. In contrast, localized CPs will only condense with their own type, those who rely on the same scarce collection of BSs. In ${{{\mathcal{S}}}}$, they will appear as homogeneous and isolated clusters, clearly separated from the primary intermixed cluster.

To observe this we, once again, simulated the population of a virtual town, featuring domestic and work-related regions (Supplementary Section 5.2). The population included homeward bound individuals, who travel sporadically and irregularly (Fig. 5a, green), alongside regular commuters who move daily from their neighborhood to their workplace (orange). To help examine the ${{{\mathcal{S}}}}$, ${{{\mathcal{M}}}}$ cluster relationship, here we used ${{{{\mathcal{M}}}}}_{{{{\rm{PLN}}}}}$ to divide the users into a pre-set number of clusters, setting K = M = 10³. Hence, in this simulation each cluster correspond to an individual simulated user.

**Fig. 5: Network planning insights by CLUSTER.**

In Fig. 5b, we present a series of ${{{\mathcal{S}}}}$ projections captured at different times of the day. In these projections, each point corresponds to a user CP, and the spatial ${{{\mathcal{S}}}}$-clustering helps group these users based on their correlated spatiotemporal movement patterns. To highlight the CP similarity, we color all points according to each user’s most-preferred BS (see Methods). Thus, a point’s position reflects its ${{{\mathcal{S}}}}$-cluster membership, while its color indicates its ${{{\mathcal{M}}}}$-cluster affiliation. As expected, we find a combination of multi-color and mono-chromatic clusters. The former represent intermixed clusters, i.e. users in a confined spatial region that spread their loads across a variety of available BSs. The latter, however, capture isolated clusters, whose users are served by just one or two BSs. These isolated clusters, we argue, indicate potential coverage gaps.

To gain deeper insight we focus on a specific projection at t = 0 h, this time using a 3-dimensional ${{{\mathcal{S}}}}$ clustering (Fig. 5c). The mapping features one central intermixed cluster (colorful), 4 clearly isolated mono-chromatic clusters (green, turquoise, yellow and red), and one dual-colored cluster (green/blue), which, given its low color diversity we also consider to be isolated. Next we project the ${{{\mathcal{S}}}}$-clusters back onto the two dimensional town map (Fig. 5d). We find that the isolated clusters, indeed, correspond to geographically concentrated populations (dashed ellipses) that are served by a small number of BSs (highlighted). The dual-colored cluster (green-blue) captures a population that is at the boundary between two neighboring BSs, and therefore, quite expectedly, exhibits the observed mixture of two CPs.

The crucial point is that, according to our prediction, the observed isolated ${{{\mathcal{S}}}}$-clusters point to areas that experience lacking coverage. To test this we measured the coverage gap G(x, y) at each coordinate on the map. This gap quantifies the difference between the user traffic at (x, y) and the collective coverage afforded by all BSs at that point (Supplementary Section 6.4). Fig. 5e clearly shows that the detected regions corresponding to the isolated clusters (dashed ellipses) suffer from significant coverage gaps (yellow-red patches), and would, indeed, benefit from additional BSs in their vicinity. Hence, CLUSTER can offer direct insights into future network expansion.

Empirical application

To examine our analysis in an empirical setting, we once again return to our Liuzhou dataset, and its five regions (Fig. 5f,g): Countryside (red), Outskirts (green), Downtown (blue), Residential (yellow) and Industrial (violet). We constructed five ${{{\mathcal{S}}}}$-projections for each of these regions, capturing the connectivity patterns at two distinctive times of the daily cycle – noon (Fig. 5f, t = 12 h) and midnight (Fig. 5g, t = 0 h). As expected, in Countryside at noontime we observe no isolated clusters, and hence no coverage gaps. This is likely due to the fact that there is little demand at these areas in the middle of the workday. In Outskirts we detect two isolated clusters, which indicate a locale with high demand (Fig. 5f, circled). Indeed, we find that these clusters correspond to a main avenue, where there is likely high midday traffic (highlighted on map). In Downtown and Residential we find large intermixed clusters, capturing the concentration of the different demographics at work or at home during the day. We also detect two isolated clusters, one of which is dual-colored, consisting of BSs from both regions (enlarged projection). These smaller clusters capture the constant traffic between residence and downtown throughout the day (e.g., for errands), linking BSs from both regions in the user CPs. Finally, two isolated clusters appear in Industrial, indicating coverage gaps at the interface of that region with Downtown (circled).

The above CP-patterns are consistent with the population’s peak daytime routine, but then they change dramatically when we shift to midnight (Fig. 5g). Now we observe the emergence of multiple isolated clusters in all areas, including also the peripheral areas Countryside and Residential. This reflects the nocturnal nature of the population, as traffic and mobility cease during the night and all residents remain at their homes. Therefore, with the majority of individuals remaining stationary, their CPs become localized, focusing on just one or two nearby BSs, thus leading to the observed isolated homogeneous ${{{\mathcal{S}}}}$-clusters.

Interestingly, our methodology detected a significantly higher density of coverage gaps in the simulated network compared to the empirical data. As shown in Fig. 5e, the simulated system with N = 15 BSs revealed five underserved regions—a gap-to-BS ratio of 1:3. In contrast, the empirical data featured only 10 gaps across N = 302 BSs, corresponding to a much lower ratio of 1:30—an order of magnitude smaller than in simulation. This outcome is expected: in our simulation, BSs were placed randomly, a deliberate choice aimed at generating a large number of gaps for CLUSTER to detect. As opposed to that, the real-world data from Liuzhou reflects a professionally planned network, where BSs are strategically deployed to minimize such gaps. The fact that this distinction between random and planned deployment is also reflected in our results further underscores CLUSTER’s predictive power.

Critical base stations

When a BS i malfunctions its user load is redistributed, transferred to nearby BSs. This user transfer can be predicted via ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$, by detecting a user cluster that relied on i and will now transition to its next preferred BS j. This results in a potential uptick in the load observed at j. The intensity of j’s increased load ${\tilde{x}}_{j}(t)$ helps quantify i’s criticality. As long as j’s response is small, service is maintained, and i is said to have little impact on the network functionality. If, however, j’s load spikes significantly following i’s failure, the system may observe service decline, and hence i is deemed critical. We can, therefore quantify a BS’s criticality by tracking its user load transfer to all other BSs^46,47,48,49. This leads to

$${G}_{i}(t)=\mathop{\sum }_{\begin{array}{c}j=1 \atop j \ne i\end{array}}^{N}\left\{\frac{1}{{x}_{j}(t)}\int\left(\tilde{x}-{x}_{j}(t)\right) \, {f}_{j}\left(\tilde{x}\,| \,{{{\bf{x}}}}(t),{{{{\bf{u}}}}}_{-i}(t),{{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}\right){{{\rm{d}}}}\tilde{x}\right\},$$

(5)

where u_−i(t) represents the event of i’s failure and ${f}_{j}(\tilde{x}| Y)$ is the probability density that ${x}_{j}(t)\in (\tilde{x},\tilde{x}+{{{\rm{d}}}}\tilde{x})$, conditional on event Y. The integral on the r.h.s. of (5) quantifies the expected increase in the load of BS j given the current loads x(t), the event of i’s failure u_−i(t), and the load transfer patterns from i to j as predicted by ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$. Normalized by the current x_j(t), the expression in the summation captures the relative increase in j’s load, which is then aggregated over all j ≠ i. We, therefore, have G_i → 0 if i’s load transfer has no impact on any other BS ($\tilde{x}-{x}_{j}(t)\to 0$), and G_i > 0 if i’s failure causes a surge in the load of some BSs.

We begin by extracting G_i(t) for all BSs in our simulated dataset (Fig. 6a, snapshot at t = 0 hours), dividing them into low risk (G_i(t) < 1, blue), moderate (1 ≤ G_i(t) ≤ 2, green) and critical (G_i(t) > 2, red). These risk values, however, change throughout the day, as user CPs and weights evolve. To observe this we show in Fig. 6b the daily averaged number of BSs at each category vs. time. We observe an increase in moderate and critical BSs that aligns, quite expectedly, with the peak morning and afternoon hours (red shaded).

In Fig. 6c we show P_t(G), the probability density to observe G_i(t) ∈ (G, G + dG), as obtained for four different times of the day. The majority of BSs, we find, peak around G_i(t) ~ 1.0, i.e. low to moderate risk. Yet, we also observe a consistent minority of critical BSs, captured by the secondary peak at the r.h.s. of the plot. We observe the highest criticality at midday, e.g., at t = 12 (purple), likely due to peaking activity coupled with the fact that users from various locations condense around occupational areas. In all cases, we find that the criticality risks are condensed on a small minority of BSs, indicating that relatively minor interventions can have a major impact on the network resilience. Of course, to take advantage of this fact, we must identify where such interventions are required, namely, pinpoint the critical BSs. This is precisely the power of our CLUSTER analysis, that it can help detect, among the many BSs, which are the network Achilles heels that must be reinforced.

In Fig. 6d we present the average daily risk cycle of all BSs (solid lines) superimposed with their observed loads x(t) (dashed lines). Interestingly, the critical stations are divided into two types: in BS 9, for example, G₉(t) is strongly correlated with the BS load x₉(t). Hence, 9’s criticality is rooted in its high traffic volume, which must be redistributed among its neighboring BSs. Such load-based criticality takes a sharp dip at times of the day when 9 experiences lower demand. In contrast, BS 5 exhibits a consistent G₅(t), largely detached from its load x₅(t). This form of criticality is driven by the 5’s neighborhood, which has typically low demand and hence displays a high sensitivity to 5’s removal. Identifying these two types of criticality—load-based vs. environment-driven, offers direct insight on mitigation. For BS 9 it is best to add an additional BS in the same area, that will help ease its load. In 5’s case, however, it seems the optimal strategy would be to increase the potential load capacity of its neighbors, rather than that of 5 itself, thus helping the neighboring BSs successfully absorb 5’s potentially redistributed load.

Next, we extract G_i(t) from our empirical Liuzhou dataset, examining its five regions at noon and at midnight (Fig. 6e,f). In Countryside, especially in isolated areas, BSs exhibit higher criticality due to sparse BS installation. In Residential, high G_i(t) BSs are often found in more densely populated regions. This observation is consistent with the urban residential patterns in China’s cities, where most residents live in high-density communities within residential buildings, resulting in clusters of high user concentration. Comparing G_i(t) across regions (Fig. 6g-k), we detect more critical BSs in Residential and Downtown at midnight, consistent with simulation results and indicative of the imbalanced geographical distribution of user homes. Interestingly, P_t(G) in Industrial remains nearly constant, possibly due to round-the-clock factory operations.

Practical considerations

Network planning is a complex, multi-faceted process that must balance service optimization with a range of additional factors, including environmental constraints, economic feasibility, and even geopolitical considerations. Among these diverse concerns, CLUSTER is designed to address the technical dimension — specifically, predicting BS loads, identifying potential service deficiencies, and assessing the network’s response to failures or interventions. In this capacity, CLUSTER provides professional and technical guidance for network optimization but is not intended to function as a standalone decision-making tool. Rather, it is meant to support and empower planners by enabling more informed decisions that appropriately weigh technical insights alongside economic, social, and geographical priorities

Another practical consideration is the computational complexity of CLUSTER, particularly how it scales with the number of BSs (N), the covered time period (T ), and the volume of users (M). In Supplementary Section 6.6, we demonstrate that the algorithmic complexity of ${{{{\mathcal{M}}}}}_{{{{\rm{BNP}}}}}$ grows as ${{{\mathcal{O}}}}({(NT\log M)}^{5/4})$. This reflects a slightly superlinear dependence on N and T, and a slow, logarithmic growth with respect to the user base M.

link

Privacy preserving optimization of communication networks

The CLUSTER framework

Protected data

From personal data to user clusters

The CLUSTER method—extracting the hidden clusters

Testing CLUSTER

Numerical testing

Empirical testing

CLUSTER in comparison to existing methods

CLUSTER-based network planning insights

Empirical application

Critical base stations

Practical considerations

athenahealth Launches Agentic Patient Communication Tools Across Its Provider Network, Which Serves One in Five Americans

Neural Network Models Human Concept Formation and Communication

A neural network for modeling human concept formation, understanding and communication

Leave a Reply Cancel reply

AI Enhances Pilot Training With Supercharged Debriefings, Embry-Riddle President Writes

News – Hexcel

Civil Aviation Industry Market: Trends, Growth,

Measure to improve tech research funding advances out of House committee

Metafuels Raises $24 Million to Scale Low-Cost Synthetic Sustainable Aviation Fuel Technology

The CLUSTER framework

Protected data

From personal data to user clusters

The CLUSTER method—extracting the hidden clusters

Testing CLUSTER

Numerical testing

Empirical testing

CLUSTER in comparison to existing methods

CLUSTER-based network planning insights

Empirical application

Critical base stations

Practical considerations

More Stories

Leave a Reply Cancel reply

You may have missed