Communication resource allocation method in vehicular networks based on federated multi-agent deep reinforcement learning

In this paper, we model the vehicular networking communication resource allocation problem as a task based on multi-agent reinforcement learning using Asynchronous Federated Learning (AFL) combined with Multi-Agent Deep Deterministic Policy Gradient (MADDPG). Each vehicle is considered as an agent that optimizes the resource allocation strategy through reinforcement learning as it interacts with the environment. Multiple agents jointly explore a dynamically changing network environment and continuously adjust the resource allocation strategy by sharing global information and local experience to improve the performance of the whole vehicular networking system. In order to solve the privacy leakage problem in the traditional approach, an asynchronous federated learning mechanism is adopted so that the learning process of each vehicle is carried out locally and only the model update parameters are aggregated with the global model, thus ensuring the protection of data privacy. In order to further improve the adaptability of the system and the efficiency of resource allocation, this paper also introduces a dynamic weight adjustment mechanism, which flexibly adjusts the updating strategy of the global model parameters according to the real-time communication environment and task requirements of each vehicle. The whole algorithm design is divided into two phases: local training phase and global aggregation phase¹⁷. In the local training phase, each agent is trained independently, and the model parameters are updated and uploaded based on the local environment optimization strategy. In the global aggregation phase, the global server receives the model parameter updates, weights the aggregated model parameters through the dynamic weight adjustment mechanism, and issues the new global model parameters. The design ideas of the AFL-MADDPG-based communication resource allocation method for vehicular networking are described in detail next.

Table of Contents

State space design

In order to adapt to the needs of multi-agent collaboration, dynamic weight adjustment and global model optimization in vehicular networking, this paper designs a state space containing local observation information, task demand information and global statistical information. where each vehicle, as an agent, can only observe the local communication environment information relevant to it, including: its own channel gain $g_{k} [m]$: the channel quality of vehicle k on channel m; interference from other V2V links $g_{{k,k{\prime} }} [m]$: the interference of other V2V link vehicles to vehicle k on channel m; interference of V2I links to vehicles $g_{m,k} [m]$: the interference of V2I links to vehicle k on channel m; and the interference of the base station $g_{k,B} [m]$: the interference from a base station transmitting a signal to a vehicle k. In vehicular networking, the heterogeneity of task requirements of vehicles has an important impact on the resource allocation strategy, so the following task-related information is incorporated into the state space: bandwidth requirement $b_{k}$: the bandwidth requirement of the current task of vehicle k; task priority $p_{k}$: the task priority of vehicle k, which determines the importance of the task. To better utilize the global model parameter aggregation capability in asynchronous federated learning, global statistical information is introduced to reflect the overall network state, including: global network load $L_{global}$: the overall load of the current network (e.g., channel occupancy); and global interference level $I_{global}$: the overall interference intensity of all communication links in the network. In order to support the dynamic weight adjustment mechanism, further metrics affecting the optimization of global model parameters are added to the state space: communication quality $C_{k}$: the current communication quality (e.g., channel strength, latency, etc.) of vehicle k; model update quality $Q_{k}$: the extent to which the local model update of vehicle k contributes to the global model; and model update frequency $F_{k}$: the frequency of model update of vehicle k.

Combining the above information, the state space of vehicle k at moment t is designed as:

$$S_{k}^{t} = \{ g_{k} [m],g_{{k,k{\prime} }} [m],g_{k,B} [m],g_{m,k} [m],b_{k} ,p_{k} ,L_{global} ,I_{global} ,C_{k} ,Q_{k} ,F_{k} \}$$

(10)

Action space design

In the resource allocation scheme for vehicular networking communication proposed in this paper, the design of the action space directly determines the decision-making ability of the agent (vehicle) under the multi-agent reinforcement learning framework. In order to adapt to the complexity of the dynamic environment and resource allocation requirements in Vehicular Networking, the design of the action space in this paper covers the three core aspects of spectrum access, transmit power control, and bandwidth allocation.

The action A of each agent (vehicle) k at time step t is defined as:

$$A_{k}^{t} = \{ f_{k} ,P_{k}^{d} [m],b_{k}^{alloc} \}$$

(11)

where: $f_{k}$ denotes the spectrum sub band selected by the vehicle, a discrete variable where the vehicle can dynamically select the currently optimal spectrum resource (e.g., occupying unused spectrum, or sharing the V2I band).$P_{k}^{d} [m]$: Indicates the transmit power of the vehicle on the V2V link and is a continuous variable with a value range of $0 \le P_{k}^{d} [m] \le P_{\max }^{d}$.

$b_{k}^{alloc}$: denotes the bandwidth allocated to the current task of vehicle k. It is a continuous variable that is used to satisfy the task requirements of the vehicle and to ensure the communication performance of tasks with different priorities. The bandwidth allocation constraint formula is as follows:

$$b_{k}^{alloc} = \sigma (z_{b}^{k} ) \cdot \frac{{B_{{{\text{total}}}} }}{K} \cdot \eta_{{{\text{priority}}}}^{k}$$

(12)

where: $\sigma (z) = \frac{1}{{1 + e^{ – z} }}$ is the normalization function, the output is limited to (0,1) to ensure that $b_{k}^{alloc} \in (0,\frac{{1.5B_{{{\text{total}}}} }}{K})$. $\eta_{{{\text{priority}}}}^{k}$ is the service priority coefficient, discrete value: low priority = 0.8, medium = 1.0, high = 1.5 (predefined by QoS demand). $B_{{{\text{total}}}} = 10{\text{MHz}}$ are the total system bandwidth resources. The single node bandwidth is limited to an upper limit of $\frac{{1.5B_{{{\text{total}}}} }}{K}$ by a Sigmoid function, and the total bandwidth is allowed to be overloaded up to 1.5 B_total for a short period of time, and the conflict between supply and demand is balanced by a reward function penalty mechanism. The bandwidth cap for high priority tasks is 1.875 times (1.5/0.8) that of low priority.

Bandwidth allocation needs to take into account the QoS requirements of high-priority tasks and the basic rights and interests of low-priority tasks. In order to quantitatively evaluate the balance of bandwidth allocation and verify the reasonableness of the trade-off between fairness and efficiency under the prioritization mechanism, the Jain fairness index is introduced:

$$J = \frac{{\left( {\sum\limits_{k = 1}^{K} {b_{k}^{{{\text{alloc}}}} } } \right)^{2} }}{{K \cdot \sum\limits_{k = 1}^{K} {\left( {b_{k}^{{{\text{alloc}}}} } \right)^{2} } }}$$

(13)

where: the numerator is the square of the total bandwidth allocation, reacting to the overall resource utilization. The denominator is K times the sum of squared bandwidth allocated to each node, reacting to individual resource differences. The range of values is $J \in \left[ {\frac{1}{K},1} \right]$. J = 1 indicates absolute fairness (all nodes share equally), and J tends to 1/K, which indicates extreme unfairness (a node monopolizes the resources). The data in Table 1 shows that the prioritization mechanism exchanges 6% fairness loss for 6% efficiency gain, while the greedy strategy exchanges 35% fairness loss for only 10% efficiency gain, which verifies that the method in this paper can effectively balance spectrum efficiency and fairness.

Table 1 Comparison of Jain Fairness Index.

Vehicles determine the resource access strategy for communications by selecting spectrum sub bands. The goal of spectrum selection is to maximize the transmission success rate and reduce system interference, and the discrete selection range of $f_{k}$ includes multiple predefined spectrum sub bands (e.g., V2I sub bands, idle sub bands, etc.). The agent dynamically adjusts the transmit power based on the channel quality and interference to balance the energy consumption with the communication quality, and $P_{k}^{d} [m]$ is a continuous variable that is calculated by the Actor network of the agent based on the current state. In order to adapt to the heterogeneity of multi-tasks, the agent can dynamically adjust the bandwidth allocation. High-priority tasks are allocated more bandwidth to ensure the fairness and efficiency of resource allocation. $b_{k}^{alloc}$ is a continuous variable, which is decided by the agent based on the task requirements and global bandwidth constraints.

By simultaneously controlling spectrum access, transmit power and bandwidth allocation, the agent can optimize the resource allocation strategy in multiple dimensions and adapt to complex vehicular networking communication scenarios. Spectrum access and transmit power control are mainly based on local observation information, while bandwidth allocation combines information such as task priority provided by the global model, achieving the unification of local optimization and global collaboration. The design of the action space fully utilizes the processing capability of MADDPG for continuous action space, and at the same time conforms to the distributed execution characteristics of multi-agent reinforcement learning. In the case of dynamic changes in vehicle density, task demand or interference intensity, the action space can flexibly adjust the decision variables of the agent and improve the efficiency of resource allocation.

Reward function design

When applying reinforcement learning to solve optimization problems in high-dimensional complex scenarios, the design strategy of the reward function directly affects the algorithm convergence and the upper limit of performance. In this paper, the reward function is designed to simultaneously optimize the total capacity of the V2I link and the success rate of the V2V link payload transmission to improve the overall system performance. In addition, in order to adapt to the dynamic vehicular networking environment and the global model optimization requirements, the reward function introduces task priority, communication quality, and global performance related metrics.

Combining V2I link capacity, V2V link success rate, task priority and global optimization objectives, the reward function of the agent is designed as:

$$\begin{gathered} R_{i} = \lambda (t) \cdot \sum\limits_{m} {C_{m}^{I} } [m,t] + (1 – \lambda (t)) \cdot \sum\limits_{k} {(Z_{k} (} t) \cdot C_{k} \cdot p_{k} ) + \varphi \cdot Q_{{{\text{global}}}} \\ R_{k}^{{}} = R_{i}^{{}} – \Upsilon \cdot \max \left( {0,\frac{{\sum\limits_{k = 1}^{K} {b_{k}^{alloc} } – B_{{{\text{total}}}} }}{{B_{{{\text{total}}}} }}} \right)^{2} \\ \end{gathered}$$

(14)

where: the V2I link capacity: $\sum\limits_{m} {C_{m}^{I} } [m,t]$, denotes the total capacity of all V2I links and is used to measure the communication performance between the vehicle and the base station. The bandwidth overload penalty term with $\Upsilon$ = 2.0 is the penalty coefficient, which controls the intensity of the overload penalty. $\frac{{\sum {b_{k}^{alloc} } – B_{{{\text{total}}}} }}{{B_{{{\text{total}}}} }}$ is the overload ratio, which quantifies the extent to which the total bandwidth exceeds the rated value. The activation threshold is that the penalty term takes effect only when $\sum {b_{k} } > B_{{{\text{total}}}}$. The gradient characteristics are:

$$\frac{\partial R}{{\partial b_{k} }} = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {\sum {b_{k} } \le B_{{{\text{total}}}} } \hfill & {} \hfill \\ { – 2\Upsilon \cdot \frac{{\sum {b_{k} } – B_{{{\text{total}}}} }}{{B_{{{\text{total}}}}^{2} }},} \hfill & {\sum {b_{k} } > B_{{{\text{total}}}} } \hfill & {} \hfill \\ \end{array} } \right.$$

(15)

The data in Table 2 shows that the combined performance of conflict probability and spectral efficiency is optimal when $\Upsilon$ = 2.0, and the spectral efficiency decreases significantly when $\Upsilon$ = 3.0.

Table 2 Tuning results of penalty coefficients.

V2V link load transfer success rate:

$$Z_{k} (t) = \left\{ {\begin{array}{*{20}l} {\sum\limits_{m} {\rho_{k}^{{}} } [m]C_{k}^{{\text{V}}} [m,t],} \hfill & {L_{k} \ge 0} \hfill & {} \hfill \\ {\omega ,} \hfill & {L_{k} < 0} \hfill & {} \hfill \\ \end{array} } \right.$$

(16)

When the payload is successfully delivered, it is rewarded with the actual transmission capacity; when it fails, a constant penalty is assigned.

Task Priority $p_{k}$: Used to indicate the importance of the current task of vehicle k. Higher priority tasks are given higher weight. Communication Quality $C_{k}$: Reflects the communication conditions (e.g., channel quality, delay, etc.) of vehicle k and is used to adjust its reward value. Global model optimization objective $Q_{global}$: denotes the performance metrics (e.g., global loss reduction, global throughput, etc.) of the global model in the asynchronous federated learning framework.

Dynamic Weights $\lambda (t)$: Dynamically adjusted weight parameters for balancing V2I link capacity and V2V link success:

$$\lambda (t) = \frac{{\sum\limits_{m} {C_{m}^{I} } [m,t]}}{{\sum\limits_{m} {C_{m}^{I} } [m,t] + \sum\limits_{k} {Z_{k} } (t)}}$$

(17)

By dynamically adjusting the weighting parameter $\lambda (t)$, the focus is adaptively optimized according to the current network state (e.g., the weight of V2I link capacity and V2V link success rate). Introducing task priority $p_{k}$ to ensure that high-priority tasks can get more guarantees in resource allocation. Enhance the global convergence of the asynchronous federated learning framework through the global performance metric $Q_{global}$, which aligns the local optimization of the agent with the global goal. Combining link capacity, communication quality and task requirements, the reward function is able to dynamically adapt to highly changing network environments in vehicular networking.

Asynchronous federated multi-agent deep reinforcement learning algorithms

Asynchronous Federated Learning (AFL) is a distributed machine learning approach that ensures data privacy while avoiding computational bottlenecks in centralized approaches when training models across multiple agent (e.g., vehicles). Each vehicle performs model training locally using its own communication data and periodically uploads updated model parameters to a global server for aggregation without transferring the raw data to a central location. Unlike traditional synchronous federated learning, AFL allows agents to upload updates at different points in time, which reduces communication delays and improves computational efficiency.

Multi-Agent Deep Deterministic Policy Gradient (MADDPG) is a reinforcement learning algorithm proposed for the problem of collaboration and competition in a multi-agent environment. It is based on the Deterministic Policy Gradient (DDPG) method of deep reinforcement learning and enables multiple agents to collaborate in a shared environment by means of centralized training and distributed execution. MADDPG works by each agent independently executing a policy (Actor network) and interacting with the environment, while using a shared Critic network to evaluate all the agents’ joint actions¹⁸. The core principle of MADDPG lies in exploiting the Actor-Critic architecture of each agent:

Policy-based Actor network: optimal actions (e.g., power control, spectrum access) are selected based on the current state.

Value-based Critic Network: evaluates the effect of joint actions of all agents and calculates Q-values to guide policy updates.

MADDPG is optimized by centralized training and distributed execution, where all the agents share a global Q-function for training, but each one executes its own policy, avoiding overly complex global optimization.

Specifically, the optimization objective of MADDPG is to maximize the expected cumulative reward:

$$\Psi (\theta_{i} ) = {\mathbb{E}}[\sum\limits_{t = 0}^{T} {\delta^{t} } R_{i} (t)]$$

(18)

where: $R_{i} (t)$ is the reward received by agent i at moment t and δ is the discount factor.

The Critic network evaluates the joint actions of all the agents by minimizing the error, while the Actor network optimizes the respective strategies by a gradient ascent method. In this way, the agents are able to optimize the resource allocation in the vehicular network based on global information and local decisions.

The communication resource allocation problem in vehicular networking is a highly dynamic multi-agent problem involving multiple factors such as spectrum access, power control, and bandwidth allocation. Traditional centralized approaches suffer from computational bottlenecks and privacy leakage risks, especially in large-scale distributed systems such as vehicular networking, where the dynamics of vehicles and data privacy make it difficult to apply centralized processing effectively. And while single multi-agent deep reinforcement learning (e.g., MADDPG) can optimize the resource allocation strategy well, it still faces the challenges of data privacy and communication efficiency.

To address these issues, this study proposes an approach that combines Asynchronous Federated Learning (AFL) with Multi-Agent Deep Deterministic Policy Gradient (MADDPG).AFL supports local training and parameter updating of the vehicles to reduce the communication overheads and protect the data privacy, while MADDPG optimizes the resource allocation through the Multi-Agent Collaboration framework to take into account both the collaborative and competitive relationship between the vehicles in the connected vehicle network. In addition, in asynchronous federated learning, simple average weighting may result in certain agents contributing too much or too little to the global model due to the different communication environments, task requirements, and model update quality of different vehicles. To improve the performance and adaptability of the global model, this study introduces a dynamic weight adjustment mechanism. This mechanism improves the efficiency of resource allocation by dynamically adjusting the weight of each agent in the global model parameter aggregation based on the update quality, communication quality, and update frequency of each vehicle. The framework of the AFL-MADDPG algorithm is shown in Fig. 2:

This algorithm integrates the advantages of asynchronous federated learning and MADDPG, which is divided into a local training phase and a global aggregation phase, and introduces a dynamic weight adjustment mechanism. Its basic flow is as follows:

Localized training phase (vehicle side): Each vehicle k acquires its local state $S_{k}^{t}$ at time step t. Based on the current state $S_{k}^{t}$, the vehicle selects an action $A_{k}^{t}$ through its Actor network, and the vehicle performs communication operations and interacts with the environment (other vehicles, base stations) based on the selected action. It observes the feedback from the environment and obtains the reward $R_{k}^{t}$ and the next state $S_{k}^{t + 1}$. The current interaction experience $(S_{k}^{t} ,A_{k}^{t} ,R_{k}^{t} ,S_{k}^{t + 1} )$ is stored in the experience replay buffer. The vehicle locally uses the data in the experience playback buffer to train its Actor network and Critic network:

Critic Network Update: Calculates the target value using the target network:

$$y = R_{k}^{t} + \varsigma Q_{{{\text{target}}}} (S_{k}^{t + 1} ,A_{k}^{t + 1} ;\theta_{{{\text{target}}}} )$$

(19)

Minimizing Error Updating Critic Networks:

$$\hbar = {\mathbb{E}}\left[ {\left( {Q(S_{k}^{t} ,A_{k}^{t} ;\theta ) – y} \right)^{2} } \right]$$

(20)

Actor network update: Update the policy network parameters so that the policy outputs actions that maximize the Q value of the Critic network.

After a vehicle completes a round of local training, it uploads its model updates (parameters) to the global server.

Global aggregation phase (global server-side): The global server asynchronously receives the local model update parameters $\theta_{k}$ uploaded by each vehicle and stores these updates. The server calculates dynamic weights based on the communication quality, model update quality and update frequency of each vehicle:

$$w_{k} = \alpha \cdot Q_{k} + \beta \cdot C_{k} + \gamma \cdot F_{k}$$

(21)

where: $Q_{k}$: the quality of model update of the vehicle. $C_{k}$: the quality of communication of the vehicle. $F_{k}$: the frequency of model update of the vehicle. $\alpha ,\beta ,\gamma$: the weight adjustment coefficient, which is used to balance the importance of each index.

The quality of a vehicle’s model updates directly affects the effectiveness of the global model, and high-quality model updates (e.g., smaller loss function drops or higher accuracy) should receive higher weights. A vehicle’s communication quality (e.g., signal strength, bandwidth, delay, etc.) affects the frequency and timeliness of its model updates, and vehicles with better communication quality should receive more contributions. In asynchronous updating, vehicles with more frequent updates contribute more to the global model, and vehicles with frequent updates should be given higher weights.

The model parameters of all vehicles are weighted and averaged according to the dynamic weights w_k to update the global model parameters:

$$\theta_{{{\text{global}}}} = \frac{{\sum\limits_{k} {w_{k} } \cdot \theta_{k} }}{{\sum\limits_{k} {w_{k} } }}$$

(22)

The updated global model parameters are stored as the latest version and distributed to all vehicles.

In the local training phase, each vehicle trains the reinforcement learning model based on the local observation state, optimizing the resource allocation strategy while ensuring data privacy. In the global aggregation phase, the server integrates the model updates uploaded by vehicles through a dynamic weight adjustment mechanism to optimize the global model parameters and ensure the overall system performance. Through the collaboration of these two phases, the method is able to achieve efficient resource allocation in dynamic vehicular networking scenarios while taking into account privacy protection and global performance optimization.

The pseudo-code of AFL-MADDPG based resource allocation algorithm for vehicular networking communication is as follows:

Theoretical analysis of dynamic weighting mechanisms

Information entropy constraints

To quantify the fairness of weight allocation and prevent certain nodes from being overly inhibited or dominated due to environmental fluctuations. Dynamic weight allocation should follow the principle of maximum entropy to guarantee the fairness, so the information entropy is defined as:

$$\begin{gathered} H(w) = – \sum\limits_{k = 1}^{K} {w_{k} } \ln w_{k} \hfill \\ {\text{s}}.{\text{t}}.\sum\limits_{k = 1}^{K} {w_{k} } = 1,\quad w_{k} \ge 0 \hfill \\ \end{gathered}$$

(23)

where: w_k is the aggregation weight of the kth agent. K is the total number of agents (number of vehicles) in the system. H(w) is the information entropy of the weight distribution, with the value range of [0, lnK], and the larger entropy value indicates the fairer distribution. The solution to maximize H(w) when unconstrained is w_k = 1/K (uniform distribution).

The actual QoS constraint w_k ≥ φ_k needs to be introduced to construct the Lagrangian function:

$${\mathcal{L}} = H({\mathbf{w}}) + \bar\lambda \left( {\sum {w_{k} } – 1} \right) + \sum {\mu _{k} } (w_{k} – \phi _{k} )$$

(24)

Solving this function by the Lagrange multiplier method yields a weight assignment that balances fairness and performance.

The data in Table 3 show that when the number of vehicles is 64, the maximum entropy value is calculated as H_max = ln64 ≈ 4.16. The attainment rate refers to the percentage of rounds where the dynamic weight entropy value is ≥ 0.8 H_max. Data characterization shows that the dynamic weight entropy value is consistently higher than the greedy strategy by about 83%, but lower than the pure average weight by about 12% (trade-off between fairness and performance). The analysis concludes that the entropy constraint mechanism successfully limits the excessive centralization of weight allocation.

Table 3 Comparative analysis of information entropy.

Gradient similarity criterion

To measure the consistency of the local model update direction with the global target and to suppress the effect of low-quality gradients. Define the cosine similarity between the local gradient and the global gradient:

$$\alpha = \frac{{\langle \nabla L_{k} ,\nabla L_{G} \rangle }}{{\nabla L_{k} \nabla L_{G} }}$$

(25)

where: ▽L_k denotes the local model gradient of agent k (with the same dimension as the neural network parameters), and ▽L_G is the global model gradient (the gradient direction after federated aggregation). The cosine similarity α ∈ [-1,1], α = 1 when the directions are consistent and α = -1 when they are completely opposite, is used to quantify the degree of synergy between the local update direction and the global goal. The contribution of directionally consistent nodes is amplified by adjusting the value of the parameter α. Nodes with α > 0.7 are regarded as “high-quality updates” and their weights w_k are enhanced, while nodes with α < 0.3 may be down-weighted due to data anomalies or channel interference.

The weighting parameter β is satisfied with the channel quality C_k and the bandwidth requirement $b_{k}^{req}$:

$$\beta = \frac{{C_{k} }}{{C_{max} }} \cdot \frac{{b_{k}^{req} }}{{B_{total} }}$$

(26)

where: C_max is the maximum channel quality and B_total is the total system bandwidth.

The data in Table 4 shows that the weight gain indicates the multiple of the dynamic weights with respect to the average weights (example: 1.8 times the average weight for α ∈ [0.6,0.9)), concluding that the dynamic weights increase the percentage of high-quality nodes (α ≥ 0.6) to 67% (the random weights are only 10.5%).

Table 4 Gradient similarity distribution.

Proof of convergence

The state of the system is described by constructing the Lyapunov function i.e. constructing the energy function V(t), which is shown to be decreasing with time.

$$V(t) = \frac{1}{2}\sum\limits_{{k = 1}}^{K} {w_{k} } (t)\left\| {\theta _{k} (t) – \theta ^{*} } \right\|^{2}$$

(27)

where: θ_k(t) is the model parameter of agent k at round t, and θ* is the optimal global parameter assumed to exist. The convergence condition is that the learning rate η_t satisfies $\sum\limits_{t = 1}^{\infty } {\eta_{t} } = \infty$ and $\sum\limits_{t = 1}^{\infty } {\eta_{t}^{2} } < \infty$ such that ${\mathbb{E}}[V(t + 1)|V(t)] \le V(t) – \eta_{t} \sum\limits_{k = 1}^{K} {w_{k} } (t)\nabla L_{k} (\theta_{k} (t))^{2}$, i.e., the convergence theorem is utilized to derive $\mathop {\lim }\limits_{t \to \infty } {\mathbb{E}}[V(t)] = 0$, which ensures that the algorithm converges.

The data in Table 5 shows that in terms of convergence speed dynamic weights take 1500 rounds to reach the loss value 0.1, which is 42.3% faster than fixed weights. In terms of interference resistance, after 600 rounds of interference events, the dynamic weights recover within 200 rounds, while the fixed weights need 450 rounds. The final difference is that the loss value of the dynamic weights is only 7.9% of the fixed weights at 2000 rounds.

Table 5 Comparison of convergence trajectories.

Parameter selection validation

Table 6 demonstrates the performance of different parameter combinations and the optimal combination (0.6,0.3,0.1) is obtained by Bayesian optimization.

Table 6 Sensitivity analysis experiments.

The data in Table 7 shows that each component of the dynamic weights contributes positively to the performance.

Table 7 Ablation experimental design.

Analysis of dynamic weighting strategies

Model quality terms Q_k (Fault Decision Filter).

To fit nodes with abnormal gradients, a gradient similarity index is introduced instead of the traditional loss function value:

$$Q_{k} = \exp \left( { – \frac{{\left\| {\nabla L_{k} – \nabla L_{G} } \right\|_{2}^{2} }}{{\sigma _{Q}^{2} }}} \right)$$

(28)

$\sigma_{Q}$ = 1.5, Weights decay by 80% for gradient anomalies > 3. When the new agent emits an error gradient ($\left\| {\nabla L_{k} – \nabla L_{G} } \right\|_{2}^{{}} > \delta$):

$$\frac{{\partial w_{k} }}{{\partial \Delta \nabla L_{2} }}|_{{{\text{error}}}} = – \frac{2\alpha }{{\sigma_{Q}^{2} }}\Delta \nabla L_{2} \cdot Q_{k} \le – 0.32\alpha < 0$$

(29)

Erroneous decisions decay the Q_k index and the weights are automatically reduced to less than 40% of the baseline value.

Communication quality item C_k (transmission stability guarantee).

To filter the channel inferior nodes, Sigmoid channel adaptive model is introduced:

$$C_{k} = \frac{1}{{1 + e^{{ – \eta ({\text{SNR}}_{k} – {\text{SNR}}_{0} )}} }}$$

(30)

η = 2.0, SINR₀ = 15 dB, and when the channel deteriorates (SNR < 10 dB), C_k tends to 0 fitting the node to participate in the aggregation to prevent the aggregation from diverging.

Updated frequency term F_k (cold start acceleration term).

To control the intensity of participation of new nodes, a gradual participation mechanism is used:

$$F_{k} = \min \left( {1,\frac{{N_{k}^{{{\text{update}}}} }}{\kappa }} \right)\quad (\kappa = 5)$$

(31)

The new agents have an initial F_k ≈ 0.2 and are fully engaged after 5 rounds of training to avoid the initial contagion of wrong decisions. In the early stage of new node joining:$\gamma F_{k} \ll \alpha Q_{k} + \beta C_{k}$.

It directly blocks the error gradient propagation by showing the associated gradient consistency through Q_k. F_k and C_k form a double insurance policy to defend against new node cold-start risk and channel transient degradation, respectively, and allow 32% node failures at α = 0.6 and γ = 0.1, which is a 45% enhancement over FedAvg. This strategy is particularly suitable for highly dynamic Vehicular Networking scenarios, e.g., new nodes join frequently and the channel environment fluctuates dramatically.

Federal learning accuracy analysis

Aiming at the complexity of communication resource allocation scenarios in vehicular networking, this paper constructs a three-level accuracy assessment framework covering decision-making accuracy, collaborative performance and environmental adaptability:

Bandwidth demand prediction error

$${\mathrm{BW}} = \frac{1}{KT}\sum\limits_{k = 1}^{K} {\sum\limits_{t = 1}^{T} {\left| {\hat{B}_{k}^{t} – B_{k}^{{{\text{opt}}}} } \right|} }$$

(32)

where $\hat{B}_{k}^{t}$ is the predicted bandwidth demand of agent k at time slot t and $B_{k}^{{{\text{opt}}}}$ is the theoretical optimal bandwidth based on centralized optimization solution. The prediction error reflects the accuracy of the local model in sensing the dynamic communication demand, which directly affects the spectrum utilization.

Global consistency index, GCI

$${\text{GCI}} = \frac{1}{{KT}}\sum\limits_{{k = 1}}^{K} {\sum\limits_{{t = 1}}^{T} {\exp } } \left( { – \left\| {\nabla L_{k}^{t} – \nabla L_{G}^{t} } \right\|_{2}^{2} } \right)$$

(33)

where $\nabla L_{k}^{t}$ denotes the local model gradient and $\nabla L_{G}^{t}$ is the global aggregated gradient. Exponential mapping is used to enhance robustness to gradient outliers, with values closer to 1 indicating better strategy synergy¹⁹.

Cross-scenario generalization gap

$$\Delta {\mathbb{C}} = \frac{1}{K}\sum\limits_{k = 1}^{K} {{\mathbb{C}}_{k}^{{{\text{test}}}} } – \frac{1}{K}\sum\limits_{k = 1}^{K} {{\mathbb{C}}_{k}^{{{\text{train}}}} }$$

(34)

where: ${\mathbb{C}}_{k} = \frac{1}{T}\sum\limits_{t = 1}^{T} {\left( {\hat{B}_{k}^{t} – B_{k}^{{{\text{opt}}}} } \right)^{2} }$ is the local loss function, which is calculated using the mean square error (MSE)²⁰. Test scenario library: three unknown interference modes: tunnel occlusion (Tunnel), storm attenuation (Rain), and multi-vehicle collision (Crash).

Comprehensive comparative analysis of accuracy

The data in Table 8 shows that the resource demand prediction error of AFL-MADDPG (8.7 MHz) is reduced by 52.5% compared to FedAvg + DDPG, which is close to the performance of the centralized method (5.2 MHz). The dynamic weighting mechanism improves the global policy consistency (GCI = 0.89) by 25.4%, proving that the local policy effectively converges to the global optimum. Under unknown interference, the generalization loss ($\Delta {\mathbb{C}}$ = 0.15) decreases by 53.1% compared to FedAvg + DDPG, verifying the environment adaptation. The method in this paper can effectively isolate the gradient noise of channel deterioration nodes and reduce its negative impact on the global model. Server-side Critic evaluates the global state based on statistical features to avoid policy bias caused by full-state transmission²¹.

Table 8 Accuracy experiment results.

link

Communication resource allocation method in vehicular networks based on federated multi-agent deep reinforcement learning

State space design

Action space design

Reward function design

Asynchronous federated multi-agent deep reinforcement learning algorithms

Theoretical analysis of dynamic weighting mechanisms

Information entropy constraints

Gradient similarity criterion

Proof of convergence

Parameter selection validation

Analysis of dynamic weighting strategies

Federal learning accuracy analysis

Bandwidth demand prediction error

Global consistency index, GCI

Cross-scenario generalization gap

Comprehensive comparative analysis of accuracy

athenahealth Launches Agentic Patient Communication Tools Across Its Provider Network, Which Serves One in Five Americans

Neural Network Models Human Concept Formation and Communication

A neural network for modeling human concept formation, understanding and communication

Leave a Reply Cancel reply

AI Enhances Pilot Training With Supercharged Debriefings, Embry-Riddle President Writes

News – Hexcel

Civil Aviation Industry Market: Trends, Growth,

Measure to improve tech research funding advances out of House committee

Metafuels Raises $24 Million to Scale Low-Cost Synthetic Sustainable Aviation Fuel Technology

State space design

Action space design

Reward function design

Asynchronous federated multi-agent deep reinforcement learning algorithms

Theoretical analysis of dynamic weighting mechanisms

Information entropy constraints

Gradient similarity criterion

Proof of convergence

Parameter selection validation

Analysis of dynamic weighting strategies

Federal learning accuracy analysis

Bandwidth demand prediction error

Global consistency index, GCI

Cross-scenario generalization gap

Comprehensive comparative analysis of accuracy

More Stories

Leave a Reply Cancel reply

You may have missed