A novel image semantic communication method via dynamic decision generation network and generative adversarial network
Experimental setting
Experimental parameters
In this study, we utilized PyTorch33 to implement our framework and leveraged a Tesla P40 GPU for both the training and evaluation34. The training dataset is compiled from the CIFAR-10 collection, which includes 50,000 images for training and 10,000 for testing, each with dimensions of 32 × 32 pixels. Model training was guided by the Adam optimizer, with \(\:\beta\:1\) set to 0.9 and \(\:\beta\:2\) set to 0.999. To prevent zero division errors during the calculation process, \(\:\epsilon\:\) is set to \(\:1\times\:10^-8\), as detailed in Ref.35. Firstly, in the data preprocessing stage, the CIFAR-10 dataset was subjected to standardization and data augmentation operations, including random horizontal flipping and cropping with a 50% probability, and the use of reflection filling operations to ensure that the pixel values of each image were within the range of [0, 1], and further standardized to [– 1, 1], in order to improve the training effectiveness and generalization ability of the model. Divide the training plan into three sequential stages during the training phase:
-
Step 1 Warm-up stage. The model quickly converges to a relatively reasonable state as soon as possible under the conditions that epochs are 300 and the learning rate is \(\:5\times\:10^-4\).
-
Step 2 Fine-tuning stage. Subsequently, the learning rate is diminished to refine the model parameters more delicately and to enhance the data generalization. The current condition is that epochs are 300 and the learning rate is \(\:5\times\:10^-5\).
-
Step 3 Stabilization stage. The model’s capability to complete specific tasks is further augmented by fixing certain modules (encoder Es and decision network P) while fine-tuning others with 200 epochs
Additionally, the batch size is established at 128, with the initial temperature parameter \(\:\tau\:\) set to 5, which then progressively declines at an exponential decay rate of − 0.015. During the training process, we continuously optimize hyperparameters using evolutionary hyperparameter methods. Each input sample undergoes random SNR sampling, with the SNR uniformly distributed between 0 and 20 dB. In contrast, during the testing phase, all input samples are assigned the same SNR.
Metrics
To assess the efficacy of the proposed image semantic communication proposed herein, we conducted simulation tests to assess the fidelity of the reconstructed images and the efficiency of data transmission compression. This evaluation entailed comparisons with several benchmarks, including JSCC-No Discriminator (our proposed JSCC without discriminator), JSCC-MSE (deep JSCC optimized for MSE), BPG + LDPC, and BPG + Capacity. The key performance metrics in this analysis include PSNR, SSIM, LPIPS, and CR. These evaluation metrics are detailed as follows:
$$\:\beginarraycPSNR\left(\mathbfx,\widehat\mathbfx\right)=10\textlog_10\left(\frac255^2\textM\textS\textE\left(\mathbfx,\widehat\mathbfx\right)\right),\endarray$$
(4)
$$\:\beginarraycSSIM\left(\mathbfx,\widehat\mathbfx\right)=\frac\left(2\mu\:_\mathbfx\mu\:_\widehat\mathbfx+C_1\right)\left(2\sigma\:_\mathbfx\widehat\mathbfx+C_2\right)\left(\mu\:_\mathbfx^2+\mu\:_\widehat\mathbfx^2+C_1\right)\left(\sigma\:_\mathbfx^2+\sigma\:_\widehat\mathbfx^2+C_2\right),\endarray$$
(5)
where \(\:\mu\:_\mathbfx\) and \(\:\sigma\:_\mathbfx^2\) are the mean and variance of x, respectively. And \(\:\sigma\:_\mathbfx\widehat\mathbfx\) is the covariance between x and \(\:\widehat\mathbfx\). \(\:C_1\) and \(\:C_2\) are stability constants to avoid having a denominator of 0.
LPIPS is trained to learn the inverse mapping from the generated images back to the original images, compelling the generator to learn how to reconstruct the original images from the fictitious images. This ensures that the generator preserves the essential features and details of the original images during image generation. A reduced LPIPS score reflects higher similarity between two images, whereas an elevated score denotes increased dissimilarity.
$$\:\beginarraycCR=\frac\mathbfs-\widehat\mathbfs\mathbfs\times\:100\%,\endarray$$
(6)
where in this paper, the data amount of the source image is s, and the data amount of the image deep learning JSCC is \(\:\widehat\mathbfs\). The CR is an important indicator that characterizes the compression degree of transmitting images in image semantic communication.
Results analysis
Two prevalent channel models were examined: the AWGN channel and the slow-fading channel. The AWGN channel’s transfer function is defined as \(\:\widehat\mathbfs=W\left(\mathbfs\right)=\mathbfs+\mathbfn\), where each component of the noise n adheres to an independent and identically distributed Gaussian distribution, denoted as \(\:\mathbfn\sim\mathcalC\mathcalN(0,\sigma\:^2\mathbfI_k)\). Here, \(\:\sigma\:^2\) represents the average noise power. For the slow fading channel, we adopt the Rayleigh fading model, characterized by the transfer function \(\:\widehat\mathbfs=W\left(\mathbfs\right)=\mathbfh\mathbfs\), where \(\:\mathbfh\sim\mathcalC\mathcalN(0,\mathbfH_\mathbfc)\) is a complex Gaussian random variable and \(\:\mathbfH_\mathbfc\) is the covariance matrix. The real part and imaginary part of the gain h are independent Gaussian random variables with a mean of 0 and a variance of 1/2, respectively. The SNR is converted into a linear SNR for calculating the noise standard deviation. Finally, the noise signal is normalized. In the experiment of the Rayleigh fading channel, we used 10 time steps to simulate the changes in h, reflecting the characteristics of the channel time-varying.
Initially, the efficacy of our scheme proposed, “D-JSCC”, was evaluated on the AWGN channel, and compared with JSCC-No Discriminator, JSCC-MSE, BPG + LDPC, and BPG + Capacity, with the channel utilization rate fixed at 0.5. JSCC-No Discriminator is D-JSCC without GAN. For the JSCC-MSE, the method described in Ref.6 was adopted, and we set \(\:\textS\textN\textR_\textt\textr\texta\texti\textn=\textS\textN\textR_\textt\texte\texts\textt\). For the BPG+LDPC, BPG36 image codec is used for source encoding, and then combined with LDPC37 for channel encoding. We provided the envelope of the best performance configuration of different LDPC coding rates and modulation combinations at each SNR under ideal capacity, labeled as BPG+Capacity. In Fig. 5, we demonstrated the evaluation performance on three metrics: SSIM, PSNR, and LPIPS, as well as the average number of active channels (PCS) in all generated images by D-JSCC. At 3 dB SNR, the SSIM of the D-JSCC has reached 0.925, which is superior among all comparison schemes. It also has the same outstanding performance in PSNR as the BPG + Capacity, which surpasses most previous JSCC schemes. With the continuous increase of SNR, both the SSIM and the PSNR of D-JSCC are still outstanding. The results indicate that our approach can achieve comparable or even better effects in pixel-level distortion. Interestingly, the LPIPS of the D-JSCC is consistently below 0.02 within the SNRs shown in Fig. 5. A smaller LPIPS means higher similarity between the original image and the generated image. This is consistent with our method, where the proposed generative model delivers higher perceptual quality.
Figure 5d illustrates that at higher channel SNRs, our scheme activates a few key channels to extract effective feature information. Meanwhile, Fig. 5e shows the correspondence between CR and SNRs in the D_JSCC. Obviously, at a certain SNR, the number of active channels and the CR are as follows.
In the low SNRs section, DDGN tends to activate more channels, thereby reducing the CR; Similarly, in the high SNRs section, DDGN tends to reduce active channels, thereby increasing the CR. For example: when SNR = 3 dB, then channels = 7.5PCS, CR\(\:\approx\:\)81.5%; When SNR = 20 dB, then channels = 5 PCS, CR\(\:\approx\:\)93.8%.
We also conducted the same measurements on a Rayleigh fading channel. As shown in Fig. 6, D-JSCC has been trained with uniform sampling between 0 and 20 dB SNR, and our designed model exhibits stronger robustness against channel interference, as detailed below.
At 3 dB SNR, the PSNR and the SSIM of D-JSCC still show better performance than other schemes in this paper. Moreover, the LPIPS of D-JSCC is consistently below 0.02 within the SNRs shown in Fig. 6. When SNR = 3 dB, then channels = 7.6PCS, and CR 80.5%. These reflect the difference between Rayleigh’s interference and AWGN’s interference in communication. At the same SNR, the Rayleigh channel introduces more interference than the AWGN channel, resulting in a higher number of active channels and a lower CR. At high SNRs, the impact of noise types on communication is minimal and can be largely ignored. Take for example: When SNR = 20 dB, then channels=6.8PCS, and CR\(\:\approx\:\)93.8%. This is almost the same as Fig. 5.
The performance comparison between JSCC-No Discriminator and JSCC-MSE in Figs. 5 and 6 demonstrates the impact of ablating the discriminator and perceptual loss. In the configuration without the discriminator (JSCC-No Discriminator), the PSNR and SSIM are worse compared to D-JSCC at low SNR, proving the effectiveness of GAN in reducing image differences. Similarly, the LPIPS is high at low SNR, indicating that perceptual loss plays an important role in enhancing the human perceptual quality of image reconstruction. Therefore, our proposed D-JSCC system can achieve high-quality image semantic transmission under low channel SNR.
link