Long Text Feature Extraction Network with Data Augmentation
Abstract
The spread of COVID-19 has had a serious impact on either work or the lives of people. With the decrease in physical social contacts and the rise of anxiety on the pandemic, social media has become the primary approach for people to access information related to COVID-19. Social media is rife with rumors and fake news, causing great damage to the Society. Facing shortages, imbalance, and nosiness, the current Chinese data set related to the epidemic has not helped the detection of fake news. Besides, the accuracy of classification was also affected by the easy loss of edge characteristics in long text data. In this paper, long text feature extraction network with data augmentation (LTFE) was proposed, which improves the learning performance of the classifier by optimizing the data feature structure. In the stage of encoding, Twice-Masked Language Modeling for Fine-tuning (TMLM-F) and Data Alignment that Preserves Edge Characteristics (DA-PEC) was proposed to extract the classification features of the Chinese Dataset. Between the TMLM-F and DA-PEC processes, we use Attention to capture the dependencies between words and generate corresponding vector representations. The experimental results illustrate that this method is effective for the detection of Chinese fake news pertinent to the pandemic.
Contributions
Twice-Masked Language Modeling for Fine-tuning (abbreviated as TMLM-F). TMLM-F is a text masking learning method for fine-tuning tasks. Unlike the MLM objective of BERT, TMLM-F adds artificial symbols like [MASK] to the real data at the fine-tuning time. This method eliminates the pretrain-finetune discrepancy. To avoid key characteristics from being masked, random masking on the same data twice are carried out. It can not only increase the amount of data but also paved the way for subsequent vector reconstruction.
Data Alignment that Preserves Edge Characteristics (abbreviated as DA-PEC). DA-PEC is a data reconstruction method for long texts. Compared with traditional data alignment methods, it carries on the global reconstruction to the double vectors after TMLM-F. Then, the connected vectors are cut out according to the uniform length so that the incomplete vectors are discarded. In this way, the edge characteristics of the data feature vectors that are easy to be clipped can be retained, and the accuracy of feature learning can be improved.
Combination novelty. This paper combines several different concepts and methods on the basis of fake news detection to form a new combination mode. It mainly includes the following parts: COVID-19, fake news, social media, data enhancement, long text data reconstruction. This paper analyzes the characteristics of Chinese news in social media, points out the existing problems, and solves them. The data set is optimized based on data enhancement to improve the learning efficiency of the neural network model.