基于TextRank和自注意力的长文档无监督抽取式摘要

AN UNSUPERVISED EXTRACTIVE SUMMARY METHOD FOR LONG DOCUMENTS BASED ON TEXTRANK AND SELF-ATTENTION

  • 摘要: 针对中文长文档自动文本摘要问题,提出将TextRank与自注意力相融合的两种模型:TRAI和TRAO。TRAI将基于统计共现字数得到的句子相似性同基于自注意力得到的句子相关性进行加权求和,作为TextRank边的权重参与迭代计算,对句子进行打分。TRAO利用TextRank对句子打分;利用自注意力重新表示每个句子融合整个文档信息的分布式向量,在此基础上计算句子间余弦相似度,作为TextRank边的权重参与迭代计算,给句子打分;将两种得分加权求和作为句子最终得分。两种模型均根据得分对句子进行排序得到候选摘要。为去除摘要冗余性,利用最大边界相关法(Maximal Marginal Relevance,MMR)在候选摘要中选取摘要句子。将提出的两种模型在构建的长文档上进行实验,与TextRank方法相比,所提方法在ROUGE评价指标上有显著提高。

     

    Abstract: For automatic text summarization of long documents in Chinese, two models TRAI and TRAO, which integrate the Self-Attention with TextRank, are proposed. TRAI performed a weighted summation of sentence similarity based on the number of co-occurring words and sentence relevance based on Self-Attention, which was used as weight of the edge in TextRank to participate in iterative calculation to score the sentence. TRAO used TextRank to score sentences. Self-Attention was used to re-express the distributed vector of each sentence integrating the entire document information, and on this basis, cosine similarity between sentences was calculated as the weight of TextRank edges to participate in iterative calculation to score the sentence. The two scores were weighted and summed as the final score for each sentence. Both TRAI and TRAO sorted sentences based on scores to get candidate abstracts. In order to remove redundancy of abstracts, maximal marginal relevance (MMR) method was used to select abstract sentences from candidate abstracts. The two proposed models were tested on the constructed long documents. Compared with TextRank method, the proposed method has a significant improvement in ROUGE evaluation index.

     

/

返回文章
返回