查询结果:   王旭阳,朱鹏飞.基于模糊机制和语义密度聚类的汉语自动语义角色标注研究[J].计算机应用与软件,2019,36(9):76 - 82,92.
中文标题
基于模糊机制和语义密度聚类的汉语自动语义角色标注研究
发表栏目
数据工程
摘要点击数
500
英文标题
CHINESE AUTOMATIC SEMANTIC ROLE LABELING BASED ON FUZZY MECHANISM AND SEMANTIC DENSITY CLUSTERING
作 者
王旭阳 朱鹏飞 Wang Xuyang Zhu Pengfei
作者单位
兰州理工大学计算机与通信学院 甘肃 兰州 730050     
英文单位
School of Computer and Communication, Lanzhou University of Technology,Lanzhou 730050,Gansu,China     
关键词
SRL 模糊机制 语义密度聚类 神经网络 词向量
Keywords
SRL Fuzzy mechanism Semantic density clustering Neural network Word embedding
基金项目
作者资料
王旭阳,教授,主研领域:数据库理论和应用,数据挖掘,知识工程。朱鹏飞,硕士生。 。
文章摘要
基于CPB (Chinese Proposition Bank)提出一种基于LSTM-Bi-LSTM的汉语自动语义角色标注方法,并提出语义密度聚类进行数据预处理以及“模糊”机制利用于词向量转换过程。语义密度聚类通过密度的概念对谓词进行全局统一的聚类,将稀疏谓词替换为其所属聚类集合中的常见谓词;利用语义距离概念,将“模糊”机制引入词向量的转换过程,能适当地减少词向量的语义性,并提升与谓词词向量的相关性。利用Bi-LSTM网络自动学习特征表达,然后利用CRF和IOBES标注策略转化为词序列标注问题,引进一种词性学习方法;利用LSTM网络学习生成的词性特征向量与“模糊化”后的词向量融合后一同作为模型的输入向量;训练过程中采用了小批量梯度下降算法和Dropout正则化,这既加快了训练速度,又易于得到全局最优解,还防止了参数过拟合情况的出现。多组对比实验表明,该方法标注结果的F值最高达到了81.24%。
Abstract
On the basis of Chinese Proposition Bank (CPB), this paper proposed a Chinese automatic semantic role labeling method based on LSTM-Bi-LSTM. And the semantic density clustering was proposed for data preprocessing, and the fuzzy mechanism was applied to the word vector transformation process. Semantic density clustering used the concept of density to cluster the predicates globally, and then replaced the sparse predicates with the common predicates in the clustering set to which they belonged. By using the concept of semantic distance, the fuzzy mechanism was introduced into the transformation process of the word vector, which could appropriately reduce the natural semantic of the word vector and improve the correlation with the predicate word vector. Bi-LSTM network was used to automatically learn feature expression, then CRF and IOBES labeling strategies were used to transform into a word sequence annotation problem, and a part of speech learning method was introduced. The part of speech feature vectors generated by LSTM network learning and the fuzzified part of speech vectors were used as input vectors of the model. In the training process, we adopted the low-batch gradient descent algorithm and Dropout regularization. It not only speeded up the training, but also made it easy to get the global optimal solution, and prevented the occurrence of over-fitting of parameters. Multi-group comparison experiments show that the F value of the labeling results of this method reaches 81.24%.
下载PDF全文