查询结果:   秦成磊,魏晓,杨阳.一种基于统计的复杂页面正文提取方法[J].计算机应用与软件,2015,32(7):90 - 92,147.
中文标题
一种基于统计的复杂页面正文提取方法
发表栏目
应用技术与研究
摘要点击数
1066
英文标题
A STATISTICS-BASED COMPLEX WEB TEXT EXTRACTION METHOD
作 者
秦成磊 魏晓 杨阳 Qin Chenglei Wei Xiao Yang Yang
作者单位
上海应用技术学院计算机科学与信息工程学院 上海 201418     
英文单位
School of Computer Science and Information Engineering,Shanghai Institute of Technology,Shanghai 201418, China     
关键词
复杂页面 正文提取 统计 公共子序列 文本长度最优阈值
Keywords
Complex web pages Text extraction Statistics Public sub-sequence Text length optimal threshold
基金项目
作者资料
秦成磊,硕士生,主研领域:Web智能信息处理,语义挖掘。魏晓,副教授。杨阳,硕士生。 。
文章摘要
随着信息技术的发展,Web页面复杂多样的特点愈来愈明显,传统页面正文提取方法的效率和精确度较低。针对这种情况,提出一种基于统计的正文提取算法。该算法依据Html标签特征提取经过过滤的每对“> ”和 “< ”之间的文本信息,对其长度进行统计并按照匹配顺序进行排序。根据文本长度最优阈值,划定文本行号区间,最后利用公共子序列进行优化并完成正文提取。实验结果表明,该方法能够精确高效地提取复杂页面的正文信息且具有较好的通用性。
Abstract
With the development of information technology, complex and diverse characteristics of webpages are getting more and more apparent, but the efficiency and accuracy of conventional web text extraction methods are quite low. Aiming at this situation, we propose a statistics-based web text extraction method. The algorithm extracts the text information between every pair of “>” and“<”, which has been filtered, based on the features of Html tags, and makes statistic on its length and then sorts according to the matching sequence. Depending on the optimal threshold of text length, it delimits the ranges of text line numbers, finally it uses the public sub-sequences to optimise and complete the text extraction. Experimental results show that this method can extract the text information from complex web accurately and effectively, of course, with better universality.
下载PDF全文