[1]袁满,欧阳元新,熊璋,等.一种基于频繁词集的短文本特征扩展方法[J].东南大学学报(自然科学版),2014,44(2):256-260.[doi:10.3969/j.issn.1001-0505.2014.02.006]
 Yuan Man,Ouyang Yuanxin,Xiong Zhang,et al.Short text feature extension method based on frequent term sets[J].Journal of Southeast University (Natural Science Edition),2014,44(2):256-260.[doi:10.3969/j.issn.1001-0505.2014.02.006]
点击复制

一种基于频繁词集的短文本特征扩展方法()
分享到:

《东南大学学报(自然科学版)》[ISSN:1001-0505/CN:32-1178/N]

卷:
44
期数:
2014年第2期
页码:
256-260
栏目:
计算机科学与工程
出版日期:
2014-03-20

文章信息/Info

Title:
Short text feature extension method based on frequent term sets
作者:
袁满欧阳元新熊璋罗建辉
北京航空航天大学计算机学院, 北京 100191; 北京航空航天大学深圳研究院, 深圳 518000
Author(s):
Yuan Man Ouyang Yuanxin Xiong Zhang Luo Jianhui
School of Computer Science and Engineering, Beihang University, Beijing 100191, China
Shenzhen Research Institute, Beihang University, Shenzhen 518000, China
关键词:
频繁项目集 短文本分类 特征扩展
Keywords:
frequent term sets short text classification feature extension
分类号:
TP391
DOI:
10.3969/j.issn.1001-0505.2014.02.006
摘要:
为了解决向量空间模型(VSM)对短文本内容表示能力不足的问题,提出了一种基于频繁词集的特征扩展方法.定义了单词间的共现关系和类别同向关系,通过计算单词集的支持度和置信度,挖掘出具有相同类别倾向的频繁词集,并将其作为短文本特征扩展的背景知识库.对于短文本中的每个原始单词,从背景知识库中查找包含有该单词的频繁词集,将其作为扩展特征加入原特征向量中.搜狗语料集上的实验结果表明,置信度和支持度对背景知识库的规模有较大的影响,但是扩展过多的特征存在冗余性,对分类效果没有进一步的提升.基于频繁词集构建的短文本背景知识库可以作为有效的扩展特征;当训练文本数较为有限时,特征扩展对支持向量机SVM的分类效果有显著的提升.
Abstract:
A short text feature extension method based on frequent term sets is proposed to overcome the drawbacks of the vector space model(VSM)on representing short text content. After defining the co-occurring and class orientation relations between terms, frequent term sets with identical class orientation are generated by calculating the support and confidence of word sets, and then are taken as the background knowledge for short text feature extension. For each single term of the short text, the term sets containing this term are found in the background knowledge and added into the original term vector as the feature extension. The experimental results on Sougou corpus show that the support and confidence have great impact on the scale of the background knowledge, but excessive extension also has redundancy and cannot obtain further improvement. The background knowledge based on frequent term sets is an effective way for feature extension. When the number of the training documents is limited, these extended features can greatly improve the classification results of the support vector mechine(SVM).

参考文献/References:

[1] Gupta V, Lehal G S. A survey of text mining techniques and applications[J]. Journal of Emerging Technologies in Web Intelligence, 2009, 1(1): 60-76.
[2] Alexander P, Patrick P. Twitter as a corpus for sentiment analysis and opinion mining[C]//Proceedings of the Seventh International Conference on Language Resources and Evaluation. Valletta, Malta,2010:19-21.
[3] Navigli R. Word sense disambiguation: a survey[J]. ACM Computing Surveys, 2009, 41(2): 1-6.
[4] Zhang W, Yoshida T, Tang X. Text classification based on multi-word with support vector machine[J]. Knowledge-Based Systems, 2008, 21(8): 879-886.
[5] Sun A. Short text classification using very few words[C]//Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA, 2012: 1145-1146.
[6] Cilibrasi R L, Vitanyi P M B. The google similarity distance[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(3): 370-383.
[7] Hu X, Zhang X, Lu C, et al. Exploiting Wikipedia as external knowledge for document clustering[C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France, 2009: 389-396.
[8] Hu J, Fang L, Cao Y. Enhancing text clustering by leveraging Wikipedia semantics[C]//Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Singapore, 2008: 179-186.
[9] Han J, Cheng H, Xin D, et al. Frequent pattern mining: current status and future directions[J]. Data Mining and Knowledge Discovery, 2007, 15(1): 55-86.
[10] Cheng H, Yan X, Han J, et al. Discriminative frequent pattern analysis for effective classification[C]//IEEE 23rd International Conference on Data Engineering. Istanbul, Turkey, 2007: 716-725.
[11] Ahonen M H. Discovery of frequent word sequences in text[C]//Proceedings of ESF Exploratory Workshop on Pattern Detection and Discovery. London, UK, 2002: 180-189.
[12] Hernández R E, Garcáa H R A, Carrasco O J A. Document clustering based on maximal frequent sequences[J]. Lecture Notes in Computer Science, 2006, 4139:257-267.
[13] 搜狗实验室.文本分类语料库[EB/OL].(2012-09-01)[2012-04-30]. http://www.sogou.com/labs/dl/c.html.
[14] Zhang H P. HHMM-based Chinese lexical analyzer ICTCLAS [C]//Proceedings of the Second SIGHAN Workshop on Chinese Language Processing. Sapporo, Japan, 2003: 184-187.

备注/Memo

备注/Memo:
收稿日期: 2013-11-02.
作者简介: 袁满(1987—),男,博士生;欧阳元新(联系人),女,博士,副教授, oyyx@buaa.edu.cn.
基金项目: 国家自然科学基金资助项目(61103095)、国家国际科技合作专项资助项目(2010DFB13350)、国家高技术研究发展计划(863计划)资助项目(2011AA010502)、中央高校基本科研业务费专项资金资助项目.
引用本文: 袁满,欧阳元新,熊璋,等.一种基于频繁词集的短文本特征扩展方法[J].东南大学学报:自然科学版,2014,44(2):256-260. [doi:10.3969/j.issn.1001-0505.2014.02.006]
更新日期/Last Update: 2014-03-20