    摘要:    文本挖掘中的探索性数据分析主要依赖于有效的可视化技术,可以揭露隐藏文件之间的关系和文档的对应关系及其功能。在文本挖掘,文件是由文度非常高的功能向量表示,需要降文来获得视觉投影在二文或三文空间。对应分析是一种无监督的做法,允许建设用两份文件和功能同时放置的低文投影空间,使它在文本挖掘适合探索性分析。然而,目前使用的是有限的单词功能。在本文中,我们探讨这个特殊文档的表示比较字母的N元语法和单词的N元语法的表示,发现这些替代表示在分离不同的类文件产生更好的结果。我们进行克罗地亚语和英语的双语平行语料库的实验分析,使我们能够同时探索不同语言的可视化的质量功能的影响。27695
    毕业论文关键词:    文本挖掘,文本可视化,字母N元语法,单词N元语法,对应分析。
    Textual features for corpus visualization using correspondence analysis
    Abstract: Explorative data analysis in text mining essentially relies on effective visualization techniques which can expose hidden relationships among documents and reveal correspondence between documents and their features. In text mining, the documents are most often represented by feature vectors of very high dimensions, requiring dimensionality reduction to obtain visual projections in two- or three-dimensional space. Correspondence analysis is an unsupervised approach that allows for construction of low-dimensional projection space with simultaneous placement of both documents and features, making it ideal for explorative analysis in text mining. Its present use, however, has been limited to word-based features. In this paper, we investigate how this particular document representation compares to the representation with letter n-grams and word n-grams, and find that these alternative representations yield better results in separating documents of different class. We perform our experimental analysis on a bilingual Croatian-English parallel corpus, allowing us to additionally explore the impact of features in different languages on the quality of visualizations.
    Keywords:  Text mining, Text visualization, Letter n-grams, Word n-grams, Correspondence analysis
