同一文抽出に基づく類似ページの検出と分類

柴田知秀; 姜ナウン; 黒橋禎夫

文献

J-GLOBAL ID：201002240508220999 整理番号：10A0089012

同一文抽出に基づく類似ページの検出と分類

Finding and Classifying Near-Duplicate Pages based on Identical Sentences Detection

出版者サイト {{ this.onShowPLink() }} 複写サービスで全文入手 {{ this.onShowCLink("http://jdream3.com/copy/?sid=JGLOBAL&noSystem=1&documentNoArray=10A0089012&COPY=1") }}
高度な検索・分析はJDreamⅢで {{ this.onShowJLink("http://jdream3.com/lp/jglobal/index.html?docNo=10A0089012&from=J-GLOBAL&jstjournalNo=U0128A") }}

著者 (3件)： , ,
資料名：
巻： 25 号： 1 ページ： 224-232 (J-STAGE) 発行年： 2010年
JST資料番号： U0128A ISSN： 1346-8030 資料種別：逐次刊行物 (A)
記事区分：原著論文発行国：日本 (JPN) 言語：日本語 (JA)

ウェブページの類似ページには,1)完全に文が同一のページ,2)1つが他方に包含されるページ,3)部分を共有するページがある。本論文では,1億ページという大規模なウェブコレクションを対象とし,類似ページの検出手法を提案した。本手法は,i)ルールベースによるコンテンツ領域抽出,ii)Web全体での低頻度で長い文の抽出,iii)前記文を共有するページペアの分類という手順をとる。i)ではページをDOM(Document Object Model)木に変換してブロックに分割し,同じ深さのノードでHTML(Hypertext Markup Language)タグが一致しているものを連結したときに,リンクの割合が閾値以上で文の最長文字数が閾値以下のブロックを非コンテンツ領域とした。また,iii)ではページ間の文の重複率,包含率を用いて1)~3)に分類した後でURL(Uniform Resource Locator)の類似度,2ページ間のリンクを用い,a)ミラーページ,b)サイト内包含ページ,c)サイト内関連ページ,d)スパムページ,e)被リンクページ,f)引用/被引用ページ,g)盗作/被盗作ページ,h)文集合共有ページに分けた。実験ではa),b),c),h)は高い分類精度が得られたが,その他は40~80%程度となった。

, , , , , , , , ,
, ,

その他の情報処理 , 検索技術

引用文献 (13件)：

[BarYossef 07] BarYossef, Z., Keidar, I., and Schonfeld, U.: Do Not Crawl in the DUST: Different URLs with Similar Text, in Proceedings of WWW2007, pp. 111--120 (2007)
[Broder 93] Broder, A. Z.: Some applications of Rabin's fingerprinting method, in Sequences II: Methods in Communications, Security, and Computer Science, pp. 143--152 (1993)
[Broder 97] Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G.: Syntactic clustering of the Web, in Proceedings of the 6th International Conference on World Wide Web, pp. 1157--1166 (1997)
[Charikar 02] Charikar, M. S.: Similarity estimation techniques from rounding algorithms, in STOC '02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pp. 380--388 (2002)
[Henzinger 06] Henzinger, M.: Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms, in Proceedings of 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrival, pp. 284--291 (2006)

, ,

前のページに戻る