Starten Sie Ihre Suche...


Durch die Nutzung unserer Webseite erklären Sie sich damit einverstanden, dass wir Cookies verwenden. Weitere Informationen

Clustering template based web documents

MacDonald, Craig (Hrsg). Advances in information retrieval : 30. European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30 - April 3, 2008 ; proceedings. Berlin u.a.: Springer 2008 S. 40 - 51

Erscheinungsjahr: 2008

ISBN/ISSN: 978-3-540-78645-0 ; 3-540-78645-7

Publikationstyp: Buchbeitrag (Konferenzbeitrag)

Sprache: Englisch

GeprüftBibliothek

Inhaltszusammenfassung


More and more documents on the World Wide Web are based on templates. On a technical level this causes those documents to have a quite similar source code and DOM tree structure. Grouping together documents which are based on the same template is an important task for applications that analyse the template structure and need clean training data. This paper develops and compares several distance measures for clustering web documents according to their underlying templates. Combining those dist...More and more documents on the World Wide Web are based on templates. On a technical level this causes those documents to have a quite similar source code and DOM tree structure. Grouping together documents which are based on the same template is an important task for applications that analyse the template structure and need clean training data. This paper develops and compares several distance measures for clustering web documents according to their underlying templates. Combining those distance measures with different approaches for clustering, we show which combination of methods leads to the desired result.» weiterlesen» einklappen

Klassifikation


DFG Fachgebiet:
Informatik

DDC Sachgruppe:
Informatik

Verknüpfte Personen


Thomas Gottron