Starten Sie Ihre Suche...


Durch die Nutzung unserer Webseite erklären Sie sich damit einverstanden, dass wir Cookies verwenden. Weitere Informationen

Content extraction: Identifying the main content in HTML Documents

Mainz: Univ. 2008 252 S.

Erscheinungsjahr: 2008

Publikationstyp: Buch (Dissertation)

Sprache: Englisch

Doi/URN: urn:nbn:de:hebis:77-18591

Volltext über DOI/URN

GeprüftBibliothek

Inhaltszusammenfassung


Most HTML documents on the World Wide Web comprise far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents which can be found along with the main text. In the context of web data mining applications or technical solutions to improve accessibility via screen readers or small screen devices it is necessary to draw the distinction between main and additional content automa...Most HTML documents on the World Wide Web comprise far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents which can be found along with the main text. In the context of web data mining applications or technical solutions to improve accessibility via screen readers or small screen devices it is necessary to draw the distinction between main and additional content automatically. The solutions for determining the main content in a web document can be divided into the two categories of content extraction and template detection. Content extraction solutions are operating on single documents and are based on heuristics. Template detection algorithms instead analyse a collection of several training documents to determine a common template structure and use this knowledge to find the main content. This thesis gives an extensive overview of existing techniques and algorithms from both areas. It contributes an objective way to measure and evaluate the performance of content extraction algorithms under different aspects. These evaluation measures allow to draw the first objective comparison of existing extraction solutions. The comparison also reveals typical problems of these solutions. The newly introduced content code blurring extraction filter overcomes at least some of the problems and proves to be the best content extraction algorithm at the moment. An analysis of methods to cluster web documents according to their underlying templates is the third major contribution of this thesis. In combination with a localised crawling process this clustering analysis can be used to automatically create high quality sets of training documents for template detection algorithms. As the whole process can be automated it basically allows to perform template detection on a single document, thereby combining the advantages of single and multi document algorithms: the independence of a manually created training set of the former with the better theoretic underpinning of the latter approaches. » weiterlesen» einklappen

Klassifikation


DFG Fachgebiet:
Informatik

DDC Sachgruppe:
Informatik

Verknüpfte Personen


Thomas Gottron