A comprehensive study of techniques for URL-based web page language classification

Baykan, Eda; Weber, Ingmar; Henzinger, Monika H

A comprehensive study of techniques for URL-based web page language classification

Baykan E, Weber I, Henzinger M. 2013. A comprehensive study of techniques for URL-based web page language classification. ACM Transactions on the Web. 7(1), 3.

Download

No fulltext has been uploaded. References only!

DOI

10.1145/2435215.2435218

Journal Article | Published | English

Scopus indexed

Author

Baykan, Eda; Weber, Ingmar; Henzinger, Monika^ISTA

Abstract

Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time. We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers. We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers. We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the “wrong” language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.

Keywords

Computer Networks and Communications

Publishing Year

2013

Date Published

2013-03-01

Journal Title

ACM Transactions on the Web

Publisher

Association for Computing Machinery

Volume

Issue

Article Number

ISSN

1559-1131

eISSN

1559-114X

IST-REx-ID

11671

Cite this

Baykan E, Weber I, Henzinger M. A comprehensive study of techniques for URL-based web page language classification. ACM Transactions on the Web. 2013;7(1). doi:10.1145/2435215.2435218

Baykan, E., Weber, I., & Henzinger, M. (2013). A comprehensive study of techniques for URL-based web page language classification. ACM Transactions on the Web. Association for Computing Machinery. https://doi.org/10.1145/2435215.2435218

Baykan, Eda, Ingmar Weber, and Monika Henzinger. “A Comprehensive Study of Techniques for URL-Based Web Page Language Classification.” ACM Transactions on the Web. Association for Computing Machinery, 2013. https://doi.org/10.1145/2435215.2435218.

E. Baykan, I. Weber, and M. Henzinger, “A comprehensive study of techniques for URL-based web page language classification,” ACM Transactions on the Web, vol. 7, no. 1. Association for Computing Machinery, 2013.

Baykan E, Weber I, Henzinger M. 2013. A comprehensive study of techniques for URL-based web page language classification. ACM Transactions on the Web. 7(1), 3.

Baykan, Eda, et al. “A Comprehensive Study of Techniques for URL-Based Web Page Language Classification.” ACM Transactions on the Web, vol. 7, no. 1, 3, Association for Computing Machinery, 2013, doi:10.1145/2435215.2435218.

A comprehensive study of techniques for URL-based web page language classification

Cite this

Export

Search this title in