A comparison of techniques for sampling web pages
Baykan Eda, Henzinger M, Keller SF, de Castelberg S, Kinzler M. 2009. A comparison of techniques for sampling web pages. 26th International Symposium on Theoretical Aspects of Computer Science. STACS: Symposium on Theoretical Aspects of Computer Science, LIPIcs, vol. 3, 13–30.
Download (ext.)
          
        
            
            
            Conference Paper
            
            
            
            | Published
            
            
              |              English
              
            
          
        Scopus indexed
Author
        
      Baykan,  Eda;
      Henzinger, MonikaISTA  ;
      Keller, Stefan F.;
      de Castelberg, Sebastian;
      Kinzler, Markus
;
      Keller, Stefan F.;
      de Castelberg, Sebastian;
      Kinzler, Markus
 ;
      Keller, Stefan F.;
      de Castelberg, Sebastian;
      Kinzler, Markus
;
      Keller, Stefan F.;
      de Castelberg, Sebastian;
      Kinzler, MarkusSeries Title
    
    LIPIcs
Abstract
    As the World Wide Web is growing rapidly, it is getting increasingly challenging to gather representative information about it. Instead of crawling the web exhaustively one has to resort to other techniques like sampling to determine the properties of the web. A uniform random sample of the web would be useful to determine the percentage of web pages in a specific language, on a topic or in a top level domain. Unfortunately, no approach has been shown to sample the web pages in an unbiased way. Three promising web sampling algorithms are based on random walks. They each have been evaluated individually, but making a comparison on different data sets is not possible. We directly compare these algorithms in this paper. We performed three random walks on the web under the same conditions and analyzed their outcomes in detail. We discuss the strengths and the weaknesses of each algorithm and propose improvements based on experimental results.
    
  Publishing Year
    
  Date Published
    2009-02-01
  Proceedings Title
    26th International Symposium on Theoretical Aspects of Computer Science
  Publisher
    Schloss Dagstuhl - Leibniz-Zentrum für Informatik
  Volume
      3
    Page
      13-30
    Conference
    
      STACS: Symposium on Theoretical Aspects of Computer Science
    
  Conference Location
    
      Freiburg, Germany
    
  Conference Date
    
      2009-02-26 – 2009-02-28
    
  ISBN
    
  ISSN
    
  IST-REx-ID
    
  Cite this
Baykan  Eda, Henzinger M, Keller SF, de Castelberg S, Kinzler M. A comparison of techniques for sampling web pages. In: 26th International Symposium on Theoretical Aspects of Computer Science. Vol 3. Schloss Dagstuhl - Leibniz-Zentrum für Informatik; 2009:13-30. doi:10.4230/LIPICS.STACS.2009.1809
    Baykan,  Eda, Henzinger, M., Keller, S. F., de Castelberg, S., & Kinzler, M. (2009). A comparison of techniques for sampling web pages. In 26th International Symposium on Theoretical Aspects of Computer Science (Vol. 3, pp. 13–30). Freiburg, Germany: Schloss Dagstuhl - Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.STACS.2009.1809
    Baykan,  Eda, Monika Henzinger, Stefan F. Keller, Sebastian de Castelberg, and Markus Kinzler. “A Comparison of Techniques for Sampling Web Pages.” In 26th International Symposium on Theoretical Aspects of Computer Science, 3:13–30. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2009. https://doi.org/10.4230/LIPICS.STACS.2009.1809.
    Eda Baykan, M. Henzinger, S. F. Keller, S. de Castelberg, and M. Kinzler, “A comparison of techniques for sampling web pages,” in 26th International Symposium on Theoretical Aspects of Computer Science, Freiburg, Germany, 2009, vol. 3, pp. 13–30.
    Baykan  Eda, Henzinger M, Keller SF, de Castelberg S, Kinzler M. 2009. A comparison of techniques for sampling web pages. 26th International Symposium on Theoretical Aspects of Computer Science. STACS: Symposium on Theoretical Aspects of Computer Science, LIPIcs, vol. 3, 13–30.
    Baykan,  Eda, et al. “A Comparison of Techniques for Sampling Web Pages.” 26th International Symposium on Theoretical Aspects of Computer Science, vol. 3, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2009, pp. 13–30, doi:10.4230/LIPICS.STACS.2009.1809.
  
      All files available under the following license(s):
      
      
        
          
        
          
          
      
      
    
  
            Copyright Statement:
          
        
            This Item is protected by copyright and/or related rights. [...]
          
        
      Link(s) to Main File(s)
    
  Access Level
     Open Access
 Open Access
    Export
Marked PublicationsOpen Data ISTA Research Explorer
Sources
 arXiv 0902.1604
arXiv 0902.1604

 Google Scholar
Google Scholar ISBN Search
ISBN Search