Skip to search form Skip to main content You are currently offline. Some features of the site may not work correctly. DOI: Lexical Computing Ltd. The question,then, is how.
|Published (Last):||22 September 2012|
|PDF File Size:||10.12 Mb|
|ePub File Size:||1.41 Mb|
|Price:||Free* [*Free Regsitration Required]|
The title instantly hit my brain and I began reading with, after a generous friend downloaded the restricted entry pdf and sent it to me. The theme of this paper is on using the world wide web as a data source for various data-intensive tasks. Now, how is this related to the topic? Well, the best way to enter the WWW is a search engine! Can you see the light?
Some of the examples of this approach mentioned in the article are: using hit counts to identify likely translations of compositional phrases, Finding Synonyms, building models of noun-noun compound bracketing what is that supposed to mean?? There was also a team which worked on validating results from these experiments on WWW by comparing with human subjects. Ofcourse, I use world wide web and word counts for something else too — spell check! Will come to this towards in the coming lines! Imagine a language with more inflections or varied constructions!
Strangely enough, the reasons I expected did not find a mention here: 1 The unreliable and inconsistent search engine counts — strange that this is not mentioned in the above reasons, but later in the paper! Here is a good article about this.
My strong objection : This will perpetuate errors. Let us say, a particular word is found in a small number on the web and it has a popular mis-spelling. What if the people who use the actual spelling write less on the web than the wrong ones?
As time passes, the hits for the wrong ones increase.. Well, this was my experience a couple of times I tried relying on google search counts, for checking spellings of a few Telugu words.. Yes, there was also a discussion on the presence of too many duplicate pages and too much of spam. Bah, I hate those duplicate pages — I had to invent all sorts of ugly workarounds in our project, to avoid duplicates being shown in the results, at a big cost.
Duplicates, I think are a big issue, even now, even in Google. Anyone who proceeds beyond page-1 of google search results, can know that The alternative for language researchers queries, according to this paper, is to build a search engine, working around these issues.
They actually tried this and prepared web corpora for German and Italian, which is publicly accessible.
Their hope is that collaborative effort of research community might be able to reach the efficiency level of a commercial search engine. To me, data cleaning appears to be an interesting problem.
Ultimately, the aim is to develop a web-scale, commercial quality, low-noise corpus which can be used by linguistic and language technology researchers in their experiments. Now comes the issue, which a cynical person like me would emphatically answer with a big NO! Is it? The article details: Googleology is bad science, A. Kilgariff, Computational Linguistics 33 1 : Baroni, Marco and Adam Kilgarriff. Large linguistically-processed web corpora for multiple languages.
Broder, Andrei Z. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. Computer Networks, 29 8—13 — Grefenstette, Gregory. Keller, Frank and Mirella Lapata. Using the web to obtain frequencies for unseen bigrams. Computational Linguistics, 29 3 — Nakov, Preslav and Marti Hearst.
Search engine statistics beyond the n-gram: Application to noun compound bracketing. Randomized algorithms and NLP: Using locality sensitive hash functions for high speed noun clustering. Semantic taxonomy induction from heterogenous evidence.
Turney, Peter D. In European Conference on Machine Learning, pages — RSS feed for comments on this post. Very good, Informative and I agree with you on contextual word help scenarios. Bhaskara Rami Reddy: I actually forgot to mention one more point in that scenario I mentioned. I noticed that Google Transliterate has this problem. You are commenting using your WordPress.
You are commenting using your Google account. You are commenting using your Twitter account. You are commenting using your Facebook account. Notify me of new comments via email. Notify me of new posts via email. If you want to use something from here, please relieve yourself of the strain of copying the whole content and forgetting to credit..
Create a free website or blog at WordPress. RSS 2. However, there are few issues to this approach, as the paper says: 1 Firstly, to get a real estimation, we might have to give several queries, on the search engine. Rate this:. Share this: Twitter Facebook.
Googleology is Bad Science
Last Words: Googleology is Bad Science