Thursday, March 15, 2007

A new type of community-based searching?

I happened to read the list of Google acquisitions and it seems search algorithms are amongst its pet products. Well, it's pretty obvious for a well-known search engine. One of their latest acquisition is a search algorithm created by a computer science Ph.D. candidate, Ori Allon, from the University of New South Wales. Basically, this search algorithm produces hits for not only the initial search terms, but also hits for webpages on related topics. For example, if I enter the search terms "Albert Einstein", I would not only get hits for webpages containing the name "Albert Einstein", but also hits for webpages containing the terms "relativity", "photoelectric effect", "Grand Unified Theory", "Nobel Prize", etc.

However, it's still quite a hassle to use Google to search for a community of webpages with the same related content. For example, if a reader wants to search for comments on the AcidFlask VS Mr Philip Yeo exchange, he would have difficulty locating the community of webpages with materials devoted to the exchange.

I understand that Google's search spider algorithms trawl the Internet and updates the google database. I doodled a possible algorithm that can allow searching for a community of websites.

The algorithm is as follows:
1) Generate a lexicon pattern from a particular webpage. This is a bit like lexicon analysis that most compilers are capable of. There should be an initial filter process to filter out grammatical terms and words like "the, is, they, her, him, she, it, etc". Lexicon patterns should contain important jargons, terminologies and names. For example, if a reader is interested in the AcidFlask VS Mr Philip Yeo exchange, the lexicon pattern should consist of terms like "Philip Yeo", "AcidFlask", "Elia Diodati", "biomedical research", "A*STAR", "scholarship", "GPA", etc.

2) Sort the terms that make up the lexicon patterns into alphabetical and numeric order.

3) Alignment with lexicon patterns of other webpages using an alignment-based algorithm similar to NCBI BLAST. Good ol' NCBI has a BLAST page to allow comparison of DNA sequences and calculation of the extent of sequence similarity. Hence, degree of relation the webpage has with other webpages can be calculated, and the most relevant hits can be returned.

In that way, if I initially find a webpage from an initial web wide search by google, the algorithm can help me to find a community of webpages devoted to discussion of that particular topic. If I want to learn a bit more about related topics, I can select a candidate webpage from the same community that discusses the related topic of interest on top of the main topic of interest and look for a new community using the algorithm. For example, an interested reader might initially read about the AcidFlask VS Mr Philip Yeo exchange and might stumble on the related topic of "GPA recalculation" and "graduate school admission committees" in my blog. The reader activates the algorithm that generates the lexicon pattern of my blog and then aligns with other webpages, and will soon be introduced to the whole list of webpages dedicated to the discussion of GPA recalculation and graduate school admission committees. From one community, the reader can broaden his horizon and learn about a list of related topics. In effect, this type of community-based searching achieves a similar result as the algorithm developed by Ori Allon.

Citations
1) Ori Allon. http://en.wikipedia.org/wiki/Ori_Allon

2) Google's purchase of search algorithm developed by Ori Allon. http://blogs.zdnet.com/Google/index.php?p=157

3) NCBI BLAST page. http://www.ncbi.nlm.nih.gov/BLAST/

1 comment:

Anonymous said...

great article!