Monday, January 01, 2007

Latent Semantic Indexing (LSI)

Latent Semantic Indexing (LSI)

Latent Semantic Indexing (LSI) seems to be the latest scientific approach within the search engine industry. It promises to bring more human sense to the results where the focus is over the context of the content rather than the keywords inside. The idea is simple and can be described that way: If we were to manually search through web documents to find information related to a particular search term we would be likely to generate our own results based on the theme/context/meaning of the site, rather than whether a word exists or doesn't exist on the page.

For example, Latent Semantic Indexing should be able to recognize that the word "engine" in "search engine optimization" is not related to searches for terms like "steam engine" or "locomotive engine" and is instead related to Internet marketing topics. In theory, LSI results give a much more accurate page of results as well as providing a broader range of pages still geared towards a particular topic.

It is widely acknowledged that Google is the search engine at the forefront of latent semantic indexing. Essentially, they have always been trying to generate results pages that are literally filled with genuine, useful results and LSI certainly provides another string that helps Google’s existing algorithms. Yahoo and MSN, for now, seem more than happy to go along with keyword specific indexing although Yahoo are known to look at singular and plural keyword variations as well as keyword stemming when judging keyword density.

Aside Google, Yahoo and MSN, there are many smaller companies taking different semantic approaches in organizing the information on the web and one in particular appears really helpful to Google’s LSI approach,, which simply does semantically precise in-(con)text hyperlinks of meaningful words, phrases and whole sentences connecting thus to other content areas with the same context.

Let’s assume that Google has already implemented the Latent Semantic Indexing (LSI) Model and has significantly expanded over web, it now becomes interesting to see how’s unique contextual approach would become pretty helpful to Google’s semantic bots in their journey indexing the huge massive of information on the web on a much more contextually, grammatically and meaningfully precise basis.

Aside’s simple yet pragmatic contextual approach, there is another company that takes it more scientifically, Telcordia Technologies, Inc. Their demo machine called Telcordia Latent Semantic Indexing (LSI) is a novel, patented information retrieval method developed at Telcordia Technologies, Inc. By using statistical algorithms, LSI can retrieve relevant documents even when they do not share any words with a query. LSI uses these statistically derived "concepts" to improve search performance by up to 30%.

More information on the topic: