Defining Search Engines

The primary purpose of a search engine is to direct users to data relevant to their query. This is done by returning a set of pointers to data - “links” along with a preview (small subset) of the data. A search engine can obtain such links in any way, including:

- crawling the web autonomously
- structured XML feeds (eg sitemaps)
- manual user submission

The distributed and hyperlinked nature of the web allows users to publish data at a single location, allowing it to be discovered by other users and search engines. In contrast to pushing data to users by publishing it in various locations, this approach eliminates management and versioning problems if data needs to be updated. In other words, allowing search engines to discover your data is better than publishing it in various centralized databases (web applications such as classified listings sites, Google Base, etc).

Vertical Search

A vertical search engine only contains pointers to data of certain type / class. Reducing the search space to data of a certain class greatly improves the relevancy of search results, as is evident by recent popularity of vertical search engines like indeed (jobs), edgeio (classifieds), sidestep (travel), kelkoo (shopping) and extate (property).

Generic search engines like google consider all data to be of the same class - text. The only attributes used for searching and refinement are keywords. This may have been appropriate at a time when most of the information on the web was static text - articles, reviews, stories, etc. Now, however, much information comes from databases. Therefore, different classes of data have different attributes, just as in the object-oriented programming world: a book for sale has a price, a property has N bedrooms, a business has an address. A vertical search engine allows searching and filtering data by these attributes, which is crucial.

Horizontal Search

Some may say that a vertical search engine is not generic enough - its a function that works only on certain inputs. Horizontal search on the other hand is not limited to data type - its limited to a set of attributes. In essence, it allows searching over a pre-defined set of attributes across different information types. Google is an example of horizontal search where allowed attributes are keywords. Possible other horizontal search engines may be:

- Local Search: find everything in a given location
- Price Search: find everything in given price range

although these are currently limited to certain classes of data as well to improve search results.

Search Matrix

We can combine both vertical and horizontal search scenarios in a matrix, where the verticals represent different classes of data and horizontals - different attributes that apply. Some of these attributes may be boolean - such as keywords, or quantitative - such as price.

Both vertical and horizontal search engines currently only support searching by a constant set of attributes. The primary reason is ease of implementation: a constant set of attributes allows data to be stored in a single table with the applicable schema. It also allows the user interface to map directly to SQL statements, and keeps the interface (such as filtering of results) more or less constant.

The ultimate goal however is obvious - a search engine that uses all possible attributes to search over all possible classes. The aim is thus to cover as much of the matrix as possible. In light of this, a query can be represented as a set of attributes:

Q = { m1a1, m2a2, m3a3 … }

where mN is a quantifier (the amount of attribute a), and aN is the attribute. For example, a query for a flight may be:

Q = { ‘london to new york’:keywords, 0-9:hours, 0-300:dollars }

Even if this can be achieved, the user may often want to limit results to a certain class. Yet the query does not always provide enough information to determine a unique class to return. For example, a query for:

Q = { 0-300:dollars }

will return everything from flights to books. Whilst allowing a user to filter results is possible, this is highly inefficient. Alternatively, the user may specify the class of data to return:

Q = { ‘books’, 0-300:dollars }

In essence, we encode the class as an attribute, thus turning our search matrix into a more regular representation: document - attribute matrix, where attributes are not limited to boolean values (absence or presence of a keyword). Given the above representation of the query, it may be possible to deduce a variation of the classic boolean model: a quantitative boolean model.

Data Class Hierarchy

Classes themselves form a hierarchy - books and gadgets is a subclass of items for sale, for example. As in OOP, attributes are inherited from parent classes: all items for sale have a price, yet books also have an author. Can such a directory of all things be constructed manually? Tedious, yet possible.

An important question remains: how to determine the type of data. Perhaps the set of extracted attributes will provide significant insight, or machine learning and classification techniques will do the trick. Current vertical search engines solve it by manually specifying the data sources, however this is very inefficient. Additionally, being able to identify the permanent URL of a single data object (not some list thereof) is another problem altogether. Perhaps structured initiatives, such as providing RSS feeds of data objects, will prove to be effective solutions.

Indeed natural language processing techniques may allow us to determine nearly all quantitative attributes, and perhaps even classify data accurately. However, will a single company be able to achieve this? The distributed nature of the web is more likely to produce vertical search engines that focus on parts of the search matrix, solving a single problem really well, rather than a difficult problem poorly.

Comments

Leave a Reply