Identifying Queries which Demand Recency Sensitive Results in Web Search

Since joining University of Glasgow for my Master’s I haven’t written a blog post. The main reason for this has been the amount of coursework which I have had to wade through in the last few months.

As part of a Research Methods and Techniques course in Semester 1, we had to submit a literature review on a topic of our choice. I chose the above topic since I had an inclination towards how freshness of web search results was maintained. I got interested in the topic after reading the papers by Dong et al. on the topic. [3,8]. I got a decent grade but, obviously, there were a huge number of errors. I thought it would be useful to blog it – just to give me a reason to correct some of  the mistakes.

WARNING: This is a lot of text for one blog post.

Title: Identifying queries which demand recency sensitive results in web search

1. Introduction

Recency ranking in web search deals with the incorporation of temporal aspects of documents into the ranking model. The aim in recency ranking is to provide results to the user which reflect the freshness of the documents without compromising on the relevance of the results. Recency, as an aspect of web search, started gaining importance with the advent of news-search. More recently, it is the demand for real-time search results which has been driving research into ranking algorithms which take freshness of content into account.

Dong et al. [3] states that there are two major challenges in providing recency based ranking in web search:

  • Firstly, The search engine has to identify which queries require recency sensitive ranking. This is important since recency sensitive ranking algorithms tend to perform poorly when used for queries where the retrieved data is relatively static.
  • Secondly, once the query has been identified, ranking recently crawled content in the absence of rich historical data, such as click-through rates and in-links is a major challenge. This is because prestige based ranking algorithms like PageRank [7] give a lot of weighting to such features.

Both challenges are closely related since the method used to identify queries which need recency sensitive treatment must drive the method used in the ranking. In this literature review, we will be looking at the solutions proposed to solve the first challenge, though we will touch upon some relevant relations between the two stated problems. Broadly, the solutions are from the fields of news search [1, 4], time-series analysis [2, 3, 5] and data mining [6].

In Section 2, we will first examine the problem is detail. Section 3 deals with the critical examination of the methods used to solve the problem. And section 4, presents a conclusion which takes a holistic approach to the problem and tries to identify a research gap.

2. Problem Statement

In traditional information retrieval, documents are considered as static entities which do not change with time. This implies that the relevance of a result set associated with a query is also seen as a constant. But new content is generated on the web very frequently in the form of blogs, news, tweets and other user-generated content. Thus in web search, users demand results which reflect the recency of the content while maintaining the relevance of the result set. Thus recency ranking is indispensable.

One way of approaching recency ranking would be to introduce a feature, as part of the ranking algorithm, which gives a certain weight to the recency of a document irrespective of the query. But Dong et al. [3] have shown that recency algorithms perform poorly when applied to queries with a static information need. For example, “liver” is a static query. The result set remains fairly constant year-on-year. Applying recency ranking to this query will result in a long established result losing out. Hence there is a need for identification of recency sensitive queries before ranking.

When a user enters a query they may or may not specify whether their intent is recency sensitive. For example, the intent behind a query such as “earthquake” could be interpreted in two ways. The user may be looking for geological information regarding earthquakes in general or they may be looking for information regarding a specific, recent earthquake. Since users do not clearly specify their information need in terms of recency, identification of such queries becomes a challenging problem.

It is not possible to identify the intent of the user query by analysing the query on its own. Thus, the methods used to tackle this problem consider more other sources to perform recency classification of the query.

3. Methods Used to Solve Problem

The methods proposed to solve the above problem can be broadly divided into three classes. They are:

  1. News Vertical solutions : This consists of methods which are focused only on the news vertical. The solutions proposed in [1, 4] perform well on news-related queries but have not been proved successful on a more general query.
  2. Temporal Modelling : It is often difficult to capture the recency intent of a user using only a single query. Methods put forward in [2, 3, 5 ] deal with either modelling of queries and time spaces looking for discrepancies from the normal frequencies. Discrepancies can be viewed as reason to provide recency sensitive results.
  3. Composite model : Zhang et al. [6] proposes a model which takes into account the time series, query log analysis, click through data and other features and feeds it into a machine learning model to produce a classifier. Using all the above measures, the classifier can identify queries which require recency sensitive ranking.

3.1 News Specific Solutions:

There are 2 closely related approaches in this class. Diaz [1] and Konig et al. [4] both aim to find the same solution to the problem. Both aim to predict whether a user will click on the dedicated news, if displayed, for a particular query. The approaches taken by both are different and they have their advantages and disadvantages, Diaz [1] proposes a method which uses the fact that there exists a relation between click-through rates and recency-sensitive, news related queries. Here first the probability of a user clicking the separate news display if it is indeed displayed for a particular query, is estimated. An initial guess at the probability is made from the context of the query is made. This is then presented as a Beta distribution such that it depends on both the assumption as well as past clicks.

Query similarity is considered and then since the probability is defined, a threshold value is set. If the probability of the query getting clicks is higher than the threshold, then results are shown. This threshold value is leniently set so that more queries have an opportunity to get exposure. Most importantly, different from the approach taken by Konig et al. [4], click feedback is used to correct any errors made by the algorithm. It is shown experimentally that in case of an erroneous judgement regarding the showing of news results, since that result would not receive any clicks, the judgement will be reversed in time automatically.

It was seen that using click feedback improves on the baseline taken in the experiments. The leniency introduced in the threshold value did not improve the overall performance. This was due to poor classification of queries which were under the threshold value.

In the method proposed by Konig et al. [4], first a set of features are considered from the corpora. Then the corpus is defined to include wikipedia, blogs and news. After feature extraction, like tf-idf, cosine similarity, Jensen-Shannon divergence, the learning model is applied. The learning model used is the Multiple Additive Regression Trees model or MART. It is built on Stochastic Gradient Boosting. It is claimed that MART has better accuracy and can handle the rich feature set provided it. Randomness is added to the algorithm to improve robustness. This algorithm is then applied to the training data and after making sure that there is no overlap between the training data and the test-collection, the testing is done. Quantitative CTR prediction for future queries is observed to be very accurate. PR curves for the threshold CTR values shown below (as seen in paper). Overall prediction success = 82%.

The major difference between the methods proposed by Konig et al. [4] and Diaz [1] lie in the usage of temporal features and machine-learning algorithms used. Both methods are biased towards news related queries and so fail to answer questions relating to coverage of different queries.

3.2 Temporal Modelling

This class of methods approaches the problem by modelling the temporal aspects of either the queries or the results of the queries to identify which ones are recency-sensitive.

The methods put forward by Vlachos et al. [5] and Dong et al. [3] are opposite in their approach. Vlachos et al. [5] proposed a method that follows the topic frequency over different time-slots, detecting any discrepancies in the frequency or “bursts” of activity. But in the methods proposed by Dong et al. [3], time slots are modelled and compared with each other.

Vlachos et al. [5] introduce methods to identify important features in the time series and also to identify bursts in activity. The paper aims to match queries with temporal bursts in activity. The aim of their compression algorithm is to minimize the Euclidean distance between the query and the time-series representation. Best Fourier coefficients are used to represent the compression because they describe the series better. 3 algorithms are put forward to improve the similarity search of the query with the compressed time-series – BestMin, BestError and BestMinError.

A cumulative distribution function of the time-series is used and the high-frequency of queries during important periods will be significantly off from the mean. Thus the periods with maximum query activity are identified. Finally, to identify bursts in query activity which are out of the ordinary, first identify and then compress the data to enable comparison later. The storage can be done on any relational database. For the detection phase, moving averages are used over the time-series. If a power value was far higher that the moving average, then it was deemed to be a burst. Next, to find all queries with similar burst patterns, the features and compressed and then compared. Similarity and overlap between the bursts are checked. Thus it becomes possible to identify a set of queries which show a spike in activity.

Dong et al. [3] proposed a method which aimed to classify a classify a query as recency sensitive. Time-slots, as opposed to topics or queries, were modelled. N-gram models were made for content and queries in each time slot and compared with past time-slot models. Any discrepancy in content and query models would then show as a recency trend. A final ‘buzz’ score is computed for every query. If this value exceeds a certain threshold, the query is considered recency sensitive. To determine the ideal constants of the function, some manual classification was done. Through the reported experiments we see that the query classifier has an accuracy approaching 90% at times when there is high breaking-news traffic. At other times the classifier doesn’t work as well.

Finally, as part of the recency ranking framework, four separate recency features were considered and compared -Timestamp, Linktime, WebBuzz and Page classification. The ranking used machine learning methods using recency training data as well as regular relevance training data. GBrank was the machine-learning algorithm used. Three ranking models were also tested – Composition Model, Overweighting Model and the Adaptation Model.

Another method which defines a query temporal model is proposed by Diaz & Jones [2]. This paper looks at the temporal nature of query results with the aim of predicting the relevance of the result set. A temporal profile is made for the query based on timestamps of the top 5 documents retrieved. The granularity is defined at 1 day. Smoothing of the model is required since some days may have an awkward spike in query traffic while other days may have no traffic at all. This concept of smoothing has been discussed in time series analysis. Once the model is prepared, features are extracted to predict the relevance of the result set. This method of modelling queries is both simple and elegant from a theoretical point of view. From the context of this paper, it was effective as well.

3.3 Composite Modelling

While the above models considered certain models and concepts to solve the recency-sensitive query identification problem, none have combined all the relevant metrics to see which ones perform the best.

In the paper by Zhang et al. [6], a very good coverage of metrics is seen to solve a unique problem. A number of queries occur at regular time intervals but require recency sensitive handling. E.g. “SIGIR”. In 2010 – the results should focus on SIGIR2010 and not on past conferences. These queries are called Recurrent Event Queries, or REQ. The time at which these queries occur is predictable and they account for 6% of all queries issued and hence are worth dealing with.

The paper presents an REQ classifier which uses a wide range of features together. Also they are all combined by machine learning algorithms. The paper defines 3 different machine learning algorithms and compares them to find the most effective.

The paper uses a number of features which it then applies to a machine learning framework. First deals classifying queries into ones with explicit time, implicit time, and no time markings. Query log analysis is used to obtain metrics like whether a query has an implicit or explicit temporal nature, the frequency of that query, how many different years are has that query been on for and a chi-square distance. Metrics on changes to queries done by users during a session are also included. Click log analysis is used to get the click-through rates of the queries. The result sets of implicit time queries are checked as to whether they show any year features. A list is made with a series of REQ words which are tokenized from implicit queries.

Three different learning algorithms are used to combine the above metrics – Naive Bayes, SVM [9] and Gradient Boosted Decision Tree (GBDT). For the purpose of evaluating the effectiveness of the learning algorithms, 6000 queries were marked as REQ or not manually by human editors.

After running the algorithms evaluation was done using a PR graph. Precision was determined as REQ/(total classified REQ). Recall was defined as REQ/(all REQ). Finally from the experiments it was found that the Gradient Boosted Decision Tree machine learning algorithm was the one which performed the best. The ratio of the no. of times a query is issued with year a qualifier to without, was the most important feature. And finally 3.6% gain in DCG at the top position was observed against regular existing web search.

4. Future Scope and Research Gap

There is work still to be done in the field of recency query identification. Twitter is a source of real-time news. It still needs to be investigated if mined Twitter data can be used effectively to identify recency sensitive queries. Also, incorporating data from book-marking sites like Digg or can be used as an indicator of recency sensitivity of a related query.

5. Bibliography

[1] Fernando Diaz. 2009. Integration of news content into web results. In Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM ’09), Ricardo Baeza-Yates, Paolo Boldi, Berthier Ribeiro-Neto, and B. Barla Cambazoglu (Eds.). ACM, 182-191.

[2] Fernando Diaz and Rosie Jones. 2004. Using temporal profiles of queries for precision prediction. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR ’04). ACM, USA, 18-24.

[3] Anlei Dong, Yi Chang, Zhaohui Zheng, Gilad Mishne, Jing Bai, Ruiqiang Zhang, Karolina Buchner, Ciya Liao, and Fernando Diaz. 2010. Towards recency ranking in web search. In Proceedings of the third ACM international conference on Web search and data mining (WSDM ’10). ACM, USA, 11-20.

[4] Arnd Christian Konig, Michael Gamon, and Qiang Wu. 2009. Click-through prediction for news queries. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR ’09). ACM, USA, 347-354.

[5] Michail Vlachos, Christopher Meek, Zografoula Vagena, and Dimitrios Gunopulos. 2004. Identifying similarities, periodicities and bursts for online search queries. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data (SIGMOD ’04). ACM, USA, 131-142.

[6] Ruiqiang Zhang, Yuki Konda, Anlei Dong, Pranam Kolari, Yi Chang, and Zhaohui Zheng. 2010. Learning recurrent event queries for web search. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP ’10). Association for Computational Linguistics, Morristown, NJ, USA, 1129-1139.

[7] PageRank

[8] Twitter for recency ranking dong et al.

[9] Corinna Cortes and V. Vapnik, “Support-Vector Networks”, Machine Learning, 20, 1995.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s