Citation: Elnaz Delpisheh, Aijun An, Heidar Davoudi, Emad Gohari Boroujerdi. Time aware topic based recommender System[J]. Big Data and Information Analytics, 2016, 1(2): 261-274. doi: 10.3934/bdia.2016008
[1] | Xin Yun, Myung Hwan Chun . The impact of personalized recommendation on purchase intention under the background of big data. Big Data and Information Analytics, 2024, 8(0): 80-108. doi: 10.3934/bdia.2024005 |
[2] | Marco Tosato, Jianhong Wu . An application of PART to the Football Manager data for players clusters analyses to inform club team formation. Big Data and Information Analytics, 2018, 3(1): 43-54. doi: 10.3934/bdia.2018002 |
[3] | Xiangmin Zhang . User perceived learning from interactive searching on big medical literature data. Big Data and Information Analytics, 2017, 2(3): 239-254. doi: 10.3934/bdia.2017019 |
[4] | Jian-Bing Zhang, Yi-Xin Sun, De-Chuan Zhan . Multiple-instance learning for text categorization based on semantic representation. Big Data and Information Analytics, 2017, 2(1): 69-75. doi: 10.3934/bdia.2017009 |
[5] | Yaguang Huangfu, Guanqing Liang, Jiannong Cao . MatrixMap: Programming abstraction and implementation of matrix computation for big data analytics. Big Data and Information Analytics, 2016, 1(4): 349-376. doi: 10.3934/bdia.2016015 |
[6] | Grace Gao, Sasank Maganti, Karen A. Monsen . Older Adults, Frailty, and the Social and Behavioral Determinants of Health. Big Data and Information Analytics, 2017, 2(3): 191-202. doi: 10.3934/bdia.2017012 |
[7] | Amanda Working, Mohammed Alqawba, Norou Diawara, Ling Li . TIME DEPENDENT ATTRIBUTE-LEVEL BEST WORST DISCRETE CHOICE MODELLING. Big Data and Information Analytics, 2018, 3(1): 55-72. doi: 10.3934/bdia.2018010 |
[8] | Xing Tan, Yilan Gu, Jimmy Xiangji Huang . An ontological account of flow-control components in BPMN process models. Big Data and Information Analytics, 2017, 2(2): 177-189. doi: 10.3934/bdia.2017016 |
[9] | Cai-Tong Yue, Jing Liang, Bo-Fei Lang, Bo-Yang Qu . Two-hidden-layer extreme learning machine based wrist vein recognition system. Big Data and Information Analytics, 2017, 2(1): 59-68. doi: 10.3934/bdia.2017008 |
[10] | Jason Adams, Yumou Qiu, Luis Posadas, Kent Eskridge, George Graef . Phenotypic trait extraction of soybean plants using deep convolutional neural networks with transfer learning. Big Data and Information Analytics, 2021, 6(0): 26-40. doi: 10.3934/bdia.2021003 |
People have always been confronting with a growing amount of data, which in turn demands more on their abilities to filter the content according to their preferences. Among the increasingly overwhelming amounts of webpages, documents, pictures, or videos, it is no longer intuitive to find what we really need. Furthermore, duplicate or several information sources are found covering the same topics. The users are sensitive to the recentness of information and their interests are also changing over time along with the content of the Web [23].
During the past two decades, the concepts of recommender systems have emerged to remedy the situation. The essence of recommender systems are highly associated with the extensive work in cognitive science, approximation theory, information retrieval, forecasting theories, and management science [1]. Recommender systems have many applications, such as product recommendations at Amazon.com [18], movies recommendations by MovieLens [22], and news recommendations [1].
The increasing amount of electronic news articles requires better tools for searching, exploring, and organizing news article collections. Previously, news article were collected and stored in large text repositories and retrieved by a set of keywords. News article were seldom analyzed using their themes, because there were very few technologies to extract their thematic structures. Moreover, newspaper companies typically do not require users to subscribe and create their user profiles and users read news articles anonymously. Therefore, news recommender systems have to make recommendations without clear user profiles. In addition, many recommendation techniques face the cold start problem. This problem occurs when there is insufficient data to draw any inferences for new users or items [4].
To remedy the situation, in this paper, we design a news recommender system that eases reading and navigation through online newspapers. In essence, the recommender system acts as filters, delivering only news articles that can be considered relevant to a user. The resulting recommender system based on the proposed method is currently operational at The Globe and Mail. The Globe and Mail offers most authoritative news in Canada, featuring national and international news.
The major contributions of this paper are as follows:
● Inferring users' profiles and predicting users' preferences by analyzing contents of large collections of news articles.
● Designing a news recommender system based on both the content of news articles and users' time spent on each article.
● Experimental studies on a news corpus and outperforming baseline recommendation approaches in terms of precision, accuracy, and recall.
The structure of this paper is as follows. In Section 2, the related literature is reviewed. Section 3 describes main objectives of a news recommender system. Section 4 presents our proposed content-based news recommender system. In Section 5, we demonstrate the effectiveness of our approach through experiments. Section 6 concludes the paper and discusses future work.
All of the known recommender techniques have strengths and weaknesses. In this section, we briefly survey the different recommender techniques, the data that they support, and the algorithms they employ [5,6].
On this basis, the following three recommender techniques are distinguished: Collaborative filtering-based, Content-based, and Hybrid-based.
Collaborative filtering-based recommender systems make recommendations based on the behavior of other users in the system. Intuitively, these systems assume that if users agree about the quality of some items, then, they will likely agree about other items [9]. For example, if a group of users have similar tastes to Mary, then, Mary is likely to like the things the group likes which she hasn't seen yet. However, in this approach the introduction of new users or new items can cause the cold start problem, as there will be insufficient data on these new entries for the collaborative filtering to draw any inferences for new users or items. The system requires a substantial number of users to show interest to a new item before that item can be recommended [4,6]. Addressing the cold start problem can be important for a new user's engagement and is therefore of critical significance in trade applications.
Content-based recommender systems recommend items similar to items a user preferred in the past [1]. For example, a content-based news recommender system observes the collection of news articles a user prefers and reads frequently. Then, only the news articles that have a high degree of similarity to the user's read articles are recommended. The greatest strength of this approach is that it only considers the properties of an item, i.e. the content of news articles, and accordingly makes recommendations. Therefore, in this approach, once a new user is introduced to the system, as soon as they read their first article, the content-based recommender system starts by recommending articles similar to the read article. Thus, this approach does not cause the cold start problem mentioned in collaborative recommender systems. The weakness of this approach is that users are limited to being recommended news articles that are similar to their read history.
Hybrid recommender systems generate recommendations by combining the above two recommendation techniques, thus, maximizing the benefits and minimizing the disadvantages of them [1]. For example, a hybrid recommendation system that combines content-based and collaborative recommendation systems considers both the content of news articles and a user's demographic information to issue recommendations. Given the fact that this approach contains collaborative recommender systems, it contains the disadvantages of such systems. Therefore, this approach also suffers from the cold start problem.
Due to the textual nature of our news application domain and avoiding the cold start problem, we focus on content-based recommender systems. Most existing content-based news recommender systems are based on keywords that is they represent the content of news articles using a set of keywords neglecting the thematic structure of the articles. We apply topic models to discover hidden themes of the news articles, and we incorporate these themes into a content-based recommender system. The reasons we employ topic models in news recommender systems are as follows. Firstly, topic models yield great insight about different themes of a newspaper article. Secondly, topic models capture probabilities of assigning different themes to newspaper articles. Thirdly, topic models provide a generative probabilistic model for the themes. As a consequence, topic models accurately assign probabilities to an unseen document. Our experimental studies show that the proposed recommender system yields more accurate results than other counterparts.
News recommender systems arise to efficiently handle the overwhelming number of news articles, simplify navigations, and retrieve relevant information. Formally, the recommendation problem can be formulated as follows: Let
Let
∀ul∈U,c′ul=argmaxc∈Cf(ul,c). | (1) |
In recommender systems, the sets
1In order to avoid idle time spent on a news article, we normalize the time by scaling between zero and five.
In our recommender system, the amount of time spent on the collection of non-read articles (
In this section, we propose a time aware content-based news recommender system by employing Latent Dirichlet Allocation (LDA). LDA-based topic modeling approaches to measure the similarity between read news articles and non-read news articles. LDA-based approaches elicit a topic model from the collection of news articles. The topic model represents news articles as a multinomial distribution over topics, where each topic is a multinomial distribution over words. Then, given the time a user has spent on read news articles, and the topic models of the collection of news articles, a user's time spent toward non-read news articles is estimated.
Latent Dirichlet Allocation (LDA), proposed by Blei et al. [3], is a generative probabilistic model for collections of discrete data such as text corpora. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA also assumes that a corpus is a collection of
2This set of vocabulary words can be the set of unique words contained in the corpus with removal of stop words.
In LDA, each document
This procedure is a joint probability distribution over the random variables
Note that words are the only observed variables. The hyperparameters
With a set of samples from the posterior distributions
Θd,t=n(d)t+αn(d).+Kα, | (2) |
where
Similarly,
Φt,wi=n(wi)t+βn(.)t+Vβ, | (3) |
where
Our content-based recommender system employs probabilistic topic models to uncover the thematic similarity between news articles and a user's preferences. Then, news articles that have a high degree of thematic similarity to the user's preferences are recommended.
We assume a collection of users is represented by
Our task is to appropriately recommend non-read articles to users or alternatively to assign users to non-read articles. In other words, for each non-read article
The proposed content-based news recommender system consists of the following three steps.
In this step, we use LDA-based topic models to best reflect the thematic structure of news articles. We build a topic model from the collection of read articles (
We use the topic model, built in Section 4.2.1, to infer the multinomial distribution of each non-read article (
For each user
The probability of article
p(qr|ul,Q,Dul)=InterestingnessScore(qr,ul,Dul)∑qj∈QInterestingnessScore(qj,ul,Dul), | (4) |
InterestingnessScore(qr,ul,Dul)=∑di∈DulDocSim(qr,di,Dul)⋅timeSpent[ul,di]. | (5) |
We apply LDA-based topic model to compute the article similarity. We utilize two arrays
Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The more similar hence the more co-oriented the vectors, thus the cosine of the angle between them is closer to one. Cosine similarity measure is often used to compare documents for text mining, classification, and clustering purposes [7]. Equation 6 is used to calculate the similarity.
cosine−similarity(→Θqr,→Θdi)=→Θqr⋅→Θdi|→Θqr|×|→Θdi|, | (6) |
where "
Finally, we return top
We conducted experiments on The Globe and Mail news article corpus. The Globe and Mail collection appeared on The Globe and Mail newswire during the period between January
We compare the performance of our proposed content-based recommender system against the following baseline recommendation systems:
In this recommendation system solely bag-of-words tfidf representation of news articles is used. Term frequency-inverse document frequency (tfidf) [15] is a statistical measure that increases proportionally to the frequency of a term in a document but lessens by the frequency of the term among documents in the corpus. The
tfidf(t,d)=tft,d×logMdft, | (7) |
where
In this method items are ranked based on the spent time on each article as a popularity measure. In fact, the results based on popularity are not personalized but are used in many research [12] to show the effectiveness of methods.
Item-to-Item collaborative filtering method has been commercially used by Aamzon [18]. Each article is represented by a vector of users on which they have spent time, then cosine measure is utilized to assess the similarity among articles. We tested the method with different number of neighbors and found
User-to-User collaborative filtering is a classical collaborating filtering method [24]. This method is similar to ItemKNN where each user is represented by a vector of articles she has read and similarities are computed among the users (rather than items). In this method, we used the same setting as the ItemKNN. UserKNN using Binary feature, similar to ItemKNN using Binary feature, ignores the amount of time spent on each article.
The method is based on Non-negative Matrix factorization [17] where the user-article matrix is factorized into two matrices with the property that all matrices have no negative value. Compared to traditional matrix factorization, the result of this method is interpretable and are more proper for the ranking tasks in recommendation. For this method, we set number of factors to
D-ocument Embedding learns a vector-space representation of the terms of a document by exploiting a two-layer neural network [21,16]. The architecture of the model that is used for training document embedding is the distributed memory phrase vector. When the model is trained on a dataset of documents, it tunes the word vectors and document vectors according to stochastic gradient descent optimization. In our experiment, the model is trained on a corpus of
LDA-based recommendation system is explained in Section 4.2. The topic models were trained with
The optimum number of topics is expected to result in a fine-grained decomposition of the corpus into topics [11], where topic distributions over words are of minimum similarity. Furthermore, the optimum number of topics leads to a low cross-entropy between the term distribution learned by the topic model and the distribution of terms in an unseen test article. Thus, the optimum number of topics results in a lower perplexity score indicating that the model is better in predicting distribution of the test article [3].
In our experiments, we learn topics for different values of
As mentioned earlier, a topic model generates
In this section, we evaluate the performance of our proposed content-based news recommender system using the following metrics: precision, recall, and F-measure.
Precision, recall, and F-measure are well-known evaluation metrics in information retrieval literature [19]. For each user, we use the original set of read articles as the ground truth
precision=|Tg∩Tr||Tr|, | (8) |
recall=|Tg∩Tr||Tg|, | (9) |
F1=2⋅precision⋅recallprecision+recall. | (10) |
In our experiments, the number of recommended articles ranges from
Empirical comparisons show that using topic models to represent articles improves the precision, recall, and F-measure. Since the only difference between the comparisons is the article similarity function
The bag-of-words with tfidf approach represents two articles by tfidf vectors. Then, the cosine similarity between these vectors are computed and used in the recommendation system. Generally speaking, the tfidf article similarity measures the quantity of term overlap, where each term has a different weight, in the two articles [25]. This approach ignores the thematic structures of articles to perform the similarity measure.
The LDA-based approaches first generate a set of topic vectors for the articles, each of which is represented by a distribution over terms. Terms in each topic are semantically coherent. Then, LDA-based recommender systems measure the cosine similarity between the topic vectors. Generally speaking, using LDA-based topic vectors quantifies the topic similarity between the two articles. These vectors yield a higher precision, recall, and F-measure than when using tfidf or document embedding vectors. Key to this improvement is incorporating thematic structure of news articles into the recommendation system. This leads to better estimates for topic similarity between two articles.
Hence we recommend using topic models to represent articles for time aware content-based news recommender systems.
This paper presents a time aware topic based recommender system for The Globe and Mail, a company that offers most authoritative news in Canada, featuring national and international news. One of the important problems of The Globe and Mail newswire is the growing amount of articles, which in turn demands a system to automatically filter and deliver the content according to readers' preferences. Furthermore, in the collaborative-filtering-based recommender system at The Globe and Mail, the introduction of new news articles can cause the cold start problem, as there will be insufficient data on these new entries for the collaborative filtering to work accurately.
We propose to utilize the latent Dirichlet allocation (LDA) model to discover hidden themes of the news articles. We incorporate these themes into a content-based recommender system. Our experimental studies show that the proposed recommendation system yields better results than solely bag-of-words with tfidf presentation. Moreover, given the fact that our recommender system considers the content of news articles to make recommendations, introducing a new news article does not cause the cold start problem.
Applying topic models in a content-based recommender system yields more accurate results than other recommender systems. However, our content-based recommender system must effectively evolve with its content. In our current system, the topic model needs to be generated offline. For instance, once non-read news articles enter the collection of read articles, the topic model needs to be updated to reflect the themes of new articles. This offline generation of a topic model is a drawback, as it hinders the system's ability to evolve quickly. We could develop a real-time content-based recommender system, that leverages a stream of news articles and is capable of handling online LDA [14] a
We would like to thank the data science group at The Globe and Mail, in particular, Michael O'Neill, Gordon Edall, and Shengqing Wu, for providing us with the data set used in this research, and insight and expertise that greatly assisted this research.
[1] | [ G. Adomavicius and A. Tuzhilin, Toward the next generation of recommender systems:A survey of the state-of-the-art and possible extensions, IEEE Transaction on Knowledge and Data Engineering, 17(2005), 734-749. |
[2] | [ D. M. Andrzejewski, Incorporating Domain Knowledge in Latent Topic Models, PhD thesis, University of Wisconsin-Madison, USA, 2010. |
[3] | [ D. M. Blei, A. Y. Ng and M. I. Jordan, Latent dirichlet allocation, The Journal of Machine Learning Research, 3(2003), 993-1022. |
[4] | [ J. Bobadilla, F. Ortega, A. Hernando and J. Bernal, A collaborative filtering approach to mitigate the new user cold start problem, Knowledge-Based System, 26(2012), 225-238. |
[5] | [ H. Borges and A. Lorena, A survey on recommender systems for news data, in Smart Information and Knowledge Management (eds. E. Szczerbicki and N. Nguyen), vol. 260 of Studies in Computational Intelligence, Springer Berlin Heidelberg, 2010, 129-151. |
[6] | [ R. Burke, Hybrid recommender systems:Survey and experiments, User Modeling and UserAdapted Interaction, 12(2002), 331-370. |
[7] | [ S.-H. Cha, Comprehensive survey on distance/similarity measures between probability density functions, International Journal of Mathematical Models and Methods in Applied Sciences, 1(2007), 300-307. |
[8] | [ T.-M. Chang and W.-F. Hsiao, Lda-based personalized document recommendation, Proceedings of the PACIS, 2013. |
[9] | [ M. D. Ekstrand, J. T. Riedl and J. A. Konstan, Collaborative filtering recommender systems, Journal of Foundations and Trends in Human-Computer Interaction, 4(2011), 81-173. |
[10] | [ T. Griffiths, Gibbs sampling in the generative model of latent dirichlet allocation, Standford University, 518(2002), 1-3. |
[11] | [ T. L. Griffiths and M. Steyvers, Finding scientific topics, Proceeding of the National Academy of Sciences of the United States of America, 101(2004), 5228-5235. |
[12] | [ X. He, T. Chen, M.-Y. Kan and X. Chen, Trirank:Review-aware explainable recommendation by modeling aspects, in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM'15, ACM, New York, NY, USA, 2015, 1661-1670. |
[13] | [ G. Heinrich, Parameter estimation for text analysis, http://www.arbylon.net/publications/text-est.pdf. |
[14] | [ M. D. Hoffman, D. M. Blei and F. R. Bach, Online learning for latent dirichlet allocation., in NIPS (eds. J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel and A. Culotta), Curran Associates, Inc., 2010, 856-864. |
[15] | [ D. Jurafsky and J. H. Martin, Speech and Language Processing:An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 1st edition, Prentice Hall PTR, Upper Saddle River, NJ, USA, 2000. |
[16] | [ Q. V. Le and T. Mikolov, Distributed representations of sentences and documents, CoRR, abs/1405.4053, URL http://arxiv.org/abs/1405.4053. |
[17] | [ D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, in Advances in Neural Information Processing Systems 13(eds. T. K. Leen, T. G. Dietterich and V. Tresp), MIT Press, 2001, 556-562, URL http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf. |
[18] | [ G. Linden, B. Smith and J. York, Amazon.com recommendations:Item-to-item collaborative filtering, IEEE Internet Computing, 7(2003), 76-80. |
[19] | [ C. D. Manning and H. Schütze, Foundations of Statistical Natural Language Processing, The MIT Press, Cambridge, Massachusetts, 1999. |
[20] | [ A. K. McCallum, Mallet:A machine learning for language toolkit, 2002, http://mallet.cs.umass.edu. |
[21] | [ T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, Distributed representations of words and phrases and their compositionality, CoRR, abs/1310.4546, URL http://arxiv.org/abs/1310.4546. |
[22] | [ B. N. Miller, I. Albert, S. K. Lam, J. A. Konstan and J. Riedl, Movielens unplugged:Experiences with an occasionally connected recommender system, in Proceedings of the 8th International Conference on Intelligent User Interfaces, IUI'03, ACM, New York, NY, USA, 2003, 263-266. |
[23] | [ D. Z. Mária Bieliková Michal Kompan, Effective hierarchical vector-based news representation for personalized recommendation, Computer Science and Information Systems, 303-322, URL http://eudml.org/doc/252774. |
[24] | [ F. Ricci, L. Rokach and B. Shapira, Recommender Systems Handbook, chapter Introduction to Recommender Systems Handbook, Springer US, Boston, MA, 2011. |
[25] | [ S. Tuarob, L. C. Pouchard and C. L. Giles, Automatic tag recommendation for metadata annotation using probabilistic topic modeling, in Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL'13, ACM, New York, NY, USA, 2013, 239-248. |
1. | Elizabeth Fernandes, Sérgio Moro, Paulo Cortez, A data‐driven approach to improve online consumer subscriptions by combining data visualization and machine learning methods, 2024, 48, 1470-6423, 10.1111/ijcs.13030 |