Abstract de la publi numéro 12623
Opinion mining is a sub-discipline within Information Retrieval (IR) and Computational Linguistics. It refers to the computational techniques for extracting, classifying, understanding, and assessing the opinions expressed in various online sources like news articles, social media comments, and other user-generated content. It is also known by many other terms like opinion finding, opinion detection, sentiment analysis, sentiment classification, polarity detection, etc. Defining in more specific and simpler context, opinion mining is the task of retrieving opinions on an issue as expressed by the user in the form of a query. There are many problems and challenges associated with the field of opinion mining. In this thesis, we focus on some major problems of opinion mining. One of the foremost and major challenges of opinion mining is to find opinions specifically relevant to the given topic (query). A document can contain information about many topics at a time and it is possible that it contains opinionated text about each of the topic being discussed or about only few of them. Therefore, it becomes very important to choose topic-relevant document segments with their corresponding opinions. We approach this problem on two granularity levels, sentences and passages. In our first approach for sentence-level, we use semantic relations of WordNet to find this opinion-topic association. In our second approach for passage-level, we use more robust IR model (i.e., language model) to focus on this problem. Basic idea behind both contributions for opinion-topic association is that if a document contains more opinionated topic-relevant textual segments (i.e., sentences or passages) then it is more opinionated than a document with less opinionated topic-relevant textual segments. Most of the machine-learning based approaches for opinion mining are domain-dependent (i.e., their performance vary from domain to domain). On the other hand, a domain or topic-independent approach is more generalized and can sustain its effectiveness across different domains. However, topic-independent approaches suffer from poor performance generally. It is a big challenge in the field of opinion mining to develop an approach which is both effective and generalized at the same time. Our contributions for this thesis include the development of such approach which combines simple heuristics-based topic-independent and topic-dependent features to find opinionated documents. Entity-based opinion mining aims at identifying the relevant entities for a given topic and extract the opinions associated to them from a set of textual documents. However, identifying and determining the relevancy of entities is itself a big challenge for this task. In this thesis, we focus on this challenge by proposing an approach which takes into account both information from the current news article as well as from the past relevant articles in order to detect the most important entities in the current news. We look at different features at both local (document) and global (data collection) level to analyse their importance to assess the relevance of an entity. Experimentation with a machine learning algorithm shows the effectiveness of our approach by giving significant improvements over baseline. In addition to this, we also present idea of a framework for opinion mining related tasks. This framework exploits content and social evidences of blogosphere for the tasks of opinion finding, opinion prediction and multidimensional ranking. This premature contribution lays foundations for our future work. Evaluation of our approaches include the use of TREC Blog 2006 data collection and TREC Novelty track data collection 2004. Most of the evaluations were performed under the framework of TREC Blog track.