Breaking the barriers to Internet access / Faire tomber les obstacles entravant l’accès à Internet
Permanent URI for this collection
Browse
Browsing Breaking the barriers to Internet access / Faire tomber les obstacles entravant l’accès à Internet by Issue Date
Now showing 1 - 20 of 28
Results Per Page
Sort Options
Item Multi-document Summarization by Information Distance(2009) Long, C; Huang, M L; Zhu, X Y; Li, MFast changing knowledge on the Internet can be acquired more efficiently with the help of automatic document summarization and updating techniques. This paper described a novel approach for multi-document update summarization. The best summary is defined to be the one which has the minimum information distance to the entire document set. The best update summary has the minimum conditional information distance to a document cluster given that a prior document cluster has already been read. Experiments on the DUC 2007 dataset and the TAC 2008 dataset have proved that our method closely correlates with the human summaries and outperforms other programs such as LexRank in many categories under the ROUGE evaluation criterion.Item Answering Opinion Questions with Random Walks on Graphs(2009-08) Li, Fangtao; Tang, Yang; Huang, Minlie; Zhu, XiaoyanOpinion Question Answering (Opinion QA), which aims to find the authors’ sentimental opinions on a specific target, is more challenging than traditional fact-based question answering problems. To extract the opinion oriented answers, we need to consider both topic relevance and opinion sentiment issues. Current solutions to this problem are mostly ad-hoc combinations of question topic information and opinion information. In this paper, we propose an Opinion PageRank model and an Opinion HITS model to fully explore the information from different relations among questions and answers, answers and answers, and topics and opinions. By fully exploiting these relations, the experiment results show that our proposed algorithms outperform several state of the art baselines on benchmark data set. A gain of over 10% in F scores is achieved as compared to many other systems.Item Specialized Review Selection for Feature Rating Estimation(IEEE, 2009-09) Long, Chong; Zhang, Lei; Huang, Minlie; Zhu, Xiaoyan; Li, MingOn participatory Websites, users provide opinions about products, with both overall ratings and textual reviews. In this paper, we propose an approach to accurately estimate feature ratings of the products. This approach selects user reviews that extensively discuss specific features of the products (called specialized reviews), using information distance of reviews on the features. Experiments on real data show that overall ratings of the specialized reviews can be used to represent their feature ratings. The average of these overall ratings can be used by recommender systems to provide feature specific recommendations that better help users make purchasing decisions.Item Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance(2009-11) Long, Chong; Huang, Minlie; Zhu, XiaoyanThis paper presents our extractive summarization systems at the update summarization track of TAC 2009. This system is based on our newly developed document summarization framework under the theory of conditional information distance among many objects. The best summary is defined in this paper to be the one which has the minimum information distance to the entire document set. The best update summary has the minimum conditional information distance to a document cluster given that a prior document cluster has already been read. Experiments on the TAC dataset have proved that our method has got a good performance in many categories.Item THU QUANTA at TAC 2009 KBP and RTE Track(2009-11) Li, Fan; Zheng, Zhicheng; Bu, Fan; Tang, Yang; Zhu, XiaoyanThis paper describes the systems of THU QUANTA in Text Analysis Conference (TAC) 2009. We participated in the Knowledge Base Population (KBP) track, and the Recognizing Textual Entailment (RTE) track. For the KBP track, we investigate two ranking strategies for Entity Linking task. We employ a Listwise “Learning to Rank” model and Augmenting Naïve Bayes model to rank the candidate. We try to use learned patterns to solve the Slot Filling task. For the RTE track, we propose an interesting method, SEGraph (Semantic Elements based Graph). This method divides the Hypothesis and Text into two types of semantic elements: Entity Semantic Element and Relation Semantic Element. The SEGraph is then constructed, with Entity Elements as nodes, and Relation Elements as edges for both Text and Hypothesis. Finally we recognize the textual entailment based on the SEGraph of Text and SEGraph of Hypothesis. The evaluation results show that our proposed two frame-works are very effective for KBP and RTE tasks, respectively.Item New Approach for Multi-Document Update Summarization(2010) Long, Chong; Huang, Min-Lie; Zhu, Xiao-Yan; Li, MingFast changing knowledge on the Internet can be acquired more efficiently with the help of automatic document summarization and updating techniques. This paper describes a novel approach for multi-document update summarization. The best summary is defined to be the one which has the minimum information distance to the entire document set. The best update summary has the minimum conditional information distance to a document cluster given that a prior document cluster has already been read. Experiments on the DUC/TAC 2007 to 2009 datasets (http://duc.nist.gov/, http://www.nist.gov/tac/) have proved that our method closely correlates with the human summaries and outperforms other programs such as LexRank in many categories under the ROUGE evaluation criterion.Item Summarizing Similar Questions for Chinese Community Question Answering Portals(2010) Tang, Yang; Li, Fangtao; Huang, Minlie; Zhu, XiaoyanAs online community question answering (cQA) portals like Yahoo! Answers1 and Baidu Zhidao2 have attracted over hundreds of millions of questions, how to utilize these questions and accordant answers becomes increasingly important for cQA websites. Prior approaches focus on using information retrieval techniques to provide a ranked list of questions based on their similarities to the query. Due to the high variance of question quality and answer quality, users have to spend lots of time on finding the truly best answers from retrieved results. In this paper, we develop an answer retrieval and summarization system which directly provides an accurate and comprehensive answer summary instead of a list of similar questions to user’s query. To fully explore the information of relations between queries and questions, between questions and answers, and between answers and sentences, we propose a new probabilistic scoring model to distinguish high-quality answers from low-quality answers. By fully exploiting these relations, we summarize answers using a maximum coverage model. Experiment results on the data extracted from Chinese cQA websites demonstrate the efficacy of our proposed method.Item Question Answering System Based on Community QA(2010-05) Zheng, Zhicheng; Tang, Yang; Long, Chong; Bu, Fan; Zhu, XiaoyanAfter a long period of research in factoid QA, such kind of questions has already been solved quite well. However, real users always concern on some more complicated questions such as ”Why XXXX?” or ”How XXXX?”. These questions are difficult to retrieve answers directly from internet, but the community question answering services provide good resources to solve these questions. As cQA portals like Yahoo! Answers and Baidu Zhidao have attracted over hundreds of millions of questions, these questions can be treated as users’ query log, and can help the QA systems understand the user’s questions better. Common approaches focus on using information retrieval techniques in order to provide a ranked list of questions based on their similarity to the query. Due to the high variance of quality of questions and answers, users have to spend lots of time on finding the truly best answers from retrieved results. In this paper, we develop an answer retrieval and summarization system which directly provides an accurate and comprehensive answer summary besides a list of similar questions to user’s query. To fully explore the information of questions and answers posted in the cQA, we adopt different strategies according to different situations. By this way, the system could output great answers to users’ questions in practice.Item Learning to Link Entities with Knowledge Base(2010-06) Zheng, Zhicheng; Li, Fangtao; Huang, Minlie; Zhu, XiaoyanThis paper address the problem of entity linking. Specifically, given an entity mentioned in unstructured texts, the task is to link this entity with an entry stored in the existing knowledge base. This is an important task for information extraction. It can serve as a convenient gateway to encyclopedic information, and can greatly improve the web users’ experience. Previous learning based solutions mainly focus on classification framework. However, it’s more suitable to consider it as a ranking problem. In this paper, we propose a learning to rank algorithm for entity linking. It effectively utilizes the relationship information among the candidates when ranking. The experiment results on the TAC 20091 dataset demonstrate the effectiveness of our proposed framework. The proposed method achieves 18.5% improvement in terms of accuracy over the classification models for those entities which have corresponding entries in the Knowledge Base. The overall performance of the system is also better than that of the state-of-the-art methods.Item Sentiment Analysis with Global Topics and Local Dependency(Association for the Advancement of Artificial Intelligence (AAAI), Palo Alto, California, 2010-07) LI, Fangtao; Huang, Minlie; Zhu, XiaoyanWith the development of Web 2.0, sentiment analysis has now become a popular research problem to tackle. Recently, topic models have been introduced for the simultaneous analysis for topics and the sentiment in a document. These studies, which jointly model topic and sentiment, take the advantage of the relationship between topics and sentiment, and are shown to be superior to traditional sentiment analysis tools. However, most of them make the assumption that, given the parameters, the sentiments of the words in the document are all independent. In our observation, in contrast, sentiments are expressed in a coherent way. The local conjunctive words, such as “and” or “but”, are often indicative of sentiment transitions. In this paper, we propose a major departure from the previous approaches by making two linked contributions. First, we assume that the sentiments are related to the topic in the document, and put forward a joint sentiment and topic model, i.e. Sentiment-LDA. Second, we observe that sentiments are dependent on local context. Thus, we further extend the Sentiment-LDA model to Dependency- Sentiment-LDA model by relaxing the sentiment independent assumption in Sentiment-LDA. The sentiments of words are viewed as a Markov chain in Dependency- Sentiment-LDA. Through experiments, we show that exploiting the sentiment dependency is clearly advantageous, and that the Dependency-Sentiment-LDA is an effective approach for sentiment analysis.Item Recognizing Biomedical Named Entities using Skip-chain Conditional Random Fields(Association for Computational Linguistics, Stroudsburg, PA, 2010-07) Liu, Jingchen; Huang, Minlie; Zhu, XiaoyanLinear-chain Conditional Random Fields (CRF) has been applied to perform the Named Entity Recognition (NER) task in many biomedical text mining and information extraction systems. However, the linear-chain CRF cannot capture long distance dependency, which is very common in the biomedical literature. In this paper, we propose a novel study of capturing such long distance dependency by defining two principles of constructing skip-edges for a skip-chain CRF: linking similar words and linking words having typed dependencies. The approach is applied to recognize gene/protein mentions in the literature. When tested on the BioCreAtIvE II Gene Mention dataset and GENIA corpus, the approach contributes significant improvements over the linear-chain CRF. We also present in-depth error analysis on inconsistent labeling and study the influence of the quality of skip edges on the labeling performance.Item Structure-Aware Review Mining and Summarization(2010-08) Li, Fangtao; Han, Chao; Huang, Minlie; Zhu, Xiaoyan; Xia, Ying-JuIn this paper, we focus on object feature 1 1 Introduction based review summarization. Different from most of previous work with linguistic rules or statistical methods, we formulate the review mining task as a joint structure tagging problem. We propose a new machine learning framework based on Conditional Random Fields (CRFs). It can employ rich features to jointly extract positive opinions, negative opinions and object features for review sentences. The linguistic structure can be naturally integrated into model representation. Besides linear- chain structure, we also investigate conjunction structure and syntactic tree structure in this framework. Through extensive experiments on movie review and product review data sets, we show that structure-aware models outperform many state-of-the-art approaches to review mining.Item Measuring the Non-compositionality of Multiword Expressions(2010-08) Bu, Fan; Zhu, XiaoyanItem Comparative Study on Ranking and Selection Strategies for Multi-Document Summarization(2010-08) Jin, Feng; Huang, Minlie; Zhu, XiaoyanThis paper presents a comparative study on two key problems existing in extractive summarization: the ranking problem and the selection problem. To this end, we presented a systematic study of comparing different learning-to-rank algorithms and comparing different selection strategies. This is the first work of providing systematic analysis on these problems. Experimental results on two benchmark datasets demonstrate three findings: (1) pairwise and listwise learning- to-rank algorithms outperform the baselines significantly; (2) there is no significant difference among the learning- to-rank algorithms; and (3) the integer linear programming selection strategy generally outperformed Maximum Marginal Relevance and Diversity Penalty strategies.Item Function-based question classification for general QA(Association for Computational Linguistics, Stroudsburg, PA, 2010-10) Bu, Fan; Zhu, Xingwei; Hao, Yu; Zhu, XiaoyanIn contrast with the booming increase of internet data, state-of-art QA (question answering) systems, otherwise, concerned data from specific domains or resources such as search engine snippets, online forums and Wikipedia in a somewhat isolated way. Users may welcome a more general QA system for its capability to answer questions of various sources, integrated from existed specialized sub-QA engines. In this framework, question classification is the primary task. However, the current paradigms of question classification were focused on some specified type of questions, i.e. factoid questions, which are inappropriate for the general QA. In this paper, we propose a new question classification paradigm, which includes a question taxonomy suitable to the general QA and a question classifier based on MLN (Markov logic network), where rule-based methods and statistical methods are unified into a single framework in a fuzzy discriminative learning approach. Experiments show that our method outperforms traditionalItem K2Q : generating natural language questions from keywords with user refinements(Asian Federation of Natural Language Processing (AFNLP), 2011) Zhicheng Zheng; Xiance Si; Chang, Edward Y.; Xiaoyan ZhuGarbage in and garbage out. A Q&A system must receive a well formulated question that matches the user’s intent or she has no chance to receive satisfactory answers. In this paper, we propose a keywords to questions (K2Q) system to assist a user to articulate and refine questions. K2Q generates candidate questions and refinement words from a set of input keywords. After specifying some initial keywords, a user receives a list of candidate questions as well as a list of refinement words. The user can then select a satisfactory question, or select a refinement word to generate a new list of candidate questions and refinement words. We propose a User Inquiry Intent (UII) model to describe the joint generation process of keywords and questions for ranking questions, suggesting refinement words, and generating questions that may not have previously appeared. Empirical study shows UII to be useful and effective for the K2Q task.Item Quality-biased ranking of short texts in microblogging services(Asian Federation of Natural Language Processing (AFNLP), 2011) Minlie Huang; Yi Yang; Xiaoyan ZhuThe abundance of user-generated content comes at a price: the quality of content may range from very high to very low. We propose a regression approach that incorporates various features to recommend short-text documents from Twitter, with a bias toward quality perspective. The approach is built on top of a linear regression model which includes a regularization factor inspired from the content conformity hypothesis - documents similar in content may have similar quality. We test the system on the Edinburgh Twitter corpus. Experimental results show that the regularization factor inspired from the hypothesis can improve the ranking performance and that using unlabeled data can make ranking performance better. Comparative results show that our method outperforms several baseline systems. We also make systematic feature analysis and find that content quality features are dominant in short-text ranking.Item New multiword expression metric and its applications(Springer Science+Business Media, 2011) Fan Bu; Xiao-Yan Zhu; Ming LiMultiword Expressions (MWEs) appear frequently and ungrammatically in natural languages. Identifying MWEs in free texts is a very challenging problem. This paper proposes a knowledge-free, unsupervised, and language- independent Multiword Expression Distance (MED). The new metric is derived from an accepted physical principle, measures the distance from an n-gram to its semantics, and outperforms other state-of-the-art methods on MWEs in two applications: question answering and named entity extraction.Item Semantic relationship discovery with wikipedia structure(International Joint Conference on Artificial Intelligence (IJCAI), 2011) Fan Bu; Yu Hao; Xiaoyan ZhuDiscovering semantic relationship between concepts is easily handled by humans but remains an obstacle for computers. Prior research on semantic computation using the Wikipedia structure only computes the tightness of the relationship between two concepts, but not which kind of relationship it is. However, concepts can be related in two different ways: linking to same categories or linking from each other by anchor texts. The algorithm RCRank (joint ranking of related concepts and categories) is proposed to jointly compute concept-concept relatedness and concept-category relatedness. The method can return a list of categories which best interpret the relationships between concepts.Item Learning to Identify Review Spam(AAAI Press / International Joint Conferences on Artificial Intelligence, Menlo Park, California, 2011-07) Li, Fangtao; Huang, Minlie; Yang, Yi; Zhu, Xiaoyan; Walsh, TobyIn the past few years, sentiment analysis and opinion mining becomes a popular and important task. These studies all assume that their opinion resources are real and trustful. However, they may encounter the faked opinion or opinion spam problem. In this paper, we study this issue in the context of our product review mining system. On product review site, people may write faked reviews, called review spam, to promote their products, or defame their competitors’ products. It is important to identify and filter out the review spam. Previous work only focuses on some heuristic rules, such as helpfulness voting, or rating deviation, which limits the performance of this task. In this paper, we exploit machine learning methods to identify review spam. Toward the end, we manually build a spam collection from our crawled reviews. We first analyze the effect of various features in spam identification. We also observe that the review spammer consistently writes spam. This provides us another view to identify review spam: we can identify if the author of the review is spammer. Based on this observation, we provide a two-view semi-supervised method, co-training, to exploit the large amount of unlabeled data. The experiment results show that our proposed method is effective. Our designed machine learning methods achieve significant improvements in comparison to the heuristic baselines.