Breaking the barriers to Internet access / Faire tomber les obstacles entravant l’accès à Internet

Permanent URI for this collection

https://hdl.handle.net/10625/49272

Browse

Now showing 1 - 20 of 28

Answering Opinion Questions with Random Walks on Graphs
(2009-08) Li, Fangtao; Tang, Yang; Huang, Minlie; Zhu, Xiaoyan
Opinion Question Answering (Opinion QA), which aims to find the authors’ sentimental opinions on a specific target, is more challenging than traditional fact-based question answering problems. To extract the opinion oriented answers, we need to consider both topic relevance and opinion sentiment issues. Current solutions to this problem are mostly ad-hoc combinations of question topic information and opinion information. In this paper, we propose an Opinion PageRank model and an Opinion HITS model to fully explore the information from different relations among questions and answers, answers and answers, and topics and opinions. By fully exploiting these relations, the experiment results show that our proposed algorithms outperform several state of the art baselines on benchmark data set. A gain of over 10% in F scores is achieved as compared to many other systems.
Breaking the barrier of internet information acquisition : question answering systems for smartphone; final technical report
(Tsinghua University, People's Republic of China, 2014-08) Xiaoyan Zhu; Ming Li; Yu Hao
This project aims to overcome language and technology barriers to acquiring information through the Internet. We intend to invent new techniques to simplify internet search processes by developing a natural language search engine, technology to facilitate cross language search (Chinese and English), and technology that would enable elementary mobile devices to search the internet. Is it possible to enable 580 million Chinese cell phone users to search the Internet without actually being on the Internet? For instance a cell phone user sends a query via a short message to our search engine, and the answer is sent back as a short message.
Breaking the internet barrier
(2017-03) Li, Ming; Zhu, Xiaoyan
The paper outlines project outputs that are real-world applications: 1) A medical information management system which integrates clinical, healthcare, medical insurance, and hospital information in rural areas. The system has been distributed to clinical and health departments of Hainan Province, and is serving more than 6 million users. 2) A clinical information system that assists physicians to make better informed decisions at Point of Care, and can deliver relevant and evidence-based medical information to their patients. 3) A robot that can serve as a teaching assistant in schools.
Comparative Study on Ranking and Selection Strategies for Multi-Document Summarization
(2010-08) Jin, Feng; Huang, Minlie; Zhu, Xiaoyan
This paper presents a comparative study on two key problems existing in extractive summarization: the ranking problem and the selection problem. To this end, we presented a systematic study of comparing different learning-to-rank algorithms and comparing different selection strategies. This is the first work of providing systematic analysis on these problems. Experimental results on two benchmark datasets demonstrate three findings: (1) pairwise and listwise learning- to-rank algorithms outperform the baselines significantly; (2) there is no significant difference among the learning- to-rank algorithms; and (3) the integer linear programming selection strategy generally outperformed Maximum Marginal Relevance and Diversity Penalty strategies.
Cross-domain co-extraction of sentiment and topic lexicons
(2012) Fangtao Li; Sinno Jialin Pan; Ou Jin; Qiang Yang; Xiaoyan Zhu
The goal is automatic extraction of relevant terms from a domain of interest. In the past few years, opinion mining and sentiment analysis have attracted much attention in natural language processing and information retrieval. The proposed method can utilize useful labeled data from the source domain as well as exploit the relationships between the topic and sentiment words to propagate information for lexicon construction in the target domain. Domain adaptation aims at transferring knowledge across domains where data distributions may be different. This model extracts both topic and sentiment words and also allows non-adjective sentiment words, achieving much better results on cross-domain lexicon extraction.
Estimating Feature Ratings through an Effective Review Selection Approach
(2012) Long, C; Zhang, J; Huang, M; Li, M; Ma, B
Most participatory websites collect overall ratings (e.g. five stars) of products from their customers, reflecting the overall assessment of the products. However, it is more useful to present ratings of product features (such as price, battery, screen and lens of digital cameras) to help customers make effective purchase decisions. Unfortunately,only a very few websites have collected feature ratings. In this paper, we propose a novel approach to accurately estimate feature ratings of products. In this paper, we propose a novel approach to accurately estimate feature ratings of products. This approach selects user reviews that extensively discuss specific features of the products (called specialized reviews), using information distance of reviews on the features. Experiments on both annotated and real data show that overall ratings of the specialized reviews can be used to represent their feature ratings. The average of these overall ratings can be used by recommender systems to provide feature-specific recommendations that can better help users make purchasing decisions.
Fine granular aspect analysis using latent structural models
(Association for Computational Linguistics, 2012) Lei Fang; Minlie Huang
In this paper, we present a structural learning model for joint sentiment classification and aspect analysis of text at various levels of granularity. Online reviews have become a major resource where users find opinions or comments on products or services they want to consume. Aspect level sentiment analysis may be useful for a more global picture of opinions on the product’s properties. The resulting model is able to predict the sentiment polarity of a document as well as to identify aspect-specific sentences. A machine-learning algorithm generalizes the Support Vector Machine (SVM) classifier.
Function-based question classification for general QA
(Association for Computational Linguistics, Stroudsburg, PA, 2010-10) Bu, Fan; Zhu, Xingwei; Hao, Yu; Zhu, Xiaoyan
In contrast with the booming increase of internet data, state-of-art QA (question answering) systems, otherwise, concerned data from specific domains or resources such as search engine snippets, online forums and Wikipedia in a somewhat isolated way. Users may welcome a more general QA system for its capability to answer questions of various sources, integrated from existed specialized sub-QA engines. In this framework, question classification is the primary task. However, the current paradigms of question classification were focused on some specified type of questions, i.e. factoid questions, which are inappropriate for the general QA. In this paper, we propose a new question classification paradigm, which includes a question taxonomy suitable to the general QA and a question classifier based on MLN (Markov logic network), where rule-based methods and statistical methods are unified into a single framework in a fuzzy discriminative learning approach. Experiments show that our method outperforms traditional
Incorporating Reviewer and Product Information for Review Rating Prediction
(AAAI / International Joint Conferences on Artificial Intelligence Press, Menlo Park, California, 2011-07) Li, Fangtao; Liu, Nathan; Jin, Hongwei; Zhao, Kai; Yang, Qiang; Walsh, Toby
Traditional sentiment analysis mainly considers binary classifications of reviews, but in many real-world sentiment classification problems,nonbinary review ratings are more useful. This is especially true when consumers wish to compare two products, both of which are not negative. Previous work has addressed this problem by extracting various features from the review text for learning a predictor. Since the same word may have different sentiment effects when used by different reviewers on different products, we argue that it is necessary to model such reviewer and product dependent effects in order to predict review ratings more accurately. In this paper, we propose a novel learning framework to incorporate reviewer and product information into the text based learner for rating prediction. The reviewer, product and text features are modeled as a three-dimension tensor. Tensor factorization techniques can then be employed to reduce the data sparsity problems. We perform extensive experiments to demonstrate the effectiveness of our model, which has a significant improvement compared to state of the art methods, especially for reviews with unpopular products and inactive reviewers.
K2Q : generating natural language questions from keywords with user refinements
(Asian Federation of Natural Language Processing (AFNLP), 2011) Zhicheng Zheng; Xiance Si; Chang, Edward Y.; Xiaoyan Zhu
Garbage in and garbage out. A Q&A system must receive a well formulated question that matches the user’s intent or she has no chance to receive satisfactory answers. In this paper, we propose a keywords to questions (K2Q) system to assist a user to articulate and refine questions. K2Q generates candidate questions and refinement words from a set of input keywords. After specifying some initial keywords, a user receives a list of candidate questions as well as a list of refinement words. The user can then select a satisfactory question, or select a refinement word to generate a new list of candidate questions and refinement words. We propose a User Inquiry Intent (UII) model to describe the joint generation process of keywords and questions for ranking questions, suggesting refinement words, and generating questions that may not have previously appeared. Empirical study shows UII to be useful and effective for the K2Q task.
Learning to Identify Review Spam
(AAAI Press / International Joint Conferences on Artificial Intelligence, Menlo Park, California, 2011-07) Li, Fangtao; Huang, Minlie; Yang, Yi; Zhu, Xiaoyan; Walsh, Toby
In the past few years, sentiment analysis and opinion mining becomes a popular and important task. These studies all assume that their opinion resources are real and trustful. However, they may encounter the faked opinion or opinion spam problem. In this paper, we study this issue in the context of our product review mining system. On product review site, people may write faked reviews, called review spam, to promote their products, or defame their competitors’ products. It is important to identify and filter out the review spam. Previous work only focuses on some heuristic rules, such as helpfulness voting, or rating deviation, which limits the performance of this task. In this paper, we exploit machine learning methods to identify review spam. Toward the end, we manually build a spam collection from our crawled reviews. We first analyze the effect of various features in spam identification. We also observe that the review spammer consistently writes spam. This provides us another view to identify review spam: we can identify if the author of the review is spammer. Based on this observation, we provide a two-view semi-supervised method, co-training, to exploit the large amount of unlabeled data. The experiment results show that our proposed method is effective. Our designed machine learning methods achieve significant improvements in comparison to the heuristic baselines.
Learning to Link Entities with Knowledge Base
(2010-06) Zheng, Zhicheng; Li, Fangtao; Huang, Minlie; Zhu, Xiaoyan
This paper address the problem of entity linking. Specifically, given an entity mentioned in unstructured texts, the task is to link this entity with an entry stored in the existing knowledge base. This is an important task for information extraction. It can serve as a convenient gateway to encyclopedic information, and can greatly improve the web users’ experience. Previous learning based solutions mainly focus on classification framework. However, it’s more suitable to consider it as a ranking problem. In this paper, we propose a learning to rank algorithm for entity linking. It effectively utilizes the relationship information among the candidates when ranking. The experiment results on the TAC 20091 dataset demonstrate the effectiveness of our proposed framework. The proposed method achieves 18.5% improvement in terms of accuracy over the classification models for those entities which have corresponding entries in the Knowledge Base. The overall performance of the system is also better than that of the state-of-the-art methods.
Measuring the Non-compositionality of Multiword Expressions
(2010-08) Bu, Fan; Zhu, Xiaoyan
Multi-document Summarization by Information Distance
(2009) Long, C; Huang, M L; Zhu, X Y; Li, M
Fast changing knowledge on the Internet can be acquired more efficiently with the help of automatic document summarization and updating techniques. This paper described a novel approach for multi-document update summarization. The best summary is defined to be the one which has the minimum information distance to the entire document set. The best update summary has the minimum conditional information distance to a document cluster given that a prior document cluster has already been read. Experiments on the DUC 2007 dataset and the TAC 2008 dataset have proved that our method closely correlates with the human summaries and outperforms other programs such as LexRank in many categories under the ROUGE evaluation criterion.
New Approach for Multi-Document Update Summarization
(2010) Long, Chong; Huang, Min-Lie; Zhu, Xiao-Yan; Li, Ming
Fast changing knowledge on the Internet can be acquired more efficiently with the help of automatic document summarization and updating techniques. This paper describes a novel approach for multi-document update summarization. The best summary is defined to be the one which has the minimum information distance to the entire document set. The best update summary has the minimum conditional information distance to a document cluster given that a prior document cluster has already been read. Experiments on the DUC/TAC 2007 to 2009 datasets (http://duc.nist.gov/, http://www.nist.gov/tac/) have proved that our method closely correlates with the human summaries and outperforms other programs such as LexRank in many categories under the ROUGE evaluation criterion.
New multiword expression metric and its applications
(Springer Science+Business Media, 2011) Fan Bu; Xiao-Yan Zhu; Ming Li
Multiword Expressions (MWEs) appear frequently and ungrammatically in natural languages. Identifying MWEs in free texts is a very challenging problem. This paper proposes a knowledge-free, unsupervised, and language- independent Multiword Expression Distance (MED). The new metric is derived from an accepted physical principle, measures the distance from an n-gram to its semantics, and outperforms other state-of-the-art methods on MWEs in two applications: question answering and named entity extraction.
Quality-biased ranking of short texts in microblogging services
(Asian Federation of Natural Language Processing (AFNLP), 2011) Minlie Huang; Yi Yang; Xiaoyan Zhu
The abundance of user-generated content comes at a price: the quality of content may range from very high to very low. We propose a regression approach that incorporates various features to recommend short-text documents from Twitter, with a bias toward quality perspective. The approach is built on top of a linear regression model which includes a regularization factor inspired from the content conformity hypothesis - documents similar in content may have similar quality. We test the system on the Edinburgh Twitter corpus. Experimental results show that the regularization factor inspired from the hypothesis can improve the ranking performance and that using unlabeled data can make ranking performance better. Comparative results show that our method outperforms several baseline systems. We also make systematic feature analysis and find that content quality features are dominant in short-text ranking.
Question Answering System Based on Community QA
(2010-05) Zheng, Zhicheng; Tang, Yang; Long, Chong; Bu, Fan; Zhu, Xiaoyan
After a long period of research in factoid QA, such kind of questions has already been solved quite well. However, real users always concern on some more complicated questions such as ”Why XXXX?” or ”How XXXX?”. These questions are difficult to retrieve answers directly from internet, but the community question answering services provide good resources to solve these questions. As cQA portals like Yahoo! Answers and Baidu Zhidao have attracted over hundreds of millions of questions, these questions can be treated as users’ query log, and can help the QA systems understand the user’s questions better. Common approaches focus on using information retrieval techniques in order to provide a ranked list of questions based on their similarity to the query. Due to the high variance of quality of questions and answers, users have to spend lots of time on finding the truly best answers from retrieved results. In this paper, we develop an answer retrieval and summarization system which directly provides an accurate and comprehensive answer summary besides a list of similar questions to user’s query. To fully explore the information of questions and answers posted in the cQA, we adopt different strategies according to different situations. By this way, the system could output great answers to users’ questions in practice.
Recognizing Biomedical Named Entities using Skip-chain Conditional Random Fields
(Association for Computational Linguistics, Stroudsburg, PA, 2010-07) Liu, Jingchen; Huang, Minlie; Zhu, Xiaoyan
Linear-chain Conditional Random Fields (CRF) has been applied to perform the Named Entity Recognition (NER) task in many biomedical text mining and information extraction systems. However, the linear-chain CRF cannot capture long distance dependency, which is very common in the biomedical literature. In this paper, we propose a novel study of capturing such long distance dependency by defining two principles of constructing skip-edges for a skip-chain CRF: linking similar words and linking words having typed dependencies. The approach is applied to recognize gene/protein mentions in the literature. When tested on the BioCreAtIvE II Gene Mention dataset and GENIA corpus, the approach contributes significant improvements over the linear-chain CRF. We also present in-depth error analysis on inconsistent labeling and study the influence of the quality of skip edges on the labeling performance.
Semantic relationship discovery with wikipedia structure
(International Joint Conference on Artificial Intelligence (IJCAI), 2011) Fan Bu; Yu Hao; Xiaoyan Zhu
Discovering semantic relationship between concepts is easily handled by humans but remains an obstacle for computers. Prior research on semantic computation using the Wikipedia structure only computes the tightness of the relationship between two concepts, but not which kind of relationship it is. However, concepts can be related in two different ways: linking to same categories or linking from each other by anchor texts. The algorithm RCRank (joint ranking of related concepts and categories) is proposed to jointly compute concept-concept relatedness and concept-category relatedness. The method can return a list of categories which best interpret the relationships between concepts.

Browse

Browsing Breaking the barriers to Internet access / Faire tomber les obstacles entravant l’accès à Internet by Title

Results Per Page

Sort Options