Breaking the barriers to Internet access / Faire tomber les obstacles entravant l’accès à Internet

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 20 of 28
  • Item
    Breaking the internet barrier
    (2017-03) Li, Ming; Zhu, Xiaoyan
    The paper outlines project outputs that are real-world applications: 1) A medical information management system which integrates clinical, healthcare, medical insurance, and hospital information in rural areas. The system has been distributed to clinical and health departments of Hainan Province, and is serving more than 6 million users. 2) A clinical information system that assists physicians to make better informed decisions at Point of Care, and can deliver relevant and evidence-based medical information to their patients. 3) A robot that can serve as a teaching assistant in schools.
  • Item
    Breaking the barrier of internet information acquisition : question answering systems for smartphone; final technical report
    (Tsinghua University, People's Republic of China, 2014-08) Xiaoyan Zhu; Ming Li; Yu Hao
    This project aims to overcome language and technology barriers to acquiring information through the Internet. We intend to invent new techniques to simplify internet search processes by developing a natural language search engine, technology to facilitate cross language search (Chinese and English), and technology that would enable elementary mobile devices to search the internet. Is it possible to enable 580 million Chinese cell phone users to search the Internet without actually being on the Internet? For instance a cell phone user sends a query via a short message to our search engine, and the answer is sent back as a short message.
  • Item
    Cross-domain co-extraction of sentiment and topic lexicons
    (2012) Fangtao Li; Sinno Jialin Pan; Ou Jin; Qiang Yang; Xiaoyan Zhu
    The goal is automatic extraction of relevant terms from a domain of interest. In the past few years, opinion mining and sentiment analysis have attracted much attention in natural language processing and information retrieval. The proposed method can utilize useful labeled data from the source domain as well as exploit the relationships between the topic and sentiment words to propagate information for lexicon construction in the target domain. Domain adaptation aims at transferring knowledge across domains where data distributions may be different. This model extracts both topic and sentiment words and also allows non-adjective sentiment words, achieving much better results on cross-domain lexicon extraction.
  • Item
    String re-writing kernel
    (2012) Fan Bu; Hang Li; Xiaoyan Zhu
    Learning for sentence re-writing is a fundamental task in natural language processing and information retrieval. In this paper, we propose a new class of kernel functions, referred to as string re-writing kernel, to address the problem. A string re-writing kernel measures the similarity between two pairs of strings, each pair representing re-writing of a string. It can capture the lexical and structural similarity between two pairs of sentences without the need of constructing syntactic trees. We further propose an instance of string rewriting kernel which can be computed efficiently. Experimental results on benchmark datasets show that our method can achieve better results than state-of-the-art methods on two sentence re-writing learning tasks: paraphrase identification and recognizing textual entailment.
  • Item
    Fine granular aspect analysis using latent structural models
    (Association for Computational Linguistics, 2012) Lei Fang; Minlie Huang
    In this paper, we present a structural learning model for joint sentiment classification and aspect analysis of text at various levels of granularity. Online reviews have become a major resource where users find opinions or comments on products or services they want to consume. Aspect level sentiment analysis may be useful for a more global picture of opinions on the product’s properties. The resulting model is able to predict the sentiment polarity of a document as well as to identify aspect-specific sentences. A machine-learning algorithm generalizes the Support Vector Machine (SVM) classifier.
  • Item
    Using first-order logic to compress sentences
    (Association for the Advancement of Artificial Intelligence (AAAI), 2012) Minlie Huang; Xing Shi; Feng Jin; Xiaoyan Zhu
    Sentence compression is one of the most challenging tasks in natural language processing, which may be of increasing interest to many applications such as abstractive summarization and text simplification for mobile devices. In this paper, we present a novel sentence compression model based on first-order logic, using Markov Logic Network. Sentence compression is formulated as a word/phrase deletion problem in this model. By taking advantage of first-order logic, the proposed method is able to incorporate local linguistic features and to capture global dependencies between word deletion operations. Experiments on both written and spoken corpora show that our approach produces competitive performance against the state-of-the-art methods in terms of manual evaluation measures such as importance, grammaticality, and overall quality.
  • Item
    K2Q : generating natural language questions from keywords with user refinements
    (Asian Federation of Natural Language Processing (AFNLP), 2011) Zhicheng Zheng; Xiance Si; Chang, Edward Y.; Xiaoyan Zhu
    Garbage in and garbage out. A Q&A system must receive a well formulated question that matches the user’s intent or she has no chance to receive satisfactory answers. In this paper, we propose a keywords to questions (K2Q) system to assist a user to articulate and refine questions. K2Q generates candidate questions and refinement words from a set of input keywords. After specifying some initial keywords, a user receives a list of candidate questions as well as a list of refinement words. The user can then select a satisfactory question, or select a refinement word to generate a new list of candidate questions and refinement words. We propose a User Inquiry Intent (UII) model to describe the joint generation process of keywords and questions for ranking questions, suggesting refinement words, and generating questions that may not have previously appeared. Empirical study shows UII to be useful and effective for the K2Q task.
  • Item
    Quality-biased ranking of short texts in microblogging services
    (Asian Federation of Natural Language Processing (AFNLP), 2011) Minlie Huang; Yi Yang; Xiaoyan Zhu
    The abundance of user-generated content comes at a price: the quality of content may range from very high to very low. We propose a regression approach that incorporates various features to recommend short-text documents from Twitter, with a bias toward quality perspective. The approach is built on top of a linear regression model which includes a regularization factor inspired from the content conformity hypothesis - documents similar in content may have similar quality. We test the system on the Edinburgh Twitter corpus. Experimental results show that the regularization factor inspired from the hypothesis can improve the ranking performance and that using unlabeled data can make ranking performance better. Comparative results show that our method outperforms several baseline systems. We also make systematic feature analysis and find that content quality features are dominant in short-text ranking.
  • Item
    New multiword expression metric and its applications
    (Springer Science+Business Media, 2011) Fan Bu; Xiao-Yan Zhu; Ming Li
    Multiword Expressions (MWEs) appear frequently and ungrammatically in natural languages. Identifying MWEs in free texts is a very challenging problem. This paper proposes a knowledge-free, unsupervised, and language- independent Multiword Expression Distance (MED). The new metric is derived from an accepted physical principle, measures the distance from an n-gram to its semantics, and outperforms other state-of-the-art methods on MWEs in two applications: question answering and named entity extraction.
  • Item
    Summarizing Similar Questions for Chinese Community Question Answering Portals
    (2010) Tang, Yang; Li, Fangtao; Huang, Minlie; Zhu, Xiaoyan
    As online community question answering (cQA) portals like Yahoo! Answers1 and Baidu Zhidao2 have attracted over hundreds of millions of questions, how to utilize these questions and accordant answers becomes increasingly important for cQA websites. Prior approaches focus on using information retrieval techniques to provide a ranked list of questions based on their similarities to the query. Due to the high variance of question quality and answer quality, users have to spend lots of time on finding the truly best answers from retrieved results. In this paper, we develop an answer retrieval and summarization system which directly provides an accurate and comprehensive answer summary instead of a list of similar questions to user’s query. To fully explore the information of relations between queries and questions, between questions and answers, and between answers and sentences, we propose a new probabilistic scoring model to distinguish high-quality answers from low-quality answers. By fully exploiting these relations, we summarize answers using a maximum coverage model. Experiment results on the data extracted from Chinese cQA websites demonstrate the efficacy of our proposed method.
  • Item
    Multi-document Summarization by Information Distance
    (2009) Long, C; Huang, M L; Zhu, X Y; Li, M
    Fast changing knowledge on the Internet can be acquired more efficiently with the help of automatic document summarization and updating techniques. This paper described a novel approach for multi-document update summarization. The best summary is defined to be the one which has the minimum information distance to the entire document set. The best update summary has the minimum conditional information distance to a document cluster given that a prior document cluster has already been read. Experiments on the DUC 2007 dataset and the TAC 2008 dataset have proved that our method closely correlates with the human summaries and outperforms other programs such as LexRank in many categories under the ROUGE evaluation criterion.
  • Item
    New Approach for Multi-Document Update Summarization
    (2010) Long, Chong; Huang, Min-Lie; Zhu, Xiao-Yan; Li, Ming
    Fast changing knowledge on the Internet can be acquired more efficiently with the help of automatic document summarization and updating techniques. This paper describes a novel approach for multi-document update summarization. The best summary is defined to be the one which has the minimum information distance to the entire document set. The best update summary has the minimum conditional information distance to a document cluster given that a prior document cluster has already been read. Experiments on the DUC/TAC 2007 to 2009 datasets (http://duc.nist.gov/, http://www.nist.gov/tac/) have proved that our method closely correlates with the human summaries and outperforms other programs such as LexRank in many categories under the ROUGE evaluation criterion.
  • Item
    Specialized Review Selection for Feature Rating Estimation
    (IEEE, 2009-09) Long, Chong; Zhang, Lei; Huang, Minlie; Zhu, Xiaoyan; Li, Ming
    On participatory Websites, users provide opinions about products, with both overall ratings and textual reviews. In this paper, we propose an approach to accurately estimate feature ratings of the products. This approach selects user reviews that extensively discuss specific features of the products (called specialized reviews), using information distance of reviews on the features. Experiments on real data show that overall ratings of the specialized reviews can be used to represent their feature ratings. The average of these overall ratings can be used by recommender systems to provide feature specific recommendations that better help users make purchasing decisions.
  • Item
    Answering Opinion Questions with Random Walks on Graphs
    (2009-08) Li, Fangtao; Tang, Yang; Huang, Minlie; Zhu, Xiaoyan
    Opinion Question Answering (Opinion QA), which aims to find the authors’ sentimental opinions on a specific target, is more challenging than traditional fact-based question answering problems. To extract the opinion oriented answers, we need to consider both topic relevance and opinion sentiment issues. Current solutions to this problem are mostly ad-hoc combinations of question topic information and opinion information. In this paper, we propose an Opinion PageRank model and an Opinion HITS model to fully explore the information from different relations among questions and answers, answers and answers, and topics and opinions. By fully exploiting these relations, the experiment results show that our proposed algorithms outperform several state of the art baselines on benchmark data set. A gain of over 10% in F scores is achieved as compared to many other systems.
  • Item
    THU QUANTA at TAC 2009 KBP and RTE Track
    (2009-11) Li, Fan; Zheng, Zhicheng; Bu, Fan; Tang, Yang; Zhu, Xiaoyan
    This paper describes the systems of THU QUANTA in Text Analysis Conference (TAC) 2009. We participated in the Knowledge Base Population (KBP) track, and the Recognizing Textual Entailment (RTE) track. For the KBP track, we investigate two ranking strategies for Entity Linking task. We employ a Listwise “Learning to Rank” model and Augmenting Naïve Bayes model to rank the candidate. We try to use learned patterns to solve the Slot Filling task. For the RTE track, we propose an interesting method, SEGraph (Semantic Elements based Graph). This method divides the Hypothesis and Text into two types of semantic elements: Entity Semantic Element and Relation Semantic Element. The SEGraph is then constructed, with Entity Elements as nodes, and Relation Elements as edges for both Text and Hypothesis. Finally we recognize the textual entailment based on the SEGraph of Text and SEGraph of Hypothesis. The evaluation results show that our proposed two frame-works are very effective for KBP and RTE tasks, respectively.
  • Item
    Sentiment Analysis with Global Topics and Local Dependency
    (Association for the Advancement of Artificial Intelligence (AAAI), Palo Alto, California, 2010-07) LI, Fangtao; Huang, Minlie; Zhu, Xiaoyan
    With the development of Web 2.0, sentiment analysis has now become a popular research problem to tackle. Recently, topic models have been introduced for the simultaneous analysis for topics and the sentiment in a document. These studies, which jointly model topic and sentiment, take the advantage of the relationship between topics and sentiment, and are shown to be superior to traditional sentiment analysis tools. However, most of them make the assumption that, given the parameters, the sentiments of the words in the document are all independent. In our observation, in contrast, sentiments are expressed in a coherent way. The local conjunctive words, such as “and” or “but”, are often indicative of sentiment transitions. In this paper, we propose a major departure from the previous approaches by making two linked contributions. First, we assume that the sentiments are related to the topic in the document, and put forward a joint sentiment and topic model, i.e. Sentiment-LDA. Second, we observe that sentiments are dependent on local context. Thus, we further extend the Sentiment-LDA model to Dependency- Sentiment-LDA model by relaxing the sentiment independent assumption in Sentiment-LDA. The sentiments of words are viewed as a Markov chain in Dependency- Sentiment-LDA. Through experiments, we show that exploiting the sentiment dependency is clearly advantageous, and that the Dependency-Sentiment-LDA is an effective approach for sentiment analysis.
  • Item
    Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance
    (2009-11) Long, Chong; Huang, Minlie; Zhu, Xiaoyan
    This paper presents our extractive summarization systems at the update summarization track of TAC 2009. This system is based on our newly developed document summarization framework under the theory of conditional information distance among many objects. The best summary is defined in this paper to be the one which has the minimum information distance to the entire document set. The best update summary has the minimum conditional information distance to a document cluster given that a prior document cluster has already been read. Experiments on the TAC dataset have proved that our method has got a good performance in many categories.
  • Item
    Question Answering System Based on Community QA
    (2010-05) Zheng, Zhicheng; Tang, Yang; Long, Chong; Bu, Fan; Zhu, Xiaoyan
    After a long period of research in factoid QA, such kind of questions has already been solved quite well. However, real users always concern on some more complicated questions such as ”Why XXXX?” or ”How XXXX?”. These questions are difficult to retrieve answers directly from internet, but the community question answering services provide good resources to solve these questions. As cQA portals like Yahoo! Answers and Baidu Zhidao have attracted over hundreds of millions of questions, these questions can be treated as users’ query log, and can help the QA systems understand the user’s questions better. Common approaches focus on using information retrieval techniques in order to provide a ranked list of questions based on their similarity to the query. Due to the high variance of quality of questions and answers, users have to spend lots of time on finding the truly best answers from retrieved results. In this paper, we develop an answer retrieval and summarization system which directly provides an accurate and comprehensive answer summary besides a list of similar questions to user’s query. To fully explore the information of questions and answers posted in the cQA, we adopt different strategies according to different situations. By this way, the system could output great answers to users’ questions in practice.
  • Item
    Learning to Link Entities with Knowledge Base
    (2010-06) Zheng, Zhicheng; Li, Fangtao; Huang, Minlie; Zhu, Xiaoyan
    This paper address the problem of entity linking. Specifically, given an entity mentioned in unstructured texts, the task is to link this entity with an entry stored in the existing knowledge base. This is an important task for information extraction. It can serve as a convenient gateway to encyclopedic information, and can greatly improve the web users’ experience. Previous learning based solutions mainly focus on classification framework. However, it’s more suitable to consider it as a ranking problem. In this paper, we propose a learning to rank algorithm for entity linking. It effectively utilizes the relationship information among the candidates when ranking. The experiment results on the TAC 20091 dataset demonstrate the effectiveness of our proposed framework. The proposed method achieves 18.5% improvement in terms of accuracy over the classification models for those entities which have corresponding entries in the Knowledge Base. The overall performance of the system is also better than that of the state-of-the-art methods.
  • Item
    Recognizing Biomedical Named Entities using Skip-chain Conditional Random Fields
    (Association for Computational Linguistics, Stroudsburg, PA, 2010-07) Liu, Jingchen; Huang, Minlie; Zhu, Xiaoyan
    Linear-chain Conditional Random Fields (CRF) has been applied to perform the Named Entity Recognition (NER) task in many biomedical text mining and information extraction systems. However, the linear-chain CRF cannot capture long distance dependency, which is very common in the biomedical literature. In this paper, we propose a novel study of capturing such long distance dependency by defining two principles of constructing skip-edges for a skip-chain CRF: linking similar words and linking words having typed dependencies. The approach is applied to recognize gene/protein mentions in the literature. When tested on the BioCreAtIvE II Gene Mention dataset and GENIA corpus, the approach contributes significant improvements over the linear-chain CRF. We also present in-depth error analysis on inconsistent labeling and study the influence of the quality of skip edges on the labeling performance.