Compendium of Text Mining applications for life sciences (Bio-NLP tools)


The following BioNLP resources have been retrieved:

ABGene [BIONLP_2] Tool URL
Description Gene and protein name tagger trained on Medline abstracts. ABGene processing begins by using these automatically generated rules from the Brill tagger to extract single word gene and protein names from biomedical abstracts. This is followed by extensive filtering for false positives and false negatives. A key step during the filtering stage is the extraction of multi-word gene and protein names that are prevalent in the literature but inaccessible to the Brill tagger. This system has also been tested for its use with full text articles.
Reference Tanabe L, Wilbur WJ. Tagging Gene and Protein Names in Full Text Articles Proceedings of the Workshop on Natural Language Processing in the Biomedical Domain, Philadelphia, July 2002, pp. 9-13
Abstract Tagging Gene and Protein Names in Full Text Articles. Current information extraction efforts in the biomedical domain tend to focus on finding entities and facts in structured databases or MEDLINE abstracts. We apply a gene and protein name tagger trained on Medline abstracts (ABGene) to a randomly selected set of full text journal articles in the biomedical domain. We show the effect of adaptations made in response to the greater heterogeneity of full text. (No PubMed ref.)
Availability Online: N, Download: Y, Web service: N
User relevance Biologist: 1, Curator: 1, NLP: 3
Query input Free text (local);
Query output Gene/protein names; Bio-entity tagged text; Semantically labelled text; Gene/Protein labelled text;
Keywords Information extraction; Entity Recognition; Abstracts;

ABNER [BIONLP_3] Tool URL
Description Machine-learning-based NER system for tagging biological entities (genes, proteins, cell lines, cell types, RNA, DNA) in text. It is based on conditional random fields (CRFs) and trained on the NLPBA and BioCreative corpora. It is implemented in Java and can be used locally.
Reference Settles B. ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005 Jul 15;21(14):3191-2. Epub 2005 Apr 28
Abstract ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for molecular biology text mining. At its core is a machine learning system using conditional random fields with a variety of orthographic and contextual features. The latest version is 1.5, which has an intuitive graphical interface and includes two modules for tagging entities (e.g. protein and cell line) trained on standard corpora, for which performance is roughly state of the art. It also includes a Java application programming interface allowing users to incorporate ABNER into their own systems and train models on new corpora. (PMID: 15860559)
Availability Online: N, Download: Y, Web service: N
User relevance Biologist: 1, Curator: 1, NLP: 3
Query input Free text (local); Sentences
Query output Gene/protein names; Bio-entity tagged text; Semantically labelled text; Gene/Protein labelled text;
Keywords Information extraction; Entity Recognition

Acromine [BIONLP_5] Tool URL
Description System which is able to assist searching for acronyms and the corresponding long name forms (as well as the opposite search type). The output is provided both in a general summary table containing the year of the first mention of the acronym as well as a detailed output with the actual mentions of the acronym in PubMed sentences. A detailed performance analysis of this tool is provided in the reference paper. This approach uses overlapping definitions of an acronym stated by a number of authors and is based on co-occurrence of long form and acronym pairs.
Reference Okazaki N, Ananiadou S. Building an abbreviation dictionary using a term recognition approach. Bioinformatics. 2006 Dec 15;22(24):3089-95. Epub 2006 Oct 18.
Abstract Building an abbreviation dictionary using a term recognition approach. MOTIVATION: Acronyms result from a highly productive type of term variation and trigger the need for an acronym dictionary to establish associations between acronyms and their expanded forms. RESULTS: We propose a novel method for recognizing acronym definitions in a text collection. Assuming a word sequence co-occurring frequently with a parenthetical expression to be a potential expanded form, our method identifies acronym definitions in a similar manner to the statistical term recognition task. Applied to the whole MEDLINE (7 811 582 abstracts), the implemented system extracted 886 755 acronym candidates and recognized 300 954 expanded forms in reasonable time. Our method outperformed base-line systems, achieving 99% precision and 82-95% recall on our evaluation corpus that roughly emulates the whole MEDLINE. AVAILABILITY AND SUPPLEMENTARY INFORMATION: The implementations and supplementary information are available at our web site: http://www.chokkan.org/research/acromine/ (PMID: 17050571)
Availability Online: Y, Download: Y, Web service: Y
User relevance Biologist: 2, Curator: 1, NLP: 3
Query input Free text (paste); Keyword; Abbreviations/Acronyms
Query output Acronyms/Abbreviations; Date
Keywords Information extraction; Acronym/abbreviation extraction

AcroTagger [BIONLP_101] Tool URL
Description Tool to tag biomedical abbreviations in text using XML.
Reference No reference, contact person: Sylvain Gaudan
Abstract Identifies and tags the abbreviations in text with xml tags. If the long-form is given in the text or can be guessed from the document context, then the tag surrounding the abbreviation will contain the expansion's normalised form. The system is written in Java and uses SVM light. (No PubMed ref.)
Availability Online: N, Download: Y, Web service: N
User relevance Biologist: 1, Curator: 1, NLP: 2
Query input Free text (local); Sentences
Query output Articles; Sentences; Abstracts; Acronyms/Abbreviations ; Acronyms/Abbreviations tagged text
Keywords Information extraction; Term extraction; Acronym/abbreviation extraction

ADAM [BIONLP_100052015] Tool URL
Description ADAM is a database and online system for abbreviations extracted from PubMed
Reference Zhou W, Torvik VI, Smalheiser NR. ADAM: another database of abbreviations in MEDLINE. Bioinformatics. 2006 Nov 15;22(22):2813-8. Epub 2006 Sep 18
Abstract ADAM: another database of abbreviations in MEDLINE. MOTIVATION: Abbreviations are an important type of terminology in the biomedical domain. Although several groups have already created databases of biomedical abbreviations, these are either not public, or are not comprehensive, or focus exclusively on acronym-type abbreviations. We have created another abbreviation database, ADAM, which covers commonly used abbreviations and their definitions (or long-forms) within MEDLINE titles and abstracts, including both acronym and non-acronym abbreviations. RESULTS: A model of recognizing abbreviations and their long-forms from titles and abstracts of MEDLINE (2006 baseline) was employed. After grouping morphological variants, 59 405 abbreviation/long-form pairs were identified. ADAM shows high precision (97.4%) and includes most of the frequently used abbreviations contained in the Unified Medical Language System (UMLS) Lexicon and the Stanford Abbreviation Database. Conversely, one-third of abbreviations in ADAM are novel insofar as they are not included in either database. About 19% of the novel abbreviations are non-acronym-type and these cover at least seven different types of short-form/long-form pairs. AVAILABILITY: A free, public query interface to ADAM is available at http://arrowsmith.psych.uic.edu, and the entire database can be downloaded as a text file. (PMID: 16982707)
Availability Online: Y, Download: Y, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 3
Query input Undefined
Query output Undefined
Keywords Undefined

AliasServer [BIONLP_102] Tool URL
Description Protein aliases handler
Reference Iragne F, Barré A, Goffard N, De Daruvar A. AliasServer: a web server to handle multiple aliases used to refer to proteins. Bioinformatics. 2004 Sep 22;20(14):2331-2. Epub 2004 Apr 1.
Abstract AliasServer: a web server to handle multiple aliases used to refer to proteins. SUMMARY: AliasServer provides services that facilitate the assembly of data or datasets that make use of different identifiers for refering to the same protein. This resource relies on a database which contains, for a given organism, a non-redundant list of protein sequences associated with a set of aliases. AVAILABILITY: AliasServer is available as an interactive Web server at http://cbi.labri.fr/outils/alias/ and as a web service using a SOAP interface. The complete tool, including sources and data, is available for local installations upon request. SUPPLEMENTARY INFORMATION: Technical documentation is available at http://cbi.labri.fr/outils/alias/asdoc.pdf (PMID: 15059813)
Availability Online: Y, Download: Y, Web service: Y
User relevance Biologist: 2, Curator: 2, NLP: 2
Query input Gene/protein names; Gene/protein identifiers; Gene/protein lists
Query output Gene/protein names; Gene/protein identifiers
Keywords Gene/protein normalization

Ali Baba [BIONLP_103] Tool URL
Description Tool to search PubMed abstracts for biological objects and their relations and visualize them through graphical networks. It provides information on: protein interactions, disease-relations and tissue specificity of genes.It is able to include pathways from KEGG and allows searches using UniProt IDs. # Find out quickly about interacting proteins, genes with implications in diseases, tissue specificity of genes, and many more! # Ali Baba allows you to search for proteins by simply querying for UniProt IDs instead of typing a long list of synonyms. # Ali Baba is able to include pathways from KEGG -- new databases will be featured in the near future. # Ali Baba links all information to the underlying literature and databases -- this provides you with detailed information on selected aspects.
Reference Plake C, Schiemann T, Pankalla M, Hakenberg J, Leser U. AliBaba: PubMed as a graph. Bioinformatics. 2006 Oct 1;22(19):2444-5. Epub 2006 Jul 26.
Abstract ALIBABA: PubMed as a graph. The biomedical literature contains a wealth of information on associations between many different types of objects, such as protein-protein interactions, gene-disease associations and subcellular locations of proteins. When searching such information using conventional search engines, e.g. PubMed, users see the data only one-abstract at a time and 'hidden' in natural language text. AliBaba is an interactive tool for graphical summarization of search results. It parses the set of abstracts that fit a PubMed query and presents extracted information on biomedical objects and their relationships as a graphical network. AliBaba extracts associations between cells, diseases, drugs, proteins, species and tissues. Several filter options allow for a more focused search. Thus, researchers can grasp complex networks described in various articles at a glance. AVAILABILITY: http://alibaba.informatik.hu-berlin.de/ (PMID: 16870931)
Availability Online: Y, Download: N, Web service: Y
User relevance Biologist: 3, Curator: 2, NLP: 2
Query input Keyword; Gene/protein names; Gene/protein identifiers
Query output Ranked list; Nr. documents; Sentences; Abstracts; PMIDs; Keyword; Bio-entity tagged text; Bio-entity association list; Semantically labelled text; Bio-entity network; Gene/Protein labelled text; Gene/Protein normalized text; Bio-entity co-occurrences; Protein Interactions;
Keywords Information extraction; Information retrieval; Text Mining; Gene/protein normalization; Relation extraction; Protein Interaction; Disease; Abstracts

ALICE [BIONLP_100040] Tool URL
Description ALICE: Abbreviation LIfter using Corpus-based Extraction. 1. Background: The rapid growth of literature in MEDLINE database gives benefit invaluably to biomedical researchers. On the other hand, unfettered introduction of new abbreviations in the literature such as gene or protein names hinders efficient use of the database. Because in the biomedical literature, abbreviations are highly ambiguous: one abbreviation may represent multiple expansions. In this situation, a support system by which those researchers identify abbreviations in the literature is strongly required. 2. ALICE system: To extract abbreviations and their expansions from biomedical literature, we propose an algorithm called ALICE (Abbreviation LIfter using Corpus-based Extraction). ALICE is composed of three phases, that is, the Inner Search (IS), the Outer Extraction (OE), and the Validity Judgment (VJ). The IS phase is for searching a candidate abbreviation and recognizing whether the candidate is an abbreviation or not, the OE phase is for extracting of its expansion, and the VJ phase is for judging the propriety of the pair of an abbreviation and its expansion. 3. Evaluation: Our algorithm solved various limitations, which other algorithms had, by carefully constructed many patterns and rules, and many stop words lists. They are based on reiterated examinations to a vast amount of biomedical literature. ALICE tries to recognize all patterns of abbreviations and extract their expansions in the literature with high precision and recall. It achieved 95% precision and 96% recall on the randomly selected literature from MEDLINE database. This achievement helps to construct a useful abbreviation dictionary, which also leads to making a new algorithm to retrieve literature from MEDLINE database. 4. How to use: ALICE can accept: # MEDLINE format file # PubMed ID listsapace (e.g., PMID1 PMID2 PMID3 ...). # Free text
Reference Ao H, Takagi T. ALICE: an algorithm to extract abbreviations from MEDLINE. J Am Med Inform Assoc. 2005 Sep-Oct;12(5):576-86. Epub 2005 May 19.
Abstract ALICE: an algorithm to extract abbreviations from MEDLINE.OBJECTIVE: To help biomedical researchers recognize dynamically introduced abbreviations in biomedical literature, such as gene and protein names, we have constructed a support system called ALICE (Abbreviation LIfter using Corpus-based Extraction). ALICE aims to extract all types of abbreviations with their expansions from a target paper on the fly. METHODS: ALICE extracts an abbreviation and its expansion from the literature by using heuristic pattern-matching rules. This system consists of three phases and potentially identifies valid 320 abbreviation-expansion patterns as combinations of the rules. RESULTS: It achieved 95% recall and 97% precision on randomly selected titles and abstracts from the MEDLINE database. CONCLUSION: ALICE extracted abbreviations and their expansions from the literature efficiently. The subtly compiled heuristics enabled it to extract abbreviations with high recall without significantly reducing precision. ALICE does not only facilitate recognition of an undefined abbreviation in a paper by constructing an abbreviation database or dictionary, but also makes biomedical literature retrieval more accurate. This system is freely available at http://uvdb3.hgc.jp/ALICE/ALICE_index.html. (PMID: 15905486)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

ARGH [BIONLP_104] Tool URL
Description ARGH: Biomedical Acronym Resolving General Heuristics database. As input an acronym can be used, the tool provides the corresponding list of expansion. One an also search with teh expanded term and find the corresponding acronyms derived from PubMed. It has also some scores: Frequency, Relative Frequency, First Observed occurence, Context Example (link to a PubMed citation where it is mentioned)
Reference Undefined
Abstract Not available (No PubMed ref.)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 2, Curator: 2, NLP: 2
Query input Keyword; Gene/protein names; Abbreviations/Acronyms
Query output Ranked list; Nr. documents; Confidence score; Keyword; Acronyms/Abbreviations; Date;
Keywords Acronym/abbreviation extraction; Abstracts

ARROWSMITH [BIONLP_105] Tool URL
Description Extended MEDLINE search tool. This tool allows you to look for terms which are common between to collections of articles, each belonging to a different keyword search. Arrowsmith can be used to provide associations between two research fields of keywords. For each of the keywords it retries the associated literature and the collection of phrases present in both literature sets. Semantic subsets of these phrases can be selected and these terms are ranked by relevance. This system can be used to discover knowledge of non-trivial relationships between the two keywords.
Reference Torvik VI, Smalheiser NR. A quantitative model for linking two disparate sets of articles in MEDLINE. Bioinformatics. 2007 Jul 1;23(13):1658-65. Epub 2007 Apr 26.
Abstract A quantitative model for linking two disparate sets of articles in MEDLINE. BACKGROUND: Identifying information that implicitly links two disparate sets of articles is a fundamental and intuitive data mining strategy that can help investigators address real scientific questions. The Arrowsmith two-node search finds title words and phrases (so-called B-terms) that are shared across two sets of articles within MEDLINE and displays them in a manner that facilitates human assessment. A serious stumbling-block has been the lack of a quantitative model for predicting which of the hundreds if not thousands of B-terms computed for a given search are most likely to be relevant to the investigator. METHODOLOGY/PRINCIPAL FINDINGS: Using a public two-node search interface, field testers devised a set of two-node searches under real life conditions and a certain number of B-terms were marked relevant. These were employed as 'gold standards;' each B-term was characterized according to eight complementary features that were strongly correlated with relevance. A logistic regression model was developed that permits one to estimate the probability of relevance for each B-term, to rank B-terms according to their likely relevance, and to estimate the overall number of relevant B-terms inherent in a given two-node search. CONCLUSIONS/SIGNIFICANCE: The model greatly simplifies and streamlines the process of carrying out a two-node search, and may be applicable to a number of other literature-based discovery applications, including the so-called one-node search and related gene-centric strategies that incorporate implicit links to predict how genes may be related to each other and to human diseases. This should encourage much wider exploration of text mining for implicit information among the general scientific community. AVAILABILITY: Two-node searches can be carried out freely at http://arrowsmith.psych.uic.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. (PMID: 17463015)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 1, NLP: 1
Query input Keyword; Gene/protein names
Query output Ranked list; Nr. documents; Confidence score; Articles; Abstracts; Keyword; Bio-entity co-occurrences; Ranked articles
Keywords Information retrieval; Text Mining; Relation extraction; Disease; Abstracts; Knowledge Discovery

arXiv.org [BIONLP_100020004] Tool URL
Description arXiv is an e-print service in the fields of physics, mathematics, non-linear science, computer science, quantitative biology and statistics. The contents of arXiv conform to Cornell University academic standards. arXiv is owned, operated and funded by Cornell University, a private not-for-profit educational institution. arXiv is also partially funded by the National Science Foundation. It provides open access to 463,358 e-prints.
Reference None
Abstract None (No PubMed ref.)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

askMEDLINE [BIONLP_1000517] Tool URL
Description PubMed search tool that translates a question into an efficient search, especially useful for users without knowledge of specialized vocabularies that an expert searcher might use.
Reference Fontelo P, Liu F, Ackerman M. askMEDLINE: a free-text, natural language query tool for MEDLINE/PubMed. BMC Med Inform Decis Mak. 2005 Mar 10;5(1):5
Abstract askMEDLINE: a free-text, natural language query tool for MEDLINE/PubMed. BACKGROUND: Plain language search tools for MEDLINE/PubMed are few. We wanted to develop a search tool that would allow anyone using a free-text, natural language query and without knowing specialized vocabularies that an expert searcher might use, to find relevant citations in MEDLINE/PubMed. This tool would translate a question into an efficient search. RESULTS: The accuracy and relevance of retrieved citations were compared to references cited in BMJ POEMs and CATs (critically appraised topics) questions from the University of Michigan Department of Pediatrics. askMEDLINE correctly matched the cited references 75.8% in POEMs and 89.2 % in CATs questions on first pass. When articles that were deemed to be relevant to the clinical questions were included, the overall efficiency in retrieving journal articles was 96.8% (POEMs) and 96.3% (CATs.) CONCLUSION: askMEDLINE might be a useful search tool for clinicians, researchers, and other information seekers interested in finding current evidence in MEDLINE/PubMed. The text-only format could be convenient for users with wireless handheld devices and those with low-bandwidth connections in remote locations. (PMID: 15760470)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

Authoratory [BIONLP_111] Tool URL
Description Authoratory is an online tool that requires registration and allows searching for social networks of scientists derived from authors from PubMed citations. It allows visualization of co-author networks using a graphical display, searching for author-keyword associations, making a co-author chart. This tool should serve to assist finding domain experts for certain scientific topics.
Reference None
Abstract None (No PubMed ref.)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 1, NLP: 1
Query input Free text (paste); Keyword; Authors
Query output Ranked list; Nr. documents; Keyword References; Date; Co-authors
Keywords Information retrieval; Relation extraction; Abstracts; Social network; Co-author

BabelMeSH [BIONLP_100052042] Tool URL
Description Online tool which allows cross-language PubMeed searches.
Reference Liu F, Ackerman M, Fontelo P. BabelMeSH: development of a cross-language tool for MEDLINE/PubMed. AMIA Annu Symp Proc. 2006;:1012
Abstract BabelMeSH: development of a cross-language tool for MEDLINE/PubMed. BabelMeSH is a cross-language tool for searching MEDLINE/PubMed. Queries can be submitted as single terms or complex phrases in French, Spanish and Portuguese. Citations will be sent to the user in English. It uses a smart parser interface with a medical terms database in MySQL. Preliminary evaluation using compound key words in foreign language medical journals showed an accuracy of 68%, 60% and 51% for French, Spanish and Portuguese, respectively. Development is continuing. (PMID: 17238631)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

BANNER [BIONLP_454565] Tool URL
Description BANNER is a named entity recognition system, primarily intended for biomedical text. It is a machine-learning system based on conditional random fields and contains a wide survey of the best features in recent literature on biomedical named entity recognition (NER). BANNER is portable and is designed to maximize domain independence by not employing semantic features or rule-based processing steps. It is therefore useful to developers as an extensible NER implementation, to researchers as a standard for comparing innovative techniques, and to biologists requiring the ability to find novel entities in large amounts of text. This system has been compared to the official results of the Second BioCreative Challenge Evaluation as well as to other applications such as ABNER, LingPipe and NERBio.
Reference Leaman R, Gonzalez G. BANNER: an executable survey of advances in biomedical named entity recognition. Pac Symp Biocomput. 2008; 652-663
Abstract BANNER: an executable survey of advances in biomedical named entity recognition. There has been an increasing amount of research on biomedical named entity recognition, the most basic text extraction problem, resulting in significant progress by different research teams around the world. This has created a need for a freely-available, open source system implementing the advances described in the literature. In this paper we present BANNER, an open-source, executable survey of advances in biomedical named entity recognition, intended to serve as a benchmark for the field. BANNER is implemented in Java as a machine-learning system based on conditional random fields and includes a wide survey of the best techniques recently described in the literature. It is designed to maximize domain independence by not employing brittle semantic features or rule-based processing steps, and achieves significantly better performance than existing baseline systems. It is therefore useful to developers as an extensible NER implementation, to researchers as a standard for comparing innovative techniques, and to biologists requiring the ability to find novel entities in large amounts of text. (PMID: 18229723)
Availability Online: Y, Download: Y, Web service: U
User relevance Biologist: 2, Curator: 2, NLP: 3
Query input Free text (paste); Sentences
Query output Confidence score; Gene/protein names; Bio-entity tagged text; Acronyms/Abbreviations
Keywords Information extraction; Text Mining; Entity Recognition; Term extraction; Acronym/abbreviation extraction

BFMed [BIONLP_100052004] Tool URL
Description BFMed: a cross-language French-to-English search engine for MEDLINE
Reference None
Abstract (No PubMed ref.)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

BIGNER [BIONLP_11101] Tool URL
Description BIGNER (Background Information driven Gene Named Entity Recognizer) is a system for automatically tagging gene and protein mentions. This tool is able to locate gene/protein names in biomedical literature. The core of the system is a dictionary generated by semi-supervised learning from huge amount of unlabeled biomedical texts. Two models are provided: (a) maximum match based on the dictionary. (b) The combination of the dictionary and a conditional random field (CRF) model.
Reference Li Y, Lin H, Yang Z. Incorporating rich background knowledge for gene named entity classification and recognition. BMC Bioinformatics. 2009 Jul 17;10(1):223.
Abstract Incorporating rich background knowledge for gene named entity classification and recognition. ABSTRACT: BACKGROUND: Gene named entity classification and recognition are crucial preliminary steps of text mining in biomedical literature. Machine learning based methods have been used in this area with great success. In most state-of-the-art systems, elaborately designed lexical features, such as words, n-grams, and morphology patterns, have played a central part. However, this type of feature tends to cause extreme sparseness in feature space. As a result, out-of-vocabulary (OOV) terms in the training data are not modeled well due to lack of information. RESULTS: We propose a general framework for gene named entity representation, called feature coupling generalization (FCG). The basic idea is to generate higher level features using term frequency and co-occurrence information of highly indicative features in huge amount of unlabeled data. We examine its performance in a named entity classification task, which is designed to remove non-gene entries in a large dictionary derived from online resources. The results show that new features generated by FCG outperform lexical features by 5.97 F-score and 10.85 for OOV terms. Also in this framework each extension yields significant improvements and the sparse lexical features can be transformed into both a lower dimensional and more informative representation. A forward maximum match method based on the refined dictionary produces an F-score of 86.2 on BioCreative 2 GM test set. Then we combined the dictionary with a conditional random field (CRF) based gene mention tagger, achieving an F-score of 89.05, which improves the performance of the CRF-based tagger by 4.46 with little impact on the efficiency of the recognition system. A demo of the NER system is available at http://202.118.75.18:8080/bioner. (PMID: 19615051)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 1, Curator: 2, NLP: 3
Query input Free text (paste); Free text (local); Sentences
Query output Gene/protein names; Bio-entity tagged text; Semantically labelled text; Gene/Protein labelled text;
Keywords Information extraction; Text Mining; Entity Recognition

BioContrasts [BIONLP_100020] Tool URL
Description Contrasts are effective conceptual vehicles for learning processes such as correcting, highlighting, contrasting, and grouping central concepts. Thus, they are useful for exploring the unknown. They can provide much invaluable insights and explanations about the observed phenomena. For example, contrasts between proteins in terms of their biological interactions can reveal what similarities, divergences, and relations there are of the proteins, leading to additional useful insights about the underlying functional nature of the proteins. A protein-protein contrast is a contrast between two proteins A and B, called as "focused proteins", which indicates that A but not B is involved in a biological property C, called as "presupposed property", or vice versa. Contrast information is often encoded by contrastive negation patterns such as "A but not B" in the biomedical literature. Such contrast not only explicitly describes a difference between focused proteins in terms of its presupposed property, but also implicitly indicates that the focused proteins are semantically similar. This combination of difference and similarity between proteins is useful for augmenting proteomics databases and also for discovering novel knowledge. BioContrasts Database is a database with protein-protein contrastive information. The database currently contains 41,471 protein-protein contrasts, which are automatically extracted from MEDLINE abstracts. Proteins in this database are cross-linked to Swiss-Prot for the purpose of effectively enhancing biomedical resources such as KEGG, InterPro, and Gene Ontology. With the web interface provided in this homepage, users can search for contrastive information of proteins of interest with their Swiss-Prot IDs or their names. Users also can attempt knowledge discovery with protein-protein contrasts through several templates of user interface.
Reference Undefined
Abstract BioContrasts: extracting and exploiting protein-protein contrastive relations from biomedical literature. MOTIVATION: Contrasts are useful conceptual vehicles for learning processes and exploratory research of the unknown. For example, contrastive information between proteins can reveal what similarities, divergences and relations there are of the two proteins, leading to invaluable insights for better understanding about the proteins. Such contrastive information are found to be reported in the biomedical literature. However, there have been no reported attempts in current biomedical text mining work that systematically extract and present such useful contrastive information from the literature for exploitation. RESULTS: Our BioContrasts system extracts protein-protein contrastive information from MEDLINE abstracts and presents the information to biologists in a web-application for exploitation. Contrastive information are identified in the text abstracts with contrastive negation patterns such as 'A but not B'. A total of 799 169 pairs of contrastive expressions were successfully extracted from 2.5 million MEDLINE abstracts. Using grounding of contrastive protein names to Swiss-Prot entries, we were able to produce 41 471 pieces of contrasts between Swiss-Prot protein entries. These contrastive pieces of information are then presented via a user-friendly interactive web portal that can be exploited for applications such as the refinement of biological pathways. AVAILABILITY: BioContrasts can be accessed at http://biocontrasts.i2r.a-star.edu.sg. It is also mirrored at http://biocontrasts.biopathway.org. SUPPLEMENTARY INFORMATION: Supplementary materials are available at Bioinformatics online. (PMID: 16368768)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

BioIE [BIONLP_100022] Tool URL
Description Rule-based system to extract informative sentences.
Reference Undefined
Abstract BioIE: extracting informative sentences from the biomedical literature. SUMMARY: BioIE is a rule-based system that extracts informative sentences relating to protein families, their structures, functions and diseases from the biomedical literaturE. Based on manual definition of templates and rules, it aims at precise sentence extraction rather than wide recall. After uploading source text or retrieving abstracts from MEDLINE, users can extract sentences based on predefined or user-defined template categories. BioIE also provides a brief insight into the syntactic and semantic context of the source-text by looking at word, N-gram and MeSH-term distributions. Important Applications of BioIE are in, for example, annotation of microarray data and of protein databases. AVAILABILITY: http://umber.sbs.man.ac.uk/dbbrowser/bioie/ (PMID: 15691860)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

BioLit [BIONLP_18515836] Tool URL
Description BioLit enables an enchanced view of articles that provides semantic data and links to biological databases based on the content of the article. For example, words matching to existing biological ontologies are highlighted and database identifiers are linked to their database of origin. For text-miners, BioLit data are provided as machine-readable XML files that contain mark-up for the ontology terms and database identifiers. Ontologies included are Gene Ontology, Pathway Ontology, Human Disease Ontology and the Cell Type Ontology among others. BioLit uses a custom parser designed in-house. Source code should be released soon. BioLit identifies database IDs from the Protein Data Bank and UniProtKB by finding IDs that have already been marked-up during the publication process and by using regular expressions on the full text of the article (PMC). Potential IDs are compared to existing IDs from both databases to remove matching strings that do not correspond to actual database records. Support for other databases, such as Entrez, will be added soon.
Reference Fink JL, Kushch S, Williams PR, Bourne PE. BioLit: integrating biological literature with databases. Nucleic Acids Res. 2008 Jul 1;36(Web Server issue):W385-9.
Abstract BioLit: integrating biological literature with databases. BioLit is a web server which provides metadata describing the semantic content of all open access, peer-reviewed articles which describe research from the major life sciences literature archive, PubMed Central. Specifically, these metadata include database identifiers and ontology terms found within the full text of the article. BioLit delivers these metadata in the form of XML-based article files and as a custom web-based article viewer that provides context-specific functionality to the metadata. This resource aims to integrate the traditional scientific publication directly into existing biological databases, thus obviating the need for a user to search in multiple locations for information relating to a specific item of interest, for example published experimental results associated with a particular biological database entry. As an example of a possible use of BioLit, we also present an instance of the Protein Data Bank fully integrated with BioLit data. We expect that the community of life scientists in general will be the primary end-users of the web-based viewer, while biocurators will make use of the metadata-containing XML files and the BioLit database of article data. BioLit is available at http://biolit.ucsd.edu. (PMID: 18515836)
Availability Online: Y, Download: N, Web service: Y
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Keyword; Gene/protein names; Gene/protein identifiers; Ontology terms
Query output Ranked list; Nr. documents; Articles; PMIDs; Semantically labelled text
Keywords Information retrieval; Disease; Gene Ontology; Full text;

BioMail [BIONLP_100025] Tool URL
Description Selective dissemination of information (SDI) service for MEDLINE.
Reference None
Abstract (No PubMed ref.)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

BiomedExperts (BME) [BIONLP_100049] Tool URL
Description Literature-based social networking platform
Reference None
Abstract (PMID: )
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

Biomedical Abbreviation [BIONLP_1] Tool URL
Description Online system which (a) retrieves for a given input abbreviation the list of corresponding long forms extracted from a previously compiled database of abbreviation-long form pairs derived from PubMed. This list is ranked based on a quality score. (b) it also identifies from a given input text the list of mentioned abbreviations.
Reference Chang JT, Schütze H, Altman RB. Creating an online dictionary of abbreviations from MEDLINE. J Am Med Inform Assoc. 2002 Nov-Dec;9(6):612-20.
Abstract Creating an online dictionary of abbreviations from MEDLINE. OBJECTIVE: The growth of the biomedical literature presents special challenges for both human readers and automatic algorithms. One such challenge derives from the common and uncontrolled use of abbreviations in the literature. Each additional abbreviation increases the effective size of the vocabulary for a field. Therefore, to create an automatically generated and maintained lexicon of abbreviations, we have developed an algorithm to match abbreviations in text with their expansions. DESIGN: Our method uses a statistical learning algorithm, logistic regression, to score abbreviation expansions based on their resemblance to a training set of human-annotated abbreviations. We applied it to Medstract, a corpus of MEDLINE abstracts in which abbreviations and their expansions have been manually annotated. We then ran the algorithm on all abstracts in MEDLINE, creating a dictionary of biomedical abbreviations. To test the coverage of the database, we used an independently created list of abbreviations from the China Medical Tribune. MEASUREMENTS: We measured the recall and precision of the algorithm in identifying abbreviations from the Medstract corpus. We also measured the recall when searching for abbreviations from the China Medical Tribune against the database. RESULTS: On the Medstract corpus, our algorithm achieves up to 83% recall at 80% precision. Applying the algorithm to all of MEDLINE yielded a database of 781,632 high-scoring abbreviations. Of all the abbreviations in the list from the China Medical Tribune, 88% were in the database. CONCLUSION: We have developed an algorithm to identify abbreviations from text. We are making this available as a public abbreviation server at \url[http://abbreviation.stanford.edu/]. (PMID: 12386112)
Availability Online: Y, Download: N, Web service: Y
User relevance Biologist: 1, Curator: 2, NLP: 3
Query input Free text (paste); Abbreviations/Acronyms; Keyword; Sentences;
Query output Ranked list; Acronyms/Abbreviations; Acronyms/Abbreviations tagged text; Nr documents; Confidence score;
Keywords Acronym/abbreviation extraction

BioMinT [BIONLP_100045] Tool URL
Description BioMinT is an easy to use information retrieval and extraction tool targetted at the online biomedical literature (web service).
Reference None
Abstract (PMID: )
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

BioRAT [BIONLP_100023] Tool URL
Description Information extraction tool for biological research
Reference Corney DP, Buxton BF, Langdon WB, Jones DT. BioRAT: extracting biological information from full-length papers. Bioinformatics. 2004 Nov 22;20(17):3206-13. Epub 2004
Abstract BioRAT: extracting biological information from full-length papers. MOTIVATION: Converting the vast quantity of free-format text found in journals into a concise, structured format makes the researcher's quest for information easier. Recently, several information extraction systems have been developed that attempt to simplify the retrieval and analysis of biological and medical data. Most of this work has used the abstract alone, owing to the convenience of access and the quality of data. Abstracts are generally available through central collections with easy direct access (e.g. PubMed). The full-text papers contain more information, but are distributed across many locations (e.g. publishers' web sites, journal web sites and local repositories), making access more difficult. In this paper, we present BioRAT, a new information extraction (IE) tool, specifically designed to perform biomedical IE, and which is able to locate and analyse both abstracts and full-length papers. BioRAT is a Biological Research Assistant for Text mining, and incorporates a document search ability with domain-specific IE. RESULTS: We show first, that BioRAT performs as well as existing systems, when applied to abstracts; and second, that significantly more information is available to BioRAT through the full-length papers than via the abstracts alone. Typically, less than half of the available information is extracted from the abstract, with the majority coming from the body of each paper. Overall, BioRAT recalled 20.31% of the target facts from the abstracts with 55.07% precision, and achieved 43.6% recall with 51.25% precision on full-length papers. (PMID: 15231534)
Availability Online: -, Download: Y, Web service: -
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

BioSearch2D [BIONLP_100039] Tool URL
Description BioSearch2D: Realtime visualization and search in massive document collections from the biomedical literature
Reference None
Abstract (PMID: )
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 2, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

BioSpider [BIONLP_237] Tool URL
Description BioSpider is a tool designed to crawl the web for chemical and/or biological information. BioSpider brings together data from a large variety of databases, uses its own set of predictive programs, and integrates chemical (metabolite, ligand, cofactor and drug) data into its biological (sequence, function, pathway etc) reporting. BioSpider supports a variety of query types. It detects the type of search you are performing. Valid query types include: name (drug name, metabolite name, chemical name), CAS number, SMILES string, Pubchem ID (sid=, cid=), InChI identifier, Kegg Compound ID, Kegg Drug ID, SDF file, MOL file. You need to specify the search class: drug, metabolite, or protein. BioSpider was created in order to aid in the human-curation of the DrugBank and the Human Metabolome Databases
Reference Knox C, Shrivastava S, Stothard P, Eisner R, Wishart DS. BioSpider: a web server for automating metabolome annotations. Pac Symp Biocomput. 2007;:145-56.
Abstract BioSpider: a web server for automating metabolome annotations. One of the growing challenges in life science research lies in finding useful, descriptive or quantitative data about newly reported biomolecules (genes, proteins, metabolites and drugs). An even greater challenge is finding information that connects these genes, proteins, drugs or metabolites to each other. Much of this information is scattered through hundreds of different databases, abstracts or books and almost none of it is particularly well integrated. While some efforts are being undertaken at the NCBI and EBI to integrate many different databases together, this still falls short of the goal of having some kind of human-readable synopsis that summarizes the state of knowledge about a given biomolecule - especially small molecules. To address this shortfall, we have developed BioSpider. BioSpider is essentially an automated report generator designed specifically to tabulate and summarize data on biomolecules - both large and small. Specifically, BioSpider allows users to type in almost any kind of biological or chemical identifier (protein/gene name, sequence, accession number, chemical name, brand name, SMILES string, InCHI string, CAS number, etc.) and it returns an in-depth synoptic report (approximately 3-30 pages in length) about that biomolecule and any other biomolecule it may target. This summary includes physico-chemical parameters, images, models, data files, descriptions and predictions concerning the query molecule. BioSpider uses a web-crawler to scan through dozens of public databases and employs a variety of specially developed text mining tools and locally developed prediction tools to find, extract and assemble data for its reports. Because of its breadth, depth and comprehensiveness, we believe BioSpider will prove to be a particularly valuable tool for researchers in metabolomics. BioSpider is available at: www.biospider.ca (PMID: 17990488)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Keyword; Chemical name; Chemical substance identifier
Query output Ranked list; Bio-entity synonyms; Compound description; Chemical name; Chemical substance identifier
Keywords Chemical compound

BioText [BIONLP_100052018] Tool URL
Description BioText search engine allow searches over: full text, abstracts, figure captions and tables.
Reference Hearst MA, Divoli A, Guturu H, Ksikes A, Nakov P, Wooldridge MA, Ye J. BioText Search Engine: beyond abstract search. Bioinformatics. 2007 Aug 15;23(16):2196-7. Epub 2007 Jun 1
Abstract BioText Search Engine: beyond abstract search. The BioText Search Engine is a freely available Web-based application that provides biologists with new ways to access the scientific literature. One novel feature is the ability to search and browse article figures and their captions. A grid view juxtaposes many different figures associated with the same keywords, providing new insight into the literature. An abstract/title search and list view shows at a glance many of the figures associated with each article. The interface is carefully designed according to usability principles and techniques. The search engine is a work in progress, and more functionality will be added over time. Availability: http://biosearch.berkeley.edu. (PMID: 17545178)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

BioThesaurus [BIONLP_100020019] Tool URL
Description BioThesaurus is a web-based system designed to map a comprehensive collection of protein and gene names to UniProt Knowledgebase (UniProtKB) protein entries (Liu et al., 2006a, 2006b). It covers all UniProtKB protein entries, and consists of several millions of names extracted from multiple resources based on database cross-references in iProClass (detailed statistics and data sources). The web site allows the retrieval of synonymous names of given protein entries and the identification of ambiguous names shared by multiple proteins.
Reference Liu H, Hu ZZ, Zhang J, Wu C. BioThesaurus: a web-based thesaurus of protein and gene names.Bioinformatics. 2006 Jan 1;22(1):103-5. Epub 2005 Nov 2
Abstract BioThesaurus: a web-based thesaurus of protein and gene names. BioThesaurus is a web-based system designed to map a comprehensive collection of protein and gene names to protein entries in the UniProt Knowledgebase. Currently covering more than two million proteins, BioThesaurus consists of over 2.8 million names extracted from multiple molecular biological databases according to the database cross-references in iProClass. The BioThesaurus web site allows the retrieval of synonymous names of given protein entries and the identification of protein entries sharing the same names. AVAILABILITY: BioThesaurus is accessible for online searching at http://pir.georgetown.edu/iprolink/biothesaurus (PMID: 16267085)
Availability Online: Y, Download: Y, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

BIOWIZARD [BIONLP_10003] Tool URL
Description a Digg-style site for PubMed/MEDLINE with search functionality through Medical Subject Headings (MeSH)
Reference Undefined
Abstract (PMID: )
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

BITOLA [BIONLP_100024] Tool URL
Description BITOLA: interactive literature-based biomedical discovery support system.
Reference Corney DP, Buxton BF, Langdon WB, Jones DT. BioRAT: extracting biological information from full-length papers. Bioinformatics. 2004 Nov 22;20(17):3206-13. Epub 2004
Abstract Exploiting semantic relations for literature-based discovery. We propose using semantic predications to enhance literature-based discovery (LBD) systems, which currently depend exclusively on co-occurrence of words or concepts in target documents. In this paper, the predications, which are produced by the combined application of two natural language processing systems, BioMedLEE and SemRep, are coupled with an LBD system BITOLA. Initial experiments suggest this approach can uncover new associations that were not possible using previous methods. (PMID: 17238361)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

botXminer [BIONLP_100050] Tool URL
Description botXminer is a literature mining tool for 'bot' related articles. It mines information from the literature files obtained from MEDLINE in the XML format. Currently two types of literature searches are implemented through botXminer: (1) Search based on authors, publication date or words present in either title or abstract and (2) Group Articles search based on the 'Group by' option selected and available with graphical views.
Reference Mudunuri U, Stephens R, Bruining D, Liu D, Lebeda FJ. botXminer: mining biomedical literature with a new web-based application. Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W748-52.
Abstract botXminer: mining biomedical literature with a new web-based application. This paper outlines botXminer, a publicly available application to search XML-formatted MEDLINE data in a complete, object-relational schema implemented in Oracle XML DB. An advantage offered by botXminer is that it can generate quantitative results with certain queries that are not feasible through the Entrez-PubMed interface. After retrieving citations associated with user-supplied search terms, MEDLINE fields (title, abstract, journal, MeSH and chemical) and terms (MeSH qualifiers and descriptors, keywords, author, gene symbol and chemical), these citations are grouped and displayed as tabulated or graphic results. This work represents an extension of previous research for integrating these citations with relational systems. botXminer has a user-friendly, intuitive interface that can be freely accessed at http://botdb.abcc.ncifcrf.gov. (PMID: 16845112)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

CBioC [BIONLP_100021] Tool URL
Description A tool for extracting binary relationships between biological entities automatically from the biomedical literature and providing a platform that allows community collaboration in the annotation of the extracted relationships.
Reference Undefined
Abstract CBioC: beyond a prototype for collaborative annotation of molecular interactions from the literature. n molecular biology research, looking for information on a particular entity such as a gene or a protein may lead to thousands of articles, making it impossible for a researcher to individually read these articles and even just their abstracts. Thus, there is a need to curate the literature to get various nuggets of knowledge, such as an interaction between two proteins, and store them in a database. However the body of existing biomedical articles is growing at a very fast rate, making it impossible to curate them manually. An alternative approach of using computers for automatic extraction has problem with accuracy. We propose to leverage the advantages of both techniques, extracting binary relationships between biological entities automatically from the biomedical literature and providing a platform that allows community collaboration in the annotation of the extracted relationships. Thus, the community of researchers that writes and reads the biomedical texts can use the server for searching our database of extracted facts, and as an easy-to-use web platform to annotate facts relevant to them. We presented a preliminary prototype as a proof of concept earlier(1). This paper presents the working implementation available for download at http://www.cbioc.org as a browser-plug in for both Internet Explorer and FireFox. This current version has been available since June of 2006, and has over 160 registered users from around the world. Aside from its use as an annotation tool, data from CBioC has also been used in computational methods with encouraging results. (PMID: 17951840)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

CGC [BIONLP_100026] Tool URL
Description Candidate Gene Capture, web-based tool for finding rat genes relevant to arthritis.
Reference None
Abstract (No PubMed ref.)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

Chilibot [BIONLP_100027] Tool URL
Description Information extraction tool for relationships between genes, proteins and keywords
Reference Chen H, Sharp BM. Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics. 2004 Oct 8;5:147.
Abstract Content-rich biological network constructed by mining PubMed abstracts.BACKGROUND: The integration of the rapidly expanding corpus of information about the genome, transcriptome, and proteome, engendered by powerful technological advances, such as microarrays, and the availability of genomic sequence from multiple species, challenges the grasp and comprehension of the scientific community. Despite the existence of text-mining methods that identify biological relationships based on the textual co-occurrence of gene/protein terms or similarities in abstract texts, knowledge of the underlying molecular connections on a large scale, which is prerequisite to understanding novel biological processes, lags far behind the accumulation of data. While computationally efficient, the co-occurrence-based approaches fail to characterize (e.g., inhibition or stimulation, directionality) biological interactions. Programs with natural language processing (NLP) capability have been created to address these limitations, however, they are in general not readily accessible to the public. RESULTS: We present a NLP-based text-mining approach, Chilibot, which constructs content-rich relationship networks among biological concepts, genes, proteins, or drugs. Amongst its features, suggestions for new hypotheses can be generated. Lastly, we provide evidence that the connectivity of molecular networks extracted from the biological literature follows the power-law distribution, indicating scale-free topologies consistent with the results of previous experimental analyses. CONCLUSIONS: Chilibot distills scientific relationships from knowledge available throughout a wide range of biological domains and presents these in a content-rich graphical format, thus integrating general biomedical knowledge with the specialized knowledge and interests of the user. Chilibot http://www.chilibot.net can be accessed free of charge to academic users. (PMID: 15473905)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

CINAHL database [BIONLP_100019] Tool URL
Description Originally a print index to the literature of nursing and eventually allied health information, the CINAHL® database has emerged as a comprehensive and versatile guide to an exploding body of knowledge, and now extends beyond the limits of a bibliographic print index. Although a bibliographic database, the CINAHL database continues to include selected original and full-text material. Full text is included for selected state nursing journals and some newsletters, standards of practice, practice acts, government publications, research instruments and patient education material. The abstracts, when available and with publisher permission, are included in the database to assist with the evaluation of journal articles. Currently abstracts for more than 924 journal titles (active and inactive) are available. CINAHL abstracts are available for audiovisuals and educational software.
Reference Undefined
Abstract (PMID: )
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

CiteMD [BIONLP_10004] Tool URL
Description Explore PubMed/MEDLINE, create a free account and organize your references into projects, export to spreadsheet and word processor and email medical citations.
Reference Undefined
Abstract (PMID: )
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

CiteSpace II [BIONLP_100017] Tool URL
Description Java application which supports visual exploration with knowledge discovery in bibliographic databases.
Reference Undefined
Abstract CiteSpace II: visualization and knowledge discovery in bibliographic databases. This article presents a description and case study of CiteSpace II, a Java application which supports visual exploration with knowledge discovery in bibliographic databases. Highly cited and pivotal documents, areas of specialization within a knowledge domain, and emergence of research topics are visually mapped through a progressive knowledge domain visualization approach to detecting and visualizing trends and patterns in scientific literature. The test case in this study is progressive knowledge domain visualization of the field of medical informatics. Datasets based on publications from twelve journals in the medical informatics field covering the time period from 1964-2004 were extracted from PubMed and Web of Science (WOS) and developed as testbeds for evaluation of the CiteSpace system. Two resulting document-term co-citation and MeSH term co-occurrence visualizations are qualitatively evaluated for identification of pivotal documents, areas of specialization, and research trends. Practical applications in bio-medical research settings are discussed. (PMID: 16779135)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

CoagMDB [BIONLP_345235] Tool URL
Description CoagMDB is a web database that collates information on coagulation point mutations together with structural analysis of the domain structures within coagulation proteases. At the moment, the database holds mutations within the five vitamin-K dependant proteases, Thrombin (also known as Factor II), Factor VII (FVII), Factor IX (FIX), Factor X (FX) and Protein C. It is hoped this mutation database will be kept up to date, as text mining methods have been implemented to automatically extract mutation data from published articles. The mutation database will be updated using new PubMed articles on a monthly basis. This database can be searches using as input: codons, proteins, domains or mutated residues
Reference Saunders RE, Perkins SJ. CoagMDB: a database analysis of missense mutations within four conserved domains in five vitamin K-dependent coagulation serine proteases using a text-mining tool. Hum Mutat. 2008 Mar;29(3):333-44.
Abstract CoagMDB: a database analysis of missense mutations within four conserved domains in five vitamin K-dependent coagulation serine proteases using a text-mining tool. Central repositories of mutations that combine structural, sequence, and phenotypic information in related proteins will facilitate the diagnosis and molecular understanding of diseases associated with them. Coagulation involves the sequential activation of serine proteases and regulators in order to yield stable blood clots while maintaining hemostasis. Five coagulation serine proteases-factor VII (F7), factor IX (F9), factor X (F10), protein C (PROC), and thrombin (F2)-exhibit high sequence similarities and all require vitamin K. All five of these were incorporated into an interactive database of mutations named CoagMDB (http://www.coagMDB.org; last accessed: 9 August 2007). The large number of mutations involved (especially for factor IX) and the increasing problem of out-of-date databases required the development of new database management tools. A text mining tool automatically scans full-length references to identify and extract mutations. High recall rates between 96 and 99% and precision rates of 87 to 93% were achieved. Text mining significantly reduces the time and expertise required to maintain the databases and offers a solution to the problem of locus-specific database management and upkeep. A total of 875 mutations were extracted from 1,279 literature sources. Of these, 116 correspond to Gla domains, 86 to the N-terminal EGF domain, 73 to the C-terminal EGF domain, and 477 to the serine protease domain. The combination of text mining and consensus domain structures enables mutations to be correlated with experimentally-measurable phenotypes based on either low protein levels (Type I) or reduced functional activities (Type II), respectively. A tendency for the conservation of phenotype with structural location was identified. (PMID: 18058827)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 1, NLP: 1
Query input Gene/protein names
Query output Articles; Mutations
Keywords Information extraction; Mutation Extraction; Text Mining

CoPub Mapper [BIONLP_238] Tool URL
Description CoPub (description based on the CoPub tool webpage) is a text mining tool that detects co-occuring biomedical concepts in abstracts from the Medline literature database. The biomedical concepts included in CoPub are all human, mouse and rat genes, furthermore biological processes, molecular functions and cellular components from Gene Ontology, and also liver pathologies, diseases, drugs and pathways. Altogether more than 250,000 search strings are linked with CoPub. It offers three search interfaces for: (1) Gene search: to find associations of genes with pathways, diseases, drugs, GO terms, liver pathology terms and tissues. It allows flexibility of the search by specifying where the search keyword should be found (exact matches, substring matches) and also allows to define some threshold related to the occurrences and the search with additional synonyms, full names etc.. of the query gene. (2) BioConcept search: allows finding terms for pathways, diseases, drugs, GO terms, liver pathology terms and tissues, and then find co-occurrences between these term classes and genes. (3) Microarray data analysis: Upload or paste a set of Affymetrix gene identifiers (probeset IDs) find biomedical concepts from Medline that are significantly linked to the gene set. From every retrieved biomedical concept it is only one mouse click to co-published genes and another one to the relevant abstracts. With the input list of differentially expressed genes and the output list of over-represented keywords, CoPub calculates and displays a literature-based network in SVG format, in which nodes and edges are hyperlinked to the relevant abstracts.
Reference Alako BT, Veldhoven A, van Baal S, Jelier R, Verhoeven S, Rullmann T, Polman J, Jenster G. CoPub Mapper: mining MEDLINE based on search term co-publication. BMC Bioinformatics. 2005 Mar 11;6:51
Abstract CoPub Mapper: mining MEDLINE based on search term co-publication. BACKGROUND: High throughput microarray analyses result in many differentially expressed genes that are potentially responsible for the biological process of interest. In order to identify biological similarities between genes, publications from MEDLINE were identified in which pairs of gene names and combinations of gene name with specific keywords were co-mentioned. RESULTS: MEDLINE search strings for 15,621 known genes and 3,731 keywords were generated and validated. PubMed IDs were retrieved from MEDLINE and relative probability of co-occurrences of all gene-gene and gene-keyword pairs determined. To assess gene clustering according to literature co-publication, 150 genes consisting of 8 sets with known connections (same pathway, same protein complex, or same cellular localization, etc.) were run through the program. Receiver operator characteristics (ROC) analyses showed that most gene sets were clustered much better than expected by random chance. To test grouping of genes from real microarray data, 221 differentially expressed genes from a microarray experiment were analyzed with CoPub Mapper, which resulted in several relevant clusters of genes with biological process and disease keywords. In addition, all genes versus keywords were hierarchical clustered to reveal a complete grouping of published genes based on co-occurrence. CONCLUSION: The CoPub Mapper program allows for quick and versatile querying of co-published genes and keywords and can be successfully used to cluster predefined groups of genes and microarray data. (PMID: 15760478)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Keyword; Gene/protein names; Gene/protein lists
Query output Ranked list; Nr. documents; Confidence score; Abstracts; PMIDs; Keyword; Gene/protein names; Gene/protein identifiers; Bio-entity tagged text; Bio-entity association list; Gene/Protein labelled text; Gene/Protein normalized text; Bio-entity co-occurrences; Bio-entity synonyms; Ranked gene/protein lists; GO-protein associations
Keywords Information extraction; Information retrieval; Text Mining; Entity Recognition; Gene/protein normalization; Term extraction; Relation extraction; Microarray; Gene/protein function; Disease; Gene Ontology; Abstracts

Dragon Toolkit [BIONLP_35435456] Tool URL
Description The Dragon Toolkit is a Java-based development package for academic use in information retrieval (IR) and text mining (TM, including text classification, text clustering, text summarization, and topic modeling). It is tailored for researchers who work on large-scale IR and TM and prefer Java programming. Moreover, different from Lucene and Lemur, it provides built-in supports for semantic-based IR and TM. The dragon toolkit seamlessly integrates a set of NLP tools, which enable the toolkit to index text collections with various representation schemes including words, phrases, ontology-based concepts and relationships. However, to minimize the learning time, we intentionally keep the package small and simple. The toolkit does not have some features including distributed IR and cross-language IR which is a part of Lemur toolkit. Another important feature of the toolkit is its scalability. Unlike many text mining tools such as Weka, the dragon toolkit is specially designed for large-scale application. The toolkit uses sparse matrix to implement text representations and does not have to load all data into memory in the running time. Therefore, it can handle hundred thousands of documents with very limited memory. Main Features: 1. Implemented by Java 2. Sparse matrix represenation and computationally efficient 3. Highly scalable to large data set 4. Well designed Programming API and XML-based Interface 5. Various document representations including words, multiword phrases, ontology-based concepts, and concept pairs 6. Various text retrieval models 7. Text classification, clustering, summarization and topic modeling
Reference Zhou, X., Zhang, X., and Hu, X. Dragon Toolkit: Incorporating Auto-learned Semantic Knowledge into Large-Scale Text Retrieval and Mining. In proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), October 29-31, 2007
Abstract Dragon Toolkit: Incorporating Auto-learned Semantic Knowledge into Large-Scale Text Retrieval and Mining. The majority of text retrieval and mining techniques are still based on exact feature (e.g. words) matching and unable to incorporate text semantics. Many researchers believe that the extension with semantic knowledge could improve the results and various methods (most of them are heuristic) have been proposed to account for concept hierarchy, synonymy, and other semantic relationships. However, the results with such semantic extension have been mixed, ranging from slight improvements to decreases in effectiveness, mostly likely due to the lack of a formal framework. Instead, we propose a novel method to address the semantic extension within the framework of language modeling. Our method extracts explicit topic signatures from documents and then statistically maps them into single-word features. The incorporation of semantic knowledge then reduces to the smoothing of unigram language models using semantic knowledge. The dragon toolkit reflects our method and its effectiveness is demonstrated by three tasks, text retrieval, text classification, and text clustering. (PMID: )
Availability Online: Y, Download: Y, Web service: N
User relevance Biologist: 2, Curator: 2, NLP: 3
Query input Undefined
Query output Undefined
Keywords Information retrieval; Text Mining; Text clustering; Text summarization; Topic modeling

E3Miner [BIONLP_3453456] Tool URL
Description E3Miner: a text minining tool specialized for ubiquitin-protein ligases E3Miner is a web-based text mining tool that extracts and incorporates comprehensive knowledge about E3s with their underlying mechanisms. This tool integrates available E3 data not only from the published literature but also from the biological databases, using natural language processing techniques. USAGE: You can insert text or comma (,) separated PubMed IDs (PMIDs), and click the 'Mining' button. When a PMID is given, E3Miner makes online access to PubMed, and retrieves the corresponding title and abstract. Note that input text or PMIDs cannot begin with hyphen or comma (,). E3 data mined is structured as a table form, and you can also save the E3 data in text format through the 'Save Result' link. E3 data consist of the items as following. ('Heading' indicates the head in downloadable text data.)
Reference Lee H, Yi GS, Park JC. E3Miner: a text mining tool for ubiquitin-protein ligases. Nucleic Acids Res. 2008 Jul 1;36(Web Server issue):W416-22.
Abstract E3Miner: a text mining tool for ubiquitin-protein ligases. Ubiquitination is a regulatory process critically involved in the degradation of >80% of cellular proteins, where such proteins are specifically recognized by a key enzyme, or a ubiquitin-protein ligase (E3). Because of this important role of E3s, a rapidly growing body of the published literature in biology and biomedical fields reports novel findings about various E3s and their molecular mechanisms. However, such findings are neither adequately retrieved by general text-mining tools nor systematically made available by such protein databases as UniProt alone. E3Miner is a web-based text mining tool that extracts and organizes comprehensive knowledge about E3s from the abstracts of journal articles and the relevant databases, supporting users to have a good grasp of E3s and their related information easily from the available text. The tool analyzes text sentences to identify protein names for E3s, to narrow down target substrates and other ubiquitin-transferring proteins in E3-specific ubiquitination pathways and to extract molecular features of E3s during ubiquitination. E3Miner also retrieves E3 data about protein functions, other E3-interacting partners and E3-related human diseases from the protein databases, in order to help facilitate further investigation. E3Miner is freely available through http://e3miner.biopathway.org. (PMID: 18483079)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input PMID; PMID list; Query (upload); Keyword; Gene/protein names; Gene/protein identifiers; Gene/protein lists
Query output Keyword; Gene/protein names; Gene/protein identifiers; GO-protein associations; Disease-protein associations
Keywords Information extraction; Gene/protein normalization; Term extraction; Relation extraction; Gene/protein function; Disease; Gene Ontology

EAGLi [BIONLP_100052006] Tool URL
Description EAGLi (“eagle eye”): an advanced search engine for MEDLINE with terminology-powered navigation skills (Gene Ontology, Swiss-Prot keywords).
Reference None
Abstract (No PubMed ref.)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

EBIMed [BIONLP_100028] Tool URL
Description EBIMed is a web application that combines Information Retrieval and Extraction from Medline. EBIMed finds Medline abstracts in the same way PubMed does. Then it goes a step beyond and analyses them to offer a complete overview on associations between UniProt protein/gene names, GO annotations, Drugs and Species. The results are shown in a table that displays all the associations and links to the sentences that support them and to the original abstracts. You can type term queries in the text box provided following the syntax conventions that can be found here. Your terms will be looked up throughout Medline and several abstracts will thus be retrieved and analysed. In the simple interface the higher limit is 500 to make the process quick. You can set a higher limit through the Advanced Search interface. By selecting relevant sentences and highlighting the biomedical terminology EBIMed enhances your ability to acquire knowledge, relate facts, discover implications and, overall, have a good overview economizing the effort in reading. Indexed fields: PMID, AbstractText, ArticleTitle, AuthorList, MeshHeadingList, DateCreated, DateCompleted, DateRevised, PubDate, Language, MedlineJournalInfo.
Reference Rebholz-Schuhmann D, Kirsch H, Arregui M, Gaudan S, Riethoven M, Stoehr P. EBIMed--text crunching to gather facts for proteins from Medline. Bioinformatics. 2007 Jan 15;23(2):e237-44.
Abstract EBIMed--text crunching to gather facts for proteins from Medline. To allow efficient and systematic retrieval of statements from Medline we have developed EBIMed, a service that combines document retrieval with co-occurrence-based analysis of Medline abstracts. Upon keyword query, EBIMed retrieves the abstracts from EMBL-EBI's installation of Medline and filters for sentences that contain biomedical terminology maintained in public bioinformatics resources. The extracted sentences and terminology are used to generate an overview table on proteins, Gene Ontology (GO) annotations, drugs and species used in the same biological context. All terms in retrieved abstracts and extracted sentences are linked to their entries in biomedical databases. We assessed the quality of the identification of terms and relations in the retrieved sentences. More than 90% of the protein names found indeed represented a protein. According to the analysis of four protein-protein pairs from the Wnt pathway we estimated that 37% of the statements containing such a pair mentioned a meaningful interaction and clarified the interaction of Dkk with LRP. We conclude that EBIMed improves access to information where proteins and drugs are involved in the same biological process, e.g. statements with GO annotations of proteins, protein-protein interactions and effects of drugs on proteins. AVAILABILITY: Available at http://www.ebi.ac.uk/Rebholz-srv/ebimed (PMID: 17237098)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

EMBASE [BIONLP_100018] Tool URL
Description More than 18 million validated biomedical and pharmacological records from EMBASE and MEDLINE EMBASE.com provides you with content that keeps you up to date with the latest scientific developments. It is the Intelligent Gateway to Biomedical and Pharmacological Information. Find out more about the product in this section.
Reference Undefined
Abstract (PMID: )
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

Enju [BIONLP_3434546] Tool URL
Description Enju is a syntactic parser for English. With a wide-coverage probabilistic HPSG grammar and an efficient parsing algorithm, this parser can effectively analyze syntactic/semantic structures of English sentences and provide a user with phrase structures and predicate-argument structures. Those outputs would be especially useful for high-level NLP applications, including information extraction, automatic summarization, and question answering, where the "meaning" of a sentence plays a central role. The main features of the Enju parser are: (1) accurate deep analysis — the parser can output both phrase structures and predicate-argument structures. The accuracy of predicate-argument relations is around 90% for newswire articles and biomedical papers. (2) high speed — parsing speed is less than 500 msec. per sentence by default (faster than most Penn Treebank parsers), and less than 50 msec. when using the high-speed version.
Reference Yusuke Miyao and Jun'ichi Tsujii. 2002. Maximum Entropy Estimation for Feature Forests. In Proceedings of HLT 2002.
Abstract Maximum Entropy Estimation for Feature Forests. An algorithm is proposed for maximum entropy modeling. It enables probabilistic modeling of complete structures, such as transition sequences in Markov models and parse trees, without dividing them into independent sub-events. A probabilistic event is represented by a feature forest, which is a packed representation of features with ambiguities. The parameters are efficiently estimated by traversing each node in a feature forest by dynamic programming. Experiments showed the algorithm worked efficiently even when ambiguities in a feature forest cause an exponential explosion of unpacked structures. (PMID: )
Availability Online: Y, Download: Y, Web service: N
User relevance Biologist: 1, Curator: 1, NLP: 3
Query input Free text (paste); Free text (local); Sentences
Query output POS labelled text; Parses; Root form
Keywords Syntactic parser, probabilistic HPSG grammar

EpiLoc [BIONLP_100033] Tool URL
Description Prediction or text based sub-cellular location fro proteins and protein sequences (UniProt)
Reference Brady S, Shatkay H. EpiLoc: a (working) text-based system for predicting protein subcellular location. Pac Symp Biocomput. 2008;:604-15.
Abstract EpiLoc: a (working) text-based system for predicting protein subcellular location. MOTIVATION: Predicting the subcellular location of proteins is an active research area, as a protein's location within the cell provides meaningful cues about its function. Several previous experiments in utilizing text for protein subcellular location prediction varied in methods, applicability and performance level. In an earlier work we have used a preliminary text classification system and focused on the integration of text features into a sequence-based classifier to improve location prediction performance. RESULTS: Here the focus shifts to the text-based component itself. We introduce EpiLoc, a comprehensive text-based localization system. We provide an in-depth study of text-feature selection, and study several new ways to associate text with proteins, so that text-based location prediction can be performed for practically any protein. We show that EpiLoc's performance is comparable to (and may even exceed) that of state-of-the-art sequence-based systems. EpiLoc is available at: http://epiloc.cs.queensu.ca. (PMID: 18229719)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 2, Curator: 2, NLP: 2
Query input Undefined
Query output Undefined
Keywords Undefined

eTBLAST [BIONLP_100030] Tool URL
Description eTBLAST is a unique search engine for searching biomedical literature. Our service is very different from PubMed. While PubMed searches for "keywords", our search engine lets you input an entire paragraph and returns MEDLINE abstracts that are similar to it. This is something like PubMed's "Related Articles" feature, only better because it runs on your unique set of interests. For example, input the abstract of an unpublished paper or a grant proposal into our engine, and with the touch of a button you'll be able to find every abstract in MEDLINE dealing with your topic. No more guessing whether your set of keywords has found all the right papers. No more sorting through hundreds of papers you don't care about to find the handful you were looking for--our search engine does it for you.
Reference Lewis J, Ossowski S, Hicks J, Errami M, Garner HR. Text similarity: an alternative way to search MEDLINE. Bioinformatics. 2006 Sep 15;22(18):2298-304. Epub 2006 Aug 22
Abstract Text similarity: an alternative way to search MEDLINE. MOTIVATION: The most widely used literature search techniques, such as those offered by NCBI's PubMed system, require significant effort on the part of the searcher, and inexperienced searchers do not use these systems as effectively as experienced users. Improved literature search engines can save researchers time and effort by making it easier to locate the most important and relevant literature. RESULTS: We have created and optimized a new, hybrid search system for Medline that takes natural text as input and then delivers results with high precision and recall. The combination of a fast, low-sensitivity weighted keyword-based first pass algorithm to cast a wide net to gather an initial set of literature, followed by a unique sentence-alignment based similarity algorithm to rank order those results was developed that is sensitive, fast and easy to use. Several text similarity search algorithms, both standard and novel, were implemented and tested in order to determine which obtained the best results in information retrieval exercises. AVAILABILITY: Literature searching algorithms are implemented in a system called eTBLAST, freely accessible over the web at http://invention.swmed.edu. A variety of other derivative systems and visualization tools provides the user with an enhanced experience and additional capabilities. CONTACT: Harold.Garner@UTSouthwestern.edu. (PMID: 16926219)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 2, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

eUtils [BIONLP_100052037] Tool URL
Description eUtils: Entrez Programming Utilities are tools that provide access to Entrez data outside of the regular web query interface and may be helpful for retrieving search results for future use in another environment. To demonstrate the capabilities of the E-Utilities (ESearch, EFetch and EPost, ESummary), NCBI has written two Perl scripts which may be downloaded and run from your machine
Reference None
Abstract None (No PubMed ref.)
Availability Online: Y, Download: Y, Web service: -
User relevance Biologist: 1, Curator: 2, NLP: 3
Query input Undefined
Query output Undefined
Keywords Undefined

FABLE [BIONLP_100046] Tool URL
Description FABLE mines the biomedical literature for information about human genes and proteins. FABLE v3 allows a user to find articles mentioning a gene of interest (Article Finder), to generate a list of genes associated with one or more keywords (Gene Lister).
Reference None
Abstract (PMID: )
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

FACTA [BIONLP_3434543] Tool URL
Description FACTA: Finding Associated Concepts with Text Analysis. Finding Associated Concepts with Text Analysis - Tool for finding associations of a given query to concepts including: Genes/Proteins, Diseases, Symptoms, Drugs and Compounds based on co-occurence analysis. Provides a summary display of the concepts co-occurring in the documents, and links to the text fragments where the query and these concepts co-occur (color highlighted)
Reference Tsuruoka Y, Tsujii J, Ananiadou S. FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics. 2008 Nov 1;24(21):2559-60.
Abstract FACTA: a text search engine for finding associated biomedical concepts. FACTA is a text search engine for MEDLINE abstracts, which is designed particularly to help users browse biomedical concepts (e.g. genes/proteins, diseases, enzymes and chemical compounds) appearing in the documents retrieved by the query. The concepts are presented to the user in a tabular format and ranked based on the co-occurrence statistics. Unlike existing systems that provide similar functionality, FACTA pre-indexes not only the words but also the concepts mentioned in the documents, which enables the user to issue a flexible query (e.g. free keywords or Boolean combinations of keywords/concepts) and receive the results immediately even when the number of the documents that match the query is very large. The user can also view snippets from MEDLINE to get textual evidence of associations between the query terms and the concepts. The concept IDs and their names/synonyms for building the indexes were collected from several biomedical databases and thesauri, such as UniProt, BioThesaurus, UMLS, KEGG and DrugBank. AVAILABILITY: The system is available at http://www.nactem.ac.uk/software/facta/ (PMID: 18772154)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Keyword; Gene/protein names; Compound/drug names; Abbreviations/Acronyms
Query output Ranked list; Nr. documents; Confidence score; Articles; Sentences; Abstracts; PMIDs; Keyword; Gene/protein names; Bio-entity association list; Bio-entity co-occurrences; Ranked gene/protein lists; Drug-target associations; Protein-compound associations
Keywords Information extraction; Information retrieval; Entity Recognition; Gene/protein normalization; Relation extraction; Abstracts

Figurome [BIONLP_11107] Tool URL
Description The Figurome search engine was developed to retrieve tables and figure types to aid computational and experimental research. Figurome labels PubMed central figures using four label types: Gel, Pathway, Structure, and Time. The labeling is based on the figure's caption, the text imbedded within the figure, and image properties
Reference Rodriguez-Esteban R, Iossifov I. Figure mining for biomedical research. Bioinformatics. 2009; 25(16):2082-2084.
Abstract Figure mining for biomedical research. MOTIVATION: Figures from biomedical articles contain valuable information difficult to reach without specialized tools. Currently, there is no search engine that can retrieve specific figure types. RESULTS: This study describes a retrieval method that takes advantage of principles in image understanding, text mining and optical character recognition (OCR) to retrieve figure types defined conceptually. A search engine was developed to retrieve tables and figure types to aid computational and experimental research. AVAILABILITY: http://iossifovlab.cshl.edu/figurome/. (PMID: 19439564)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Free text (paste); Keyword; Journal; Sentences;
Query output Ranked list; Articles; Sentences; Figures; Tables; Figure legends; Table captions; Images
Keywords Information extraction; Information retrieval; Text Mining; Text classification; Full text; Figure legends

FreePatentsOnline [BIONLP_100012] Tool URL
Description FreePatentsOnline, the best online site for free patent searching, has had a makeover. In addition to the graphical differences, the site has been completely rewritten to allow more users, more powerful functions, and more data.
Reference Undefined
Abstract (PMID: )
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

GAPSCORE [BIONLP_100031] Tool URL
Description Protein and gene name (PGN) tagger.
Reference Chang JT, Schütze H, Altman RB. GAPSCORE: finding gene and protein names one word at a time. Bioinformatics. 2004 Jan 22;20(2):216-25.
Abstract GAPSCORE: finding gene and protein names one word at a time. MOTIVATION: New high-throughput technologies have accelerated the accumulation of knowledge about genes and proteins. However, much knowledge is still stored as written natural language text. Therefore, we have developed a new method, GAPSCORE, to identify gene and protein names in text. GAPSCORE scores words based on a statistical model of gene names that quantifies their appearance, morphology and context. RESULTS: We evaluated GAPSCORE against the Yapex data set and achieved an F-score of 82.5% (83.3% recall, 81.5% precision) for partial matches and 57.6% (58.5% recall, 56.7% precision) for exact matches. Since the method is statistical, users can choose score cutoffs that adjust the performance according to their needs. AVAILABILITY: GAPSCORE is available at http://bionlp.stanford.edu/gapscore/ (PMID: 14734313)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 1, Curator: 1, NLP: 3
Query input Undefined
Query output Undefined
Keywords Undefined

GDep [BIONLP_56456567] Tool URL
Description GDep (GENIA Dependency parser): A dependency parser for biomedical text developed by Kenji Sagae at Tsujii Lab (University of Tokyo) and the Institute for Creative Technologies (University of Southern California). This is a version of the KSDep dependency parser trained on the GENIA Treebank for parsing biomedical text.
Reference Sagae, K., Tsujii, J. 2007. Dependency parsing and domain adaptation with LR models and parser ensembles. Proceedings of the CoNLL 2007 Shared Task. Joint Conferences on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL'07).
Abstract Dependency Parsing and Domain Adaptation with LR Models and Parser Ensembles. We present a data-driven variant of the LR algorithm for dependency parsing, and extend it with a best-first search for probabilistic generalized LR dependency parsing. Parser actions are determined by a classifier, based on features that represent the current state of the parser. We apply this parsing framework to both tracks of the CoNLL 2007 shared task, in each case taking advantage of multipl e models trained with different learners. In the multilingual track, we train three LR models for each of the ten languages, and combine the analyses obtained with each individual model with a maximum spanning tree voting scheme. In the domain adaptation track, we use two models to parse unlabeled data in the target domain to supplement the labeled out-ofdomain training set, in a scheme similar to one iteration of co-training. (PMID: )
Availability Online: N, Download: Y, Web service: N
User relevance Biologist: 1, Curator: 1, NLP: 3
Query input Free text (local); Sentences
Query output Parses
Keywords Syntactic parser

GenNav [BIONLP_100042] Tool URL
Description Find Gene Ontology terms and gene products in PubMed
Reference None
Abstract (PMID: )
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

GOAnnotator [BIONLP_100052024] Tool URL
Description GOAnnotator: Protein-GO annotation extraction from literature.
Reference Couto FM, Silva MJ, Lee V, Dimmer E, Camon E, Apweiler R, Kirsch H, Rebholz-Schuhmann D. GOAnnotator: linking protein GO annotations to evidence text. J Biomed Discov Collab. 2006 Dec 20;1:19
Abstract GOAnnotator: linking protein GO annotations to evidence text. BACKGROUND: Annotation of proteins with gene ontology (GO) terms is ongoing work and a complex task. Manual GO annotation is precise and precious, but it is time-consuming. Therefore, instead of curated annotations most of the proteins come with uncurated annotations, which have been generated automatically. Text-mining systems that use literature for automatic annotation have been proposed but they do not satisfy the high quality expectations of curators. RESULTS: In this paper we describe an approach that links uncurated annotations to text extracted from literature. The selection of the text is based on the similarity of the text to the term from the uncurated annotation. Besides substantiating the uncurated annotations, the extracted texts also lead to novel annotations. In addition, the approach uses the GO hierarchy to achieve high precision. Our approach is integrated into GOAnnotator, a tool that assists the curation process for GO annotation of UniProt proteins. CONCLUSION: The GO curators assessed GOAnnotator with a set of 66 distinct UniProt/SwissProt proteins with uncurated annotations. GOAnnotator provided correct evidence text at 93% precision. This high precision results from using the GO hierarchy to only select GO terms similar to GO terms from uncurated annotations in GOA. Our approach is the first one to achieve high precision, which is crucial for the efficient support of GO curators. GOAnnotator was implemented as a web tool that is freely available at http://xldb.di.fc.ul.pt/rebil/tools/goa/. (PMID: 17181854)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

GOCat [BIONLP_100052005] Tool URL
Description GOCat (Gene Ontology Categorizer): an automatic text categorizer to automatically annotate proteins using GO categories (molecular functions, subcellular locations and biological processes).
Reference None
Abstract (No PubMed ref.)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

GoGene [BIONLP_11102] Tool URL
Description GoGene takes your query to search for genes and gene-related concepts such as biological processes, molecular functions, diseases, and mutations. It accepts PubMed queries, EntrezGene queries, and nucleotide or amino acid sequences as input and shows the resuling list of genes together with relevant GO and MeSH concepts that can be used to filter the result further. All gene annotations are linked to the relevant literature and to database entries.
Reference Conrad Plake, Loic Royer, Rainer Winnenburg, Joerg Hakenberg, and Michael Schroeder. GoGene: gene annotation in the fast lane. Nucleic Acids Res, 37(Web Server issue):W300-4, 2009.
Abstract GoGene: gene annotation in the fast lane. High-throughput screens such as microarrays and RNAi screens produce huge amounts of data. They typically result in hundreds of genes, which are often further explored and clustered via enriched GeneOntology terms. The strength of such analyses is that they build on high-quality manual annotations provided with the GeneOntology. However, the weakness is that annotations are restricted to process, function and location and that they do not cover all known genes in model organisms. GoGene addresses this weakness by complementing high-quality manual annotation with high-throughput text mining extracting co-occurrences of genes and ontology terms from literature. GoGene contains over 4,000,000 associations between genes and gene-related terms for 10 model organisms extracted from more than 18,000,000 PubMed entries. It does not cover only process, function and location of genes, but also biomedical categories such as diseases, compounds, techniques and mutations. By bringing it all together, GoGene provides the most recent and most complete facts about genes and can rank them according to novelty and importance. GoGene accepts keywords, gene lists, gene sequences and protein sequences as input and supports search for genes in PubMed, EntrezGene and via BLAST. Since all associations of genes to terms are supported by evidence in the literature, the results are transparent and can be verified by the user. GoGene is available at http://gopubmed.org/gogene. (PMID: 19465383)
Availability Online: Y, Download: N, Web service: Y
User relevance Biologist: 3, Curator: 3, NLP: 2
Query input PMID; PMID list; Keyword; Gene/protein names; Gene/protein identifiers; Gene/protein lists; Gene/protein sequences; Authors; Abbreviations/Acronyms
Query output Online browsing Download of results
Keywords Information extraction; Information retrieval; Text Mining; Entity Recognition; Gene/protein normalization; Term extraction; Relation extraction; Microarray; Gene/protein function; Disease; Gene Ontology; Knowledge Discovery

Google Patent Search [BIONLP_100014] Tool URL
Description Google also offers a patent search
Reference Undefined
Abstract (PMID: )
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

Google Scholar [BIONLP_113] Tool URL
Description Google Scholar is a search engine specialized in retrieval of academic and scholar texts for a number of scientific disciplines. It includes searching within scientific articles, books, thesis, scientific reports, summaries and publications of academic institutions and universities. The retrieved documents are ordered by relevance, the ranking is based on the full text of each article the author, the journal publication and the number of time sit has been cited by other specialized sources. The search option includes searching with keywords (including also exact phrase or containing at least one word of the posed query keyword option), searching by author, publication (journal name) or date.
Reference None
Abstract None (No PubMed ref.)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Keyword; Authors; Abbreviations/Acronyms; Date
Query output Ranked list; Nr. documents; Articles; References; Ranked articles; Date
Keywords Information retrieval; Full text; Abstracts

GoPubMed [BIONLP_100052012] Tool URL
Description GoPubMed tool to explore biomedical literature (PubMed) according to Gene Ontology
Reference Doms A, Schroeder M. GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W783-6
Abstract GoPubMed: exploring PubMed with the Gene Ontology The biomedical literature grows at a tremendous rate and PubMed comprises already over 15 000 000 abstracts. Finding relevant literature is an important and difficult problem. We introduce GoPubMed, a web server which allows users to explore PubMed search results with the Gene Ontology (GO), a hierarchically structured vocabulary for molecular biology. GoPubMed provides the following benefits: first, it gives an overview of the literature abstracts by categorizing abstracts according to the GO and thus allowing users to quickly navigate through the abstracts by category. Second, it automatically shows general ontology terms related to the original query, which often do not even appear directly in the abstract. Third, it enables users to verify its classification because GO terms are highlighted in the abstracts and as each term is labelled with an accuracy percentage. Fourth, exploring PubMed abstracts with GoPubMed is useful as it shows definitions of GO terms without the need for further look up. GoPubMed is online at www.gopubmed.org. Querying is currently limited to 100 papers per query. (PMID: 15980585)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

HCAD [BIONLP_100020016] Tool URL
Description Online database of literature associated breakpoint of human genes generated using automatic literature processing tools. HCAD was designed to facilitate the identification of potential breakpoint genes (see Figure). This is a difficult task even though the complete human genome is now known, because of the sheer number of genes per cytoband (Rabbitss 1999). The HCAD system is based on the hypothesis that genes directly effected by recurrent breakage events will be quoted more often in abstracts about the corresponding breakpoint, even if a direct proof for this association has not yet been described. The statistical analysis in HCAD thus provides probabilities (Z-Score) for genes to be relevant for a certain breakpoint (literature evidence). False positive associations can be eliminated by crosschecking with genomic data (LocPrecision).
Reference Hoffmann R, Dopazo J, Cigudosa JC, Valencia A. HCAD, closing the gap between breakpoints and genes. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D511-3
Abstract HCAD, closing the gap between breakpoints and genes. Recurrent chromosome aberrations are an important resource when associating human pathologies to specific genes. However, for technical reasons a large number of chromosome breakpoints are defined only at the level of cytobands and many of the genes involved remain unidentified. We developed a web-based information system that mines the scientific literature and generates textual and comprehensive information on all human breakpoints. We show that the statistical analysis of this textual information and its combination with genomic data can identify genes directly involved in DNA rearrangements. The Human Chromosome Aberration Database (HCAD) is publicly accessible at http://www.pdg.cnb.uam.es/UniPub/HCAD/. (PMID: 15608250)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

HighWire Press [BIONLP_114] Tool URL
Description HighWire Press, represents another complementary resource to PubMed for accessing peer-reviewed articles, providing a search interface to over 1100 journals and 4.6 million full text articles, out of which over 1.8 articles are available free by Highwire partner publishers. A comparative evaluation of HighWire Press and PubMed in terms of search efficiency showed that both system although sharing many search characteristics also provide features unique to each (Vanhecke et al, 2007). Highwire Press also offers the possibility to provide a graphical visualization of the article's citation map and allows the user to further specify where to conduct the search (title, abstract, full text).
Reference HighWire press is 5 years old. J Biol Chem. 2000 Apr 14;275(15):10717.
Abstract HighWire Press Is 5 years old ARTICLE It has been 5 years since Stanford University's HighWire Press launched a revolution in scientific publishing with the creation of JBC Online in 1995. Now virtually every life science journal, as well as journals of many other disciplines, publish online. From that beginning with JBC Online in 1995, HighWire now produces online editions of more than 180 of the most highly cited journals and has clearly set the standard for innovative, very high quality Internet publication of the world's most important scientific journals (see the 500 most highly cited journals at http://highwire.stanford.edu/top/journals.dtl). During this remarkable 5-year period, HighWire has published more than 600,000 articles, and more than 140,000 articles are available free to anyone with Internet access. This represents, by far, the largest archive of barrier-free, peer-reviewed life science research reports on earth and serves the world research community exceedingly well. The publication of JBC Online has also skyrocketed during this 5-year revolution. As of March 12, 2000, we have published 25,874 articles. As a result of our "free back issues" policy to release all articles free at the end of each calendar year, more than 24,694 are available to everyone at no cost. Also during this time, the usage of JBC online has increased enormously. In 1 week in mid-1995, about 1000 individual readers contacted the JBC Online. For the 1 week ending March 2, 2000, more than 41,000 individuals contacted the JBC site. On March 8, 2000, JBC and HighWire launched a truly innovative feature called JBC Papers in Press. We now publish papers the hour they are accepted for publication and release them free to anyone with internet access (see www.jbc.org). What's next? The HighWire team continues to innovate, and we can expect many exciting new developments in the next 5 years. It is clear that online publication is vastly superior to print publication with easily searchable archives and very productive linking to other related research information. Given this great superiority of online publication, we are rapidly headed toward a drastic transformation of the print journal. Many journals are already making the online version of their research reports the version of record. These changes will redefine our libraries and how scientists access the exponentially increasing volume of research results. Congratulations HighWire, for revolutionizing science publication. We can't wait to see what's next. (PMID: 10753859)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 3, NLP: 2
Query input Keyword; Authors; Date; Journal
Query output Ranked list; Nr. documents; Abstracts; PMIDs; References; Ranked articles; e-mail returned article collection; Date
Keywords Information retrieval; Full text; Abstracts; Literature repository

HubMed [BIONLP_1000512] Tool URL
Description Alternative interface to the PubMed database with additional options (e.g. visualization of links between articels TouchGraph).
Reference Eaton AD. HubMed: a web-based biomedical literature search interface. Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W745-7
Abstract HubMed: a web-based biomedical literature search interface. HubMed is an alternative search interface to the PubMed database of biomedical literature, incorporating external web services and providing functions to improve the efficiency of literature search, browsing and retrieval. Users can create and visualize clusters of related articles, export citation data in multiple formats, receive daily updates of publications in their areas of interest, navigate links to full text and other related resources, retrieve data from formatted bibliography lists, navigate citation links and store annotated metadata for articles of interest. HubMed is freely available at http://www.hubmed.org/. (PMID: 16845111)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

HuGE Navigator [BIONLP_100052039] Tool URL
Description HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology, including information on population prevalence of genetic variants, gene-disease associations, gene-gene and gene- environment interactions, and evaluation of genetic tests. The HuGEtools include: (1) HuGE Literature Finder: to find published articles in human genome epidemiology. (2) HuGE Investigator Browser: to find investigators in a particular field of human genome epidemiology. (3) GeneSelectAssist: a search engine for finding possible candidate genes. (4) HuGE Watch: to track the evolution of published literature in human genome epidemiology.
Reference Yu W, Gwinn M, Clyne M, Yesupriya A, Khoury MJ. A navigator for human genome epidemiology. Nat Genet. 2008 Feb;40(2):124-5.
Abstract A navigator for human genome epidemiology. (PMID: 18227866)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

iHOP [BIONLP_100052021] Tool URL
Description iHOP: information on hyperlinked proteins, a system to search PubMed using gene and protein names which are automatically linked to their corresponding database entries and the associated literature. Search options inculde retrieval of protein interaction sentences, description sentences as well as ranking based on publication date. Also a graphical interaction network of proteins can be constructed based on literatur evidence.
Reference Hoffmann R, Valencia A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics. 2005 Sep 1;21 Suppl 2:ii252-8
Abstract Implementing the iHOP concept for navigation of biomedical literature. MOTIVATION: The World Wide Web has profoundly changed the way in which we access information. Searching the internet is easy and fast, but more importantly, the interconnection of related contents makes it intuitive and closer to the associative organization of human memory. However, the information retrieval tools currently available to researchers in biology and medicine lag far behind the possibilities that the layman has come to expect from the internet. RESULTS: By using genes and proteins as hyperlinks between sentences and abstracts, the information in PubMed can be converted into one navigable resource. iHOP (Information Hyperlinked over Proteins) is an online service that provides this gene-guided network as a natural way of accessing millions of PubMed abstracts and brings all the advantages of the internet to scientific literature research. Navigating across interrelated sentences within this network is closer to human intuition than the use of conventional keyword searches and allows for stepwise and controlled acquisition of information. Moreover, this literature network can be superimposed upon experimental interaction data to facilitate the simultaneous analysis of novel and existing knowledge. The network presented in iHOP currently contains5 million sentences and 40 000 genes from Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Danio rerio, Arabidopsis thaliana, Saccharomyces cerevisiae and Escherichia coli. AVAILABILITY: iHOP is freely accessible at http://www.pdg.cnb.uam.es/UniPub/iHOP/ (PMID: 16204114)
Availability Online: Y, Download: N, Web service: Y
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

Infomap / WORDSPACE [BIONLP_100020017] Tool URL
Description The Infomap NLP software performs automatic indexing of words and documents from free-text corpora, using a variant of LSA to enable information retrieval and other applications. It was developed by the Infomap Project at Stanford University's CSLI. There is a download page at: http://sourceforge.net/project/showfiles.php?group_id=99109 as well as an online demo version at: http://infomap.stanford.edu/webdemo
Reference None
Abstract None (No PubMed ref.)
Availability Online: Y, Download: Y, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

InfoPubMed [BIONLP_100052030] Tool URL
Description Interaction Network over the sea of MEDLINE. Info-PubMed is an efficient PubMed search tool, helping users to find information about biomedical entities such as genes, proteins,and the interactions between them. Currently, 14,785,094 MEDLINE articles are indexed. Info-PubMed is based on Natural Language Processing and Text Mining techniques and has been developed by Tsujii Laboratory, University of Tokyo, Japan. Requirement: (1) This system runs only on IE, NN and FireFox, and doesn't run on MacIE or Opera. (2) Set your browser to use JavaScript. (3) High screen resolution or full screen mode are recommended. (4) FireFox is more comfortable than IE with respect to speed. (Information based on the InfoPubMed webpage)
Reference None
Abstract (No PubMed ref.)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

iProLINK [BIONLP_100020018] Tool URL
Description iProLINK (integrated Protein Literature, INformation and Knowledge) has been developed as a resource to facilitate text mining in the area of literature-based database curation, named entity recognition, and protein ontology development. The collection of data sources can be utilized by computational and biological researchers to explore literature information on proteins and their features or properties (Hu et al., 2004).The data sources for bibliography mapping and feature evidence attribution include mapped citations (PubMed ID to protein entry and feature line mapping) and annotation-tagged literature corpora. The latter includes several hundred abstracts and full-text articles tagged with experimentally validated post-translational modifications (PTMs) annotated in the PIR protein sequence database. The data sources for entity recognition and ontology development include protein name dictionaries, word token dictionaries, protein name-tagged literature corpora along with tagging guidelines, and a protein ontology based on PIRSF protein family names. iProLINK also provides tools developed using PIR data sources, e.g. RLIMS-P for text mining of protein phosphorylation and BioThesaurus for mapping protein/gene names to UniProtKB entries.
Reference Hu ZZ, Mani I, Hermoso V, Liu H, Wu CH. iProLINK: an integrated protein resource for literature mining.Comput Biol Chem. 2004 Dec;28(5-6):409-16
Abstract iProLINK: an integrated protein resource for literature mining. The exponential growth of large-scale molecular sequence data and of the PubMed scientific literature has prompted active research in biological literature mining and information extraction to facilitate genome/proteome annotation and improve the quality of biological databases. Motivated by the promise of text mining methodologies, but at the same time, the lack of adequate curated data for training and benchmarking, the Protein Information Resource (PIR) has developed a resource for protein literature mining--iProLINK (integrated Protein Literature INformation and Knowledge). As PIR focuses its effort on the curation of the UniProt protein sequence database, the goal of iProLINK is to provide curated data sources that can be utilized for text mining research in the areas of bibliography mapping, annotation extraction, protein named entity recognition, and protein ontology development. The data sources for bibliography mapping and annotation extraction include mapped citations (PubMed ID to protein entry and feature line mapping) and annotation-tagged literature corpora. The latter includes several hundred abstracts and full-text articles tagged with experimentally validated post-translational modifications (PTMs) annotated in the PIR protein sequence database. The data sources for entity recognition and ontology development include a protein name dictionary, word token dictionaries, protein name-tagged literature corpora along with tagging guidelines, as well as a protein ontology based on PIRSF protein family names. iProLINK is freely accessible at http://pir.georgetown.edu/iprolink, with hypertext links for all downloadable files. (PMID: 15556482)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

KEX [BIONLP_100020025] Tool URL
Description KEX (Knowledge EXtraction) is a protein name annotation tool based on PROPER (PROtien Proper-noun Extraction Rules). The format of your input file should be a plain and simple text format or a "MEDLINE report" format. The format of the output file is a one_sentence-one_line format. Protein names are annotated with special mark-ups. KEX can be downloaded and was tested on Solaris, dec and irix. You need a C compier (preferably gcc), and Perl version 5.
Reference Yoshida M, Fukuda K, Takagi T. PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary.Bioinformatics. 2000 Feb;16(2):169-75
Abstract PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary. MOTIVATION: Since their initial development, integration and construction of databases for molecular-level data have progressed. Though biological molecules are related to each other and form a complex system, the information is stored in the vast archives of the literature or in diverse databases. There is no unified naming convention for biological object, and biological terms may be ambiguous or polysemic. This makes the integration and interaction of databases difficult. In order to eliminate these problems, machine-readable natural language resources appear to be quite promising. We have developed a workbench for protein name abbreviation dictionary (PNAD) building. RESULTS: We have developed PNAD Construction Support System (PNAD-CSS), which offers various convenient facilities to decrease the construction costs of a protein name abbreviation dictionary of which entries are collected from abstracts in biomedical papers. The system allows the users to concentrate on higher level interpretation by removing some troublesome tasks, e.g. management of abstracts, extracting protein names and their abbreviations, and so on. To extract a pair of protein names and abbreviations, we have developed a hybrid system composed of the PROPER System and the PNAD System. The PNAD System can extract the pairs from parenthetical-paraphrases involved in protein names, the PROPER System identified these paris, with 98.95% precision, 95.56% recall and 97.58% complete precision. AVAILABILITY: PROPER System is freely available from http://www.hgc.inc.u-tokyo.ac.jp/service/tooldoc /KeX/intro.html. The other software are also available on request. Contact the authors. CONTACT: mikio@ims.u-tokyo.ac.jp (PMID: 10842739)
Availability Online: N, Download: Y, Web service: N
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

Kfinder [BIONLP_10009] Tool URL
Description Get a list of suggested keywords and referees for your paper Paste the text of your paper in the window. For better results paste the complete text of the abstract, intro and discussion finder gives you a set of putative keywords from your text file. This is running at OHRI (Ottawa, Canada). The input needed is a text file (please, do not upload a word doc. If you have a word doc then "Select all", "Copy" and "Paste" in the input window).
Reference Undefined
Abstract (PMID: 12074170)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

KinasePathway database [BIONLP_53] Tool URL
Description NLP extraction of protein-protein interactions from text
Reference Asako Koike, Yoshiyuki Kobayashi, and Toshihisa Takagi (2003). Kinase Pathway Database: An Integrated Protein-Kinase and NLP-Based Protein-Interaction Resource. Genome Res. 2003;13 1231-1243
Abstract KinasePathwayDatabase is an integrated database concerning completed sequenced major eukaryotes, which contains the classification of protein kinases and their functional conservation and orthologous tables among species, protein-protein interaction data, domain information, structural information, and automatic pathway graph image interface. The protein-protein interactions are extracted by natural language processing (NLP) from abstracts using basic word pattern and protein name dictionary GENA: developed by our group. (PMID: 12799355)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Gene/protein names; Gene/protein identifiers; Gene/protein lists;
Query output Bio-entity network; Protein Interactions;
Keywords Information extraction; Text Mining; Relation extraction; Protein Interaction;

KMedDB [BIONLP_100052040] Tool URL
Description KMedDB enables you to search Pubmed abstracts for key words related to enzyme kinetic parameters. For querying KMedDB, please choose a type of kinetic constant. In addition, you can specify other entities, which have to appear in the abstracts (Enzyme: KEGG Identifier (EC-Number), Compound: KEGG Identifier, Species: NCBI Taxonomy Identifier). If you leave these fields open, you may obtain a large number of hits, and processing the query will take a while. At least you have to submit one other field.
Reference Hakenberg J, Schmeier S, Kowald A, Klipp E, Leser U. Finding kinetic parameters using text mining. OMICS. 2004 Summer;8(2):131-52
Abstract Finding kinetic parameters using text mining. The mathematical modeling and description of complex biological processes has become more and more important over the last years. Systems biology aims at the computational simulation of complex systems, up to whole cell simulations. An essential part focuses on solving a large number of parameterized differential equations. However, measuring those parameters is an expensive task, and finding them in the literature is very laborious. We developed a text mining system that supports researchers in their search for experimentally obtained parameters for kinetic models. Our system classifies full text documents regarding the question whether or not they contain appropriate data using a support vector machine. We evaluated our approach on a manually tagged corpus of 800 documents and found that it outperforms keyword searches in abstracts by a factor of five in terms of precision. (PMID: 15268772)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

LingPipe [BIONLP_100020026] Tool URL
Description LingPipe is a suite of NLP tools (in Java) including many features such as named-entity detector, an approximate dictionary match named-entity detector, a heuristic sentence boundary detector, a heuristic within-document coreference resolution engine and a set of tools for MEDLINE data.
Reference Carpenter, B. Phrasal queries with LingPipe and Lucene: ad hoc genomics text retrieval; NIST Special Publication: SP 500–261 The Thirteenth Text Retrieval Conference; TREC. 2004.2004.
Abstract Phrasal Queries with LingPipe and Lucene: Ad Hoc Genomics Text Retrieval. The hypothesis we explored for the Ad Hoc task of the Genomics track for TREC 2004 was that phrase-level queries would increase precision over a baseline of token-level terms. We implemented our approach using two open source tools: the Apache Jakarta Lucene TF/IDF search engine (version 1.3) and the Alias-i LingPipe tokenizer and named entity annotator (version 1.0.6). Contrary to our intuitions, the baseline system provided better performance in terms of recall and precision for almost every query at almost every precision/recall operating point. (No PubMed ref.)
Availability Online: -, Download: Y, Web service: -
User relevance Biologist: 2, Curator: 2, NLP: 3
Query input Undefined
Query output Undefined
Keywords Undefined

LitLinker [BIONLP_100020028] Tool URL
Description The design of LitLinker is based on the Swanson's open discovery approach. LitLinker starts with a provided starting concept, which specifies the concept that the researchers wants to investigate. Next, LitLinker goes through a text mining process to find a set of terms (linking concepts) that are correlated with the starting concept. For each of the linking concepts, LitLinker uses the same text-mining process to identify a set of terms (target concepts) that are correlated with the linking concepts. Finally, LitLinker groups and ranks the target concepts by the number of linking concepts that connect the target concept to starting concept. LitLinker returns a complex set of data with connections between medical concepts that are new to users. Because these connections are new, one of the most important aspects of the LitLinker interface must be the ability to examine how the connections were generated. The interface must help the user understand the text-mining process and allow them to examine how the terms are connected in the scientific literature. In order for users to understand how the connections were generated an important aspect of the interface must be helping users understand the difference between the three types of terms and how they are each involved in the text-mining process. While helping the user form a conceptual model of how the connections were generated we must also keep the interface simple enough that it will not overwhelm the user. This is a challenge and a great opportunity to apply information visualization techniques to the text-mining process.
Reference Yetisgen-Yildiz M, Pratt W. Using statistical and knowledge-based approaches for literature-based discovery.J Biomed Inform. 2006 Dec;39(6):600-11. Epub 2006 Jan 4
Abstract Using statistical and knowledge-based approaches for literature-based discovery. The explosive growth in biomedical literature has made it difficult for researchers to keep up with advancements, even in their own narrow specializations. While researchers formulate new hypotheses to test, it is very important for them to identify connections to their work from other parts of the literature. However, the current volume of information has become a great barrier for this task and new automated tools are needed to help researchers identify new knowledge that bridges gaps across distinct sections of the literature. In this paper, we present a literature-based discovery system called LitLinker that incorporates knowledge-based methodologies with a statistical method to mine the biomedical literature for new, potentially causal connections between biomedical terms. We demonstrate LitLinker's ability to capture novel and interesting connections between diseases and chemicals, drugs, genes, or molecular sequences from the published biomedical literature. We also evaluate LitLinker's performance by using the information retrieval metrics of precision and recall. (PMID: 16442852)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

LitMiner [BIONLP_57] Tool URL
Description Keyword-based tool to predict relationships using statistical co-occurrence analysis.
Reference Demaine J, Martin J, Wei L, de Bruijn B. (2006). LitMiner: integration of library services within a bio-informatics application. Biomed Digit Libr. 2006 Oct 19;3:11
Abstract LitMiner is a literature data mining tool that is based on the annotation of key terms in article abstracts followed by statistical co-citation analysis of annotated key terms in order to predict relationships. Key terms belonging to four different categories are used for the annotation process: * Genes: Names of genes and gene products. Gene name recognition is based on Ensembl . Synonyms and aliases are resolved. * Chemical Compounds: Names of chemical compounds and their respective aliases. * Diseases and Phenotypes: Names of diseases and phenotypes * Tissues and Organs: Names of tissues and organs (PMID: 17052341)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 3, NLP: 2
Query input Keyword; Gene/protein names; Gene/protein identifiers; Gene/protein lists; Gene/protein sequences;
Query output Bio-entity association list; Bio-entity co-occurrences; Bio-entity clusters/groups; Ranked articles; Phenotypes;
Keywords Information retrieval; Text Mining; Entity Recognition; Relation extraction; Gene/protein function; Knowledge Discovery

MarkerInfoFinder [BIONLP_100052020] Tool URL
Description Tool to extract from teh literatureinformation related to SNPs and genetic markers
Reference Xuan W, Wang P, Watson SJ, Meng F. Medline search engine for finding genetic markers with biological significance. Bioinformatics. 2007 Sep 15;23(18):2477-84. Epub 2007 Sep 6
Abstract Medline search engine for finding genetic markers with biological significance. MOTIVATION: Genome-wide high density SNP association studies are expected to identify various SNP alleles associated with different complex disorders. Understanding the biological significance of these SNP alleles in the context of existing literature is a major challenge since existing search engines are not designed to search literature for SNPs or other genetic markers. The literature mining of gene and protein functions has received significant attention and effort while similar work on genetic markers and their related diseases is still in its infancy. Our goal is to develop a web-based tool that facilitates the mining of Medline literature related to genetic studies and gene/protein function studies. Our solution consists of four main function modules for (1) identification of different types of genetic markers or genetic variations in Medline records (2) distinguishing positive versus negative linkage or association between genetic markers and diseases (3) integrating marker genomic location data from different databases to enable the retrieval of Medline records related to markers in the same linkage disequilibrium region (4) and a web interface called MarkerInfoFinder to search, display, sort and download Medline citation results. Tests using published data suggest MarkerInfoFinder can significantly increase the efficiency of finding genetic disorders and their underlying molecular mechanisms. The functions we developed will also be used to build a knowledge base for genetic markers and diseases. AVAILABILITY: The MarkerInfoFinder is publicly available at: http://brainarray.mbni.med.umich.edu/brainarray/datamining/MarkerInfoFinder. (PMID: 17823133)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

MaSTerClass [BIONLP_58] Tool URL
Description Case-based reasoning system developed for term extraction and classification.
Reference Irena Spasic, Sophia Ananiadou and Junichi Tsujii. (2005) "MaSTerClass: a case-based reasoning system for the classification of biomedical terms," in Bioinformatics, Vol. 21, No. 11, pp. 2748-2758
Abstract MOTIVATION: The sheer volume of textually described biomedical knowledge exerts the need for natural language processing (NLP) applications in order to allow flexible and efficient access to relevant information. Specialized semantic networks (such as biomedical ontologies, terminologies or semantic lexicons) can significantly enhance these applications by supplying the necessary terminological information in a machine-readable form. With the explosive growth of bio-literature, new terms (representing newly identified concepts or variations of the existing terms) may not be explicitly described within the network and hence cannot be fully exploited by NLP applications. Linguistic and statistical clues can be used to extract many new terms from free text. The extracted terms still need to be correctly positioned relative to other terms in the network. Classification as a means of semantic typing represents the first step in updating a semantic network with new terms. RESULTS: The MaSTerClass system implements the case-based reasoning methodology for the classification of biomedical terms. (PMID: 15728115)
Availability Online: N, Download: Y, Web service: N
User relevance Biologist: 1, Curator: 1, NLP: 3
Query input Free text (local);
Query output Bio-entity tagged text; Categorized text (classification);
Keywords Entity Recognition; Term extraction;

MaxMatcher [BIONLP_34345345] Tool URL
Description MaxMatcher is a biological concept extractor tool using dicitonary-based approximate matching. UMLS 2004AA version is used as the dictionary. The precision and recall on GENIA3.02 corpus are 71.60% and 75.18%, respectively.
Reference Zhou, X., Zhang, X., and Hu, X. Dragon Toolkit: Incorporating Auto-learned Semantic Knowledge into Large-Scale Text Retrieval and Mining. In proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), October 29-31, 2007
Abstract Using Concept-Based Indexing to Improve Language Modeling Approach to Genomic IR. Genomic IR, characterized by its highly specific information need, severe synonym and polysemy problem, long term name and rapid growing literature size, is challenging IR community. In this paper, we are focused on addressing the synonym and polysemy issue within the language model framework. Unlike the ways translation model and traditional query expansion techniques approach this issue, we incorporate concept-based indexing into a basic language model for genomic IR. In particular, we adopt UMLS concepts as indexing and searching terms. A UMLS concept stands for a unique meaning in the biomedicine domain; a set of synonymous terms will share same concept ID. Therefore, the new approach makes the document ranking effective while maintaining the simplicity of language models. A comparative experiment on the TREC 2004 Genomics Track data shows significant improvements are obtained by incorporating concept-based indexing into a basic language model. The MAP (mean average precision) is significantly raised from 29.17% (the baseline system) to 36.94%. The performance of the new approach is also significantly superior to the mean (21.72%) of official runs participated in TREC 2004 Genomics Track and is comparable to the performance of the best run (40.75%). Most official runs including the best run extensively use various query expansion and pseudo-relevance feedback techniques while our approach does nothing except for the incorporation of concept-based indexing, which evidences the view that semantic smoothing, i.e. the incorporation of synonym and sense information into the language models, is a more standard approach to achieving the effects traditional query expansion and pseudo-relevance feedback techniques target. (PMID: )
Availability Online: Y, Download: Y, Web service: N
User relevance Biologist: 1, Curator: 2, NLP: 3
Query input Undefined
Query output Undefined
Keywords Term extraction

McSyBi [BIONLP_100052002] Tool URL
Description Multi-document Clustering System for Biomedicine
Reference Yamamoto Y, Takagi T. Biomedical knowledge navigation by literature clustering. J Biomed Inform. 2007 Apr;40(2):114-30. Epub 2006 Aug 5
Abstract Biomedical knowledge navigation by literature clustering. There is an urgent need for a system that facilitates surveys by biomedical researchers and the subsequent formulation of hypotheses based on the knowledge stored in literature. One approach is to cluster papers discussing a topic of interest and reveal its sub-topics that allow researchers to acquire an overview of the topic. We developed such a system called McSyBi. It accepts a set of citation data retrieved with PubMed and hierarchically and non-hierarchically clusters them based on the titles and the abstracts using statistical and natural language processing methods. A novel point is that McSyBi allows its users to change the clustering by entering a MeSH term or UMLS Semantic Type, and therefore they can see a set of citation data from multiple aspects. We evaluated McSyBi quantitatively and qualitatively: clustering of 27 sets of citation data (40643 different papers) and scrutiny of several resultant clusters. While non-hierarchical clustering provides us with an overview of the target topic, hierarchical clustering allows us to see more details and relationships among citation data. McSyBi is freely available at http://textlens.hgc.jp/McSyBi/. (PMID: 16996316)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

MedBlast [BIONLP_100052028] Tool URL
Description NLP based retrieval system to return relevant articles for a given query protein sequence
Reference Tu Q, Tang H, Ding D. MedBlast: searching articles related to a biological sequence. Bioinformatics. 2004 Jan 1;20(1):75-7.
Abstract MedBlast: searching articles related to a biological sequence. In the genomic era, researchers often want to know more information about a biological sequence by retrieving its related articles. However, there is no available tool yet to achieve conveniently this goal. Here we developed a new literature-mining tool MedBlast, which uses natural language processing techniques, to retrieve the related articles of a given sequence. An online server of this program is also provided. AVAILABILITY: Both online server and the program are available freely at http://medblast.sibsnet.org (PMID: 14693811)
Availability Online: Y, Download: Y, Web service: N
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

MedEvi [BIONLP_100052031] Tool URL
Description MedEvi: search engine with the purpose to select all sentences from Medline that can be aligned with the positions of terms in the intial multi-term query (Permuted concordancer). Users of MedEvi have found the tool useful, for example, to find evidence from the literature that prove whether candidate chemicals are involved in a given metabolic pathway, to identify which proteins regulate a given set of proteins, and to find whether a given multi-term ontology concept actually appears in the literature even with a high degree of syntactic variations. MedEvi is based on the observation that terms with semantic relations are not found too far from each other. MedEvi further identifies additional keywords of biological and statistical significance from local context of matching occurrences in order to help users reformulate their queries for better results.
Reference None
Abstract None (No PubMed ref.)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 2, Curator: 2, NLP: 2
Query input Undefined
Query output Undefined
Keywords Undefined

MedlineRanker [BIONLP_11106] Tool URL
Description Medline Ranker ranks scientific abstracts from the last years of the Medline database using a query topic. It processes thousands of abstracts in few seconds, or approximately one Million abstracts in one minute. The output includes the best abstracts, cross validations, and highlights discriminative words.
Reference Fontaine JF, Barbosa-Silva A, Schaefer M, Huska MR, Muro EM, Andrade-Navarro MA. MedlineRanker: flexible ranking of biomedical literature. Nucleic Acids Res. 2009 Jul 1;37(Web Server issue):W141-6.
Abstract MedlineRanker: flexible ranking of biomedical literature. The biomedical literature is represented by millions of abstracts available in the Medline database. These abstracts can be queried with the PubMed interface, which provides a keyword-based Boolean search engine. This approach shows limitations in the retrieval of abstracts related to very specific topics, as it is difficult for a non-expert user to find all of the most relevant keywords related to a biomedical topic. Additionally, when searching for more general topics, the same approach may return hundreds of unranked references. To address these issues, text mining tools have been developed to help scientists focus on relevant abstracts. We have implemented the MedlineRanker webserver, which allows a flexible ranking of Medline for a topic of interest without expert knowledge. Given some abstracts related to a topic, the program deduces automatically the most discriminative words in comparison to a random selection. These words are used to score other abstracts, including those from not yet annotated recent publications, which can be then ranked by relevance. We show that our tool can be highly accurate and that it is able to process millions of abstracts in a practical amount of time. MedlineRanker is free for use and is available at http://cbdm.mdc-berlin.de/tools/medlineranker. (PMID: 19429696)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input PMID; PMID list; Keyword
Query output Ranked articles; Ranked list; Abstracts; PMID list; Confidence score
Keywords Information retrieval; Text Mining; Text classification; Abstracts

MedPost [BIONLP_100052026] Tool URL
Description MedPost: Part-of-speech tagger for biomedical literature (Medline citations).
Reference Smith L, Rindflesch T, Wilbur WJ. MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics. 2004 Sep 22;20(14):2320-1. Epub 2004 Apr 8.
Abstract MedPost: a part-of-speech tagger for bioMedical text. SUMMARY: We present a part-of-speech tagger that achieves over 97% accuracy on MEDLINE citations. AVAILABILITY: Software, documentation and a corpus of 5700 manually tagged sentences are available at ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medpost.tar.gz (PMID: 15073016)
Availability Online: N, Download: Y, Web service: -
User relevance Biologist: 1, Curator: 1, NLP: 3
Query input Undefined
Query output Undefined
Keywords Undefined

MEDSUM [BIONLP_3454566] Tool URL
Description MEDSUM: The MEDLINE/PubMed summary tool tackles a new aspect of bioinformatics: literature summary. Although there are other tools that provide some summary statistics, MEDSUM has a variety of novel features that make profiling authour and journals quicker and more informative, plus it has features such as "timeline" which allow you to research the growth of any field of literature in PubMed (or even the entire PubMed database) over time. MEDSUM is therefore designed as a general summary tool that can be extended for advanced use such as analysing research trends for grant applications or review papers.
Reference Undefined
Abstract - (No PubMed ref.)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 2, NLP: 2
Query input Free text (paste); Free text (local); PMID; PMID list; Keyword; Authors; Journals; Gene/protein names; Compound/drug names; Abbreviations/Acronyms
Query output Summaries; Graphs; Tables; Categorized text (classification);
Keywords Literature Summary; Literature Statistics; Author Profile; Journal profile; Trends in Research; Journal Preferences; Publication History

MeInfoText [BIONLP_241] Tool URL
Description MeInfoText database presents comprehensive association information about gene methylation and cancer, the profile of gene methylation among human cancer types and the gene methylation profile of a specific cancer type, based on association mining from large amounts of literature. In addition, MeInfoText offers integrated protein-protein interaction and biological pathway information collected from the Internet. MeInfoText also provides pathway cluster information regarding to a set of genes which may contribute the development of cancer due to aberrant methylation. The extracted evidence with highlighted keywords and the gene names identified from each methylation-related abstract is also retrieved. It will complement existing DNA methylation information and will be useful in epigenetics research and the prevention of cancer. This tool provides and online tutorial and allows the following search types: (1) Search for Associations among Gene, Methylation and Cancer using as input either a gene symbol or related keyword. (2) Multiple Searches for Gene Methylation Associations using as input official gene symbols separated by space. (3) Multiple Searches for the Profile of Gene Methylation across Human Cancer Types using as query input: official gene symbols separated by space and selecting cancer types. (Description based on the MeInfoText web-page). As output this system matches the search gene symbol to gene database (NCBI EntrezGene, SwissProt), provides gene cross-reference information from these databases. It shows a summary of the number of papers where this gene is associated with hypermethylation, hypomethylation, methylation and and histone, methylation and stem cell. It also provides the number of sentences where the query gene is associated with methylation and cancer. To weight the return output a probability-based confidence and a support score are calculated. This system also provides information on protein-protein interaction and a graphical interaction network output. This system also allows searching with a list of gene symbols. Or restrict searches to specific cancer types.
Reference Fang YC, Huang HC, Juan HF. MeInfoText: associated gene methylation and cancer information from text mining. BMC Bioinformatics. 2008 Jan 14;9(1):22
Abstract MeInfoText: associated gene methylation and cancer information from text mining. ABSTRACT: BACKGROUND: DNA methylation is an important epigenetic modification of the genome. Abnormal DNA methylation may result in silencing of tumor suppressor genes and is common in a variety of human cancer cells. As more epigenetics research is published electronically, it is desirable to extract relevant information from biological literature. To facilitate epigenetics research, we have developed a database called MeInfoText to provide gene methylation information from text mining. Description: MeInfoText presents comprehensive association information about gene methylation and cancer, the profile of gene methylation among human cancer types and the gene methylation profile of a specific cancer type, based on association mining from large amounts of literature. In addition, MeInfoText offers integrated protein-protein interaction and biological pathway information collected from the Internet. MeInfoText also provides pathway cluster information regarding to a set of genes which may contribute the development of cancer due to aberrant methylation. The extracted evidence with highlighted keywords and the gene names identified from each methylation-related abstract is also retrieved. The database is now available at http://mit.lifescience.ntu.edu.tw/. CONCLUSION: MeInfoText is a unique database that provides comprehensive gene methylation and cancer association information. It will complement existing DNA methylation information and will be useful in epigenetics research and the prevention of cancer. (PMID: 18194557)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Gene/protein names; Gene/protein lists; Keyword
Query output Ranked list; Nr. documents; Confidence score; Sentences; Abstracts; PMIDs; Keyword; Gene/protein names; Gene/protein identifiers; Bio-entity tagged text; Bio-entity association list; Semantically labelled text; Bio-entity network; Bio-entity co-occurrences; Protein Interactions; Ranked gene/protein lists
Keywords Information extraction; Information retrieval; Text Mining; Entity Recognition; Gene/protein normalization; Relation extraction; Sentence extraction; Protein Interaction; Disease; Abstracts; Methylation; Hypermethylation; Hypomethylation

MeSHer [BIONLP_100052013] Tool URL
Description MeSHer is a system which allows analysis of groups of genes (e.g. resulting from a microarray experiment) using overrepresented MeSH terms from the associated literature (PubMed).
Reference Djebbari A, Karamycheva S, Howe E, Quackenbush J. MeSHer: identifying biological concepts in microarray assays based on PubMed references and MeSH terms. Bioinformatics. 2005 Aug 1;21(15):3324-6. Epub 2005 May 26
Abstract MeSHer: identifying biological concepts in microarray assays based on PubMed references and MeSH terms. SUMMARY: MeSHer uses a simple statistical approach to identify biological concepts in the form of Medical Subject Headings (MeSH terms) obtained from the PubMed database that are significantly overrepresented within the identified gene set relative to those associated with the overall collection of genes on the underlying DNA microarray platform. As a demonstration, we apply this approach to gene lists acquired from a published study of the effects of angiotensin II (Ang II) treatment on cardiac gene expression and demonstrate that this approach can aid in the interpretation of the resulting 'significant' gene set. AVAILABILITY: The software is available at http://www.tm4.org. SUPPLEMENTARY INFORMATION: Results from the analysis of significant genes from the published Ang II study. (PMID: 15919728)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

MeshPubMed [BIONLP_10001] Tool URL
Description Explore PubMed/MEDLINE with Medical Subject Headings (MeSH). MeshPubMed is together with GoPubMed a knowledge-based search engine for biomedical texts. The Medical Subject Headings (MeSH) serve as "Table of contents" in order to structure the over 16 million articles of the MEDLINE data base. The search engine allows Medical Doctors (and Biologists) to find relevant search results significantly faster. The technologies used in MeshPubMed are generic and can in general be applied to any kind of texts and any kind of knowledge bases. MeshPubMed is one of the first Web 2.0 search engines. The system was developed at the Technical University of Dresden by Michael Schroeder and his team and at Transinsight.
Reference Undefined
Abstract (PMID: )
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 2, Curator: 1, NLP: 2
Query input Undefined
Query output Undefined
Keywords Undefined

MetaMap [BIONLP_111011] Tool URL
Description MetaMap is a program for mapping biomedical text to concepts in the UMLS Metathesaurus or, equivalently, to discover Metathesaurus concepts referred to in text. MetaMap uses a knowledge intensive approach based on symbolic, natural language processing (NLP) and computational linguistic techniques. Because of the intensiveness of its computations, it is not appropriate for real time processing. On the other hand, it is thorough and is particularly adept at constructing partial matches when a phrase cannot be described by a single concept. MetaMap has been used to support tasks such as information retrieval, text mining, literature-based discovery, document indexing, classification and question answering. MetaMap's output normally consists of the best "mappings" for input text phrases, i.e., sets of Metathesaurus concepts which best match the input. Intermetiate results, also available for output, consist of ranked lists of concepts (keywords, gene/protein names, ..., i.e., any concept in the UMLS Metathesaurus), a shallow parse of the text and a list of author-defined acronyms/abbreviations. MetaMap has been used by a range of different tools/applications, some of them are listed in the keyword section. MetaMap is one of the primary components of the NLM Medical Text Indexer (NLM) which is in daily use assisting NLM indexers in creating the MeSH indexing for MEDLINE citations.
Reference Aronson AR. Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program. Proc AMIA Symp. 2001:17-21.
Abstract Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. The UMLS Metathesaurus, the largest thesaurus in the biomedical domain, provides a representation of biomedical knowledge consisting of concepts classified by semantic type and both hierarchical and non-hierarchical relationships among the concepts. This knowledge has proved useful for many applications including decision support systems, management of patient records, information retrieval (IR) and data mining. Gaining effective access to the knowledge is critical to the success of these applications. This paper describes MetaMap, a program developed at the National Library of Medicine (NLM) to map biomedical text to the Metathesaurus or, equivalently, to discover Metathesaurus concepts referred to in text. MetaMap uses a knowledge intensive approach based on symbolic, natural language processing (NLP) and computational linguistic techniques. Besides being applied for both IR and data mining applications, MetaMap is one of the foundations of NLM's Indexing Initiative System which is being applied to both semi-automatic and fully automatic indexing of the biomedical literature at the library. (PMID: 11825149)
Availability Online: Y, Download: Y, Web service: Y
User relevance Biologist: 3, Curator: 2, NLP: 3
Query input Free text (paste); Free text (upload); Free text (local)
Query output Ranked list; Nr. documents; Confidence score; Keyword; Gene/protein names; Gene/protein identifiers; POS labelled text; Parses; Acronyms/Abbreviations ; Ranked gene/protein lists; Geographical locations
Keywords Information extraction; Information retrieval; Text Mining; Entity Recognition; Gene/protein normalization; Term extraction; Acronym/abbreviation extraction; Text classification; Text clustering; Sentence extraction; Full text; Abstracts; Figure legends; Knowledge Discovery

METIS [BIONLP_100052025] Tool URL
Description METIS combines the PRECIS annotation tool with two information extraction systems: a set of Support Vector Machine classifiers and BioIE, which is rule-based. Starting with a single query sequence, PRECIS generates structured reports from sets of related Swiss-Prot entries. The information extraction systems can then be used to extract pertinent sentences from the biomedical literature.
Reference Mitchell AL, Divoli A, Kim JH, Hilario M, Selimas I, Attwood TK. METIS: multiple extraction techniques for informative sentences. Bioinformatics. 2005 Nov 15;21(22):4196-7. Epub 2005 Sep 13
Abstract METIS: multiple extraction techniques for informative sentences. SUMMARY: METIS is a web-based integrated annotation tool. From single query sequences, the PRECIS component allows users to generate structured protein family reports from sets of related Swiss-Prot entries. These reports may then be augmented with pertinent sentences extracted from online biomedical literature via support vector machine and rule-based sentence classification systems. AVAILABILITY: http://umber.sbs.man.ac.uk/dbbrowser/metis/ (PMID: 16159915)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

MimMiner [BIONLP_100043] Tool URL
Description Based on the MimMiner webpage: The relationship between genes/proteins are screened in several large-scale studies. However, the phenotypes involved in these relatoinships have not been evaluated systemically. These phenotype relationships have proven to be very usefull in more specific studies and have potential as a functional genomics tool. We analysed the Online Mendelian Inheritance in Man (OMIM) database with various text mining algorithms and classified the phenotypes herein. The MimMiner page give access to our results in a fast and easy way. The data can be searched for a specific phenotype. The related phenotypes can be displayed in two ways: (1) Table/Ranking: retrieve the related phenotypes of a specific phenotype. (2) Tree/Clustering: retrieve the related phenotypes of a specific phenotype, while the relations between all phenotype are taken into account simultaneously.
Reference None
Abstract (PMID: )
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

MScanner [BIONLP_100020005] Tool URL
Description MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. MScanner uses the examples you provided to learn what relevant articles should look like (compared to the rest of Medline). It returns articles that are most likely to be relevant. The MScanner database is updated nightly. If the topic is narrow, MScanner can get a good estimate of its term distribution with as few as a dozen PubMed IDs. For broad topics, where many terms in the article could indicate relevance (database curators have this problem), you may need hundreds or thousands of examples.
Reference Poulter GL, Rubin DL, Altman RB, Seoighe C. MScanner: a classifier for retrieving Medline citations. BMC Bioinformatics. 2008 Feb 19;9(1):108
Abstract MScanner: a classifier for retrieving Medline citations. ABSTRACT: BACKGROUND: Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or of text miners developing a corpus on a topic that has many terms indicative of relevance. Several databases have developed supervised learning methods that operate on a filtered subset of Medline, to classify Medline records so that fewer articles have to be manually reviewed for relevance. A few studies have considered generalisation of Medline classification to operate on the entire Medline database in a non-domain-specific manner, but existing applications lack speed, available implementations, or a means to measure performance in new domains. RESULTS: MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. For maximum speed it uses the Medical Subject Headings (MeSH) and journal of publication as a concise document representation, and takes roughly 90 seconds to return results against the 16 million records in Medline. The web interface provides interactive exploration of the results, and cross validated performance evaluation on the relevant input against a random subset of Medline. We describe the classifier implementation, cross validate it on three domain-specific topics, and compare its performance to that of an expert PubMed query for a complex topic. In cross validation on the three sample topics against 100,000 random articles, the classifier achieved excellent separation of relevant and irrelevant article score distributions, ROC areas between 0.97 and 0.99, and averaged precision between 0.69 and 0.92. CONCLUSIONS: MScanner is an effective non-domain-specific classifier that operates on the entire Medline database, and is suited to retrieving topics for which many features may indicate relevance. Its web interface simplifies the task of classifying Medline citations, compared to building a pre-filter and classifier specific to the topic. The data sets and open source code used to obtain the results in this paper are available on-line and as supplementary material, and the web interface may be accessed at http://mscanner.stanford.edu. (PMID: 18284683)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

mSTRAP [BIONLP_18172931] Tool URL
Description mSTRAPis an automated workflow for mining mutation annotations from full-text biomedical literature using natural language processing (NLP) techniques as well as for their subsequent reuse in protein structure annotation and visualization. This system, called mSTRAPTM (Mutation extraction and STRucture Annotation Pipeline), is designed for both information aggregation and subsequent brokerage of the mutation annotations. It facilitates the coordination of semantically related information from a series of text mining and sequence analysis steps into a formal OWL-DL ontology. The ontology is designed to support application-specific data management of sequence, structure, and literature annotations that are populated as instances of object and data type properties. mSTRAPvizTM is a subsystem that facilitates the brokerage of structure information and the associated mutations for visualization. For mutated sequences without any corresponding structure available in the Protein Data Bank (PDB), an automated pipeline for homology modeling is developed to generate the theoretical model. With mSTRAP, we demonstrate a workable system that can facilitate automation of the workflow for the retrieval, extraction, processing, and visualization of mutation annotations. Tasks which are well known to be tedious, time-consuming, complex, and error-prone.To run this system you need to register (name and e-mail) and install: -Java (version 1.6 and above) -ClustalW -MODELLER (version 9v2)
Reference Kanagasabai R, Choo KH, Ranganathan S, Baker CJ. A workflow for mutation extraction and structure annotation. J Bioinform Comput Biol. 2007 Dec;5(6):1319-37
Abstract A workflow for mutation extraction and structure annotation. Rich information on point mutation studies is scattered across heterogeneous data sources. This paper presents an automated workflow for mining mutation annotations from full-text biomedical literature using natural language processing (NLP) techniques as well as for their subsequent reuse in protein structure annotation and visualization. This system, called mSTRAP (Mutation extraction and STRucture Annotation Pipeline), is designed for both information aggregation and subsequent brokerage of the mutation annotations. It facilitates the coordination of semantically related information from a series of text mining and sequence analysis steps into a formal OWL-DL ontology. The ontology is designed to support application-specific data management of sequence, structure, and literature annotations that are populated as instances of object and data type properties. mSTRAPviz is a subsystem that facilitates the brokerage of structure information and the associated mutations for visualization. For mutated sequences without any corresponding structure available in the Protein Data Bank (PDB), an automated pipeline for homology modeling is developed to generate the theoretical model. With mSTRAP, we demonstrate a workable system that can facilitate automation of the workflow for the retrieval, extraction, processing, and visualization of mutation annotations -- tasks which are well known to be tedious, time-consuming, complex, and error-prone. The ontology and visualization tool are available at (http://datam.i2r.a-star.edu.sg/mstrap). (PMID: 18172931)
Availability Online: N, Download: Y, Web service: N
User relevance Biologist: 3, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Information extraction; Full text

MutationFinder [BIONLP_100052035] Tool URL
Description MutationFinder: tool to automatically extract mutations of amino acid residues from the literature. Can be downloaded to extract mutation mentions from a large collection of abstracts. This tool has a high precision and also a considerable recall.
Reference Caporaso JG, Baumgartner WA Jr, Randolph DA, Cohen KB, Hunter L. MutationFinder: a high-performance system for extracting point mutation mentions from text. Bioinformatics. 2007 Jul 15;23(14):1862-5. Epub 2007 May 11
Abstract MutationFinder: a high-performance system for extracting point mutation mentions from text. Discussion of point mutations is ubiquitous in biomedical literature, and manually compiling databases or literature on mutations in specific genes or proteins is tedious. We present an open-source, rule-based system, MutationFinder, for extracting point mutation mentions from text. On blind test data, it achieves nearly perfect precision and a markedly improved recall over a baseline. AVAILABILITY: MutationFinder, along with a high-quality gold standard data set, and a scoring script for mutation extraction systems have been made publicly available. Implementations, source code and unit tests are available in Python, Perl and Java. MutationFinder can be used as a stand-alone script, or imported by other applications. PROJECT URL: http://bionlp.sourceforge.net. (PMID: 17495998)
Availability Online: N, Download: Y, Web service: N
User relevance Biologist: 2, Curator: 2, NLP: 3
Query input Undefined
Query output Undefined
Keywords Undefined

OBO-Annotator [BIONLP_4534546] Tool URL
Description The OBO-Annotator is a semantic NLP tool that is designed to give its end-users a great deal of flexibility to combine any number of OBO ontologies from the OBO foundry regardless of their format and use them to annotate text-bases.
Reference Undefined
Abstract (PMID: )
Availability Online: N, Download: Y, Web service: N
User relevance Biologist: 2, Curator: 2, NLP: 2
Query input Undefined
Query output Undefined
Keywords Gene Ontology

OSIRISv1.2 [BIONLP_100016] Tool URL
Description Sequence variants, in particular Single Nucleotide Polymorphisms (SNPs), are considered key elements in fields such as genetic epidemiology and pharmacogenomics [Palmer and Cardon, 2005]. Researchers in these areas are interested in finding genes associated with diseases or with drug responses, as well as in selecting the relevant sequence variants on candidate genes for genotyping studies. Several public databases are available containing sequence information on genes and proteins (NCBI Entrez, SwissProt and many others). Data on sequence variants can be found at other public resources such as NCBI dbSNP and HapMap. In contrast, information about phenotypic consequences of the sequence variants of genes is generally found as non-structured text in the biomedical literature. However, the identification of the relevant documents and the extraction of the information from them are often hampered by the lack of widely accepted standard notation for genes, proteins and sequence variants in the biomedical literature, and by the large size of current literature databases. Bearing this in mind, automatic systems for the identification of gene/protein entities and their corresponding sequence variants from biomedical texts are required. Our group have previously reported the development of OSIRIS, a search system that integrates different sources of information and incorporates ad-hoc tools for synonymy generation with the aim of retrieving literature about sequence variation of a gene using PubMed search engine. We have developed a new version of OSIRIS as a first step towards an integrated text mining system for the extraction of information about genes, sequence variants and related phenotypes. The new implementation of OSIRIS (OSIRISv1.2) incorporates a new entity recognition module and is built on top of a local mirror of MEDLINE collection and HgenetInfoDB. HgenetInfoDB is a database that integrates data of human genes from the NCBI Gene database and dbSNP. The entity recognition module is based on a corpus of articles annotated with gene identifiers and the new search algorithm, which uses a pattern-based search strategy and a sequence variant nomenclature dictionary for the identification of terms denoting SNPs and other sequence variants and their mapping to dbSNP entries. The use of OSIRISv1.2 generates a corpus of annotated literature linked to sequence database entries (NCBI Gene and dbSNP). The results of the searches are stored in a database that can be used to query the results and, in the future, for the extraction of relationships among biological entities. The performance of OSIRISv1.2 was evaluated on a manually annotated corpus, resulting in a 99 % precision at a 82 % recall, and a F-score of 0.9.
Reference Undefined
Abstract (PMID: 18251998)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PAKORA [BIONLP_11103] Tool URL
Description PAKORA is based on several approaches for mining literature-based information associated with a list of differentially expressed genes (DEG) and to search within them for terms or biological concepts that are significantly over-represented.
Reference Leong, H.S. and Kipling, D. (2009). Text-based over-representation analysis of microarray gene lists
Abstract Text-based over-representation analysis of microarray gene lists with annotation bias. A major challenge in microarray data analysis is the functional interpretation of gene lists. A common approach to address this is over-representation analysis (ORA), which uses the hypergeometric test (or its variants) to evaluate whether a particular functionally defined group of genes is represented more than expected by chance within a gene list. Existing applications of ORA have been largely limited to pre-defined terminologies such as GO and KEGG. We report our explorations of whether ORA can be applied to a wider mining of free-text. We found that a hitherto underappreciated feature of experimentally derived gene lists is that the constituents have substantially more annotation associated with them, as they have been researched upon for a longer period of time. This bias, a result of patterns of research activity within the biomedical community, is a major problem for classical hypergeometric test-based ORA approaches, which cannot account for such bias. We have therefore developed three approaches to overcome this bias, and demonstrate their usability in a wide range of published datasets covering different species. A comparison with existing tools that use GO terms suggests that mining PubMed abstracts can reveal additional biological insight that may not be possible by mining pre-defined ontologies alone. (PMID: 19429895)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 1, NLP: 1
Query input Gene/protein identifiers; Gene/protein lists; Affymetrix IDs; EntrezGene IDs
Query output Over-represented tokens and their corresponding p-values.
Keywords Text Mining; Microarray; Abstracts; Over-representation analysis

Pat2PDF [BIONLP_100013] Tool URL
Description Pat2PDF Download U.S. patents (in PDF) and more! To download a copy of a patent from the U.S. Patent and Trademark Office, please enter your request in the format below. Pat2PDF as a PDF document
Reference Undefined
Abstract (PMID: )
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PathBinderH [BIONLP_100020007] Tool URL
Description PathBinderH is a gateway that provides enhanced access to PubMed allowing: (1) Relevant information retrieval (i.e., only abstracts that contain both search terms in one sentence are retrieved) (2) Filters based on species of interest (3) Ontology-based information retrieval (using terms from the Gene & Plant Ontology, Enzyme Nomenclature & MeSH)
Reference Ding J, Viswanathan K, Berleant D, Hughes L, Wurtele ES, Ashlock D, Dickerson JA, Fulmer A, Schnable PS. Using the biological taxonomy to access biological literature with PathBinderH.Bioinformatics. 2005 May 15;21(10):2560-2. Epub 2005 Mar 15.
Abstract (PMID: 15769838)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PDQ Wizard [BIONLP_100052000] Tool URL
Description Tool for literatur-based gene prioritization
Reference Grimes GR, Wen TQ, Mewissen M, Baxter RM, Moodie S, Beattie JS, Ghazal P. PDQ Wizard: automated prioritization and characterization of gene and protein lists using biomedical literature. Bioinformatics. 2006 Aug 15;22(16):2055-7
Abstract PDQ Wizard: automated prioritization and characterization of gene and protein lists using biomedical literature. SUMMARY: PDQ Wizard automates the process of interrogating biomedical references using large lists of genes, proteins or free text. Using the principle of linkage through co-citation biologists can mine PubMed with these proteins or genes to identify relationships within a biological field of interest. In addition, PDQ Wizard provides novel features to define more specific relationships, highlight key publications describing those activities and relationships, and enhance protein queries. PDQ Wizard also outputs a metric that can be used for prioritization of genes and proteins for further research. AVAILABILITY: PDQ Wizard is freely available from http://www.gti.ed.ac.uk/pdqwizard/. (PMID: 16809392)
Availability Online: -, Download: Y, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PepBank [BIONLP_100051999] Tool URL
Description Text-mining tool for extracting peptide sequence data from MEDLINE abstracts. Has been used to asist in the development of the PepBank database
Reference Shtatland T, Guettler D, Kossodo M, Pivovarov M, Weissleder R. PepBank--a database of peptides based on sequence text mining and public peptide data sources. BMC Bioinformatics. 2007 Aug 1;8:280.
Abstract PepBank--a database of peptides based on sequence text mining and public peptide data sources. BACKGROUND: Peptides are important molecules with diverse biological functions and biomedical uses. To date, there does not exist a single, searchable archive for peptide sequences or associated biological data. Rather, peptide sequences still have to be mined from abstracts and full-length articles, and/or obtained from the fragmented public sources. DESCRIPTION: We have constructed a new database (PepBank), which at the time of writing contains a total of 19,792 individual peptide entries. The database has a web-based user interface with a simple, Google-like search function, advanced text search, and BLAST and Smith-Waterman search capabilities. The major source of peptide sequence data comes from text mining of MEDLINE abstracts. Another component of the database is the peptide sequence data from public sources (ASPD and UniProt). An additional, smaller part of the database is manually curated from sets of full text articles and text mining results. We show the utility of the database in different examples of affinity ligand discovery. CONCLUSION: We have created and maintain a database of peptide sequences. The database has biological and medical applications, for example, to predict the binding partners of biologically interesting peptides, to develop peptide based therapeutic or diagnostic agents, or to predict molecular targets or binding specificities of peptides resulting from phage display selection. The database is freely available on http://pepbank.mgh.harvard.edu/, and the text mining source code (Peptide::Pubmed) is freely available above as well as on CPAN (http://www.cpan.org/). (PMID: 17678535)
Availability Online: -, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PhenoGO [BIONLP_3435435] Tool URL
Description PhenoGO is a computed database designed for high throughput mining that provides phenotypic and experimental context, such as the cell type, disease, tissue and organ to existing annotations between gene products and Gene Ontology (GO) terms as specified in the Gene Ontology Annotations (GOA) for multiple model organisms. Phenotypic and Experimental (P&E) contexts to identifiers are computationally mapped to general biological ontologies, including: the Cell Ontology (CO),phenotypes from the Unified Medical Language System (UMLS), species from Taxonomy of the National Center for Biotechnology Information (NCBI) taxonomy, specialized ontologies such as Mammalian Phenotype Ontology (MP) and Mouse Anatomy (MA). PhenoGO is computed using natural language processing (NLP), and thus some mappings are inaccurate. It is possible to download the database as a CVS format file. Allowed query fields include: PMIDs, Gene Accession numbers, Gene Names, Gene descriptions , GO terms and Phenotypic or experimental context identifier from the following ontologies: Cell Ontology, UMLS, NCBI Taxonomy, MP or MA. PhenoGO contains data from the following publicly available primary databases: Flybase, MGI, SGD, WormBase and GO Annotations at EBI.
Reference Lussier Y, Borlawsky T, Rappaport D, Liu Y, Friedman C. PhenoGO: assigning phenotypic context to gene ontology annotations with natural language processing. Pac Symp Biocomput. 2006:64-75.
Abstract PhenoGO: assigning phenotypic context to gene ontology annotations with natural language processing. Natural language processing (NLP) is a high throughput technology because it can process vast quantities of text within a reasonable time period. It has the potential to substantially facilitate biomedical research by extracting, linking, and organizing massive amounts of information that occur in biomedical journal articles as well as in textual fields of biological databases. Until recently, much of the work in biological NLP and text mining has revolved around recognizing the occurrence of biomolecular entities in articles, and in extracting particular relationships among the entities. Now, researchers have recognized a need to link the extracted information to ontologies or knowledge bases, which is a more difficult task. One such knowledge base is Gene Ontology annotations (GOA), which significantly increases semantic computations over the function, cellular components and processes of genes. For multicellular organisms, these annotations can be refined with phenotypic context, such as the cell type, tissue, and organ because establishing phenotypic contexts in which a gene is expressed is a crucial step for understanding the development and the molecular underpinning of the pathophysiology of diseases. In this paper, we propose a system, PhenoGO, which automatically augments annotations in GOA with additional context. PhenoGO utilizes an existing NLP system, called BioMedLEE, an existing knowledge-based phenotype organizer system (PhenOS) in conjunction with MeSH indexing and established biomedical ontologies. More specifically, PhenoGO adds phenotypic contextual information to existing associations between gene products and GO terms as specified in GOA. The system also maps the context to identifiers that are associated with different biomedical ontologies, including the UMLS, Cell Ontology, Mouse Anatomy, NCBI taxonomy, GO, and Mammalian Phenotype Ontology. In addition, PhenoGO was evaluated for coding of anatomical and cellular information and assigning the coded phenotypes to the correct GOA; results obtained show that PhenoGO has a precision of 91% and recall of 92%, demonstrating that the PhenoGO NLP system can accurately encode a large number of anatomical and cellular ontologies to GO annotations. The PhenoGO Database may be accessed at the following URL: http://www.phenoGO.org (PMID: 17094228)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Free text (paste); Keyword; Gene/protein names; Gene/protein identifiers
Query output Ranked list; Nr. documents; Confidence score; Articles; Sentences; Abstracts; PMIDs; Keyword; Gene/protein names; Gene/protein identifiers; Bio-entity association list; Bio-entity co-occurrences; Phenotypes; GO-protein associations
Keywords Information extraction; Gene/protein normalization; Relation extraction; Gene/protein function; Gene Ontology; Abstracts

Phospho.ELM [BIONLP_100052044] Tool URL
Description Online database of S/t/Y phosphorylation sites which uses a text mining system to extract the phosphorylation relation from the literature.
Reference Narayanaswamy M, Ravikumar KE, Vijay-Shanker K. Beyond the clause: extraction of phosphorylation information from medline abstracts. Bioinformatics. 2005 Jun;21 Suppl 1:i319-27
Abstract Beyond the clause: extraction of phosphorylation information from medline abstracts. MOTIVATION: Phosphorylation is an important biochemical reaction that plays a critical role in signal transduction pathways and cell-cycle processes. A text mining system to extract the phosphorylation relation from the literature is reported. The focus of this paper is on the new methods developed and implemented to connect and merge pieces of information about phosphorylation mentioned in different sentences in the text. The effectiveness and accuracy of the system as a whole as well as that of the methods for extraction beyond a clause/sentence is evaluated using an independently annotated dataset, the Phospho.ELM database. The new methods developed to merge pieces of information from different sentences are shown to be effective in significantly raising the recall without much difference in precision. (PMID: 15961474)
Availability Online: Y, Download: -, Web service: Y
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PLAN2L [BIONLP_11108] Tool URL
Description PLAN2L (Plant annotation to literature) is a web tool for integrated text mining and literature-derived bio-entity relation extraction. Although most of the underlying technology could in principle be adapted to other organisms (or bio-topics), we try to provide through PLAN2L a literature mining applications adapted to the needs of researches studying the model organism Arabidopsis thaliana. This system extracts biologically relevant information at various levels, covering the detection and ranking of biological descriptions related to cell cycle, protein interactions, gene regulations, cellular locations as well as developmental processes (flower, root, leave and seed development). It also provides biological relations between specific bio-entities: gene regulation relations and protein interactions together with the corresponding textual evidence sentences.
Reference Krallinger M, Rodriguez-Penagos C, Tendulkar A, Valencia A. PLAN2L: a web tool for integrated text mining and literature-based bioentity relation extraction. Nucleic Acids Res. 2009 Jul 1;37(Web Server issue):W160-5.
Abstract PLAN2L: a web tool for integrated text mining and literature-based bioentity relation extraction. here is an increasing interest in using literature mining techniques to complement information extracted from annotation databases or generated by bioinformatics applications. Here we present PLAN2L, a web-based online search system that integrates text mining and information extraction techniques to access systematically information useful for analyzing genetic, cellular and molecular aspects of the plant model organism Arabidopsis thaliana. Our system facilitates a more efficient retrieval of information relevant to heterogeneous biological topics, from implications in biological relationships at the level of protein interactions and gene regulation, to sub-cellular locations of gene products and associations to cellular and developmental processes, i.e. cell cycle, flowering, root, leaf and seed development. Beyond single entities, also predefined pairs of entities can be provided as queries for which literature-derived relations together with textual evidences are returned. PLAN2L does not require registration and is freely accessible at http://zope.bioinfo.cnio.es/plan2l. (PMID: 19520768)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Free text (paste); Keyword; Gene/protein names; Gene/protein identifiers; Compound/drug names; Abbreviations/Acronyms
Query output Ranked list; Sentences; Abstracts; Keyword; Gene/protein names; Gene/protein identifiers; Bio-entity tagged text; Bio-entity association list; Gene/Protein labelled text; Gene/Protein normalized text; Bio-entity co-occurrences; Protein Interactions; Gene regulation pairs; Categorized text (classification); Ranked articles; Experimental method mentions; Phenotypes
Keywords Information extraction; Information retrieval; Text Mining; Entity Recognition; Gene/protein normalization; Relation extraction; Text classification; Sentence extraction; Protein Interaction; Gene regulation; Gene/protein function; Full text; Abstracts

PMD2HD [BIONLP_11109] Tool URL
Description The new service provides an automatic check if the listed articles taken from an PubMed search are in the central library collection of the DKFZ. The user works as used with PubMed. But now he has the posibillity to check how to get the article. He only needs to marks the complete web page and copy-past it into a web form. A script fetches this data, analyse it and checks by a z39.50 request to the local lib-server (Horizon). Analysed data are year entry and journal name. As a result the user gets a list of the articles and if possible a fulltest link by DOI, the library signature, a link to the eJournal or if nothing is avaiable a online link to web form for the loan request of the library. In this case all data are filled in automatically.
Reference Bohne-Lang A, Lang E, Taube A. PMD2HD--a web tool aligning a PubMed search results page with the local German Cancer Research Centre library collection. Biomed Digit Libr. 2005 Jun 27;2:4.
Abstract PMD2HD--a web tool aligning a PubMed search results page with the local German Cancer Research Centre library collection. BACKGROUND: Web-based searching is the accepted contemporary mode of retrieving relevant literature, and retrieving as many full text articles as possible is a typical prerequisite for research success. In most cases only a proportion of references will be directly accessible as digital reprints through displayed links. A large number of references, however, have to be verified in library catalogues and, depending on their availability, are accessible as print holdings or by interlibrary loan request. METHODS: The problem of verifying local print holdings from an initial retrieval set of citations can be solved using Z39.50, an ANSI protocol for interactively querying library information systems. Numerous systems include Z39.50 interfaces and therefore can process Z39.50 interactive requests. However, the programmed query interaction command structure is non-intuitive and inaccessible to the average biomedical researcher. For the typical user, it is necessary to implement the protocol within a tool that hides and handles Z39.50 syntax, presenting a comfortable user interface. RESULTS: PMD2HD is a web tool implementing Z39.50 to provide an appropriately functional and usable interface to integrate into the typical workflow that follows an initial PubMed literature search, providing users with an immediate asset to assist in the most tedious step in literature retrieval, checking for subscription holdings against a local online catalogue. CONCLUSION: PMD2HD can facilitate literature access considerably with respect to the time and cost of manual comparisons of search results with local catalogue holdings. The example presented in this article is related to the library system and collections of the German Cancer Research Centre. However, the PMD2HD software architecture and use of common Z39.50 protocol commands allow for transfer to a broad range of scientific libraries using Z39.50-compatible library information systems. (PMID: 15982415)
Availability Online: Y, Download: N, Web service: Y
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input PMID; PMID list; Authors
Query output Nr. documents; Articles; Abstracts; PMIDs;
Keywords Literature repository

PPI finder [BIONLP_11104] Tool URL
Description A web-based tool, which mines human PPIs from PubMed abstracts based on their co-occurrences and interaction words, followed by evidences in human PPI databases and shared terms in GO database.
Reference He M, Wang Y, Li W. PPI Finder: A mining tool for human protein-protein interactions. PLoS ONE, 2009; 4:e4554
Abstract PPI finder: a mining tool for human protein-protein interactions. BACKGROUND: The exponential increase of published biomedical literature prompts the use of text mining tools to manage the information overload automatically. One of the most common applications is to mine protein-protein interactions (PPIs) from PubMed abstracts. Currently, most tools in mining PPIs from literature are using co-occurrence-based approaches or rule-based approaches. Hybrid methods (frame-based approaches) by combining these two methods may have better performance in predicting PPIs. However, the predicted PPIs from these methods are rarely evaluated by known PPI databases and co-occurred terms in Gene Ontology (GO) database. METHODOLOGY/PRINCIPAL FINDINGS: We here developed a web-based tool, PPI Finder, to mine human PPIs from PubMed abstracts based on their co-occurrences and interaction words, followed by evidences in human PPI databases and shared terms in GO database. Only 28% of the co-occurred pairs in PubMed abstracts appeared in any of the commonly used human PPI databases (HPRD, BioGRID and BIND). On the other hand, of the known PPIs in HPRD, 69% showed co-occurrences in the literature, and 65% shared GO terms. CONCLUSIONS: PPI Finder provides a useful tool for biologists to uncover potential novel PPIs. It is freely accessible at http://liweilab.genetics.ac.cn/tm/. (PMID: 19234603)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 3, NLP: 2
Query input PMID; Keyword; Gene/protein names; Gene/protein identifiers; Abbreviations/Acronyms
Query output Ranked list; Nr. documents; Confidence score; Articles; Sentences; PMIDs; Gene/protein names; Gene/protein identifiers; Bio-entity tagged text; Bio-entity association list; Semantically labelled text; Bio-entity co-occurrences; Protein Interactions; GO-protein associations; Abstract list which contains two signated genes/proteins general info about input gene/protein protein/gene/relation high-lighted relevant pubmed abstract with full text link evidence in gene ontolog, other databases and computed evidence score
Keywords Information extraction; Information retrieval; Entity Recognition; Relation extraction; Sentence extraction; Protein Interaction; Gene/protein function; Gene Ontology; Abstracts; Literature repository

Protein Corral [BIONLP_100052033] Tool URL
Description Protein Corral is a web application to extract associations between UniProt protein/gene names. The associations are ranked in three levels of confidence: (1) Ppi: Pattern matching (natural language processing), being the highest level, this method is precision based. (2) Co3: Tri Co-Occurrence, two protein/gene names are found in conjunction with a verb in a sentence. Overall, the number of times (the number of sentences) the three entities appear together is greater than what can be attributed to chance. This method offers an intermediate confidence level and is a mid step between precision and recall. (3) Co: Co-Occurrence, being the lowest level of confidence, is recall based. The results are shown in a table that displays all the associations and links them to the sentences that support them as well as to the original abstracts. When appropriate, the involved verbs are also displayed. The table is sorted by relevance so that the associations better supported by the evidence are found higher up. Your terms will be looked up throughout Medline and several abstracts will thus be retrieved and analysed. In the simple interface the higher limit is 500 to make the process quick. You can set a higher limit through the Advanced Search interface. (Based on information from the Protein Corral webpage)
Reference None
Abstract None (No PubMed ref.)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PUBCLUST [BIONLP_100020008] Tool URL
Description PubClust allows the user to organize the retrieved PubMed abtract in a set of clusters topically 'homogeneous'. The PubClust user can submit a query by using the standard rule of NCBI ENTREZ format. The terms can be combined by the same logical operators used for the common PubMed query. The system will execute the abstract search on the PubMed database. Taking into account the NCBI restriction rules the program perform a the abstract classification on a maximum numer of 5000 documents. The system removes the empty abstract from the retrieved set. The classification procedure is done by using only the abstract text withouth the application of MeSH term od other kind dictionary. In order to access PubClust an user must to be registered: the e-mail is required The domain and the IP are checked. After the registration the user can submit his proper query, taking into account the canonical PubMed query structure. It is possible to define the time range for the query by using th ENTREZ date:90 days, 180 days, 1 year, 5 years and 10 years. The retrieved set of abstracts are represented in a tabular way. The output shows two level clusters of abstracts (Blue bars indicate high level clusters. Light brown bars indicate low level clusters). It is possible to visualize the low level sub-clusters by clicking the link "expand" on high level blue bar. In each bar the common main relevant term are shown. In some case the system cannot find common terms, in this situation an informative message appears in the blue bar. Each subgroup is also characterised by their proper terms. In each subset the PMID of the abstracts are shown. The user can view the single abstract by clicking on the corresponding PMID. It is possible to interactively refine the original query by using terms of the clusters.. The user can select the term by clicking on the [+] sign adjacent to each word; the combination of words allows to create a new AND query to submit for a subsequent PubMed Search. It is not possible, at the moment, to save the original query. The user can save locally the final search as HTML document by using the browser options. The saved HTML obviously allows to access the PMID of the documents. The figure bellows exemplify the output of PubClust system.
Reference Fattore M, Arrigo P. Knowledge discovery and system biology in molecular medicine: an application on neurodegenerative diseases. In Silico Biol. 2005;5(2):199-208
Abstract Knowledge discovery and system biology in molecular medicine: an application on neurodegenerative diseases. The possibility to study an organism in terms of system theory has been proposed in the past, but only the advancement of molecular biology techniques allow us to investigate the dynamical properties of a biological system in a more quantitative and rational way than before . These new techniques can gave only the basic level view of an organisms functionality. The comprehension of its dynamical behaviour depends on the possibility to perform a multiple level analysis. Functional genomics has stimulated the interest in the investigation the dynamical behaviour of an organism as a whole. These activities are commonly known as System Biology, and its interests ranges from molecules to organs. One of the more promising applications is the 'disease modeling'. The use of experimental models is a common procedure in pharmacological and clinical researches; today this approach is supported by 'in silico' predictive methods. This investigation can be improved by a combination of experimental and computational tools. The Machine Learning (ML) tools are able to process different heterogeneous data sources, taking into account this peculiarity, they could be fruitfully applied to support a multilevel data processing (molecular, cellular and morphological) that is the prerequisite for the formal model design; these techniques can allow us to extract the knowledge for mathematical model development. The aim of our work is the development and implementation of a system that combines ML and dynamical models simulations. The program is addressed to the virtual analysis of the pathways involved in neurodegenerative diseases. These pathologies are multifactorial diseases and the relevance of the different factors has not yet been well elucidated. This is a very complex task; in order to test the integrative approach our program has been limited to the analysis of the effects of a specific protein, the Cyclin dependent kinase 5 (CDK5) which relies on the induction of neuronal apoptosis. The system has a modular structure centred on a textual knowledge discovery approach. The text mining is the only way to enhance the capability to extract ,from multiple data sources, the information required for the dynamical simulator. The user may access the publically available modules through the following site: http://biocomp.ge.ismac.cnr.it. (PMID: 15972015)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PubCrawler [BIONLP_100010] Tool URL
Description PubCrawler is a free "alerting" service that scans daily updates to the NCBI Medline (PubMed) and GenBank databases. PubCrawler helps keeping scientists informed of the current contents of Medline and GenBank, by listing new database entries that match their research interests.
Reference Undefined
Abstract (PMID: )
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PubFinder [BIONLP_100020009] Tool URL
Description PubFinder allows the user to retrieve and rank all relevant publications found for a specific scientific topic. Thus, it is possible to use PubFinder to continuously update databases dealing with a specific subject by automatically adding the latest reference data. The purpose of this web-service is to automatically extract Pubmed abstracts that deal with a specific scientific subject. The user has to define a set of representative abstracts, which delineate well a certain scientific topic. The assignment of a set of characteristic abstracts is done by the input of the corresponding Pubmed IDs. Based on these abstracts, a list of so-called discriminating words is calculated which is used for scoring all available Pubmed abstracts for their probability of belonging to the user definded topic.
Reference Goetz T, von der Lieth CW. PubFinder: a tool for improving retrieval rate of relevant PubMed abstracts.Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W774-8
Abstract PubFinder: a tool for improving retrieval rate of relevant PubMed abstracts. Since it is becoming increasingly laborious to manually extract useful information embedded in the ever-growing volumes of literature, automated intelligent text analysis tools are becoming more and more essential to assist in this task. PubFinder (www.glycosciences.de/tools/PubFinder) is a publicly available web tool designed to improve the retrieval rate of scientific abstracts relevant for a specific scientific topic. Only the selection of a representative set of abstracts is required, which are central for a scientific topic. No special knowledge concerning the query-syntax is necessary. Based on the selected abstracts, a list of discriminating words is automatically calculated, which is subsequently used for scoring all defined PubMed abstracts for their probability of belonging to the defined scientific topic. This results in a hit-list of references in the descending order of their likelihood score. The algorithms and procedures implemented in PubFinder facilitate the perpetual task for every scientist of staying up-to-date with current publications dealing with a specific subject in biomedicine. (PMID: 15980583)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PubGene [BIONLP_240] Tool URL
Description PubGene is an online system which retrieves information from the literature on genes, proteins, protein sequences, compounds, GO terms and diseases (MeSH). The result is a "literature network" organizing information in a form that is easy to navigate. It provides a graphical visualization of the resulting network and provides an association summary of the most recent literature, sequence homology and literature neighbors. The sensitivity range can also be changed (document count). It has three main search interfaces: Bio Networks, Bio Associations and sequence Homology. It allows also searching for co-occurences of protein interaction keywords.
Reference Jenssen TK, Laegreid A, Komorowski J, Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001 May;28(1):21-8
Abstract A literature network of human genes for high-throughput analysis of gene expression. We have carried out automated extraction of explicit and implicit biomedical knowledge from publicly available gene and text databases to create a gene-to-gene co-citation network for 13,712 named human genes by automated analysis of titles and abstracts in over 10 million MEDLINE records. The associations between genes have been annotated by linking genes to terms from the medical subject heading (MeSH) index and terms from the gene ontology (GO) database. The extracted database and accompanying web tools for gene-expression analysis have collectively been named 'PubGene'. We validated the extracted networks by three large-scale experiments showing that co-occurrence reflects biologically meaningful relationships, thus providing an approach to extract and structure known biology. We validated the applicability of the tools by analyzing two publicly available microarray data sets. (PMID: 11326270)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Gene/protein names; Gene/protein sequences; Keyword
Query output Ranked list; Nr. documents; Confidence score; Abstracts; PMIDs; Keyword; Gene/protein names; Gene/protein identifiers; Bio-entity association list; Bio-entity network; Bio-entity co-occurrences; Protein Interactions; Bio-entity synonyms; GO-protein associations; Date
Keywords Information extraction; Information retrieval; Text Mining; Entity Recognition; Gene/protein normalization; Relation extraction; Protein Interaction; Gene/protein function; Chemical compound; Disease; Gene Ontology; Abstracts

Pubget [BIONLP_3435345] Tool URL
Description Pubget is an online search application that indexes nearly 20 million life science research documents, including those in PubMed. It can be searched by typing terms into the search field, a lot like you'd search PubMed or Google Scholar. The difference is Pubget gets you the PDF right away.
Reference Undefined
Abstract (No PubMed ref.)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Keyword
Query output Ranked list; Nr. documents; Articles; Abstracts; PMIDs
Keywords Information retrieval Full text; Abstracts

PubMatrix [BIONLP_100020010] Tool URL
Description PubMatrix is a simple way to rapidly and systematically compare any list of terms against any other list of terms in PubMed. It reports back the frequency of co-occurrence between all pairwise comparisons between the two lists as a matrix table. Lists of terms can be anything; gene names, diseases, gene functions, authors... pretty much anything. The user can then quickly sort or browse the frequency matrix table to do individual searches independently. This allows the user to build up tables of word relationships in PubMed in the context of your experiments or your scientific interests. This is useful for analyzing combinatorial datasets, as found with multiplex experimental systems, such as cDNA microarrays, genomic, proteomic, or other multiplex comparisons. The PubMatrix database is an archive of previous searches on many topics.
Reference Becker KG, Hosack DA, Dennis G Jr, Lempicki RA, Bright TJ, Cheadle C, Engel J. PubMatrix: a tool for multiplex literature mining. BMC Bioinformatics. 2003 Dec 10;4:61
Abstract PubMatrix: a tool for multiplex literature mining. BACKGROUND: Molecular experiments using multiplex strategies such as cDNA microarrays or proteomic approaches generate large datasets requiring biological interpretation. Text based data mining tools have recently been developed to query large biological datasets of this type of data. PubMatrix is a web-based tool that allows simple text based mining of the NCBI literature search service PubMed using any two lists of keywords terms, resulting in a frequency matrix of term co-occurrence. RESULTS: For example, a simple term selection procedure allows automatic pair-wise comparisons of approximately 1-100 search terms versus approximately 1-10 modifier terms, resulting in up to 1,000 pair wise comparisons. The matrix table of pair-wise comparisons can then be surveyed, queried individually, and archived. Lists of keywords can include any terms currently capable of being searched in PubMed. In the context of cDNA microarray studies, this may be used for the annotation of gene lists from clusters of genes that are expressed coordinately. An associated PubMatrix public archive provides previous searches using common useful lists of keyword terms. CONCLUSIONS: In this way, lists of terms, such as gene names, or functional assignments can be assigned genetic, biological, or clinical relevance in a rapid flexible systematic fashion. http://pubmatrix.grc.nia.nih.gov/ (PMID: 14667255)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PubMed Assistant [BIONLP_100020006] Tool URL
Description PubMed Assistant is: (1) A stand-alone Java program that accesses NLM's (National Library of Medicine) MEDLINE database directly through NCBI's (National Center for Biotechnology Information) Entrez Programming Utilities, by-passing the PubMed web interface. (2) A visual query editor that eases the pain in editing advanced boolean queries, such as balancing multiple levels of parentheses, checking boolean operator precedence, etc. (3) A specialized browser that gets the most out of MEDLINE query hits, for example keyword highlighting, automatic query formulation, MeSH term listing and chemical listing. (4) A collection of utility tools that make easy connections to other frequently used applications, for example, exporting MEDLINE hits to citation managers, one-click Google and Google Scholar search, etc.
Reference Ding J, Hughes LM, Berleant D, Fulmer AW, Wurtele ES. PubMed Assistant: a biologist-friendly interface for enhanced PubMed search. Bioinformatics. 2006 Feb 1;22(3):378-80. Epub 2005 Dec 6
Abstract PubMed Assistant: a biologist-friendly interface for enhanced PubMed search. MEDLINE is one of the most important bibliographical information sources for biologists and medical workers. Its PubMed interface supports Boolean queries, which are potentially expressive and exact. However, PubMed is also designed to support simplicity of use at the expense of query expressiveness and exactness. Many PubMed users have never tried explicit Boolean queries. We developed a Java program, PubMed Assistant, to make literature access easier in several ways. PubMed Assistant provides an interface that efficiently displays information about the citations and includes useful functions such as keyword highlighting, export to citation managers, clickable links to Google Scholar and others that are lacking in PubMed. (PMID: 16332704)
Availability Online: -, Download: Y, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PubMed Central [BIONLP_115] Tool URL
Description PubMed Central (PMC) is a free digital archive of biomedical and life sciences journal literature that can be accessed online. It contains and electronic archive of full-text journal articles characterized by a unique identifier (PMC identifier), most of which have a corresponding entry in PubMed. PMC also includes material that is not indexed in PubMed such as book reviews. In addition, articles published prior to 1966 (and added to PMC via the Back Issue Digitization project) generally will be in PMC for several months before appearing in PubMed. Citations and links for these pre-1966 articles are added to PubMed one journal at a time, after all the back issues for that journal have been added to PMC. The content of PMC can be downloaded via FTP but there is also access to the content of PMC using a search engine. Search options include: search by author, by Journal, by publication Date, by article type (e.g. reviews, open access article, digitized back issues, etc.) or by tag term (e.g. abstract, body, figure/table caption, acknowledgements, etc. ).
Reference None
Abstract PubMed Central is a free digital archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health (NIH), developed and managed by NIH's National Center for Biotechnology Information (NCBI) in the National Library of Medicine (NLM). With PubMed Central, NLM is taking the lead in preserving and maintaining unrestricted access to the electronic literature, just as it has done for decades with the printed biomedical literature. PubMed Central aims to fill the role of a world class library in the digital age. It is not a journal publisher. NLM believes that giving all users free and unrestricted access to the material in PubMed Central is the best way to ensure the durability and utility of the archive as technology changes over time. PubMed Central follows in the footsteps of other highly successful and useful services that NCBI has developed for the worldwide scientific community: GenBank, the genetic sequence data repository, and PubMed, the database of citations and abstracts to biomedical and other life science journal literature. GenBank, and the tools provided by NCBI for searching and manipulating its contents, have been a boon to molecular biologists and have helped advance developments in the field. PubMed (which encompasses Medline) is the database of choice, for researchers and clinicians alike, to locate relevant articles and, in many cases, link directly to a publisher's site for the full text. (No PubMed ref.)
Availability Online: Y, Download: Y, Web service: N
User relevance Biologist: 3, Curator: 3, NLP: 3
Query input PMID; PMID list; Keyword; Authors
Query output Ranked list; Nr. documents; Articles; Abstracts; PMIDs; References; Ranked articles; Date
Keywords Information retrieval; Full text; Abstracts; Literature repository

PubMed Interact [BIONLP_100020002] Tool URL
Description PubMed Interact is interactive Web-based search application for MEDLINE/PubMed allowing greater functionality as users can refine search parameters and interact with the search results to retrieve and display relevant information and related articles.
Reference Muin M, Fontelo P. Technical development of PubMed interact: an improved interface for MEDLINE/PubMed searches. BMC Med Inform Decis Mak. 2006 Nov 3;6:36
Abstract Technical development of PubMed interact: an improved interface for MEDLINE/PubMed searches. BACKGROUND: The project aims to create an alternative search interface for MEDLINE/PubMed that may provide assistance to the novice user and added convenience to the advanced user. An earlier version of the project was the 'Slider Interface for MEDLINE/PubMed searches' (SLIM) which provided JavaScript slider bars to control search parameters. In this new version, recent developments in Web-based technologies were implemented. These changes may prove to be even more valuable in enhancing user interactivity through client-side manipulation and management of results. RESULTS: PubMed Interact is a Web-based MEDLINE/PubMed search application built with HTML, JavaScript and PHP. It is implemented on a Windows Server 2003 with Apache 2.0.52, PHP 4.4.1 and MySQL 4.1.18. PHP scripts provide the backend engine that connects with E-Utilities and parses XML files. JavaScript manages client-side functionalities and converts Web pages into interactive platforms using dynamic HTML (DHTML), Document Object Model (DOM) tree manipulation and Ajax methods. With PubMed Interact, users can limit searches with JavaScript slider bars, preview result counts, delete citations from the list, display and add related articles and create relevance lists. Many interactive features occur at client-side, which allow instant feedback without reloading or refreshing the page resulting in a more efficient user experience. CONCLUSION: PubMed Interact is a highly interactive Web-based search application for MEDLINE/PubMed that explores recent trends in Web technologies like DOM tree manipulation and Ajax. It may become a valuable technical development for online medical search applications. (PMID: 17083729)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PubMed / Medline / Entrez [BIONLP_112] Tool URL
Description Online literature collections with over 70 million queries every month and over 17 million publications. This centralized literature repository face double-exponential growth. PubMed, available via the NCBI Entrez retrieval system, was developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), located at the U.S. National Institutes of Health (NIH). The content of PubMed is accessible through Entrez, a text-based search and retrieval system. PubMed includes citations that are submitted by participating publishers. In case the publishers also offer access to the corresponding full text articles, PubMed also provides links to these. Entrez improves the keyword searches by mapping including Mesh terms (the resulting query can be checked out at clicking at the ‘Details’ button, and also modified by selecting certain sub-nodes in the MeSH term tree). PubMed can be search also using advances search options (Limits), specifically selecting subsets such as authors, publications dates, languages, etc,.. For a more programmatic access to PubMed the Entrez Programming Utilities can be used. Also specialized libraries for many of the currently used scripting languages (BioPython and BioPerl) provide modules to allow PubMed searches. There is also the possibility to obtain a local copy of PubMed by obtaining a license from the NLM/NCBI. Periodical e-mail alert (SDI) is also offered for PubMed through the My NCBI service.
Reference Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2008 Jan;36(Database issue):D13-21
Abstract Database resources of the National Center for Biotechnology Information. In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data available through NCBI's web site. NCBI resources include Entrez, the Entrez Programming Utilities, My NCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link, Electronic PCR, OrfFinder, Spidey, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genome, Genome Project and related tools, the Trace, Assembly, and Short Read Archives, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups, Influenza Viral Resources, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus, Entrez Probe, GENSAT, Database of Genotype and Phenotype, Online Mendelian Inheritance in Man, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool and the PubChem suite of small molecule databases. Augmenting the web applications are custom implementations of the BLAST program optimized to search specialized data sets. These resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov. (PMID: 18045790)
Availability Online: Y, Download: Y, Web service: N
User relevance Biologist: 3, Curator: 3, NLP: 3
Query input PMID; PMID list; Keyword; Authors
Query output Ranked list; Nr. documents; Confidence score; Abstracts; PMIDs; Keyword; e-mail returned article collection; Date
Keywords Information retrieval; Abstracts; Literature repository; SDI

PubMed Reader [BIONLP_10005] Tool URL
Description PubMed Reader - A free web-based alternative interface for PubMed search
Reference Undefined
Abstract (PMID: )
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PubMed Sentence Extractor (PSE) [BIONLP_1000515] Tool URL
Description A web-based software that parses large number of PubMed abstracts, extracts and displays the co-occurrence sentences of gene names and other keywords, and some information from EntrezGene records. The result links to whole abstracts and other resources such as the Online Mendelian Inheritance in Men and Reference Sequence.
Reference Yoneya T. PSE: a tool for browsing a large amount of MEDLINE/PubMed abstracts with gene names and common words as the keywords. BMC Bioinformatics. 2005 Dec 10;6:295
Abstract PSE: a tool for browsing a large amount of MEDLINE/PubMed abstracts with gene names and common words as the keywords. BACKGROUND: MEDLINE/PubMed (hereinafter called PubMed) is one of the most important literature databases for the biological and medical sciences, but it is impossible to read all related records due to the sheer size of the repository. We usually have to repeatedly enter keywords in a trial-and-error manner to extract useful records. Software which can reduce such a laborious task is therefore required. RESULTS: We developed a web-based software, the PubMed Sentence Extractor (PSE), which parses large number of PubMed abstracts, extracts and displays the co-occurrence sentences of gene names and other keywords, and some information from EntrezGene records. The result links to whole abstracts and other resources such as the Online Mendelian Inheritance in Men and Reference Sequence. While PSE executes at the sentence-level when evaluating the existence of keywords, the popular PubMed operates at the record-level. Therefore, the relationship between the two keywords, a gene name and a common word, is more accurately captured by PSE than PubMed. In addition, PSE shows the list of keywords and considers the synonyms and variations on gene names. Through these functions, PSE would reduce the task of searching through records for gene information. CONCLUSION: We developed PSE in order to extract useful records efficiently from PubMed. This system has four advantages over a simple PubMed search; the reduction in the amount of collected literatures, the showing of keyword lists, the consideration for synonyms and variations on gene names, and the links to external databases. We believe PSE is helpful in collecting necessary literatures efficiently in order to find research targets. PSE is freely available under the GPL licence as additional files to this manuscript. (PMID: 16336692)
Availability Online: -, Download: Y, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PubMed slicer [BIONLP_10002] Tool URL
Description PubMed slicer - an alternative interface to the PubMed, with clear interface
Reference Undefined
Abstract (PMID: )
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PubMeth [BIONLP_100052045] Tool URL
Description Pubmeth is an annotated and reviewed database of methylation in cancer. It is based on automated textmining of literature and is afterwards manually curated and annotated.
Reference Ongenaert M, Van Neste L, De Meyer T, Menschaert G, Bekaert S, Van Criekinge W. PubMeth: a cancer methylation database combining text-mining and expert annotation. Nucleic Acids Res. 2008 Jan;36(Database issue):D842-6
Abstract PubMeth: a cancer methylation database combining text-mining and expert annotation. Epigenetics, and more specifically DNA methylation is a fast evolving research area. In almost every cancer type, each month new publications confirm the differentiated regulation of specific genes due to methylation and mention the discovery of novel methylation markers. Therefore, it would be extremely useful to have an annotated, reviewed, sorted and summarized overview of all available data. PubMeth is a cancer methylation database that includes genes that are reported to be methylated in various cancer types. A query can be based either on genes (to check in which cancer types the genes are reported as being methylated) or on cancer types (which genes are reported to be methylated in the cancer (sub) types of interest). The database is freely accessible at http://www.pubmeth.org. PubMeth is based on text-mining of Medline/PubMed abstracts, combined with manual reading and annotation of preselected abstracts. The text-mining approach results in increased speed and selectivity (as for instance many different aliases of a gene are searched at once), while the manual screening significantly raises the specificity and quality of the database. The summarized overview of the results is very useful in case more genes or cancer types are searched at the same time. (PMID: 17932060)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PubNet [BIONLP_111013] Tool URL
Description PubNet is a web-based tool that extracts several types of relationships returned by PubMed queries and maps them into networks, allowing for graphical visualization, textual navigation, and topological analysis. PubNet supports the creation of complex networks derived from the contents of individual citations, such as genes, proteins, Protein Data Bank (PDB) IDs, Medical Subject Headings (MeSH) terms, and authors. This feature allows one to, for example, examine a literature derived network of genes based on functional similarity.
Reference Douglas SM, Montelione GT, Gerstein M. PubNet: a flexible system for visualizing literature derived networks. Genome Biol. 2005;6(9):R80.
Abstract PubNet: a flexible system for visualizing literature derived networks. We have developed PubNet, a web-based tool that extracts several types of relationships returned by PubMed queries and maps them into networks, allowing for graphical visualization, textual navigation, and topological analysis. PubNet supports the creation of complex networks derived from the contents of individual citations, such as genes, proteins, Protein Data Bank (PDB) IDs, Medical Subject Headings (MeSH) terms, and authors. This feature allows one to, for example, examine a literature derived network of genes based on functional similarity. (PMID: 16168087)
Availability Online: Y, Download: Y, Web service: N
User relevance Biologist: 3, Curator: 3, NLP: 2
Query input Free text (paste); Free text (local); PMID; PMID list; Keyword; Gene/protein identifiers; Gene/protein lists; Gene/protein sequences; Compound/drug names; Compound/drug identifiers; Authors
Query output Abstracts; PMIDs; Keyword; Gene/protein names; Gene/protein identifiers; Bio-entity co-occurrences;
Keywords Information extraction; Information retrieval; Text Mining; Text classification; Text clustering; Protein Interaction; Co-authors

PubReMiner [BIONLP_10008] Tool URL
Description PubReMiner allows an alternative search to the PubMed database, with the option to save the results as a txt file. The primary result is a structure summary table and providing info on co-occurring terms and counts as well as publication year. It allows selecting using Boolean operators these features and refining then the search. It has a search limit result limit. How does pubmed PubReMiner work? PubReMiner is a front-end for the popular PubMed literature database at the NCBI. When you submit your query (which can be any query that can be processed by PubMed), PubReMiner will process the result of that query and display its results (in the form of selectable "keywords") in frequency tables, which can be added/excluded from the query to optimize the results. Why would I need such a tool? PubMed is becoming larger and larger. To obtain a workable amount of references from your queries, often one needs to combine different keywords. But which ones should you use? With this idea in mind, PubReMiner has been developed. The tool allows you to initiate a broad query (which is currently restricted to 7.500 abstracts), after which you can add/exclude words/authors/journals to guide your search. words/authors/journals are displayed in descending order, so you can immediately see which words are used the most in combination with your query. In addition you may also appreciate popular journals in the field of your query, which may help to select a targeted journal for your own work. Furthermore, experts in the field (most actively publishing authors) are becoming visible. What is this literature mining about? When you are querying a certain subject/gene; the words that are used most frequently often provide a quick way to gather additional information on the subject. if for example we would query the gene "PHOX2B", then we can see immediately in the table words like transcription,neuron,embryology and homeodomain. these terms indicate that PHOX2b is probably a transcription factor, containing a homeodomain which is involved in neurons and probably important during development. Obtaining this insight only took like 3 seconds.
Reference Undefined
Abstract (PMID: )
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

PubViz [BIONLP_100020001] Tool URL
Description PubViz: An Interactive Medline Search Engine Utilizing External Knowledge Motivation: Although there are tremendous progresses in the development of Medline search functions, existing solutions are still less than ideal in providing highly efficient Medline exploration. PubViz is designed to overcome some of the limitations by taking advantage of external biomedical conceptual relationships for better information retrieval. It also provides flexible visualization functions for efficient visual queries. In short, PubViz is developed to provide the capability of utilizing external knowledge as well as interactive visual query functions for more efficient exploration of the Medline database. The current version has the ability to utilize protein-protein interaction data during Medline search and enable researchers to identify functionally related Medline records not retrievable in existing search engines. It can also utilize the structure relationship of different type of genetic markers including cytobands, microsatellite/STS markers, SNPs and genes derived from human genome assembly and HapMap data for deep search of genetically related Medline records. We include many visualization functions in PubViz, such as interactive PMID, MeSH, Gene views, the transition between different views, selection of node description display on network graph, as well as details of abstract and sorting/filtering functions. The combination of these novel capabilities will make PubViz a powerful tool for Medline exploration.
Reference Xuan W, Dai M, Mirel B, Wilson J, Athey B, Watson SJ, Meng F. An active visual search interface for Medline. Comput Syst Bioinformatics Conf. 2007;6:359-69
Abstract An active visual search interface for Medline. Searching the Medline database is almost a daily necessity for many biomedical researchers. However, available Medline search solutions are mainly designed for the quick retrieval of a small set of most relevant documents. Because of this search model, they are not suitable for the large-scale exploration of literature and the underlying biomedical conceptual relationships, which are common tasks in the age of high throughput experimental data analysis and cross-discipline research. We try to develop a new Medline exploration approach by incorporating interactive visualization together with powerful grouping, summary, sorting and active external content retrieval functions. Our solution, PubViz, is based on the FLEX platform designed for interactive web applications and its prototype is publicly available at: http://brainarray.mbni.med.umich.edu/Brainarray/DataMining/PubViz. (PMID: 17951838)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

ReleMed [BIONLP_100037] Tool URL
Description ReleMed is a search engine that allows you to find the most relevant answers to your questions. You enter one or a few query words, and then ReleMed searches for articles containing those query words. Unlike other search engines, ReleMed displays your results so that the most relevant articles-the ones most closely matching your query-are shown first. Currently ReleMed is focused on biomedical findings published in scientific journals. It searches 17 million articles indexed in MEDLINE, the National Library of Medicine's electronic database
Reference Siadaty MS, Shu J, Knaus WA. Relemed: sentence-level search engine with relevance score for the MEDLINE database of biomedical articles. BMC Med Inform Decis Mak. 2007 Jan 10;7:1.
Abstract Relemed: sentence-level search engine with relevance score for the MEDLINE database of biomedical articles. BACKGROUND: Receiving extraneous articles in response to a query submitted to MEDLINE/PubMed is common. When submitting a multi-word query (which is the majority of queries submitted), the presence of all query words within each article may be a necessary condition for retrieving relevant articles, but not sufficient. Ideally a relationship between the query words in the article is also required. We propose that if two words occur within an article, the probability that a relation between them is explained is higher when the words occur within adjacent sentences versus remote sentences. Therefore, sentence-level concurrence can be used as a surrogate for existence of the relationship between the words.In order to avoid the irrelevant articles, one solution would be to increase the search specificity. Another solution is to estimate a relevance score to sort the retrieved articles. However among the >30 retrieval services available for MEDLINE, only a few estimate a relevance score, and none detects and incorporates the relation between the query words as part of the relevance score. RESULTS: We have developed "Relemed", a search engine for MEDLINE. Relemed increases specificity and precision of retrieval by searching for query words within sentences rather than the whole article. It uses sentence-level concurrence as a statistical surrogate for the existence of relationship between the words. It also estimates a relevance score and sorts the results on this basis, thus shifting irrelevant articles lower down the list.In two case studies, we demonstrate that the most relevant articles appear at the top of the Relemed results, while this is not necessarily the case with a PubMed search. We have also shown that a Relemed search includes not only all the articles retrieved by PubMed, but potentially additional relevant articles, due to the extended 'automatic term mapping' and text-word searching features implemented in Relemed. CONCLUSION: By using sentence-level matching, Relemed can deliver higher specificity, thus eliminating more false-positive articles. By introducing an appropriate relevance metric, the most relevant articles on which the user wishes to focus are listed first. Relemed also shrinks the displayed text, and hence the time spent scanning the articles. (PMID: 17214888)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

RLIMS-P [BIONLP_100020020] Tool URL
Description The RLIMS-P is a rule-based text-mining program specifically designed to extract protein phosphorylation information on protein kinase, substrate and phosphorylation sites from the abstracts (Hu et al., 2005). The program was originally developed by Narayanaswamya, Ravikumara, and Vijay-Shankerb (2005), and was tested and benchmarked by PIR using iProLINK annotated datasets (Hu et al., 2004). The RLIMS-P program is now adopted at PIR and being developed into an online text mining tool for extracting protein phosphorylation information from PubMed literature. The online RLIMS-P (Yuan et al., 2006) currently provides the following functions to: 1) determine whether the MEDLINE abstract contains protein phosphorylation information and to extract protein kinase, protein substrate and phosphorylation site/residue when available; 2) tag extracted phosphorylation objects in the abstract in different colors; 3) map the protein substrate to UniProtKB protein entries based on PMID; 4) map protein names to UniProtKB protein entries based on BioThesaurus.
Reference Hu ZZ, Narayanaswamy M, Ravikumar KE, Vijay-Shanker K, Wu CH. Literature mining and database annotation of protein phosphorylation using a rule-based system.Bioinformatics. 2005 Jun 1;21(11):2759-65. Epub 2005 Apr 6.
Abstract Literature mining and database annotation of protein phosphorylation using a rule-based system. MOTIVATION: A large volume of experimental data on protein phosphorylation is buried in the fast-growing PubMed literature. While of great value, such information is limited in databases owing to the laborious process of literature-based curation. Computational literature mining holds promise to facilitate database curation. RESULTS: A rule-based system, RLIMS-P (Rule-based LIterature Mining System for Protein Phosphorylation), was used to extract protein phosphorylation information from MEDLINE abstracts. An annotation-tagged literature corpus developed at PIR was used to evaluate the system for finding phosphorylation papers and extracting phosphorylation objects (kinases, substrates and sites) from abstracts. RLIMS-P achieved a precision and recall of 91.4 and 96.4% for paper retrieval, and of 97.9 and 88.0% for extraction of substrates and sites. Coupling the high recall for paper retrieval and high precision for information extraction, RLIMS-P facilitates literature mining and database annotation of protein phosphorylation. (PMID: 15814565)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

SciMiner [BIONLP_11105] Tool URL
Description SciMiner is a web-based literature mining and functional analysis tool that identifies genes and proteins using a context specific analysis of MEDLINE abstracts and full texts. SciMiner identifies genes/proteins names in literature and allows identifying over-represented biological functions of the identified genes/proteins. SciMiner basically accepts PMID lists and keywords. Keyword-based search will be done through Entrez search. Users will get the exactly same documents from PubMed. So keyword could be author, gene name, disease, or whatever users would use in PubMed search.
Reference Hur J, Schuyler AD, States DJ, Feldman EL: SciMiner: web-based literature mining tool for target identification and functional enrichment analysis. Bioinformatics 2009, 25(6):838-840.
Abstract SciMiner: web-based literature mining tool for target identification and functional enrichment analysis. SciMiner is a web-based literature mining and functional analysis tool that identifies genes and proteins using a context specific analysis of MEDLINE abstracts and full texts. SciMiner accepts a free text query (PubMed Entrez search) or a list of PubMed identifiers as input. SciMiner uses both regular expression patterns and dictionaries of gene symbols and names compiled from multiple sources. Ambiguous acronyms are resolved by a scoring scheme based on the co-occurrence of acronyms and corresponding description terms, which incorporates optional user-defined filters. Functional enrichment analyses are used to identify highly relevant targets (genes and proteins), GO (Gene Ontology) terms, MeSH (Medical Subject Headings) terms, pathways and protein-protein interaction networks by comparing identified targets from one search result with those from other searches or to the full HGNC [HUGO (Human Genome Organization) Gene Nomenclature Committee] gene set. The performance of gene/protein name identification was evaluated using the BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) version 2 (Year 2006) Gene Normalization Task as a gold standard. SciMiner achieved 87.1% recall, 71.3% precision and 75.8% F-measure. SciMiner's literature mining performance coupled with functional enrichment analyses provides an efficient platform for retrieval and summary of rich biological information from corpora of users' interests. AVAILABILITY: http://jdrf.neurology.med.umich.edu/SciMiner/. A server version of the SciMiner is also available for download and enables users to utilize their institution's journal subscriptions. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. (PMID: 19188191)
Availability Online: Y, Download: Y, Web service: N
User relevance Biologist: 3, Curator: 1, NLP: 1
Query input PMID; PMID list; Keyword
Query output Gene/protein names; Gene/protein identifiers; Bio-entity co-occurrences; Gene/Protein normalized text; Gene Ontology associations; MeSH associations; Pathway (KEGG and Reactome) associations
Keywords Information extraction; Information retrieval; Text Mining; Entity Recognition; Gene/protein normalization; Gene Ontology; Full text; Abstracts

SherLoc [BIONLP_100035] Tool URL
Description Prediction or text based sub-cellular location for proteins
Reference Shatkay H, Höglund A, Brady S, Blum T, Dönnes P, Kohlbacher O. SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics. 2007 Jun 1;23(11):1410-7. Epub 2007 Mar 28
Abstract SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. MOTIVATION: Knowing the localization of a protein within the cell helps elucidate its role in biological processes, its function and its potential as a drug target. Thus, subcellular localization prediction is an active research area. Numerous localization prediction systems are described in the literature; some focus on specific localizations or organisms, while others attempt to cover a wide range of localizations. RESULTS: We introduce SherLoc, a new comprehensive system for predicting the localization of eukaryotic proteins. It integrates several types of sequence and text-based features. While applying the widely used support vector machines (SVMs), SherLoc's main novelty lies in the way in which it selects its text sources and features, and integrates those with sequence-based features. We test SherLoc on previously used datasets, as well as on a new set devised specifically to test its predictive power, and show that SherLoc consistently improves on previous reported results. We also report the results of applying SherLoc to a large set of yet-unlocalized proteins. AVAILABILITY: SherLoc, along with Supplementary Information, is available at: http://www-bs.informatik.uni-tuebingen.de/Services/SherLoc/ (PMID: 17392328)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 2, Curator: 2, NLP: 2
Query input Undefined
Query output Undefined
Keywords Undefined

ShrubMed [BIONLP_10007] Tool URL
Description ShrubMed - A PubMed index focusing on herbal and alternative medicine.
Reference Undefined
Abstract (PMID: )
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

Springer [BIONLP_100015] Tool URL
Description Advanced search against articles from thus publisher, allows full text search, selection of topics, authors search, language selection, specification of the type of media (e.g. books, journal articles, webpages). Returns hits structured by the type of media
Reference Undefined
Abstract (PMID: )
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

STITCH [BIONLP_111012] Tool URL
Description STITCH (‘search tool for interactions of chemicals’) integrates information about interactions from metabolic pathways, crystal structures, binding experiments and drug–target relationships. Inferred information from phenotypic effects, text mining and chemical structure similarity is used to predict relations between chemicals. It further allows exploring the network of chemical relations, also in the context of associated binding proteins. Each proposed interaction can be traced back to the original data sources. Our database contains interaction information for over 68 000 different chemicals, including 2200 drugs, and connects them to 1.5 million genes across 373 genomes.
Reference Kuhn M, von Mering C, Campillos M, Jensen LJ, Bork P. STITCH: interaction networks of chemicals and proteins. Nucleic Acids Res. 2008 Jan;36(Database issue): D684-8.
Abstract STITCH: interaction networks of chemicals and proteins. The knowledge about interactions between proteins and small molecules is essential for the understanding of molecular and cellular functions. However, information on such interactions is widely dispersed across numerous databases and the literature. To facilitate access to this data, STITCH ('search tool for interactions of chemicals') integrates information about interactions from metabolic pathways, crystal structures, binding experiments and drug-target relationships. Inferred information from phenotypic effects, text mining and chemical structure similarity is used to predict relations between chemicals. STITCH further allows exploring the network of chemical relations, also in the context of associated binding proteins. Each proposed interaction can be traced back to the original data sources. Our database contains interaction information for over 68,000 different chemicals, including 2200 drugs, and connects them to 1.5 million genes across 373 genomes and their interactions contained in the STRING database. STITCH is available at http://stitch.embl.de/. (PMID: 18084021)
Availability Online: Y, Download: N, Web service: Y
User relevance Biologist: 3, Curator: 3, NLP: 1
Query input Gene/protein names; Gene/protein identifiers; Gene/protein lists; Gene/protein sequences; Compound/drug names; Compound/drug identifiers
Query output Ranked list; Confidence score; Abstracts; PMIDs; Gene/protein names; Gene/protein identifiers; Bio-entity association list; Bio-entity network; Protein Interactions; Drug-target associations; Protein-compound associations
Keywords Information extraction; Text Mining; Entity Recognition; Gene/protein normalization; Term extraction; Acronym/abbreviation extraction; Relation extraction; Microarray; Protein Interaction; Gene regulation; Gene/protein function; Chemical compound; Transcription factor; Gene Ontology; Abstracts

SuperPred [BIONLP_56456545] Tool URL
Description SuperPred is a web-server that allows extracting drug-target associations. It translates a user-defined molecule into a structural fingerprint that is compared to about 6300 drugs, which are enriched by 7300 links to molecular targets of the drugs, derived through text mining followed by manual curation. These putative drugs are most likely candidates for having the same mode of action, binding to the same target/enzyme and being assigned to the same medical indication as the WHO-classified drugs. In order to allow the examination of the drug effect on a molecular level, information about the target proteins was extracted from literature and was provided for half of the drugs. [Description adapted from the original article]
Reference Dunkel M, Günther S, Ahmed J, Wittig B, Preissner R. SuperPred: drug classification and target prediction. Nucleic Acids Res. 2008 Jul 1;36(Web Server issue):W55-9
Abstract SuperPred: drug classification and target prediction. The drug classification scheme of the World Health Organization (WHO) [Anatomical Therapeutic Chemical (ATC)-code] connects chemical classification and therapeutic approach. It is generally accepted that compounds with similar physicochemical properties exhibit similar biological activity. If this hypothesis holds true for drugs, then the ATC-code, the putative medical indication area and potentially the medical target should be predictable on the basis of structural similarity. We have validated that the prediction of the drug class is reliable for WHO-classified drugs. The reliability of the predicted medical effects of the compounds increases with a rising number of (physico-) chemical properties similar to a drug with known function. The web-server translates a user-defined molecule into a structural fingerprint that is compared to about 6300 drugs, which are enriched by 7300 links to molecular targets of the drugs, derived through text mining followed by manual curation. Links to the affected pathways are provided. The similarity to the medical compounds is expressed by the Tanimoto coefficient that gives the structural similarity of two compounds. A similarity score higher than 0.85 results in correct ATC prediction for 81% of all cases. As the biological effect is well predictable, if the structural similarity is sufficient, the web-server allows prognoses about the medical indication area of novel compounds and to find new leads for known targets. Availability: the system is freely accessible at http://bioinformatics.charite.de/superpred. SuperPred can be obtained via a Creative Commons Attribution Noncommercial-Share Alike 3.0 License. (PMID: 18499712)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 3, Curator: 1, NLP: 1
Query input Gene/protein names; Gene/protein identifiers; Compound/drug names; Compound/drug identifiers
Query output Bio-entity association list; Bio-entity co-occurrences; Drug-target associations;
Keywords Information extraction; Text Mining; Gene/protein normalization; Relation extraction; Chemical compound; Abstracts; Protein-compound association

T2K Gene Tagger [BIONLP_3434544] Tool URL
Description T2K Gene Tagger is a web tool that take a medical text file or a list of gene names and tag genes with <'gene'> tag with taxonomy and sequence information. Note that it will take quiet a long time if you are tagging a long paragraphs or a full paper.
Reference None
Abstract None (No PubMed ref.)
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 1, Curator: 1, NLP: 2
Query input Free text (paste);
Query output Gene/protein names; Bio-entity tagged text; Semantically labelled text; Gene/Protein labelled text
Keywords Information extraction; Entity Recognition

TaxonGrab [BIONLP_100052010] Tool URL
Description TaxonGrab is a tool written in PHP for the purpose of taxonomic name extraction.
Reference Koning D, Sarkar IN, Moritz TD. TaxonGrab: Extracting Taxon Names from Text. Journal of. Biodiversity Informatics 2;79-82. 2005
Abstract TaxonGrab: extracting taxonomic names from text. bstract.––Identification of organism names in biological texts is essential for the management of archival resources to facilitate comparative biological investigation. Because organism nomenclature conforms closely to prescribed rules, automated techniques may be useful for identifying organism names from existing documents, and may also support the completion of comprehensive indices of taxonomic names; such comprehensive lists are not yet available. Using a combination of contextual rules and a language lexicon, we have developed a set of simple computational techniques for extracting taxonomic names from biological text. Our proposed method consistently performs at greater than 96% Precision and 94% Recall, and at a much higher speed than manual extraction techniques. An implementation of the described method is available as a Web based tool written in PHP. Additionally, the PHP source code is available from SourceForge: http://sourceforge.net/projects/taxongrab, and the project website is http://research.amnh.org/informatics/taxlit/apps/ (No PubMed ref.)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

TerMine [BIONLP_100020012] Tool URL
Description TerMine is the terminological management system with the C-Value term extraction and AcroMine acronym recognition integrated. C-value is a domain-independent method for automatic term recognition (ATR) which combines linguistic and statistical analyses, emphasis being placed on the statistical part. The linguistic analysis enumerates all candidate terms in a given text by applying part-of-speech tagging, extracting word sequences of adjectives/nouns based, and stop-list. The statistical analysis assigns a termhood to a candidate term by using the following four characteristics: (1) the occurrence frequency of the candidate term, (2) the frequency of the candidate term as part of other longer candidate terms (3) the number of these longer candidate terms (4) the length of the candidate term The TerMine implementation of the C-value method is optimized for scalability and processing speed: given a set of 1.3 million MEDLINE abstracts (2GB text), the implementation extracts 9.8 million term candidates and their termhood scores in about ten minutes. This demonstration system highlights multi-word terms found in the text presented by a user.
Reference Frantzi, K., Ananiadou, S. and Mima, H. (2000) Automatic recognition of multi-word terms. International Journal of Digital Libraries 3(2), pp.117-132.
Abstract Automatic Recognition of Multi-Word Terms: the C-value/NC-value Method. Technical terms (henceforth called terms), are important elements for digital libraries. In this paper we present a domain-independent method for the automatic extraction of multi-word terms, from machine- readable special language corpora. The method, (C-value/NC-value), combines linguistic and statistical information. The first part, C-value enhances the common statistical measure of frequency of occurrence for term extraction, making it sensitive to a particular type of multi-word terms, the nested terms. The second part, NC-value, gives: 1) a method for the extraction of term context words (words that tend to appear with terms), 2) the incorporation of information from term context words to the extraction of terms. (No PubMed ref.)
Availability Online: Y, Download: N, Web service: Y
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

Textpresso [BIONLP_100020011] Tool URL
Description Textpresso is an information extracting and processing package for biological literature and is part of WormBase.
Reference Mueller HM, Kenny EE, Sternberg PW. Textpresso: an ontology-based information retrieval and extraction system for biological literature.PLoS Biol. 2004 Nov;2(11):e309. Epub 2004 Sep 21
Abstract Textpresso: an ontology-based information retrieval and extraction system for biological literature. We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org. (PMID: 15383839)
Availability Online: Y, Download: Y, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

Tokenizer/stemmer (for biomedical literature) [BIONLP_100052016] Tool URL
Description Perl script tokenizer and stemmer developed for biomedical text.
Reference Zhou W, Torvik VI, Smalheiser NR. ADAM: another database of abbreviations in MEDLINE. Bioinformatics. 2006 Nov 15;22(22):2813-8. Epub 2006 Sep 18
Abstract A simple Perl tokenizer and stemmer for biomedical text We present a simple Perl script tokenizer and stemmer which we developed in order to process biomedical text so that the resulting words and phrases remain meaningful and can be linked to slight variants in other titles. (technical report) (No PubMed ref.)
Availability Online: Y, Download: Y, Web service: N
User relevance Biologist: 1, Curator: 1, NLP: 3
Query input Undefined
Query output Undefined
Keywords Undefined

uBio [BIONLP_100052009] Tool URL
Description Universal Biological Indexer and Organizer. A useful resource for finding and tagging organism names in electronic texts.
Reference Leary PR, Remsen DP, Norton CN, Patterson DJ, Sarkar IN. uBioRSS: tracking taxonomic literature using RSS. Bioinformatics. 2007 Jun 1;23(11):1434-6. Epub 2007 Mar 28
Abstract uBioRSS: tracking taxonomic literature using RSS. Web content syndication through standard formats such as RSS and ATOM has become an increasingly popular mechanism for publishers, news sources and blogs to disseminate regularly updated content. These standardized syndication formats deliver content directly to the subscriber, allowing them to locally aggregate content from a variety of sources instead of having to find the information on multiple websites. The uBioRSS application is a 'taxonomically intelligent' service customized for the biological sciences. It aggregates syndicated content from academic publishers and science news feeds, and then uses a taxonomic Named Entity Recognition algorithm to identify and index taxonomic names within those data streams. The resulting name index is cross-referenced to current global taxonomic datasets to provide context for browsing the publications by taxonomic group. This process, called taxonomic indexing, draws upon services developed specifically for biological sciences, collectively referred to as 'taxonomic intelligence'. Such value-added enhancements can provide biologists with accelerated and improved access to current biological content. AVAILABILITY: http://names.ubio.org/rss/ (PMID: 17392332)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

Unbound MEDLINE [BIONLP_10006] Tool URL
Description Unbound MEDLINE - Clinician-friendly access to PubMed searcing via PDA, wireless devices and the Web.
Reference Undefined
Abstract (PMID: )
Availability Online: Y, Download: N, Web service: N
User relevance Biologist: 2, Curator: 1, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

Whatizit [BIONLP_100020013] Tool URL
Description Whatizit internally employs a pipeline of filters to enrich plain text with annotation. Interesting pieces of text are combined into XML elements. Filters further down the pipeline use elements identified by upstream filters to aggregate larger structures. Some of the XML tags applied carry link information to biological databases.
Reference Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A. Text processing through Web services: calling Whatizit. Bioinformatics. 2008 Jan 15;24(2):296-8. Epub 2007 Nov 15.
Abstract Text processing through Web services: calling Whatizit. MOTIVATION: Text-mining (TM) solutions are developing into efficient services to researchers in the biomedical research community. Such solutions have to scale with the growing number and size of resources (e.g. available controlled vocabularies), with the amount of literature to be processed (e.g. about 17 million documents in PubMed) and with the demands of the user community (e.g. different methods for fact extraction). These demands motivated the development of a server-based solution for literature analysis. Whatizit is a suite of modules that analyse text for contained information, e.g. any scientific publication or Medline abstracts. Special modules identify terms and then link them to the corresponding entries in bioinformatics databases such as UniProtKb/Swiss-Prot data entries and gene ontology concepts. Other modules identify a set of selected annotation types like the set produced by the EBIMed analysis pipeline for proteins. In the case of Medline abstracts, Whatizit offers access to EBI's in-house installation via PMID or term query. For large quantities of the user's own text, the server can be operated in a streaming mode (http://www.ebi.ac.uk/webservices/whatizit). (PMID: 18006544)
Availability Online: Y, Download: -, Web service: Y
User relevance Biologist: 2, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

WikiGene [BIONLP_100020014] Tool URL
Description WikiGene is a scientific project that follows a community-based approach to collect data about genes and gene regulatory events. As in the Wikipedia project, everyone who is interested can take part and contribute to the project. As part of the Wiki concept, the WikiGene pages can be edited by everyone. Data that is collected by this collaborative effort will in turn be freely available to everyone. In WikiGene, every gene has its own page that contains a short description of the gene, its aliases and a gene specific link to the Ensembl site, that provides further information. To access your gene of interest in WikiGene, you need to (1) know the name of your gene. For the human gene CCR5, the name of the WikiGene page is "Homo_sapiens_gene:_CCR5", (2) a better way is to use LitMiner`s built-in GeneFinder to identify your gene of interest and follow the WikiGene link there to be directed to the correct page.
Reference Maier H, Doehr S, Grote K, O'Keeffe S, Werner T, Hrabe de Angelis M, Schneider R. LitMiner and WikiGene: identifying problem-related key players of gene regulation using publication abstracts. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W779-82
Abstract LitMiner and WikiGene: identifying problem-related key players of gene regulation using publication abstracts. The LitMiner software is a literature data-mining tool that facilitates the identification of major gene regulation key players related to a user-defined field of interest in PubMed abstracts. The prediction of gene-regulatory relationships is based on co-occurrence analysis of key terms within the abstracts. LitMiner predicts relationships between key terms from the biomedical domain in four categories (genes, chemical compounds, diseases and tissues). Owing to the limitations (no direction, unverified automatic prediction) of the co-occurrence approach, the primary data in the LitMiner database represent postulated basic gene-gene relationships. The usefulness of the LitMiner system has been demonstrated recently in a study that reconstructed disease-related regulatory networks by promoter modelling that was initiated by a LitMiner generated primary gene list. To overcome the limitations and to verify and improve the data, we developed WikiGene, a Wiki-based curation tool that allows revision of the data by expert users over the Internet. LitMiner (http://andromeda.gsf.de/litminer) and WikiGene (http://andromeda.gsf.de/wiki) can be used unrestricted with any Internet browser. (PMID: 15980584)
Availability Online: Y, Download: -, Web service: -
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined

XplorMed [BIONLP_100020015] Tool URL
Description The XplorMed server allows you to explore a set of abstracts derived from a MEDLINE search. The system gives you the main associations between the words in groups of abstracts. Then, you can select a subset of your abstracts based on selected groups of related words and iterate your analysis on them. XplorMed is recommended for cases in which you do not know exactly what are you expecting to find. Your interests may be modified by the results obtained, or you may want to enquire new questions as the analysis develops. Also, the results may suggest you additional words that should be used to expand your query in MEDLINE (e.g., unexpected abbreviations of a protein name, or synonyms of a disease).
Reference Perez-Iratxeta C, Perez AJ, Bork P, Andrade MA. Update on XplorMed: A web server for exploring scientific literature. Nucleic Acids Res. 2003 Jul 1;31(13):3866-8
Abstract Update on XplorMed: A web server for exploring scientific literature. As scientific literature databases like MEDLINE increase in size, so does the time required to search them. Scientists must frequently inspect long lists of references manually, often just reading the titles. XplorMed is a web tool that aids MEDLINE searching by summarizing the subjects contained in the results, thus allowing users to focus on subjects of interest. Here we describe new features added to XplorMed during the last 2 years (http://www.bork.embl-heidelberg.de/xplormed/). (PMID: 12824439)
Availability Online: Y, Download: -, Web service: Y
User relevance Biologist: 3, Curator: 2, NLP: 1
Query input Undefined
Query output Undefined
Keywords Undefined