PLAN2L : PLant ANnotation to Literature: a text mining and information extraction system for Arabidopsis thaliana





SYSTEM DESCRIPTION AND OVERVIEW AND EVALUATION:

General system description
PLAN2L is a text mining online application to improve integration of knowledge by retrieving and scoring textual hits for multiple biological topics simultaneously, allowing to explore the use of literature data for describing the developmental interactomes of Arabidopsis. Our system incorporates information extraction of individual entities, together with retrieval of protein interaction relations and gene regulatory associations, ranking each of these biological objects according to their relevance for central developmental processes studied in higher plants, namely flowering, leaf, root and seed development. At the cellular level, we prioritize each of the Arabidopsis genes for their implication in the cell cycle process through ranked links to their corresponding evidence texts together with co-mentioned cell cycle terms. Spatial information in terms of sub-cellular location of proteins can be useful to understand the functional properties and interaction network of particular protein; therefore PLAN2L integrates a localization retrieval module for finding location evidence descriptions The figure below provides an overview of the PLAN2L information extraction and text mining system for Arabidopsis thaliana. This information resource tried to improve the efficiency in searching the available literature for the plant model organism Arabidopsis thaliana in general, and especially for biological relations (protein interactions and gene regulation), as well as for developmental processes studied in higher plants.



Preprocessing and article retrieval
In order to extract more fine-grained information at the level of bio-entities, the actual literature collection relevant for the studied model organism needs to be gathered first. This was accomplished using a document retrieval pipeline that takes into account several sources of evidence for the determining whether a given article is associated to A. thaliana: (1) external references derived from multiple databases providing annotations and literature references for A. thaliana genes. (2) Organism and taxonomic name tagging using dictionary look-up based on a species lexicon derived from the NCBI Taxonomy that was automatically extended using a rule-based approach to account for typographical variants and abbreviations of species names. (3) Keyword based retrieval from PubMed and PubMed Central. The fraction of Arabidopsis mentions from the total list of tagged organism sources co-occurring in the article is used to score how specific the article is for this plant model organism. Additionally a full text collection of Arabidopsis-related articles was constructed from a local repository of open access full text articles as well as using an in-house retrieval system to collect articles. Plain text conversion was carried out through a combination of systems including pdftotext. Both abstracts as well as full text articles where then further processed using a rule-based sentence boundary detection module implemented in Python, specifically adapted to handle biomedical articles.

Gene/protein mention normalization
An important step for the extraction of protein and gene annotations is the detection of links between the literature and concrete biological entities for instance as provided in annotation databases, often referred to as protein or gene mention normalization. Our protein normalization approach is based on the construction and look-up of a gene and protein lexicon, followed by a protein normalization scoring/disambiguation approach. The gene dictionary integrated A. thaliana gene names and symbols derived from multiple databases, including TAIR, SwissProt and from a collection of gene and protein names identified by a machine learning named entity recognition program (ABNER) as well as based a rule based approach considering morphological cues and name length to identify potential Arabidopsis gene symbols (e.g. using organism source gene prefixes and suffixes like 'At' or 'AT'). Lexicon expansion using manually-crafted rules was carried out. For disambiguation and scoring the reliability of a given entity normalization, we calculated the document similarity between the context of mention and the corresponding database record. Additionally co-mentioned entity attributes (mutations, sequence length and molecular weight) were used as disambiguation qualifiers. The figure below provides a general flowchart of the PLAN2L protein normalization process.



Gene regulation
Regulation of gene expression is a fundamental cellular control process that involves complex interactions between genes, transcriptions factors (proteins) and other biological entities. To extract such complex relations, where the correct identification of directionality of the event (i.e. regulator and regulated gene) plays an important role, we adapted an Information Extraction (IE) architecture relying on a pipeline of semantic/syntactic rules. We applied part-of-speech tagging of each word using a GENIA-trained version of Treetager (Schmid et al, 1994). Then a module was used that substituted some of the POS tags with more semantically oriented labels, such as org (organism), nnpg (protein/gene name), actv (activation verb), etc. For this Named-Entity Recognition task we used dictionaries the previously describe gene lexicon. The text with mixed syntactic and semantic tags was fed into a SCOL parser (Abney et al 1996) that generated a tree-like structure by applying a modified CASS grammar originally developed for the STRING-IE system (Saric et al. 2004). These rules constitute cascades of finite-state automata, and use patterns that combine both grammatical and biological-meaning features in the linguistic structure. The initial cascades group all tokens referring to a single entity (like multiple word terms), while the latter ones are triggered by active or passive forms of regulation-related verbs or nominal phrases, e.g. "the activation of gene X by protein Y. We implemented extensions of the rules to handle frequent phrase coordination and prepositional anaphora that the original system didn't attempted due to self-imposed restraints. For example, an "activation" relationship between PROTEIN1 and GENE1 can be inferred from a sentence such as (1), but "repressing" relationships with the rest of the entities should be extracted as well.
Additionally we have also constructed a high recall system for ranking sentences related to transcription, gene regulation and expression. This system is based on a SVM (radial basis kernel) approach that uses a collection of gene regulation relevant and not relevant sentences as training set and is based on the bag of word approach. The initial feature word dictionary was filtered to remove stop words (uninformative words) and words are weighted using term frequency. To facilitate the practical interpretation of the obtained sentence scores from this classifier, we have evaluated for each score interval a random subset of sentences through comparison to manual classification. The obtained result is shown in the figure below, where or each score interval the manual classification result is shown in blue (relevant) and red (non-relevant). The default cut-off is shown as a doted grey vertical line.


Protein Interaction
There is an increasing interest in the characterization of the Arabidopsis thaliana protein interactome under the systems biology perspective (Cui et al. 2008). The extraction of protein interaction evidence associations was addressed using a machine learning sentence classifier approach relying on manually selected interaction evidence sentences (Krallinger et al. 2008). The used sentence classifier relies on a Support Vector Machines algorithm trained on set of manually classified interaction evidence passages derived from a collection used at the second BioCreative challenge. The resulting classifier used a set of 9,970 feature words, and obtained a performance of 89.75 for precision and 92.62 for recall using a radial basis kernel function on a balanced test set. Finally experimental keywords have been automatically tagged to account for experimental interaction detection methods described in the literature.
To facilitate the practical interpretation of the obtained sentence scores from this classifier, we have evaluated for each score interval a random subset of sentences through comparison to manual classification. The obtained result is shown in the figure below, where or each score interval the manual classification result is shown in blue (relevant) and red (non-relevant). The default cut-off is shown as a doted grey vertical line. Note that these results are based on sentences regardless if they mention at least two proteins or an experimental interaction detection keyword. In this particular case it would make sense to use a more stringent sentence score cut-off.


Sub-cellular location evidence
To retrieve protein localization description sentences, we explored both the use of semantic-syntactic frames for extracting a fine-grained association between proteins and subcellular location mentions together with a machine learning sentence classifier for retrieving protein localization description sentences in general. The initial step followed, consisted in the construction of a sub-cellular location dictionary that integrates location keywords and synonyms derived from SwissProt together with Cellular Component terms from Gene Ontology. After detecting protein names co-occurring with the location terms, a total of 1,288 sentences were used for manually inspection to derive hand crafted location frames. This resulted in a total of 396 location frames, covering mainly binary relations between a single protein and a single location term, although a subset corresponded also to protein associations to multiple (alternative) locations. We then applied an approach to learn locative expressions using automatic expansion of an initial seed set of 220 manually defined location and motion-relevant verb roots. As localization expressions might be sensitive towards inflectional properties we decided to apply verb root extension rather than morphological normalization. Automatically generated variants were then filtered based on their instantiation on the whole PubMed database (remaining a total of 6,436 location words). The sentence classifier was constructed using a collection of 2,264 protein location descriptions.

Cellular and developmental processes
A central component of PLAN2L is the scoring of each evidence sentence according to its relevance for complex temporal biological events (topics), at the cellular level (cell cycle) as well as at the level of developmental processes. We therefore implemented a classifier for scoring cell cycle relevant abstracts and document passages. The full text passage classifier models were applied to classify and score each of the Arabidopsis full text sentence passages using a sliding window approach, resulting in a collection of cell cycle-scored windows of 2,987,342 (5 sentences) and 2,971,840 (7 sentences) passages. The SVM text classifier was trained on a collection of cell cycle relevant abstracts and non-relevant abstracts and then applied to a literature collection of abstracts and full text articles mentioning A. thaliana genes. Additionally four specific sentence classifiers for the most relevant developmental processes in higher plants, namely (a) flowering, (b) leaf development, (c) root development and (d) seed development/germination have been developed. The tool provides a comprehensive approach to assist in the selection and ranking of genes, proteins, documents and terms relevant to a specific biological process for this model organism.

Similarly to the approach followed for the gene regulation and interaction classifier we also integrated a single sentence classifier for the cell cycle topic using a balanced training set of 5840 sentences. We have evaluated for each sentence score interval a random subset of sentences through comparison to manual classification. The obtained result for the cell cycle single sentence classifier is shown in the figure below, where for each score interval the manual classification result is shown in blue (relevant) and red (non-relevant). The default cut-off is shown as a doted grey vertical line. Note that these results are based on sentences regardless if the sentence contains a gene or protein mention, but is based on examining whether the sentence is relevant to cell cycle, cell division or related biological processes. From this sample evaluation we can see that score above 2 show a very high precision and that the default cut-off is still suitable to recover at an acceptable performance cell cycle relevant sentences.

For each of four main developmental processes studied in Arabidopsis a sentence specific sentence classifier had been trained. In case of the flowering process (i.e. flower-related topic) a balanced collection of 10,000 sentences had been used as training set. The negative (non-relevant sentences) were derived from random sentences selection from the Arabidopsis bibliome. Therefore the developmental sentence classifiers are actually based on a semi-supervised learning approach, under the assumption that most of the randomly selected instances correspond to non-relevant sentences. The same strategy was also followed fro the other developmental processes, namely leaf development (i.e. leaf- related topic), root development (root-related topic) and seed development (seed, seedling and germination related topic). In case of the leaf topic, we used a balanced collection of 2,344 sentences for training the system; while in case of the root and seed topic we relied on 2,458 and 11083 sentences respectively for the classifier construction. The figure below shows the evaluation against manually labeled sentences for randomly selected sentences for predefined score intervals.



Frequently asked questions (FAQ):

1) What is PLAN2L?
PLAN2L is an automatic
bio-text mining system developed for the plant model organism Arabidopsis thaliana, with the aim to enable more efficient retrieval of biologically relevant information related to protein interaction, regulatory events and some of the prominent biological processes.

2) What kind of searches can be carried out using PLAN2L?
Supported searches include gene/proteins, keywords and pairs of bio-entities.

3) What do the sentence and article scores mean and how have they been generated?
The scores reflect the relevance for the given biological topic. Positive scores mean that the sentence is relevant for the topic; negative scores mean that it is not relevant. These scores have been generated using a machine learning approach based on Support Vector Machines (SVMs), trained on a collection of sentences known to be relevant for the topic in order to ‘detect’ which terms are relevant for the given topic.

4) How have the gene regulation relations been extracted?
They have been generated using a rule based information extraction system that exploits both syntactic and semantic information to determine whether to co-mentioned gene and protein pairs have a regulatory association.

5) How have the protein interaction relations been extracted?
They have been extracted using a sentence classifier based on SVMs together with the analysis of co-occurring bio-entities and experimental interaction detection method terms.

6) What is the BioCreative Metaserver (BCMS)?
The BCMS is a meta-server that integrates text annotations from various systems

7) Does PLAN2L contain the whole PubMed database?
In the online version of PLAN2L we restrict the data collection to articles that are associated to Arabidopsis thaliana because we actually wanted to provide a system that initially offers better literature-mining support specifically for the Arabidopsis user community. Without this consideration, end users would have to face additional inter-species gene symbol ambiguity and adapt their queries in a way that only Arabidopsis relevant articles would be retrieved. This would in general be similar to some of the problems encountered when carrying out baseline PubMed searches. Anyhow the technology used by PLAN2L could in principle be adapted to handle other model organisms.

8) Are there Arabidopsis relevant articles not contained in PLAN2L?
Regarding Arabidopsis relevant articles not covered in PLAN2L, we actually adopted a high recall strategy integrating Arabidopsis relevant articles through a pipeline that takes into account references contained in annotated resources (TAIR, SwissProt) as well as detected through Arabidopsis species mention lookup. This implies that most of the Arabidopsis relevant articles contained in PubMed or PubMed central should be covered by PLAN2L. Nevertheless we did not include articles from journals that are not contained in PubMed (e.g. specialized conference proceedings or articles only contained in AGRIGOLA database - agricola.nal.usda.gov, but not covered in PubMed).

9) How many articles are contained in PLAN2L?
Currently there are a total of 73,622 articles (titles or titles with abstracts, corresponding to 332,839 sentences) in PLAN2L and a total of 11,637 full text articles.

10) When was the last PLAN2L data update?
The last update of articles was december 27th 2008. We intend to update teh system every 6 month.

11) How are gene/protein mentions identified in the text?
Using mainly a dictionary look-up approach, for more details refer to the systems description section.

12) I was searching with a gene and did not find any hit, why?
There are several potenitial explanations, either there is no literature description in the underlying article collection used, or the gene/protein could not be identified due to lexical and typographical variability of the gene mention currently not contained in the plan2l lexicon. If this is the case please send us the query gene information (TAIR locus id together with the used query name).

13) Why are there cases in which no query term is highlighted in the sentence?
Because the text highlightning is case sensitive, to easier spot those cases that match exactly the query term.


COMPARISON TO RELATED RESOURCES:

1. Comparison to basic literature search systems
There are a range of online literature search engines, some of the most popular ones include
PubMed, AGRICOLA, Highwire Press or Google scholar. A more exhaustive analysis of existing applications is described in Krallinger et al. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol. 2008;9 Suppl 2:S8. These systems offer a general infrastructure to retrieve articles of biological interest through Boolean search queries. Although some tools like PubMed improve user queries through consideration of specialized text indexing initiatives (e.g. using MeSH terms), they are usually not suitable to carried out more complex searches at the level of evidence sentences, relevant not only for a particular organism of interest (e.g. A. thaliana) but also for a certain biological topic like interaction relations or developmental processes. In this context PLAN2L represents a complementary approach to improve literature searches for a set of biological topics of interest for plant sciences in general and for the Arabidopsis community in particular.

2. Comparison to manually annotated resources.
Databases like TAIR (The Arabidopsis Information Resource) or SwissProt are providing plant biologists with valuable infrastructures of manually curated information. Database annotations are based on manual revision by domain experts (database curators) that extract from the literature relevant information on genes and gene products, often encoding the resulting information in form of structured database records that associate these bio-entities to some controlled vocabulary terms (keywords, ontology terms). PLAN2L provides complementary information to databases, by directly pointing to relevant literature descriptions, rather than offering formal associations between bio-entities and controlled vocabulary terms. The retrieval of evidence sentences as offered by PLAN2L make direct interpretation of returned description by the human end user feasible. Interpretation and validation of database annotations by the end user biologist is sometimes challenging.

3. Comparison to other text mining applications for Arabidopsis.
Despite the considerable number of newly published text mining methods over the past years, only a small fraction is actually available as online applications and among these only few are able to provide specific information for A. thaliana. Previously published systems include the Dragon Plant Biology Explorer (DPBE), PubSearch and Textpresso.



POTENTIAL SYSTEM LIMITATIONS:

PLAN2L is a text mining system, which implies that the obtained results are primarily generated automatically. Therefore it is important to keep in mind when interpreting obtained results, especially taking into account the complexity of scientific language, that such a system has several potential limitations. Common difficulties encountered by PLAN2L are actually similar to problems that most text mining systems currently experience. These include:

(a) General difficulty in detecting correctly associations of articles as well as specific bio-entity mentions to the relevant organism/species of interest (in this case A. thaliana). This may be explained partially by several factors, such as multiple species mentions given a particular article (species disambiguation), difficulties in handling implicit species association through gene symbols without mentioning the organism source explicitly (in abstracts) or the case of gene and protein inter-species ambiguity (e.g. detecting whether the mentioned gene corresponds to Arabidopsis or another species like rise).
(b) Difficulties in correctly detecting gene and protein names and their associations to database records due to gene and protein name/symbol ambiguity as well as variability in referring to a given bio-entity in the literature.
(c) Retrieval and preprocessing of abstracts and specially full text articles due to limitations in terms of accessibility of the documents as well as due to errors in plain text conversion of PDF formatted articles.
(d) In case of the extraction or relations, multiple gene mentions, correct handling of coordination and negations is also a common challenge.






by PLAN2L team 2009, last page update 13th May