Preprocessing and article retrieval
In order to extract more fine-grained information at the level of bio-entities, the actual
literature collection relevant for the studied model organism needs to be gathered first.
This was accomplished using a document retrieval pipeline that takes into account several sources
of evidence for the determining whether a given article is associated to A. thaliana: (1)
external references derived from multiple databases providing annotations and literature
references for A. thaliana genes. (2) Organism and taxonomic name tagging using dictionary
look-up based on a species lexicon derived from the NCBI Taxonomy that was automatically
extended using a rule-based approach to account for typographical variants and abbreviations of
species names. (3) Keyword based retrieval from PubMed and PubMed Central. The fraction of
Arabidopsis mentions from the total list of tagged organism sources co-occurring in the article is
used to score how specific the article is for this plant model organism. Additionally a full text
collection of Arabidopsis-related articles was constructed from a local repository of open access
full text articles as well as using an in-house retrieval system to collect articles. Plain text
conversion was carried out through a combination of systems including pdftotext. Both abstracts as
well as full text articles where then further processed using a rule-based sentence boundary detection
module implemented in Python, specifically adapted to handle biomedical articles.
Gene/protein mention normalization
An important step for the extraction of protein and gene annotations is the detection of links between the
literature and concrete biological entities for instance as provided in annotation databases, often referred
to as protein or gene mention normalization. Our protein normalization approach is based on the construction
and look-up of a gene and protein lexicon, followed by a protein normalization scoring/disambiguation approach.
The gene dictionary integrated A. thaliana gene names and symbols derived from multiple databases, including
TAIR, SwissProt and from a collection of gene and protein names identified by a machine learning named entity
recognition program (ABNER) as well as based a rule based approach considering morphological cues and name
length to identify potential Arabidopsis gene symbols (e.g. using organism source gene prefixes and suffixes
like 'At' or 'AT'). Lexicon expansion using manually-crafted rules was carried out. For disambiguation and
scoring the reliability of a given entity normalization, we calculated the document similarity between the
context of mention and the corresponding database record. Additionally co-mentioned entity attributes
(mutations, sequence length and molecular weight) were used as disambiguation qualifiers. The figure below
provides a general flowchart of the PLAN2L protein normalization process.
Gene regulation
Regulation of gene expression is a fundamental cellular control process that involves complex interactions
between genes, transcriptions factors (proteins) and other biological entities. To extract such complex
relations, where the correct identification of directionality of the event (i.e. regulator and regulated gene)
plays an important role, we adapted an Information Extraction (IE) architecture relying on a pipeline of
semantic/syntactic rules. We applied part-of-speech tagging of each word using a GENIA-trained version of
Treetager (Schmid et al, 1994). Then a module was used that substituted some of the POS tags with more
semantically oriented labels, such as org (organism), nnpg (protein/gene name), actv
(activation verb), etc. For this Named-Entity Recognition task we used dictionaries the previously
describe gene lexicon. The text with mixed syntactic and semantic tags was fed into a SCOL parser
(Abney et al 1996) that generated a tree-like structure by applying a modified CASS grammar originally
developed for the STRING-IE system (Saric et al. 2004). These rules constitute cascades of finite-state
automata, and use patterns that combine both grammatical and biological-meaning features in the linguistic
structure. The initial cascades group all tokens referring to a single entity (like multiple word terms),
while the latter ones are triggered by active or passive forms of regulation-related verbs or nominal phrases,
e.g. "the activation of gene X by protein Y. We implemented extensions of the rules to
handle frequent phrase coordination and prepositional anaphora that the original system didn't attempted
due to self-imposed restraints. For example, an "activation" relationship between PROTEIN1 and GENE1 can be
inferred from a sentence such as (1), but "repressing" relationships with the rest of the entities should
be extracted as well.
Additionally we have also constructed a high recall system for ranking sentences related to transcription, gene regulation and expression. This system is based on a SVM (radial basis kernel) approach that uses a collection of gene regulation relevant and not relevant sentences as training set and is based on the bag of word approach. The initial feature word dictionary was filtered to remove stop words (uninformative words) and words are weighted using term frequency. To facilitate the practical interpretation of the obtained sentence scores from this classifier, we have evaluated for each score interval a random subset of sentences through comparison to manual classification. The obtained result is shown in the figure below, where or each score interval the manual classification result is shown in blue (relevant) and red (non-relevant). The default cut-off is shown as a doted grey vertical line.
Protein Interaction
There is an increasing interest in the characterization of the Arabidopsis thaliana protein interactome
under the systems biology perspective (Cui et al. 2008). The extraction of protein interaction evidence
associations was addressed using a machine learning sentence classifier approach relying on manually
selected interaction evidence sentences (Krallinger et al. 2008). The used sentence classifier relies on a
Support Vector Machines algorithm trained on set of manually classified interaction evidence passages derived
from a collection used at the second BioCreative challenge. The resulting classifier
used a set of 9,970 feature words, and obtained a performance of 89.75 for precision and 92.62 for recall
using a radial basis kernel function on a balanced test set. Finally experimental keywords have been
automatically tagged to account for experimental interaction detection methods described in the literature.
To facilitate the practical interpretation of the obtained sentence scores from this classifier, we
have evaluated for each score interval a random subset of sentences through comparison to manual
classification. The obtained result is shown in the figure below, where or each score interval the manual
classification result is shown in blue (relevant) and red (non-relevant). The default cut-off is shown as
a doted grey vertical line. Note that these results are based on sentences regardless if they mention at
least two proteins or an experimental interaction detection keyword. In this particular case it would make
sense to use a more stringent sentence score cut-off.
Sub-cellular location evidence
To retrieve protein localization description sentences, we explored both the use of semantic-syntactic
frames for extracting a fine-grained association between proteins and subcellular location mentions
together with a machine learning sentence classifier for retrieving protein localization description
sentences in general. The initial step followed, consisted in the construction of a sub-cellular location
dictionary that integrates location keywords and synonyms derived from SwissProt together with Cellular
Component terms from Gene Ontology. After detecting protein names co-occurring with the location terms,
a total of 1,288 sentences were used for manually inspection to derive hand crafted location frames. This
resulted in a total of 396 location frames, covering mainly binary relations between a single protein and a
single location term, although a subset corresponded also to protein associations to multiple (alternative)
locations. We then applied an approach to learn locative expressions using automatic expansion of an initial
seed set of 220 manually defined location and motion-relevant verb roots. As localization expressions might
be sensitive towards inflectional properties we decided to apply verb root extension rather than morphological
normalization. Automatically generated variants were then filtered based on their instantiation on the whole
PubMed database (remaining a total of 6,436 location words). The sentence classifier was constructed using a
collection of 2,264 protein location descriptions.
Cellular and developmental processes
A central component of PLAN2L is the scoring of each evidence sentence according to its relevance
for complex temporal biological events (topics), at the cellular level (cell cycle) as well as at
the level of developmental processes. We therefore implemented a classifier for scoring cell cycle
relevant abstracts and document passages. The full text passage classifier models were applied to
classify and score each of the Arabidopsis full text sentence passages using a sliding window approach,
resulting in a collection of cell cycle-scored windows of 2,987,342 (5 sentences) and 2,971,840 (7 sentences)
passages. The SVM text classifier was trained on a collection of cell cycle relevant abstracts and non-relevant
abstracts and then applied to a literature collection of abstracts and full text articles mentioning
A. thaliana genes. Additionally four specific sentence classifiers for the most relevant developmental
processes in higher plants, namely (a) flowering, (b) leaf development, (c) root development and (d) seed
development/germination have been developed. The tool provides a comprehensive approach to assist in the
selection and ranking of genes, proteins, documents and terms relevant to a specific biological process for
this model organism.
Similarly to the approach followed for the gene regulation and interaction classifier we also
integrated a single sentence classifier for the cell cycle topic using a balanced training set of
5840 sentences. We have evaluated for each sentence score interval a random subset of sentences
through comparison to manual classification. The obtained result for the cell cycle single sentence
classifier is shown in the figure below, where for each score interval the manual classification
result is shown in blue (relevant) and red (non-relevant). The default cut-off is shown as a doted
grey vertical line. Note that these results are based on sentences regardless if the sentence contains
a gene or protein mention, but is based on examining whether the sentence is relevant to cell cycle,
cell division or related biological processes. From this sample evaluation we can see that score above
2 show a very high precision and that the default cut-off is still suitable to recover at an acceptable
performance cell cycle relevant sentences.
For each of four main developmental processes studied in Arabidopsis a sentence specific sentence
classifier had been trained. In case of the flowering process (i.e. flower-related topic) a balanced
collection of 10,000 sentences had been used as training set. The negative (non-relevant sentences)
were derived from random sentences selection from the Arabidopsis bibliome. Therefore the developmental
sentence classifiers are actually based on a semi-supervised learning approach, under the assumption that
most of the randomly selected instances correspond to non-relevant sentences. The same strategy was also
followed fro the other developmental processes, namely leaf development (i.e. leaf- related topic), root
development (root-related topic) and seed development (seed, seedling and germination related topic). In
case of the leaf topic, we used a balanced collection of 2,344 sentences for training the system; while
in case of the root and seed topic we relied on 2,458 and 11083 sentences respectively for the classifier
construction. The figure below shows the evaluation against manually labeled sentences for randomly
selected sentences for predefined score intervals.
2) What kind of searches can be carried out using PLAN2L?
Supported searches include gene/proteins, keywords and pairs of bio-entities.
3) What do the sentence and article scores mean and how have they been generated?
The scores reflect the relevance for the given biological topic. Positive scores mean that the sentence is relevant
for the topic; negative scores mean that it is not relevant. These scores have been generated using a machine learning
approach based on Support Vector Machines (SVMs),
trained on a collection of sentences known to be relevant for the topic in order to ‘detect’ which terms are relevant for
the given topic.
4) How have the gene regulation relations been extracted?
They have been generated using a rule based information extraction system that exploits both syntactic and semantic
information to determine whether to co-mentioned gene and protein pairs have a regulatory association.
5) How have the protein interaction relations been extracted?
They have been extracted using a sentence classifier based on SVMs together with the analysis of co-occurring
bio-entities and experimental interaction detection method terms.
6) What is the BioCreative Metaserver (BCMS)?
The BCMS is a meta-server that integrates
text annotations from various systems
7) Does PLAN2L contain the whole PubMed database?
In the online version of PLAN2L we restrict the data collection to articles that are associated to Arabidopsis thaliana because we actually wanted to provide a system that initially offers better literature-mining support specifically for the Arabidopsis user community. Without this consideration, end users would have to face additional inter-species gene symbol ambiguity and adapt their queries in a way that only Arabidopsis relevant articles would be retrieved. This would in general be similar to some of the problems encountered when carrying out baseline PubMed searches. Anyhow the technology used by PLAN2L could in principle be adapted to handle other model organisms.
8) Are there Arabidopsis relevant articles not contained in PLAN2L?
Regarding Arabidopsis relevant articles not covered in PLAN2L, we actually adopted a high recall
strategy integrating Arabidopsis relevant articles through a pipeline that takes into account
references contained in annotated resources (TAIR, SwissProt) as well as detected through
Arabidopsis species mention lookup. This implies that most of the Arabidopsis relevant articles
contained in PubMed or PubMed central should be covered by PLAN2L. Nevertheless we did not include
articles from journals that are not contained in PubMed (e.g. specialized conference proceedings
or articles only contained in AGRIGOLA database - agricola.nal.usda.gov, but not covered in PubMed).
9) How many articles are contained in PLAN2L?
Currently there are a total of 73,622 articles (titles or titles with abstracts, corresponding to 332,839 sentences)
in PLAN2L and a total of 11,637 full text articles.
10) When was the last PLAN2L data update?
The last update of articles was december 27th 2008. We intend to update teh system every 6 month.
11) How are gene/protein mentions identified in the text?
Using mainly a dictionary look-up approach, for more details refer to the systems description section.
12) I was searching with a gene and did not find any hit, why?
There are several potenitial explanations, either there is no literature description in the
underlying article collection used, or the gene/protein could not be identified due to lexical and
typographical variability of the gene mention currently not contained in the plan2l lexicon. If this
is the case please send us the query gene information (TAIR locus id together with the used query name).
13) Why are there cases in which no query term is highlighted in the sentence?
Because the text highlightning is case sensitive, to easier spot those cases that match exactly the query term.
2. Comparison to manually annotated resources.
Databases like TAIR (The Arabidopsis Information Resource) or
SwissProt are providing plant biologists with valuable infrastructures of manually curated information. Database annotations are based on manual revision by domain experts (database curators) that extract from the literature relevant information on genes and gene products, often encoding the resulting information in form of structured database records that associate these bio-entities to some controlled vocabulary terms (keywords, ontology terms). PLAN2L provides complementary information to databases, by directly pointing to relevant literature descriptions, rather than offering formal associations between bio-entities and controlled vocabulary terms. The retrieval of evidence sentences as offered by PLAN2L make direct interpretation of returned description by the human end user feasible. Interpretation and validation of database annotations by the end user biologist is sometimes challenging.
3. Comparison to other text mining applications for Arabidopsis.
Despite the considerable number of newly published text mining methods over the past years, only a small fraction is
actually available as online applications and among these only few are able to provide specific information
for A. thaliana. Previously published systems
include the Dragon Plant Biology Explorer (DPBE),
PubSearch and Textpresso.
PLAN2L and DPBE. The Dragon Plant Biology Explorer (DPBE) was an online text mining application for plant biology
based on integration of co-occurrence analysis of collections of vocabularies compiled for several biological topics.
Although DPBE offered the useful possibility of generating and visualizing networks of co-occurring terms and genes, one
of its limitations is that it was based on user provided abstracts, implying that the obtained results were heavily depending
on the proper selection of this initial document set (especially cosidering term co-occurrence statistics). Under certain scenarios, e.g. the analysis of larger gene collections,
the selection of this initial document set may be especially challenging. Similarly to PLAN2L, DPBE based its gene/protein text
indexing on records contained in the TAIR database. In case of PLAN2L a rule based dictionary expansion was carried out together
with integration of a gene dictionary derived from SwissProt and obtained through bio-entity tagging and detection through gene
symbol morphology rules. In case of DPBE from the provided description it is not clear how the actual protein/gene normalization
is carried out. DPBE makes use of collections of manually curated vocabularies. This implies both advantages and inconveniencies.
An obvious advantage of using controlled vocabularies is that they might allow structuring the literature according to predefined
relations existing between these terms. On the other side in order to detect links between the terms and the articles, in general
the used vocabulary should be suitable for text indexing. In case of resources such as Gene Ontology or Plant Ontology
(as used by DPBE), being primarily constructed as a framework for consistent manual functional annotation of gene products,
a considerable difficulty is encountered when using them as lexical resources for text indexing. Most of these terms do not
resemble the kind of expressions that are found in the scientific literature
(see Blaschke et al. BMC Bioinformatics. 2005;6 Suppl 1:S16). Therefore
we used an alternative approach in PLAN2L that is based on retrieving and scoring each sentence according to predefined topics of
biological relevant for this organism. The approach followed by DPBE also does not focus on ranking evidence sentences, but is more
centered on information summarization rather than retrieval. The textual units of interest in case of DPBE are documents
(co-occurrence in the same document), while in case of PLAN2L the basic textual units are sentences, often showing a more
specific association to the query term than loose relation between the co-occurring terms at the level of whole documents.
One topic that was covered by DPBE, currently not addressed by PLAN2L is related to metabolite associations.
Currently the DPBE system is not maintained any more that the corresponding URL: http://research.i2r.a-star.edu.sg/DRAGON/ME2.
PLAN2L and PubSearch.
The PubSearch was build primarily to facilitate the literature annotation of genes and proteins with controlled vocabularies (especially GO terms). It integrates relational database and uses indexing strategies to detect gene mentions and keywords. It also requires articles to be loaded and integrates a protocol for generating indices of the articles using the gene and keyword names. Installation of this system requires several steps, that are often cumbersome for end users that are not familiar with relational databases, XML and command line environments (see the original PubSearch paper), while PLAN2L does not require to install anything locally and is easy to use. Also PubSearch is well adapted to handle GO terms but does not cover well other biological topics such as interactions and gene regulation events when compared to PLAN2L. Unfortunately the PubSearch online demo is no longer working http://tesuque.stanford.edu:9999/pubdemo
PLAN2L and Textpresso Textpresso is one of the most widely used text mining systems by model organism databases. It has been integrated at the TAIR website. It has been mainly used to assist in the consultation of literature resources by model database experts, especially in the context of Gene Ontology term annotation of gene products. Similarly to the underlying approach of plan2l it relies on a gene/protein dictionary derived from TAIR but it seems that it does not integrate additional gene lexicon as is the case of PLAN2L. The query interface of TextPresso is rather complex and mainly structured based on the literature curation needs while PLAN2L provides a more general framework that allows ranking of sentences according to multiple topics and re-ranking based on user demands for each of the sentence topic classes currently covered by PLAN2L. In case of Textpresso the end user needs to select first biological categories, some of which are actually not relevant to this plant (e.g. human disease). Regarding the developmental processes covered by PLAN2L, this system offers a more general view on temporal processes, mainly concerned with the plant ontology rather than using the type of sentence classification and scoring approach followed by PLAN2L. On the other hand Textpresso highlights the co-occurring ontology term in the text. In case of the association and regulation option of Textpresso it seems that it is mainly based on interaction trigger word co-occurrence which is a complementary strategy to what is currently integrated to PLAN2L interaction extraction pipeline, mainly based on interaction sentence classification. In summary Textpresso may serve as a complementary resource to PLAN2L in case of a more ontology term co-occurrence based search strategy.