The Level 3 project to be introduced in 2021 is a multi-disciplinary project involving Machine Learning, Bioinformatics and AWS DevOps.
This project involves the replication and extension of the work done in Learning the Structure of Biomedical Relationships from Unstructured Text, a research paper by Percha and Altman, published in 2015. While all teams will use this paper as a starting point, each Mentor Chains® Project Lead will define and drive the work done by their team. Mentor Chains® Leads will be encouraged to focus on inter-team collaboration as the way to achieve publication worthy results.
The Percha and Altman paper leverages unstructured text from Medline publication abstracts, a freely available repository of biomedical text that contains vast literature involving drug-target relationships.
Many drug-target relationships (the relationships of interest in this project) can be described in a huge array of ways, and thus finding the ‘consensus’ relationships from unstructured text is a brittle approach when using fixed patterns such as regular expressions. Synthesizing the available body of scientific data requires immense effort and domain knowledge, and thus it is slow to build consensus knowledge that is widely available in structured forms.
Machine learning was employed to automatically learn not only specific relationships of drugs and protein targets but also wider clusters of global thematic relationships such as ‘inhibition’, ‘coadministration’, and ‘raised levels’. The fundamental approach used in the paper that we will extend is the ‘distributional semantic approach’, which means finding semantic relations by discovering words or phrases used in similar contexts.
Medline abstracts will be scraped and combined using an AWS instance that can facilitate the large memory and compute requirements of such a task.
Abstracts will then be filtered to search for sentences that contain both a drug and protein product of interest. These terms or entities of interest will come from open source bioinformatics databases that describe communal knowledge.
Filtered sentences will be parsed to find their dependency path trees. The dependency trees will formalize the semantic relationships of the entities of interest.
In the original paper, this parsing was done using a neural network-based transition parser called the Stanford parser. We have the opportunity to extend this approach with state-of-the-art parsers that have been released since the paper was published. In particular, we can leverage new graph-based parsers with deep biaffine attention (a new gold standard via several benchmarks) to conduct more efficient and potentially more accurate parsing. This new parser can potentially handle longer range dependencies more accurately than the original; we will test this using example Medline sentences.
After the parsing, a sparse matrix will be built where each row is a drug-gene relation and each column a potential dependency path.
Ensemble Biclustering for Classification (EBC) will be used to bicluster the matrix to yield information on which dependency paths frequently cluster together. The assumption is that those paths are likely semantically similar, since their linguistic contexts are similar. The algorithm also yields information on which drug-gene pairs cluster with each other, and this creates an overall ‘landscape’ of interpretable ways that drugs and genes can interact.
The original paper also shows the coverage of relationships discovered via EBC in DrugBank, a publicly available knowledgebase for consensus drug-target relationships. The algorithm can output the certainty score for a predicted drug-gene relationship, so those with high certainty scores in the original paper but no coverage in DrugBank were likely consensus relationships that had yet to make their way into Drug Bank at the time of publication.
As a sanity check, we can look at some proposed candidates for new relationships at the time of publication and see if they eventually did wind up in DrugBank (roughly five years has elapsed).
The EBC algorithm itself has components that can be extended, including other ways of potentially performing the biclustering as well as other methods of actually scoring the ability of the biclustering to recover known relationships from DrugBank.
This will conclude the direct replication and extension of the original paper, but many ideas can be explored to utilize the extracted dependencies for other bioinformatic analyses.
Quantify the probability of entities of interest appearing in Medline.
It is generally useful knowledge to know how likely or how common a term of interest is to appear in Medline.
As a first step, we will search for the first instance of each term and calculate the proportion of Medline abstracts using that term within the window of the first instance date to the most recent publication data we have (through the end of 2019). We can then visualize how certain terms of interest fall in or out of favor by plotting their usage rates year by year.
An extremely robust relationship might not be such a hot topic after the initial flurry of research. Conversely, a robust relationship might continuously be discussed in new contexts. Having a global timeline of usage will help inform the weighting of extracted dependencies from our replication and extension of the Altman paper for further bioinformatic analyses.
This extension will involve expert usage of the STEM-Away® AWS Data Science Portal
Teams are encouraged to come up with their own ideas.
Here are some more suggestions to get you thinking!!
Multiple weighting schemes can be devised that favor the total volume of usage, the timing and momentum of the usage, the specific journals they were used in and their associated citations, etc.
Additionally, there is low hanging fruit to be plucked with simple cooccurrence statistics from Medline abstracts (completely ignoring nuanced and potentially critical context). String DB, another public database that pools protein-protein interactions, uses text mining as a component to its database weightings but goes no further than cooccurrence of entities within abstracts.
We can propose and test ways of generating weightings that explicitly leverage context and by assumption biomedical meaning, and this could create more accurate overall weightings in String DB as well as other public knowledge sources that rely on raw co occurrence statistics for their textual components.
This is another potential extension of the original paper, in that the core algorithms make no assumption as to the entities being related or the types of relation. Thus, we could use this framework on protein-protein interactions as well.