Skip to main content

Machine Learning, NLP and Network Analysis-Guided Medical Research : A Case Study

Can Machine Learning help us in identifying the origin of several Medical Syndromes?

In previous posts we have seen how approximately 8 Million PubMed abstracts were collected and analyzed using Natural Language Processing (NLP) techniques. This NLP Processing is the basis for generating Data that may then be used as Input to several Machine Learning algorithms.

In this Case Study our Goal is to identify relevant Medical Topics (Topics include Genes, Biological Pathways, etc) that are most likely to direct Medical Researchers towards the origin(s) of the following Syndromes :

-Post-Finasteride Syndrome
-Post-Accutane Syndrome
-Chronic Fatigue Syndrome
-Gulf-War Syndrome
-Post-Treatment Lyme disease Syndrome

Before continuing, please read the following post for important disclaimers here

Note that the results shown below originate strictly from output of Machine Learning Algorithms / Network Analysis. No Human intervention has been made apart from the fact that Candidate Topics were constantly being added for evaluation by a software system that combines ML Algorithms, NLP and Network Analysis to identify most promising Medical Topics.

Some of the most relevant candidate Medical Topics were found to be the following shown Table 1 (list is not inclusive) :

1) Sulfation

2) Bile Acid Homeostasis

3) Vitamin K Metabolism

3) Carboxylation

4) Urea Cycle

5) Adrenal Insufficiency

Table 1  - Candidate Medical Topics

Note that Topics listed in Table 1 (and in subsequent posts of this Blog) may also be associated with each other (e.g. Carboxylation and Vitamin K Metabolism)

Some of the Topics considered as input to the Algorithms originate from Research/hypotheses of other Authors. This work will be also discussed in future posts.

As an example of how Machine Learning and Network Analysis has generated a Hypothesis we will look at one of the proposed topics listed in Table 1, namely  “Vitamin K Metabolism”

The results of the Network Analysis can be seen below in Figure 1. Using certain analytical techniques, associations between Medical Topics have been generated and serve as input to Network Analysis. We then use several Network Analysis metrics to identify “important” or “central” nodes to our Research subject.

The process generated the following network using Fruchterman-Reingold algorithm for Graph Layout purposes :

We can identify some areas where more connections between Medical Topics occur. Upon closer inspection we see the following nodes (Red Nodes are considered as most important, Blue least important. Size of nodes also suggests their Importance)  :

We can clearly see that Vitamin K and Urea Cycle appear to have a "central" role to topics considered in this Research. We can also see Topics like LXR (Liver X Receptor), CoA (Coenzyme A) and others (Some Topics are purposely not disclosed at present)

At this point, a Researcher must first try to identify why these Topics appear to be central and whether these results are something that are expected, given the Research Subject.

We now move on to an example that shows how a Machine Learning Algorithm could be used for Research purposes. I will not get into details of pre-processing steps that are necessary to ensure that proper input features are used. One of the most important things to consider in this analysis however, is that a lot of correlated features exist in our input data.

An example run from a Machine Learning Algorithm (in this case it is Random Forests from H2O) along with variable importances can be found below :

Again, since this Research is ongoing, some Topics are not disclosed. However by looking at the scaled importance on the example chart, this particular algorithm suggests that cysteine desulfurase, cysteine dioxygenase and -generally speaking- Medical Topics associated with Carboxylation ,Sulfation and Cysteine Metabolism appear to be relevant to our Research subject. Make also special note of Oxidative Protein Folding.

Hypothesis Generation may then be driven based on the output of these Algorithms : After looking at the results above, a Researcher may choose to consider CDO1 (Cysteine Dioxygenase) as a Gene that could be important for the Research Topic under investigation. He/she may also choose to consider Biotin metabolism, since Propionyl-Coa Carboxylase is a Biotin-dependent enzyme [4] (more on Biotin role and BTD Gene to follow shortly).

As discussed, Vitamin K metabolism is not the only Topic that was selected and many other Biological Pathways appear also to be important (as scored by the Algorithms).

For example SULT2A1 and SULT1A2 appear to be also important. These are Genes involved in “Sulfation”  which was listed on Table 1. We will revisit Sulfation in full detail in the next Post.

I now propose the following Hypotheses :

Hypothesis #1

Based on results from a particular type of Network Analysis and output from several Machine Learning Algorithms,  it is hypothesized that Vitamin K - related Genes play a central role to the Syndromes discussed in this post (and possibly more syndromes having similar symptoms) :

Suspected Genes are any combination of  the following Genes that are either directly or indirectly associated with Vitamin K. These are :


Hypothesis No 1 - Discussion

Regarding MERTK :

"It is known that MERTK/TAM deficient animals show signs of autoimmunity with features resembling certain human autoimmune pathologies including serum autoantobodies against DNA, collagen and antiphospholipid antibodies (e.g anticardiolipin antibodies) and lymphocyte activation and hyperproliferation" [1]

Note also that anticardiolipin antibodies have been found to patients of CFS [2]

Apart from MERTK, VKORC1 is important for Protein Disulfide Bond formation within the Endoplasmic Reticulum [3]

Note also that Vitamin K needs Bile Salts for proper absorption : [1] - p. 268. Recall also that Bile Acid Metabolism was selected by the Algorithms as a candidate Topic (Table 1).

We now continue to Hypothesis No 2 :

Specific interventions according to an individual's DNA may significantly ameliorate or even reverse the Symptoms that are associated with the Syndromes discussed in this post

This post has been forwarded to the following foundations (and undisclosed Researchers) in hope that the Hypotheses discussed may be properly evaluated  :

-OpenMedicine foundation (
-Solve ME/CFS Initiative (
-The Post-Finasteride Syndrome Foundation (

Important Note For Researchers : I would kindly ask that you cite the information found in this post (and all subsequent posts in this Blog) if you find it them any way helpful for your Research.


[1] : Vitamin K-Academic Press,  Elsevier (2008) - ISBN 9780123741134


Popular posts from this blog

New findings : Myosin, D3, Actins, Autophagy/Phagocytosis

It is time to look at some new findings as these were identified by Machine Learning and Network Analysis.
Before continuing please note that in previous posts we discussed the importance of Endoplasmic Reticulum Stress, the Unfolded Protein Response and Genes AXL, GRB2, MGP, TYRO3, MERTK, GGCX, GAS6, SH2B3.
Recall also that Sulfation has been also selected as important.

The latest findings suggest the following Topics as being relevant to the Research presented in this Blog :
CYP27A1 and VDBP LXR (Liver X Receptor ) Actins (G-Actin, F-Actin) Myosin Phagocytosis / Autophagy

On the following algorithmic run, Machine Learning identifies relevant Topics to this Research :

Gut Microbiome, Bile Acids and Butyric Acid

As previously discussed, Machine Learning and Network Analysis have suggested the possible role of Bile Acids in ME/CFS and several other syndromes such as Post-Finasteride Syndrome, Post-Accutane Syndrome and Gulf War Ilness Syndrome .
It is time to look at one more reason on why Bile Acid metabolism and Liver pathology (e.g Hemochromatosis, WIlson's Disease, Gilbert's Syndrome) should be further investigated :

The reason is the association of Bile Acids (BAs) with the Gut Microbiome.

As discussed in [1] :

"the gut microbiota closely interact and modulate each other; BAs exert direct control on the intestinal microbiota. By binding to FXR, they induce production of antimicrobial peptides (AMPs) such as angiogenin 1 and RNase family member 4, which are directly involved in inhibiting gut microbial overgrowth and subsequent gut barrier dysfunction"

Professors Derya Unutmaz (Jackson laboratories) and W. Ian Lipkin (Columbia University) are investigating the importance…