As previously discussed, it appeared to me that Post-Finasteride Syndrome, Chronic Fatigue, Post-Accutane Syndrome, Gulf-War Syndrome and Post-Treatment Lyme Disease Syndrome, they all have several similar and overlapping symptoms.
The Goal then was to identify similar symptoms from all of these Syndromes, collect PubMed articles mentioning these symptoms and then have a Machine Learning Algorithm generate several hypotheses as to the most probable Topics and Genes that are relevant to all of these Syndromes.
Of course, this is easier said than done.
Consider the symptom depression. This symptom may be present on its own or it can be the result of other symptoms such as anxiety or erectile dysfunction.
The problems do not end here. There are far too many symptoms that have strong correlation between them. Apart from this, there were a lot of Symptoms involved in this Analysis since we are looking at several Syndromes and not just one.
This is a list of Symptoms (not inclusive) that individuals having Chronic Fatigue experience as found in Wikipedia :
- brain fog (feeling like one is in a mental fog)
- difficulty maintaining an upright position, dizziness, balance problems or fainting
- allergies or sensitivities to foods, odors, chemicals, medications, or noise
- irritable bowel syndrome-like symptoms such as bloating, stomach pain, constipation, diarrhoea and nausea
- chills and night sweats
- visual disturbances (sensitivity to light, blurring, eye pain)
- depression or mood problems (irritability, mood swings, anxiety, panic attacks)
Of course any Symptom may have multiple causes which in other words means that we have multiple Symptoms and multiple causes.
Yet one more problem is that -for example- Chronic Fatigue Syndrome may be a form of a well known disease such as Addison's :
I could not find a better opportunity to remind to new Data Scientists the importance of Data Representation also known as "Feature Engineering" for a successful data analysis outcome.
Not only we have to fine-tune our Algorithms and choose to start our Analysis with the best-performing ones (e.g for this reason you may decide -say- to start with XGBoost) but we also have to spend enough time at the problem representation.
Having all of these limitations in mind, the next step was to use several machine Learning Algorithms to analyze and "Learn" from the data and come up with the Topics and Genes relevant to the Syndromes discussed.
As soon as we have these "Machine-Learning Generated" Topics and Genes we may then move on to making certain hypotheses as to what lies behind all of these Syndromes.