Computational prediction of patients diagnosis and feature selection applied to a mesothelioma dataset
Computational prediction of patients diagnosis and feature selection applied to a mesothelioma dataset
To run the scripts, you need to have installed:
- R (version 3.3.2)
- R packages rgl, clusterSim and randomForest
- Python 3
- Python package xlsx2csv
- git (version 1.8.3.1)
- Torch (version 7)
- LuaRocks (version 2.3.0)
You need to have root privileges, an internet connection, and at least 1 GB of free space on your hard disk. We here provide the instructions to install all the needed programs and dependencies on Linux CentOS, Linux Ubuntu, and Mac OS. Our scripts were originally developed on a Linux Ubuntu computer.
Here are the instructions to install all the programs and libraries needed by our scripts on a Linux Ubuntu computer, from a shell terminal. We tested these instructions on a Dell Latitude 3540 laptop, running Linux Ubuntu 16.10 operating system, and having a 64-bit kernel, in February 2017. If you are using another operating system version, some instructions might be slightly different.(Optional) First of all, update:
sudo apt-get update
Install R and its rgl, clusterSim, randomForest packages:
sudo apt-get -y install r-base-core
sudo apt-get -y install r-cran-rgl
sudo Rscript -e 'install.packages(c("rgl", "clusterSim", "randomForest"), repos="https://cran.rstudio.com")'
Install xlsx2csv and git:
sudo apt-get -y install xlsx2csv
sudo apt-get -y install git
Install Torch and luarocks:
# in a terminal, run the commands WITHOUT sudo
git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; bash install-deps;
./install.sh
source ~/.bashrc
cd ~
sudo apt-get -y install luarocks
sudo luarocks install csv
(Optional) First of all, update:
sudo yum -y update
Install R, its dependencies, and is rgl, clusterSim, randomForest packages:
sudo yum -y install R
sudo yum -y install mesa-libGL
sudo yum -y install mesa-libGL-devel
sudo yum -y install mesa-libGLU
sudo yum -y install mesa-libGLU-devel
sudo yum -y install libpng-devel
sudo Rscript -e 'install.packages(c("rgl", "clusterSim", "randomForest"), repos="https://cran.rstudio.com")'
Install Python, its dependencies, and its packages pip and xlsxcsv:
sudo yum -y install python
sudo yum -y install epel-release
sudo yum -y install python-pip
sudo pip install xlsx2csv
Install Torch and luarocks:
sudo apt-get -y install git
# in a terminal, run the commands WITHOUT sudo
git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; bash install-deps;
./install.sh
source ~/.bashrc
cd ~
sudo yum -y install luarocks
sudo luarocks install csv
(Optional) First of all, update:
sudo softwareupdate -iva
Manually download and install XQuartz from https://www.xquartz.org
Install R and its packages:
brew install r
sudo Rscript -e 'install.packages(c("rgl”, "clusterSim”, "randomForest”), repos="https://cran.rstudio.com")'
Install rudix:
curl -O https://raw.githubusercontent.com/rudix-mac/rpm/2016.12.13/rudix.py
sudo python rudix.py install rudix
Install the development tools (such as gcc):
xcode-select --install
Install xlsx2csv:
sudo easy_install xlsx2csv
Install Torch and laurocks:
git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; bash install-deps
./install.sh
cd ~
brew install lua
source ~/.profile
sudo luarocks install csv
Move to the project main directory, then use the script to download the mesothelioma dataset file, normalize the columns, and remove the "diagnosis method" feature (that is a duplicate of the target feature "class of diagnosis"):
cd /mesothelioma-diagnosis-predictions/
./script_prepare_dataset.sh
To run the Torch software of the perceptron-based artificial neural network:
th mesothelioma_ann_script_val.lua
To run the Python 3 software of the probabilistic neural network:
python3 pnn_mesothelioma_initial_py3.py
To run the R software of the random forest classifier:
Rscript random_forests_class.r
To run the R software of the CART classifier:
Rscript cart.r
To run the R software of the onre rule classifier:
Rscript oner_class.r
To run the random forest R code for feature selection:
Rscript random_forests.r Mesothelioma_data_set_COL_NORM.csv
More information about this project can be found on this paper:
Davide Chicco, and Cristina Rovelli, "Computational prediction of diagnosis and feature selection on mesothelioma patient health records", PLoS ONE 14(1): e0208737, 2019. https://doi.org/10.1371/journal.pone.0208737
All the software code is licensed under the GNU General Public License, version 2 (GPLv2).
The mesothelioma dataset is publically available on the website of the University of California Irvine Machine Learning Repository, under its copyright license.
This sofware was developed by Davide Chicco at the Princess Margaret Cancer Centre and at the Peter Munk Cardiac Centre (Toronto, Ontario, Canada).
For questions or help, please write to davidechicco(AT)davidechicco.it