GitHub - yuxiaohui78/PubMed-Text-Mining-Tool at v0.1

yuxiaohui78 / PubMed-Text-Mining-Tool Public

forked from lushl9301/PubMed-Text-Mining-Tool

Notifications You must be signed in to change notification settings
Fork 0
Star 0

A Simple Text Mining Tool for Analyzing Research Paper Abstracts

GPL-2.0 license

0 stars 11 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
Makefile		Makefile
README		README
dict.py		dict.py
htmlGenerator.py		htmlGenerator.py
keywords.txt		keywords.txt
preProcess.pl		preProcess.pl
pubmed_result.txt		pubmed_result.txt
selectSentence.pl		selectSentence.pl
stem.pl		stem.pl
stemFunction.pl		stemFunction.pl
stemKeyword.pl		stemKeyword.pl

Repository files navigation

Project Name:
A Simple Text Mining Tool for Analyzing Research Paper Abstracts

Description:
This project is a text mining tool using search results from National Center for
Biotechnology Information's database (http://www.ncbi.nlm.nih.gov/pubmed).
It uses Perl and Python for text processing and statistic analysis.

Modules and files (not all):
pubmed_result.txt     -- results downloaded from NCBI PubMed
preProcess.pl         -- take pubmed_result.txt as input;
                         make it easy for later process
myFormat.txt          -- generated by preProcess.pl
stem.pl               -- take myFormat.txt as input;
                         stem each word in every sentence
stemDict.txt          -- stemmed words and their corresponding original words
                         generated by stem.pl
stemmedSentence.txt   -- stemmed words in sentences; generated by stem.pl
selectSentence.pl     -- take stemmedSentence.txt as input;
                         take stemKeyword.pl as sub-module;
                         handle all stemmed sentences and select those contains
                         given keywords; if no keywords is provided, take
                         myFormat.txt as result instead.
stemKeyword.pl        -- take keywords.txt as input; stem the keywords
keywords.txt          -- keywords provided by user
stemFunction.pl       -- core stem function; Porter stemmer
dict.py               -- take stemDict.txt as input; eliminate stop words and
                         proceed simple statistic
static_words.txt      -- stemmed words and their frequencies; generate by
                         dict.py
pmidList.txt          -- pmid list file; generated by selectSentence.pl
htmlGenerator.py      -- use pmidList.txt to generate a simple webpage for easy
                         database access
PMIDList.html         -- simple webpage contains PMID, hyperlinks and titles 

HOWTO:
1. Make a search on http://www.ncbi.nlm.nih.gov/pubmed.
2. Press "Send to" on the right top of page and select "File" & "MEDLINE".
   Press "Create File"
3. Put this file "pubmed_result.txt" into the same directory as these codes.
4. cd to current directory and type make<RETURN> in the command line.
5. type make<SPACE>html<RETURN> in the command line to generate PMIDList.html

Installation (Ubuntu as example):
#install perl, python and make.
#you can install build-essential too.
$sudo apt-get install perl python make

#install CPAN for perl modules
$sudo perl -MCPAN -e shell
#press <RETURN> until the installation is finished
$sudo cpan
cpan[1]> install Lingua:EN:Sentence
cpan[2]> install Unicode:Normalize
#quit cpan shell
cpan[3]> exit
#DONE

LICENSE:
See LICENSE