GitHub - yuxiaohui78/PubMed-Text-Mining-Tool at 6cc5bff04f1d5b2a88cfd221da8264a4a81123d0

yuxiaohui78 / PubMed-Text-Mining-Tool Public

forked from lushl9301/PubMed-Text-Mining-Tool

Notifications You must be signed in to change notification settings
Fork 0
Star 0

A Simple Text Mining Tool for Analyzing Research Paper Abstracts

GPL-2.0 license

0 stars 11 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
backup		backup
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README		README
dict.py		dict.py
htmlGenerator.py		htmlGenerator.py
jsonParser.pl		jsonParser.pl
keywords.txt		keywords.txt
nextStep.py		nextStep.py
preProcess.pl		preProcess.pl
pubmed_result.txt		pubmed_result.txt
raw_data.json		raw_data.json
selectSentence.pl		selectSentence.pl
splitFunction.pl		splitFunction.pl
stem.pl		stem.pl
stemFunction.pl		stemFunction.pl
stemKeyword.pl		stemKeyword.pl

Repository files navigation

Project Name:
A Simple Text Mining Tool for Analyzing Research Paper Abstracts

Description:
This project is a text mining tool using search results from National Center for
Biotechnology Information's database (http://www.ncbi.nlm.nih.gov/pubmed).
It uses Perl and Python for text processing and statistic analysis.

Modules and files (not all):
pubmed_result.txt     -- results downloaded from NCBI PubMed
preProcess.pl         -- take pubmed_result.txt as input;
                         make it easy for later process
myFormat.txt          -- generated by preProcess.pl
stem.pl               -- take myFormat.txt as input;
                         stem each word in every sentence
stemDict.txt          -- stemmed words and their corresponding original words
                         generated by stem.pl
stemmedSentence.txt   -- stemmed words in sentences; generated by stem.pl
selectSentence.pl     -- take stemmedSentence.txt as input;
                         take stemKeyword.pl as sub-module;
                         handle all stemmed sentences and select those contains
                         given keywords; if no keywords is provided, take
                         myFormat.txt as result instead.
stemKeyword.pl        -- take keywords.txt as input; stem the keywords
keywords.txt          -- keywords provided by user
stemFunction.pl       -- core stem function; Porter stemmer
dict.py               -- take stemDict.txt as input; eliminate stop words and
                         proceed simple statistic
static_words.txt      -- stemmed words and their frequencies; generate by
                         dict.py
pmidList.txt          -- pmid list file; generated by selectSentence.pl
htmlGenerator.py      -- use pmidList.txt to generate a simple webpage for easy
                         database access
PMIDList.html         -- simple webpage contains PMID, hyperlinks and titles
nextStep.py           -- access original raw data; extract original entries
                         listed in pmidList.txt and the html file
new_pubmed_result.txt -- new pubmed_result.txt selected by nextStep.py

HOWTO:
1. Make a search on http://www.ncbi.nlm.nih.gov/pubmed.
2. Press "Send to" on the right top of page and select "File" & "MEDLINE".
   Press "Create File"
3. Put this file "pubmed_result.txt" into the same directory as these codes.
4. Type make<RETURN> in the command line; this may take several minutes, which
   depends on the size of pubmed_result.txt
5. Type make<SPACE>html<RETURN> in the command line to generate PMIDList.html
6. Type make<SPACE>next<RETURN> in the command line to backup current raw data
   and make new pubmed_result.txt for a new round of Make
7. Change keywords in keywords.txt and goto step 4

Installation (Ubuntu as example):
#install perl, python and make.
#you can install build-essential too.
$sudo apt-get install perl python make

#install CPAN for perl modules
$sudo perl -MCPAN -e shell
#press <RETURN> until the installation is finished
$sudo cpan
cpan[1]> install Lingua:EN:Sentence
cpan[2]> install Unicode:Normalize
#quit cpan shell
cpan[3]> exit
#DONE

LICENSE:
See LICENSE