forked from lushl9301/PubMed-Text-Mining-Tool
-
Notifications
You must be signed in to change notification settings - Fork 0
A Simple Text Mining Tool for Analyzing Research Paper Abstracts
License
mingtao13595/PubMed-Text-Mining-Tool
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Project Name: A Simple Text Mining Tool for Analyzing Research Paper Abstracts Description: This project is a text mining tool using search results from National Center for Biotechnology Information's database (http://www.ncbi.nlm.nih.gov/pubmed). It uses Perl and Python for text processing and statistic analysis. Modules and files (not all): pubmed_result.txt -- results downloaded from NCBI PubMed preProcess.pl -- take pubmed_result.txt as input; make it easy for later process myFormat.txt -- generated by preProcess.pl stem.pl -- take myFormat.txt as input; stem each word in every sentence stemDict.txt -- stemmed words and their corresponding original words generated by stem.pl stemmedSentence.txt -- stemmed words in sentences; generated by stem.pl selectSentence.pl -- take stemmedSentence.txt as input; take stemKeyword.pl as sub-module; handle all stemmed sentences and select those contains given keywords; if no keywords is provided, take myFormat.txt as result instead. stemKeyword.pl -- take keywords.txt as input; stem the keywords keywords.txt -- keywords provided by user stemFunction.pl -- core stem function; Porter stemmer dict.py -- take stemDict.txt as input; eliminate stop words and proceed simple statistic static_words.txt -- stemmed words and their frequencies; generate by dict.py pmidList.txt -- pmid list file; generated by selectSentence.pl htmlGenerator.py -- use pmidList.txt to generate a simple webpage for easy database access PMIDList.html -- simple webpage contains PMID, hyperlinks and titles nextStep.py -- access original raw data; extract original entries listed in pmidList.txt and the html file new_pubmed_result.txt -- new pubmed_result.txt selected by nextStep.py HOWTO: 1. Make a search on http://www.ncbi.nlm.nih.gov/pubmed. 2. Press "Send to" on the right top of page and select "File" & "MEDLINE". Press "Create File" 3. Put this file "pubmed_result.txt" into the same directory as these codes. 4. Type make<RETURN> in the command line; this may take several minutes, which depends on the size of pubmed_result.txt 5. Type make<SPACE>html<RETURN> in the command line to generate PMIDList.html 6. Type make<SPACE>next<RETURN> in the command line to backup current raw data and make new pubmed_result.txt for a new round of Make 7. Change keywords in keywords.txt and goto step 4 Installation (Ubuntu as example): #install perl, python and make. #you can install build-essential too. $sudo apt-get install perl python make #install CPAN for perl modules $sudo perl -MCPAN -e shell #press <RETURN> until the installation is finished $sudo cpan cpan[1]> install Lingua:EN:Sentence cpan[2]> install Unicode:Normalize #quit cpan shell cpan[3]> exit #DONE LICENSE: See LICENSE
About
A Simple Text Mining Tool for Analyzing Research Paper Abstracts
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published
Languages
- Perl 54.4%
- Python 29.7%
- Makefile 15.9%