learn_python/find_answers at master · clp/learn_python

History

Name		Name	Last commit message	Last commit date
parent directory ..
data		data
indir		indir
tmp		tmp
util		util
.gitignore		.gitignore
README.rst		README.rst
README_2.rst		README_2.rst
config.py		config.py
draw_plots.py		draw_plots.py
fga_find_good_answers.py		fga_find_good_answers.py
fga_find_good_answers.sh		fga_find_good_answers.sh
find_answers.md		find_answers.md
grade_each_answer.py		grade_each_answer.py
nltk_ex22.py		nltk_ex22.py
nltk_ex25.py		nltk_ex25.py
save_output.py		save_output.py
search_for_keyword.py		search_for_keyword.py
sga_show_good_answers.py		sga_show_good_answers.py
sga_show_good_answers.sh		sga_show_good_answers.sh
text_ui.py		text_ui.py

README.rst

README

Basic data about the project and code.

See README_2 for more details.

Project: Find Good Answers

Introduction

One goal of this project is to learn Python by building tools that analyze text data. The first data sources are questions and answers from stackoverflow.com about Python. The software tries to identify good but 'hidden' answers to the questions. Such answers have useful data but a low stackoverflow score.

It includes the following programs.

fga_find_good_answers.py (fga), main control program.

nltk_ex25.py, analyze text with natural language toolkit.

sga_show_good_answers.py, search for text in the data.

grade_each_answer.py, a user can read & grade answers, from A to F.

Program: fga_find_good_answers.py

Introduction

The fga_find_good_answers program formats the input data from stackoverflow.com into question-and-answer groups, Q&A groups: one question followed by its related answers, followed by the next Q&A group.

It then analyzes the data to find good answers, using statistical techniques or natural language processing, NLP.

Usage

Start the program with one of these commands:

./fga_find_good_answers.sh

python fga_find_good_answers.py

Respond to the prompt with 'm' to see the menu. Enter 'h' to see help. Enter 'q' to quit the program.

Notes

The main o/p data is saved at data/popular_qa.csv.
One way to examine the data: open the file with LibreOffice Calc or other spreadsheet tool that can show the full content of each cell.
The program writes some data to the log file, fl_fga.log.

Module: sga_show_good_answers.py

Introduction

The sga_show_good_answers module searches the input data and collects all questions and answers that contain the search terms, and includes related Q's and A's.

Usage

Import the module to use it. Do not run the module directly.

See the fga_find_good_answers.py program for an example that uses this module.

To perform a search, start fga_find_good_answers.py, and wait for it to prompt for a command. Respond to the prompt with 'lek' (look for exact keywords) to enter a search term, which is then handled by this module.

The fga*py program also has a command-line option to enter a search term, eg,

python fga_find_good_answers.py -s term1 t2 t3

It returns a dataframe containing the Q's and A's that it found, in an order based on the results of the NLP analysis.

Notes

One way to examine the output: open the data/search_result_full.html file with a web browser or other tool that renders HTML properly.

If you perform another search, re-load the browser to show the latest output file.

Module: nltk_ex25.py

Introduction

The nltk_ex25 program is an experiment to learn about text processing using the Natural Language Toolkit, NLTK. It processes Q & A data from stackoverflow (and the use of NLTK by this program needs more than one answer for a question).

One way that it can identify useful answers is based on the terms that each answer contains, compared to terms that are found in high-score answers.

The o/p is a csv file containing a question followed by the selected answers.

Usage

At this time, the natural language processing routines included in the nltk_ex25.py program are called from the main fga*.py program. Do not run the nltk*.py module directly.

Notes

The o/p data of answers that contain high score terms (hst) is saved at data/popular_qa.csv
The program writes some summary data to the screen, to help with debugging.
The program writes some data to the log file, fl_fga.log.

Program: grade_each_answer.py

Introduction

The grade_each_answer program is a utility tool to add new fields and their content to each record of its input file.

The o/p is a csv file containing two new fields for each i/p record: Grade and Notes.

Usage

python grade_each_answer.py

See README_2 for more details.

Notes

Consider opening the i/p csv data file in a separate window for easier viewing of the Q&A being graded. Eg, using the surf web browser:

surf data/graded_popular_qa.csv
If you finish handling all records in the i/p file, the program saves data and stops. If some answers were ignored and are graded 'N', they will be shown for grading when you next run the program.

FAQ

What is stackoverflow.com?

SO is a question-and-answer web site. Registered users can enter data and vote on questions and answers, so that higher-quality contributions might be identified.

What is kaggle.com?

Kaggle is a web site for learning about data science by using documentation and participating in competitions. You can download data sets from the site. One data set that is used in this project is a collection of questions and answers from stackoverflow about python.

How to read the stackoverflow data?

Use the Python pandas module, read_csv().

ans_df = pandas.read_csv('Answers.csv', encoding='latin-1', warn_bad_lines=False, error_bad_lines=False)

How to find answers with low scores that are high quality?

That's one goal of this project. One way might be to identify some unique properties of high score answers, and find low score answers with the same or similar properties.

What is the Natural Language Toolkit, NLTK?

NLTK is a platform (code, documents, data sets, and more) for building s/w to work with human language data. For documentation, please visit nltk.org.

What are some other useful sites and resources to check?

https://github.com/gleitz/howdoi A CLI tool that gets answers from stackoverflow.
https://worksheets.codalab.org/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find_answers

find_answers

README.rst

Project: Find Good Answers

Introduction

Program: fga_find_good_answers.py

Introduction

Usage

Notes

Module: sga_show_good_answers.py

Introduction

Usage

Notes

Module: nltk_ex25.py

Introduction

Usage

Notes

Program: grade_each_answer.py

Introduction

Usage

Notes

FAQ

Files

find_answers

Directory actions

More options

Directory actions

More options

Latest commit

History

find_answers

Folders and files

parent directory

README.rst

Project: Find Good Answers

Introduction

Program: fga_find_good_answers.py

Introduction

Usage

Notes

Module: sga_show_good_answers.py

Introduction

Usage

Notes

Module: nltk_ex25.py

Introduction

Usage

Notes

Program: grade_each_answer.py

Introduction

Usage

Notes

FAQ