This guide provides instructions for using Python on research projects. Its purpose is to use with collaborators and research assistants to make code consistent, easier to read, transparent, and reproducible.
Also see my R Guide and Stata Guide.
For coding style practices, follow the PEP 8 style guide.
- While you should read the style guide and do your best to follow it, there are tools to help you.
- In JupyterLab, first install
flake8
,pycodestyle
, andpycodestyle_magic
. Then includein a blank cell at the top of your script, and each cell afterwards will be checked for styling errors upon running.%load_ext pycodestyle_magic %flake8_on
- In Spyder go to Tools > Preferences > Editor > Code Introspection/Analysis and activate the option called "Real-time code style analysis". After doing so, Spyder will show bad formatting warnings directly in the editor.
- In JupyterLab, first install
- Use
pandas
for wrangling data. - Use
datetime
for working with dates. - Never use
os.chdir()
or absolute file paths. Instead use relative file paths with thepyprojroot
package.pyprojroot
looks for the following files to determine which oflder is your root folder for the project: .git, .here, *.Rproj, requirements.txt, setup.py, .dvc, *.spyproject, pyproject.toml, .idea, or .vscode. If you don't have any of them, either create a project in Spyder using Projects > New Project, or create a blank file with one of these names (e.g., .here) in your project root directory.
- Use
assert
frequently to add programmatic sanity checks in the code pandas.describe()
can be useful to print a "codebook" of the data, i.e. some summary stats about each variable in a data set.- Use
pipconflictchecker
to make sure there are not dependency conflicts after mass installing packages through pip. - Use
fastreg
for fast sparse regressions, particularly good for high-dimensional fixed effects. - Use
pandas_tab.tab()
for one-way and two-way tabulations similar to Stata'stabulate
.
Generally, within a project folder, we have a subfolder called analysis
where we are doing data analysis (and other sub-folders like paper
where the paper draft is saved). Within the analysis
subfolder, we have:
- An .spyproject file for the project. (This can be created in Spyder, with Projects > New Project.)
- If you always open the project within Spyder before working (See "Project" in the left of Spyder) then the
pyprojroot
package will work for relative filepaths. More details can be found here.
- If you always open the project within Spyder before working (See "Project" in the left of Spyder) then the
- data - only raw data go in this folder
- documentation - documentation about the data go in this folder
- proc - processed data sets go in this folder
- results - results go in this folder
- figures - subfolder for figures
- tables - subfolder for tables
- scripts - code goes in this folder
- Number scripts in the order in which they should be run
- programs - a subfolder containing functions called by the analysis scripts (if applicable)
- old - a subfolder where old scripts from previous versions are stored if there are major changes to the structure of the project for cleanliness
Because we often work with large data sets and efficiency is important, I advocate (nearly) always separating the following three actions into different scripts:
- Data preparation (cleaning and wrangling)
- Analysis (e.g. regressions)
- Production of figures and tables
The analysis and figure/table scripts should not change the data sets at all (no pivoting from wide to long or adding new variables); all changes to the data should be made in the data cleaning scripts. The figure/table scripts should not run the regressions or perform other analysis; that should be done in the analysis scripts. This way, if you need to add a robustness check, you don't necessarily have to rerun all the data cleaning code (unless the robustness check requires defining a new variable). If you need to make a formatting change to a figure, you don't have to rerun all the analysis code (which can take awhile to run on large data sets).
- Include a 00_run.py script (described below).
- Because a project often uses multiple data sources, I usually include a brief description of the data source being used as the first part of the script name (in the example below,
ex
describes the data source), followed by a description of the action being done (e.g.dataprep
,reg
, etc.), with each component of the script name separated by an underscore (_
). - Number scripts in the order in which they should be run, starting with 01.
Keep a script that lists each script that should be run to go from raw data to final results. Under the name of each script should be a brief description of the purpose of the script, as well all the input data sets and output data sets that it uses. Ideally, a user could run the master script to run the entire analysis from raw data to final results (although this may be infeasible for some project, e.g. one with multiple confidential data sets that can only be accessed on separate servers).
Also, consider adding a print_performance
function to your "00_run.py"
. You can write a print_performance.py
file and store it in a seperate folder along with other self-written programs, so that you can use it in multiple scripts. We included an example of print_performance.py
in this repo. The time_stamp.py
is designed to use along with print_performance.py
or for logging purposes. It appended a time stamp in the format of "YYYYMMDD_hhmmss" to the file name, e.g. "01_ex_dataprep_performance_20220101_200000.txt" (The performance of "01_ex_dataprep", started running at 20:00:00 on 1/1/2022).
# Run script for example project
# PACKAGES ------------------------------------------------------------------
import os
import subprocess
from pyprojroot import here
# Add to the system path the directory for your self-written functions
import sys
program_path = str(here('scripts/programs')) # the folder where you store additional functions
sys.path.append(program_path)
from time_stamp import time_stamp
from print_performance import print_performance
from is_ipython import is_ipython
# PRELIMINARIES -------------------------------------------------------------
# Control which scripts run
run_01_ex_dataprep = 1
run_02_ex_reg = 1
run_03_ex_table = 1
run_04_ex_graph = 1
program_list = []
program_name_list = []
# RUN SCRIPTS ---------------------------------------------------------------
if run_01_ex_dataprep:
program_list.append(here('./scripts/01_ex_dataprep.py'))
program_name_list.append("01_ex_dataprep")
# INPUTS
# here("./data/example.csv") # raw data from XYZ source
# OUTPUTS
# here("./proc/example_cleaned.csv") # cleaned
if run_02_ex_reg:
program_list.append(here("./scripts/02_ex_reg.py"))
program_name_list.append("02_ex_reg")
# INPUTS
# here("./proc/example_cleaned.csv") # 01_ex_dataprep.py
# OUTPUTS
# here("./proc/ex_results.csv") # regression results
if run_03_ex_table:
program_list.append(here("./scripts/03_ex_table.py"))
program_name_list.append("03_ex_table")
# Create table of regression results
# INPUTS
# here("./proc/ex_results.csv") # 02_ex_reg.py
# OUTPUTS
# here("./results/tables/ex_table.tex") # tex of table for paper
if run_04_ex_graph:
program_list.append(here('./scripts/04_ex_graph.py'))
program_name_list.append("04_ex_graph")
# Create scatterplot of Y and X with local polynomial fit
# INPUTS
# here("./proc/example_cleaned.csv") # 01_ex_dataprep.py
# OUTPUTS
# here("./results/tables/ex_scatter.eps") # figure
ts = time_stamp()
for program, name in zip(program_list, program_name_list):
init_time = time.perf_counter()
# The advantage of "check_call(...)" is that it print synchronous messages to the terminal while running the program. However, the IPython interpreter is not compatible with it. So I wrote a function "is_ipython()" to check if ths script is running in IPython environment. Please find the code for "is_ipython()" in this repository.
if is_ipython():
subprocess.run(['python', program], check = True) # This won't print stdout of the subprograms as it runs
else:
subprocess.check_call(['python', program], shell=True, stdout=sys.stdout, stderr=subprocess.STDOUT)
print("Finished:" + str(program))
print_performance(init_time, file = os.path.join(here("./results/performance"),
f"{name}_performance_{ts}.csv"))
If your scripts are .ipynb rather than .py files, instead of using subprocess.call()
to run the list of programs in program_list
, replace the subprocess.call()
loop with the following:
import nbformat
from nbconvert.preprocessors import ExecutePreprocessor
for program in program_list:
with open(program) as f:
nb = nbformat.read(f, as_version=1)
ep = ExecutePreprocessor(timeout=-1, kernel_name='python3')
ep.preprocess(nb, {'metadata': {'path': here('./scripts')}})
print("Finished:" + str(program))
- Use
matplotlib
orseaborn
for graphing. For graphs with colors, usecubehelix
for a colorblind friendly palette. - For reproducible graphs, always specify the
width
andheight
arguments insavefig
. - To see what the final graph looks like, open the file that you save since its appearance will differ from what you see in the JupyterLabs or the Spyder plots pane.
- For higher (in fact, infinite) resolution, save graphs as .eps files. (This is better than .pdf given that .eps are editable images, which is sometimes required by journals.)
- I've written a Python function
crop_eps
to crop (post-process) .eps files when you can't get the cropping just right in Stata.
- I've written a Python function
- For maps (and working with geospatial data more broadly), use
GeoPandas
.
- For small data sets, save as .csv with
pandas.to_csv()
and read withpandas.read_csv()
. - For larger data sets, save with
pandas.to_pickle()
using a .pkl file extension, and read withpandas.read_pickle()
. - For truly big data sets (hundreds of millions or billions of observations), use
write.parquet()
andread.parquet()
frompyspark.sql
.
When randomizing assignment in a randomized control trial (RCT):
- Seed: Use a seed from https://www.random.org/: put Min 1 and Max 100000000, then click Generate, and copy the result into your script at the appropriate place. Towards the top of the script, assign the seed with the line
where
seed = ... # from random.org random.seed(seed)
...
is replaced with the number that you got from random.org. - Use the
stochatreat
package to assign treatment and control groups. - Build a randomization check: create a second variable a second time with a new name, repeating
random.seed(seed)
immediately before creating the second variable. Then check that the randomization is identical usingassert(df.var1 == df.var2)
. - It is also good to do a more manual check where you run the full script once, save the resulting data with a different name, then restart Python (see instructions below), run it a second time. Then read in both data sets with the random assignment and assert that they are identical.
Above I described how data preparation scripts should be separate from analysis scripts. Randomization scripts should also be separate from data preparation scripts, i.e. any data preparation needed as an input to the randomization should be done in one script and the randomization script itself should read in the input data, create a variable with random assignments, and save a data set with the random assignments.
Once you complete a script or Jupyter notebook, which you might be running line by line, make sure it runs on a fresh Python session.
- To do this in Jupyter, use the menus and select Kernel
> Restart and run all
to ensure that the script runs in its entirety.
- To do this in Spyder,
Create a virtual environment to run your project. Use a virtual environment through venv
(instead of pyenv
) to manage the packages in a project and avoid conflicts related to package versioning.
- If you are using Anaconda, navigate to the directory of the project in the command line, and type
conda create -n yourenvname python=x.x anaconda
. Activate the environment usingconda activate yourenvname
anddeactivate
will exit the environment. - First run
conda install pip
to install pip to your directory. - Final step in Anaconda to install the packages, find your anaconda directory, it should be something like
/anaconda/envs/venv_name/
. Install new packages by using/anaconda/envs/venv_name/bin/pip install package_name
, this can also be used to install the requirements.txt file. To create arequirements.txt
file usepip freeze -l > requirements.txt
- If you are only using Python3,
python3 -m venv yourenvname
will create your environment. Activate the environment usingsource activate yourenvname
anddeactivate
will exit the environment. - In the command line after activating your virtual environment in Python3 using
pip freeze > requirements.txt
will create a text document of the packages in the environment to include in your project directory. pip install -r requirements.txt
in a virtual environment will install all the required packages for the project in Python3.
-
Do not ignore
SettingWithCopyWarning
. See more detailed explanations here. There are two main reasons for this warning: chained assignment and hidden chaining.- Chained assignment: when you want to assign a value to a column in a dataframe, do not type
country_df[country_df.country == "USA"]["state"] = "Illinois"
intsead, you should type
country_df.loc[country_df.country=="USA", "state"] = "Illinois"
- Hidden chaining: if you want to create a dataframe from another dataframe based on some conditions and you are pretty sure that you do not want to change the value of the original dataframe(
country_df
) in the future analysis, you should not type
new_country_df = country_df[country_df.country == "USA"]
instead, you should type
new_country_df = country_df[country_df.country == "USA"].copy()
- General idea: All files that are read or written in any script should have a full relative filepath using
here()
, and never an absolute file path. - Using
here()
(note that the syntax is different fromrprojroot.here
):- Make sure there's a file flagging the root directory. The easiest way is to create a
".here"
file in the root directory. Other available formats for the "flag file" including".git"
,"*.Rproj"
, etc. See details in the master file of `pyprojroot.here()' - The Syntax:
- Make sure there's a file flagging the root directory. The easiest way is to create a
# Read a file
data = read.csv(here(./proc/your_data.csv))
# Write a file
write.csv(os.path.join(here(./proc), "your_data.csv")
- Note that the
here()
function will report "Path doesn't exist" error if you put"your_data.csv"
insidehere()
, since this file does not exist the moment whenhere()
tries to find it.