Skip to content

skhiggins/Stata_guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 

Repository files navigation

Stata Guide

This guide provides instructions for using Stata on research projects. Its purpose is to use with collaborators and research assistants to make code consistent, easier to read, transparent, and reproducible.

Also see my R Guide and Python Guide.

Style

For coding style practices, follow the DIME Analytics Coding Guide. There are a few places where I recommend deviating from this style guide:

  • Use the boilerplate described below in the 00_run.do script to ensure a fresh Stata session when running scripts, rather than using ieboilstart.
  • #delimit ; can be used when there is a command that takes up many lines, such as a long local macro where it is preferable to list each element of the list vertically rather than horizontally since this is easier to read. However, this should only be used for the command that takes up many lines, and immediately afterwards #delimit cr should be included to go back to not needing to include ; at the end of each line.
  • Use // for both single-line comments and in-line comments. Using the same characters for both types of comments more closely matches what other statistical programming languages do (e.g. # for both types of comments in R and Python), and it ensures that various text editors' syntax highlighting can identify comments. (The problem with using * for single-line comments is that * is also used for multiplication and this can confuse some text editors' syntax highlighting.)

Packages

Most user-written Stata packages are hosted on Boston College Statistical Software Components (SSC) archive. It easy to download packages from SSC; simply run ssc install package where package should be replaced with the name of the package you want to install.

  • Use reghdfe for fixed effects regressions
  • Use ftools and gtools for working with large datasets
  • Use randtreat for randomization
  • When generating tables with multiple panels, regsave and texsave are recommended.

Folder structure

Generally, within a project folder, we have a subfolder called analysis where we are doing data analysis (and other sub-folders like paper where the paper draft is saved). Within the analysis subfolder, we have:

  • data - only raw data go in this folder
  • documentation - documentation about the data go in this folder
  • logs - log files go in this folder
  • proc - processed data sets go in this folder
  • results - results go in this folder
    • figures - subfolder for figures
    • tables - subfolder for tables
  • scripts - code goes in this folder. The scripts needed to go from raw data to final results are stored directly in the scripts folder.
    • programs - a subfolder containing functions called by the analysis scripts. All user-written ado files should be contained in this directory.
    • old - a subfolder where old scripts are stored if there are major changes to the structure of the project. Scripts in the old subfolder are not used to go from raw data to final results, but are kept here while the project is ongoing in case they need to be used or referred back to in the future. The old subfolder is not included in the replication package since the scripts in this subfolder are not part of the process of going from raw data to final results.

Filepaths

  • Use forward slashes for filepath names ($results/tables not $results\tables). This ensures that the code works across multiple operating systems, and avoids issues that arise due to the backslash being used as an escape character.
  • Avoid spaces and capital letters in file and folder names.
  • Never use cd to manually change the directory. Unfortunately Stata does not have a package to work with relative filepaths (like here in R or pyprojroot in Python). Instead, the 00_run.do script (described below) should define a global macro for the project's root directory and (optionally) global macros for its immediate subdirectories. Then, since scripts should always be run through the 00_run.do script, all other do files should not define full absolute paths but instead should specify absolute paths using the global macro for the root directory that was defined in 00_run.do.
    • This ensures that when others run the project code, they only need to change the file path in one place.
    • Within project teams, you can include a chunk of code in 00_run.do that automatically determines which team member's computer or which server is running the code using if conditions with "`c(username)'". This is described in more detail below in the example 00_run.do script below.
    • However, for the replication package a user outside the team would still need to manually edit the file path of the project's root directory. This should require editing only one line of code in 00_run.do and not editing any code in any other do files.

Scripts structure

Separating scripts

Because we often work with large data sets and efficiency is important, I advocate (nearly) always separating the following three actions into different scripts:

  1. Data preparation (cleaning and wrangling)
  2. Analysis (e.g. regressions)
  3. Production of figures and tables

The analysis and figure/table scripts should not change the data sets at all (no pivoting from wide to long or adding new variables); all changes to the data should be made in the data cleaning scripts. The figure/table scripts should not run the regressions or perform other analysis; that should be done in the analysis scripts. This way, if you need to add a robustness check, you don't necessarily have to rerun all the data cleaning code (unless the robustness check requires defining a new variable). If you need to make a formatting change to a figure, you don't have to rerun all the analysis code (which can take awhile to run on large data sets).

Naming scripts

  • Include a 00_run.do script (described below).
  • Number scripts in the order in which they should be run, starting with 01.
  • Because a project often uses multiple data sources, I usually include a brief description of the data source being used as the first part of the script name (in the example below, ex describes the data source), followed by a description of the action being done (e.g. dataprep, reg, etc.), with each component of the script name separated by an underscore (_).

00_run.do script

Keep a "run" script, 00_run.do that lists each script in the order they should be run to go from raw data to final results. Under the name of each script should be a brief description of the purpose of the script, as well all the input data sets and output data sets that it uses.

The 00_run.do script accomplishes three objectives:

  1. Define the global macro for the project's root directory. A code chunk that automatically identifies which user on the team is running the code can also be included so that no code needs to be edited for different team members to run 00_run.do. Nevertheless, one line of code in 00_run.do will need to be edited when someone outside the research team wants to run the replication package.
  2. Include boilerplate to mimic a fresh Stata session (e.g. clearing any data sets and locals in memory).
  3. Run particular scripts for the analysis. Which scripts are run is controlled with local macros. In the final replication package, these macros should all be set to 1.

Ideally, a user could run 00_run.do to run the entire analysis from raw data to final results (although this may be infeasible for some projects, e.g. one with multiple confidential data sets that can only be accessed on separate servers).

Below is a brief example of a 00_run.do script.

// Run script for example project

// BOILERPLATE ---------------------------------------------------------- 
// For nearly-fresh Stata session and reproducibility
set more off
set varabbrev off
clear all
macro drop _all
version 14.2

// DIRECTORIES ---------------------------------------------------------------
// To replicate on another computer simply uncomment the following lines
//  by removing // and change the path:
// global main "/path/to/replication/folder"

if "$main"=="" { // Note this will only be untrue if line above uncommented
                 // due to the `macro drop _all` in the boilerplate
    if "`c(username)'" == "John" { // John's Windows computer
        global main "C:/Dropbox/MyProject/analysis" // Ensure no spaces 
    }
    else if "`c(username)'" == "janedoe" { // Jane's Mac laptop 
        global main "/Users/janedoe/Dropbox/MyProject/analysis"
    }
    else { // display an error 
        display as error "User not recognized."
        display as error "Specify global main in 00_run.do."
        exit 198 // stop the code so the user sees the error
    }
}

// Also create globals for each subdirectory
local subdirectories ///
    data             ///
    documentation    ///
    logs             ///
    proc             ///
    results          ///
    scripts
foreach folder of local subdirectories {
    cap mkdir "$main/`folder'" // Create folder if it doesn't exist already
    global `folder' "$main/`folder'"
}
// Create results subfolders if they don't exist already
cap mkdir "$results/figures"
cap mkdir "$results/tables"

// The following code ensures that all user-written ado files needed for
//  the project are saved within the project directory, not elsewhere.
tokenize `"$S_ADO"', parse(";")
while `"`1'"' != "" {
    if `"`1'"'!="BASE" cap adopath - `"`1'"'
    macro shift
}
adopath ++ "$scripts/programs"

// PRELIMINARIES -------------------------------------------------------------
// Control which scripts run
local 01_ex_dataprep = 1
local 02_ex_reg      = 1
local 03_ex_table    = 1
local 04_ex_graph    = 1

// RUN SCRIPTS ---------------------------------------------------------------

// Read and clean example data
if (`01_ex_dataprep' == 1) do "$scripts/01_ex_dataprep.do"
// INPUTS
//  "$data/example.csv"
// OUTPUTS
//  "$proc/example.dta"

// Regress Y on X in example data
if (`02_ex_reg' == 1) do "$scripts/02_ex_reg.do"
// INPUTS
//  "$proc/example.dta" // 01_ex_dataprep.do
// OUTPUTS 
//  "$proc/ex_reg_results.dta" // results stored as a data set

// Create table of regression results
if (`03_ex_table' == 1) do "$scripts/03_ex_table.do"
// INPUTS 
//  "$proc/ex_reg_results.dta" // 02_ex_reg.do
// OUTPUTS
//  "$results/tables/ex_reg_table.tex" // tex of table for paper

// Create scatterplot of Y and X with local polynomial fit
if (`04_ex_graph' == 1) do "$scripts/04_ex_graph.do"
// INPUTS
//  "$proc/example.dta" // 01_ex_dataprep.R
// OUTPUTS
//  "$results/figures/ex_scatter.eps" # figure

Graphing

  • I wrote an ado file graph_options.ado to standardize and facilitate graph formatting. It creates locals that can be used with graph twoway to control the formatting. Use the local macros generated by graph_options to ensure a clean formatting of your graphs, including a white background, sufficiently large text, axis lines only at x=0 and y=0, and no gridlines or tickmarks. For example:

    // Load data
    sysuse "auto2.dta", clear
    
    // Generate local macros for graph formatting (use defaults)
    graph_options
    
    // Create the graph
    #delimit ;
    graph twoway scatter mpg trunk, 
        title("Cars with more trunk space have worse mpg", `title_options')
        ytitle("Miles per gallon (mpg)", `ytitle_options')
        ylabel(, `ylabel_options') 
        xtitle("Trunk space (cubic inches)", `xtitle_options')
        xlabel(, `xlabel_options') 
        xscale(noline)
        yscale(noline)
        `marker_options'
        `plotregion' `graphregion'
        legend(off)
    ;
    #delimit cr
    • With the same graph twoway code, the defaults to graph_options can also be changed, for example to add more margin to the right of the graph (which can be useful for example if the x-axis numbers have more digits) and to increase the size of the points in the scatterplot:
    // Generate local macros for graph formatting
    graph_options, ///
        graph_margin(l=0 t=0 b=0 r=5) /// right margin
        marker_size(medlarge)          // larger points

    See graph_options_reprex.do for more examples of its use with changes to its defaults, and look at the graph_options.ado function itself to see what the arguments and graph formatting settings that it can change are, as I have not yet written a help file. (Pull requests welcome to expand it to more use cases.)

  • For graphs with colors, use the palettes package cblind palette, which also requires colrspace, for a colorblind friendly palette. For convenience, my graph_options function also creates locals cblind1 to cblind9 that correspond to the colors in this palette.

  • For reproducible graphs, always use the width() and height() options when exporting graphs.

  • I recommend using R for creating maps given the ease and computational efficiency of working with shapefiles using the sf package in R and in plotting maps using ggplot2::geom_sf(). See my R Guide. If you are creating a map in Stata, use shp2dta to convert shapefiles to .dta and use spmap for creating maps and plotting spatial data.

Saving files

Data sets

  • Save data sets as .dta with save and read .dta data sets with use.

    • To write over a file that already exists when saving, use the replace option.
    • If a different data set is already loaded in memory, when reading another data set, use the clear option.
  • When dealing with large datasets, there are a couple of things that can make your life easier:

    • Stata reads faster from its native .dta format.
    • You can read a subset of the observations or variables using the following syntax:
    use [varlist] [if] [in] using filename
  • When you just want to save your files temporarily for later use within the same script, use tempfile. Note that tempfiles will not be saved after the program ends.

Graphs

  • Save graphs with graph export.
    • For reproducible graphs, always specify the width and height dimensions in pixels using the width and height options (e.g. width(2400) height(1600)).
  • To see what the final graph looks like, open the file that you save since its appearance will differ from what you see in Stata graphs pane when you specify the width and height arguments in graph export.
  • For higher (in fact, infinite) resolution, save graphs as .eps files. (This is better than .pdf given that .eps are editable images, which is sometimes required by journals.)
    • I've written a Python function crop_eps to crop (post-process) .eps files when you can't get the cropping just right in Stata.

Randomization

When randomizing assignment in a randomized control trial (RCT):

  • Seed: Use a seed from https://www.random.org/: put Min 1 and Max 100000000, then click Generate, and copy the result into your script. Towards the top of the script, assign the seed with the line
    local seed ... // from random.org
    where ... is replaced with the number that you got from random.org.
  • Make sure the Stata version is set in the 00_run.do script, as described above. This ensures that the randomization algorithm is the same, since the randomization algorithm sometimes changes between Stata versions.
  • Use the randtreat package.
  • Immediately before the line using a randomization function, include set seed `seed'.
  • Build a randomization check: create a second variable a second time with a new name, repeating set seed `seed' immediately before creating the second variable. Then check that the randomization is identical using assert.
  • As a second randomization check, create a separate script that runs the randomization script once (using do) but then saves the data set with a different name, then runs it again (with do), then reads in the two differently-named data sets from these two runs of the randomization script and ensures that they are identical.
  • Note: if creating two cross-randomized variables, you would not want to repeat set seed `seed' before creating the second one, otherwise it would use the same assignment as the first.

Above I described how data preparation scripts should be separate from analysis scripts. Randomization scripts should also be separate from data preparation scripts, i.e. any data preparation needed as an input to the randomization should be done in one script and the randomization script itself should read in the input data, create a variable with random assignments, and save a data set with the random assignments.

Running scripts

Once you complete a script, which you might be running line by line while you work on it, make sure the script works on a fresh Stata session. To do this, adjust the local macros in 00_run.do to run the appropriate scripts (i.e., set the local macros for the scripts you want to run to 1, and the local macros for the scripts you do not want to run to 0), and run the entire 00_run.do file. The boilerplate code in 00_run.do will ensure that you are running the code in a nearly-fresh Stata session.

Reproducibility

  • As shown above, include a version statement in the 00_run.do script. For example, writing version 16.1 makes all future versions of Stata run your code the same way Stata 16.1 did.

  • Start your do files by opening a log file; this records all commands and output in a session and can be helpful to look back at the output from a particular script.

    • Name the log files using the same naming convention as you use for your scripts. For example the log file for 01_ex_dataprep.do should be 01_ex_dataprep.log.
    • Use the text option so that log files are saved as plain text rather than in Stata Markdown Command Language (SMCL). This ensures that they can be easily viewed in any text editor.
    • Start a log with log using, for example:
    log using "$logs/01_ex_dataprep.log", text replace
    • Close a log file at the end of the script with log close.
  • All user-written ado files that are used by your scripts should be kept in the $scripts/programs folder.

    • The 00_run.do script should include the following code, which will lead to an error if any of the user-written ado files needed for the project are not saved in $scripts/programs. After running the code below (h/t Julian Reif), if you ssc install any programs during the same Stata session, they will correctly install in the project's $scripts/programs folder. If you switch to working on a different project, you should close and reopen Stata.
    tokenize `"$S_ADO"', parse(";")
    while `"`1'"' != "" {
        if `"`1'"'!="BASE" cap adopath - `"`1'"'
        macro shift
    }
    adopath ++ "$scripts/programs"

Misc.

Some additional tips:

  • For debugging, use set trace on before running the script. This will show you how Stata is interpreting your code and can help you find the bug.
    • set tracedepth is also useful to control how deep into each command's code the trace feature will go. The default when you set trace on is set tracedepth 32000. If you don't want to print so much of the code Stata is interpreting as it goes through your script, you can use for example set tracedepth 2.
  • To run portions of code while you are programming, you can set local macros at the top of the do file and then use if conditions to only run some of the chunks of code. This is preferable to highlighting sections of code in the do file and running just those lines. For example:
    // Set local macros
    local cleaning  = 0
    local reshape   = 1
    
    // Read in ex data
    use "$data/ex_data.dta", clear
    
    // Wrangle the data
    if (`cleaning' == 1) {
        // Clean the data
    }
    if (`reshape' == 1) {
        // Reshape the data
    }
    • In the final replication package, all of these local macros and if conditions should either be removed or all set to 1 so that all of the code runs in the replication package without the user needing to adjust the local macros.