Skip to content
This repository has been archived by the owner on Oct 30, 2018. It is now read-only.

alphagov/classifyintents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status codecov.io GitHub tag

classifyintents

This is a python module which prepares and cleans GOV.UK survey data, in preparation for classification using a machine learning algorithm. Training of the algorithm and prediction on new data is handled in the alphagov/classifyintentspipe repo. The module is built around the classifyintents.survey and associated methods.

To install this module using pip:

pip install git+git://github.com/alphagov/classifyintents.git

Alternatively, place the following line in your requirements.txt file:

git+git://github.com/alphagov/classifyintents.git

and run the command pip install -r requirements.txt as usual.

Requirements

  • Python >= 3.5 See requirements.txt for additional requirements.

Usage

Loading data

To begin instantiate an instance of the class with:

import survey from classifyintents

intent = survey()

Load some raw data. The class expects an unedited CSV file downloaded from survey monkey. Note that the load() method also does some cleaning of the column names, and drops a sub-heading row from the csv that was generated by survey monkey.

intent.load('data.csv')

The data is stored as pandas dataframe in the class named intent.raw.

Cleaning the raw data

The next step is to perform some cleaning of the raw data. This is accomplished in the clean_raw() method. The method does a number of things:

  • Creates a copy of the intent.raw dataframe, and calls this new dataframe intent.data.
  • The messy column names inherited from the csv are cleaned up using a dictionary called intent.raw_mapping.
    • Note that if the format of the survey or the names of questions are changed, breaking the class, a quick fix may be to update the intent.raw_mapping dictionary.
  • A number of new features are added to the data:
    • Time taken to complete the survey
    • Some simple features based on the free text
      • Number of characters in the string.
      • Ratio of both capital letters, and exclamation marks to total number of characters.

Determining the org and section

The page that the user was visiting when they were asked to complete the survey is recorded in a cleaned field called full_url. In this step the URLs are cleaned according to a number of rules, and then the unique URLs are extracted and then queried using the GOV.UK content API. This returns an organisation (org) and a section (section).

These data are then merged back into the intent.data dataframe. This step is completed with:

intent.api_lookup()

This step is verbose, and can take a while if there are a large number of URLs to lookup.

Preparing the data for training or prediction

Assuming all has gone well so far, the next step is to prepare the data for training or prediction using a machine learnign algorithm. This is done with the methods intent.trainer() and intent.predictor() respectively.

When calling intent.trainer() a list of classes must be passed as an argument. As part of the method, all classes that are not specified in the list are concatenated into one, enabling one-versus-all (OVA) classification.

Using the predictor() method will remove the outcome class, if it was present.

The data are now ready for the application of a machine learning algorithm.