Data Pre-processing involves transformation of data into useful information for knownledge gain through classifying, sorting, merging, retrieving, transmitting or recording. Data preprocessing can be done manually or computer based and it also can be automated.
One such form of data preprocessing is data cleaning. Here the following steps are applied to get preprocessed data :
- Remove Square brackets
- Remove non-ASCII characters from list of tokenized words
- Convert all characters to lowercase from list of tokenized words
- Remove Stopwords
- Remove punctuation from list of tokenized words
The above steps are applied for files from the poll directory. Poll the directory for files and preprocess the contents in each file.
Post preprocessing , write the content to new file in same or different directory.