Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

README.md

Modeling with Unstructured Data

Data Ingestion

> res <- XML::readHTMLTable(paste0('http://cran.r-project.org/',
+     'web/packages/available_packages_by_name.html'), which = 1)

R comes with a bunch of functions that can be used to read different types of files. In this tutorial, we are going to use tm and XML. If you do not have XML package installed, it can be installed using the following command:

install.packages("XML")

In order to see supported text file formats, we can use the function getReaders.

> getReaders()
 [1] "readDataframe" "readDOC"
 [3] "readPDF" "readPlain"
 [5] "readRCV1" "readRCV1asPlain"
 [7] "readReut21578XML" "readReut21578XMLasPlain"
 [9] "readTagged" "readXML"

At the time of writing this book, the snippet downloaded 12658 R libraries name and their short descriptions. A new term to get familiar is corpus which is basically a collection of text documents that we can include in the analytics. We can use the getSources function to see the available options to import a corpus with the tm package.

> library(tm)
Loading required package: NLP
> getSources()
[1] "DataframeSource" "DirSource" "URISource" "VectorSource"
[5] "XMLSource" "ZipSource"

Building a corpus from the vector source of package descriptions downloaded from R package lists can be done using one of the above options. In this case, we can go ahead and use VectorSource.

> v <- Corpus(VectorSource(res$V2))
> v
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 12658

This step created a VCorpus (in-memory) object which currently holds 12658 packages descriptions. We can use inspect and head function to view the output of processed statements. So, in order to see the first 5 documents in the corpus, we can run the following command:

> inspect(head(v, 5))
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 3
[1] Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModels
[2] Access to Abbyy Optical Character Recognition (OCR) API
[3] Tools for Approximate Bayesian Computation (ABC)
[4] Data Only: Tools for Approximate Bayesian Computation (ABC)
[5] Array Based CpG Region Analysis Pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CH07

CH07

README.md

Modeling with Unstructured Data

Data Ingestion

Files

CH07

Directory actions

More options

Directory actions

More options

Latest commit

History

CH07

Folders and files

parent directory

README.md

Modeling with Unstructured Data

Data Ingestion