-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME.Rmd
396 lines (224 loc) · 11.7 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
---
output: github_document
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
message=FALSE,
warning=FALSE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%")
options(tibble.print_min = 5, tibble.print_max = 5)
```
# bulkreadr <a><img src='man/figures/logo.png' align="right" height="138.5" /></a>
> The Ultimate Tool for Reading Data in Bulk
<!-- badges: start -->
[](https://github.com/gbganalyst/bulkreadr/actions/workflows/R-CMD-check.yaml)
[](https://cran.r-project.org/package=bulkreadr)
[](https://cran.r-project.org/package=bulkreadr)
[](https://app.codecov.io/gh/gbganalyst/bulkreadr?branch=main)
<!-- badges: end -->
## About the package
`bulkreadr` is an R package designed to simplify and streamline the process of reading and processing large volumes of data. With a collection of functions tailored for bulk data operations, the package allows users to efficiently read multiple sheets from Microsoft Excel/Google Sheets workbooks and multiple CSV files from a directory. It returns the data as organized data frames, making it convenient for further analysis and manipulation.
Whether dealing with extensive data sets or batch processing tasks, "bulkreadr" empowers users to effortlessly handle data in bulk, saving time and effort in data preparation workflows.
Additionally, the package seamlessly works with labelled data from SPSS and Stata. For a quick video tutorial, I gave a talk at the International Association of Statistical Computing webinar. The recorded session is available [here](https://isi-web.org/webinar/iasc-bulkreadr-ultimate-tool-reading-data-bulk) and the webinar resources [here](https://github.com/gbganalyst/bulkreadr-webinar).
## Installation
You can install `bulkreadr` package from [CRAN](https://cran.r-project.org/) with:
```{r eval=FALSE}
install.packages("bulkreadr")
```
or the development version from [GitHub](https://github.com/) with
``` r
if(!require("devtools")){
install.packages("devtools")
}
devtools::install_github("gbganalyst/bulkreadr")
```
## How to load the package
Now that you have installed `bulkreadr` package, you can simply load it by using:
```{r pkgload}
library(bulkreadr)
library(dplyr)
```
## Functions in bulkreadr package
This section provides a concise overview of the different functions available in the `bulkreadr` package. These functions serve various purposes and are designed to handle importing of data in bulk.
- [`read_excel_workbook()`](#read_excel_workbook)
- [`read_excel_files_from_dir()`](#read_csv_files_from_dir)
- [`read_csv_files_from_dir()`](#read_csv_files_from_dir)
- [`read_gsheets()`](#read_gsheets)
- [`read_spss_data()`](#read_spss_data)
- [`read_stata_data()`](#read_stata_data)
## Other functions in `bulkreadr` package:
- [`generate_dictionary()`](#generate_dictionary)
- [`look_for()`](#look_for)
- [`pull_out()`](#pull_out)
- [`convert_to_date()`](#convert_to_date)
- [`inspect_na()`](#inspect_na)
- [`fill_missing_values()`](#fill_missing_values)
**Note:**
> For the majority of functions within this package, we will utilize data stored in the system file by the `bulkreadr`, which can be accessed using the `system.file()` function. If you wish to utilize your own data stored in your local directory, please ensure that you have set the appropriate file path prior to using any functions provided by the bulkreadr package.
## `read_excel_workbook()`
`read_excel_workbook()` reads all the data from the sheets of an Excel workbook and return an appended dataframe.
```{r example1}
# path to the xls/xlsx file.
path <- system.file("extdata", "Diamonds.xlsx", package = "bulkreadr", mustWork = TRUE)
# read the sheets
read_excel_workbook(path = path)
```
## `read_excel_files_from_dir()`
`read_excel_files_from_dir()` reads all Excel workbooks in the `"~/data"` directory and returns an appended dataframe.
```{r example1a}
# path to the directory containing the xls/xlsx files.
directory <- system.file("xlsxfolder", package = "bulkreadr")
# import the workbooks
read_excel_files_from_dir(dir_path = directory)
```
## `read_csv_files_from_dir()`
`read_csv_files_from_dir()` reads all csv files from the `"~/data"` directory and returns an appended dataframe. The resulting dataframe will be in the same order as the CSV files in the directory.
```{r example2}
# path to the directory containing the CSV files.
directory <- system.file("csvfolder", package = "bulkreadr")
# import the csv files
read_csv_files_from_dir(dir_path = directory)
```
## `read_gsheets()`
The `read_gsheets()` function imports data from multiple sheets in a Google Sheets spreadsheet and appends the resulting dataframes from each sheet together to create a single dataframe. This function is a powerful tool for data analysis, as it allows you to easily combine data from multiple sheets into a single dataset.
```{r, include=FALSE}
googlesheets4::gs4_deauth()
```
```{r example3}
# Google Sheet ID or the link to the sheet
sheet_id <- "1izO0mHu3L9AMySQUXGDn9GPs1n-VwGFSEoAKGhqVQh0"
# read all the sheets
read_gsheets(ss = sheet_id)
```
## `read_spss_data()`
`read_spss_data()` is designed to seamlessly import data from an SPSS data (`.sav` or `.zsav`) files. It converts labelled variables into factors, a crucial step that enhances the ease of data manipulation and analysis within the R programming environment.
```{r spssdata1}
# Read an SPSS data file without converting variable labels as column names
file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr")
data <- read_spss_data(file = file_path)
data
```
```{r spssdata2}
# Read an SPSS data file and convert variable labels as column names
data <- read_spss_data(file = file_path, label = TRUE)
data
```
## read_stata_data()
`read_stata_data()` reads Stata data file (`.dta`) into an R data frame, converting labeled variables into factors.
**Read the Stata data file without converting variable labels as column names**
```{r statadata1}
file_path <- system.file("extdata", "Wages.dta", package = "bulkreadr")
data <- read_stata_data(file = file_path)
data
```
**Read the Stata data file and convert variable labels as column names**
```{r statadata2}
data <- read_stata_data(file = file_path, label = TRUE)
data
```
## `generate_dictionary()`
`generate_dictionary()` creates a data dictionary from a specified data frame. This function is particularly useful for understanding and documenting the structure of your dataset, similar to data dictionaries in Stata or SPSS.
```{r}
# Creating a data dictionary from an SPSS file
file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr")
wage_data <- read_spss_data(file = file_path)
generate_dictionary(wage_data)
```
## `look_for()`
The `look_for()` function is designed to emulate the functionality of the Stata `lookfor` command in R. It provides a powerful tool for searching through large datasets, specifically targeting variable names, variable label descriptions, factor levels, and value labels. This function is handy for users working with extensive and complex datasets, enabling them to quickly and efficiently locate the variables of interest.
```{r}
# Look for a single keyword.
look_for(wage_data, "south")
```
## `pull_out()`
`pull_out()` is similar to `[`. It acts on vectors, matrices, arrays and lists to extract or replace parts. It is pleasant to use with the magrittr (`%>%`) and base(`|>`) operators.
```{r example4}
top_10_richest_nig <- c("Aliko Dangote", "Mike Adenuga", "Femi Otedola", "Arthur Eze", "Abdulsamad Rabiu", "Cletus Ibeto", "Orji Uzor Kalu", "ABC Orjiakor", "Jimoh Ibrahim", "Tony Elumelu")
top_10_richest_nig %>%
pull_out(c(1, 5, 2))
```
```{r}
top_10_richest_nig %>%
pull_out(-c(1, 5, 2))
```
## `convert_to_date()`
`convert_to_date()` parses an input vector into POSIXct date-time object. It is also powerful to convert from excel date number like `42370` into date value like `2016-01-01`.
```{r example 5}
## ** heterogeneous dates **
dates <- c(
44869, "22.09.2022", NA, "02/27/92", "01-19-2022",
"13-01- 2022", "2023", "2023-2", 41750.2, 41751.99,
"11 07 2023", "2023-4"
)
# Convert to POSIXct or Date object
convert_to_date(dates)
# It can also convert date time object to date object
convert_to_date(lubridate::now())
```
```{r example5}
# With dataframe
file_path <- system.file("extdata", "OGD.xlsx", package = "bulkreadr")
ogd_data <- read_excel_workbook(path = file_path)
ogd_data %>% head()
# Convert to POSIXct or Date object
modified_ogd_data <- ogd_data %>%
mutate(Date_format = convert_to_date(Date))
modified_ogd_data %>% head()
```
## `inspect_na()`
`inspect_na()` summarizes the rate of missingness in each column of a data frame. For a grouped data frame, the rate of missingness is summarized separately for each group.
```{r example 6a}
# dataframe summary
inspect_na(airquality)
# grouped dataframe summary
airquality %>%
group_by(Month) %>%
inspect_na()
```
## `fill_missing_values()`
`fill_missing_values()` in an efficient function that addresses missing values in a dataframe. It uses imputation by function, meaning it replaces missing data in numeric variables with either the mean or the median, and in non-numeric variables with the mode. The function takes a column-based imputation approach, ensuring that replacement values are derived from the respective columns, resulting in accurate and consistent data. This method enhances the integrity of the dataset and promotes sound decision-making and analysis in data processing workflows.
```{r example 6}
df <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Sepal.Width = c(4.1, 3.6, 3, 3, 2.9, 2.5, 2.4),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", NA, "versicolor", "setosa",
NA, "virginica", "setosa"
)
)
df
# Using mean to fill missing values for numeric variables
result_df_mean <- fill_missing_values(df, use_mean = TRUE)
result_df_mean
# Using median to fill missing values for numeric variables
result_df_median <- fill_missing_values(df, use_mean = FALSE)
result_df_median
```
### Impute missing values (NAs) in a grouped data frame
You can use the `fill_missing_values()` in a grouped data frame by using other grouping and map functions. Here is an example of how to do this:
```{r}
sample_iris <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", "setosa", "versicolor", "setosa",
"virginica", "virginica", "setosa")
)
sample_iris
sample_iris %>%
group_by(Species) %>%
group_split() %>%
map_df(fill_missing_values)
```
## Context
bulkreadr draws on and complements / emulates other packages such as readxl, readr, and googlesheets4 to read bulk data in R.
* [readxl](https://readxl.tidyverse.org) is the tidyverse package for reading Excel files (xls or xlsx) into an R data frame.
* [readr](https://readr.tidyverse.org) is the tidyverse package for reading delimited files (e.g., csv or tsv) into an R data frame.
* [googlesheets4](https://cran.r-project.org/package=googlesheets) is the package to interact with Google Sheets through the Sheets API v4 <https://developers.google.com/sheets/api>.