-
Notifications
You must be signed in to change notification settings - Fork 13
/
09-Exploring-game-sale-data-set.Rmd
586 lines (460 loc) · 31.1 KB
/
09-Exploring-game-sale-data-set.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
# Exploring of game sale dataset
This is a video game sales data including game sales of North America, European, Japan and other area, together they make the global sale. The data also give the information about the critic score, user score and the number of critics or users who gave these two scores. This data is downloaded from [https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings#Video_Games_Sales_as_at_22_Dec_2016.csv](https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings#Video_Games_Sales_as_at_22_Dec_2016.csv).
The detail about the data is listed as follow:
Name: Name of the game
Platform: Console on which the game is running
Year_of_Release: Year of the game released
Genre: Game's category
Publisher: Publisher
NA_Sales: Game sales in North America (in millions of units)
EU_Sales: Game sales in the European Union (in millions of units)
JP_Sales: Game sales in Japan (in millions of units)
Other_Sales: Game sales in the rest of the world, i.e. Africa, Asia excluding Japan, Australia, Europe excluding the E.U. and South America (in millions of units)
Global_Sales: Total sales in the world (in millions of units)
Critic_Score: Aggregate score compiled by Meta critic staff
Critic_Count: The number of critics used in coming up with the Critic_score
User_Score: Score by Metacritic's subscribers
User_Count: Number of users who gave the user_score
Developer: Party responsible for creating the game
Rating: The ESRB ratings:
E for "Everyone"; E10+ for "Everyone 10+"; T for "Teen"; M for "Mature"; AO for "Adults Only"; RP for "Rating Pending"; K-A for kids to adults.
After downloading the data, We replace N/A with NA first in excel and save it as csv file and read in. We remove these observations with empty string of Rating and 6825 observations was left in our data set "game". We notice that there are lots of zero sales value. To prepare these variables ready for being taken log value to make them normal or close to normal distribution, we transform the sales into its basic units and plus 1. We also divide critic score by 10 to march the scale unit of user score.
##Reading in data and manage data
```{r message=FALSE}
tem <- read.csv("datasets/video-game-sales-at-22-Dec-2016.csv")
tem <- na.omit(tem) #remove NA
library(dplyr)
game <- tem %>% filter(Rating != "") %>% droplevels() #remove empty rating observations
#by multiplying 1000000 we get the actual sale,
#adding 1 makes all sales positive which make log possible for all sales later
game$Year_of_Release <- as.factor(as.character(game$Year_of_Release))
game$NA_Sales <- game$NA_Sales * 1000000 + 1
game$EU_Sales <- game$EU_Sales * 1000000 + 1
game$JP_Sales <- game$JP_Sales * 1000000 + 1
game$Other_Sales <- game$Other_Sales * 1000000 + 1
game$Global_Sales <- game$Global_Sales * 1000000 + 1
# By divide by 10 to make Critic Score the same decimal as User Score
game$Critic_Score <- as.numeric(as.character(game$Critic_Score)) / 10
game$User_Score <- as.numeric(as.character(game$User_Score))
game$Critic_Count <- as.numeric(game$Critic_Count)
game$User_Count <- as.numeric(game$User_Count)
# format column names
colnames(game) <- c("Name", "Platform", "Year.Release", "Genre", "Publisher", "NA.Sales", "EU.Sales", "JP.Sales", "Other.Sales", "Global.Sales", "Critic.Score", "Critic.Count", "User.Score", "User.Count", "Developer", "Rating")
head(game)
str(game)
summary(game)
```
Summary of these variables tell us that some of games were published in the same name; PS2 is the most popular platform; Action is the most popular Genre; Electronic Arts has the most high frequency among the publishers; Rating T and E are the two most released ratings; For these sales, though the minimums, several quantiles and medians are small, but the maximums are high, which means there are real good sale games among them; Extreme big maximum User count hints so many users score some special games.
Our pre-analysis shows that these variables are not normally distributed, especially those sales and score counts variables. We take logs to transform these variables.
```{r}
NA.Sales.Log <- log(game$NA.Sales)
EU.Sales.Log <- log(game$EU.Sales)
JP.Sales.Log <- log(game$JP.Sales)
Other.Sales.Log <- log(game$Other.Sales)
Global.Sales.Log <- log(game$Global.Sales)
Critic.Count.Log <- log(game$Critic.Count)
User.Count.Log <- log(game$User.Count)
```
Then we combine the log variables with the original variables.
```{r}
game.log <- cbind.data.frame(NA.Sales.Log, EU.Sales.Log, JP.Sales.Log, Other.Sales.Log,
Global.Sales.Log, Critic.Count.Log, User.Count.Log)
game <- cbind.data.frame(game, game.log) # the data we use for analysis
head(game)
str(game)
```
Now we plot histogram and QQ plot for the transformed data set.
```{r message=FALSE, fig.width=8, fig.height=10}
name <- colnames(game)[c(11, 13, 17:23)] # pick up the numeric columns according to the names
par(mfrow = c(5, 4)) # layout in 5 rows and 4 columns
for (i in 1:length(name)){
sub <- sample(game[name[i]][, 1], 5000)
submean <- mean(sub)
hist(sub, main = paste("Hist. of", name[i], sep = " "), xlab = name[i])
abline(v = submean, col = "blue", lwd = 1)
qqnorm(sub, main = paste("Q-Q Plot of", name[i], sep = " "))
qqline(sub)
if (i == 1) {s.t <- shapiro.test(sub)
} else {s.t <- rbind(s.t, shapiro.test(sub))
}
}
s.t <- s.t[, 1:2] # take first two column of shapiro.test result
s.t <- cbind(name, s.t) # add variable name for the result
s.t
```
From the histograms and qq plots we can see that two scores and and their count log values, and global sales log are close to normal distribution. Though the Shapiro test still deny the normality of these log values. We assume they are normally distributed in our analysis.
There are lots of interest points in this data set such as the distribution of global and regional sales and their relationship, the correlation of critic score and user score and their counts, whether these scores are the main effect for sales, or the effect of other factors like genre, rating, platform, even publisher to sales, and so on. First let's do visualization.
## Visualization of categorical variables
To simplify platform analysis, We regroup platform as Platform.type.
```{r}
#regroup platform as Platform.type
pc <- c("PC")
xbox <- c("X360", "XB", "XOne")
nintendo <- c("Wii", "WiiU", "N64", "GC", "NES", "3DS", "DS")
playstation <- c("PS", "PS2", "PS3", "PS4", "PSP", "PSV")
game <- game %>%
mutate(Platform.type = ifelse(Platform %in% pc, "PC",
ifelse(Platform %in% xbox, "Xbox",
ifelse(Platform %in% nintendo, "Nintendo",
ifelse(Platform %in% playstation, "Playstation", "Others")))))
```
(ref:game-PlatformType) Bar plot of platform type
```{r game-PlatformType, message=FALSE, fig.cap='(ref:game-PlatformType)', fig.align='center'}
library(ggplot2)
ggplot(game, aes(x = Platform.type)) + geom_bar(fill = "blue")
```
As the bar plot shown here, Playstation is the biggest group, then xbox and nintendo. While others are the smallest type.
```{r}
dat <- data.frame(table(game$Genre))
dat$fraction = dat$Freq / sum(dat$Freq)
dat = dat[order(dat$fraction), ]
dat$ymax = cumsum(dat$fraction)
dat$ymin = c(0, head(dat$ymax, n = -1))
names(dat)[1] <- "Genre"
library(ggplot2)
ggplot(dat, aes(fill = dat$Genre, ymax = ymax, ymin = ymin, xmax = 4, xmin = 3)) +
geom_rect(colour = "grey30") + # background color
coord_polar(theta = "y") + # coordinate system to polar
xlim(c(0, 4)) +
labs(title = "Ring plot for Genre", fill = "Genre") +
theme(plot.title = element_text(hjust = 0.5))
```
Action, Sports and Shooter are the first three biggest genre. Action occupies almost 25% genre. Three of
them together contribute half of genre count. Puzzle, Adventure and Stratage have relative less count.
We regroup rating AO, RP and K-A as "Others" because there are only few observations of these ratings.
```{r}
#regroup Rating as Rating.type
rating <- c("E", "T", "M", "E10+")
game <- game %>% mutate(Rating.type = ifelse(Rating %in% rating, as.character(Rating), "Others"))
```
```{r}
counts <- sort(table(game$Rating.type), decreasing = TRUE)
names(counts)[1] <- "T - Teen" # rename the names of counts for detail information
names(counts)[2] <- "E - Everyone"
names(counts)[3] <- "M - Mature"
names(counts)[4] <- "E10+ - Everyone 10+"
pct <- paste(round(counts/sum(counts) * 100), "%", sep = " ")
lbls <- paste(names(counts), "\n", pct, sep = " ") # labels with count number
pie(counts, labels = lbls, col = rainbow(length(lbls)),
main="Pie Chart of Ratings with sample sizes")
```
According to the order, the most popular ratings are T, E, M and E10+. Other ratings only occupy very little in the all games.
(ref:game-mosaic) Mosaic plot between platform type and rating type
```{r game-mosaic, fig.cap='(ref:game-mosaic)', fig.align='center'}
library(ggmosaic)
library(plotly)
p <- ggplot(game) +
geom_mosaic(aes(x = product(Rating.type), fill = Platform.type), na.rm=TRUE) +
labs(x="Rating.type", y = "Platform Type", title="Mosaic Plot") +
theme(axis.text.y = element_blank())
ggplotly(p)
```
For all platform and rating combination, Playstation games are released most in all other three different rating types except Everyone 10 age plus. Nintendo is the most popular game for Everyone 10+, it's the second popular platform for rating Everyone. Xbox is the second popular platform for rating mature and teenage,and it's the third favorite platform for rating everyone and everyone 10+. Most Other platform games are rated as Everyone.
```{exercise}
Download the game sale data set and clean the data as similar as described in the beginning of this chapter, produce a masaic plot between genre and rating. Interpret your plot breifly.
```
## Correlation among numeric variables
(ref:game-corrplot) Corrplot among numeric variables
```{r game-corrplot, fig.width=8, fig.height=8, fig.cap='(ref:game-corrplot)', fig.align='center'}
st <- game[, c(11, 13, 17:23)] # take numeric variables as goal matrix
st <- na.omit(st)
library(ellipse) # install.packages("ellipses")
library(corrplot)
corMatrix <- cor(as.matrix(st)) # correlation matrix
col <- colorRampPalette(c("#7F0000", "red", "#FF7F00", "yellow", "#7FFF7F",
"cyan", "#007FFF", "blue", "#00007F"))
corrplot.mixed(corMatrix, order = "AOE", lower = "number", lower.col = "black",
number.cex = .8, upper = "ellipse", upper.col = col(10),
diag = "u", tl.pos = "lt", tl.col = "black")
```
There are high r values of 0.75, 0.65, 0.52 and 0.42 between the log value of Global.Sales and regional sales, we will consider to use Global.Sales.Log as our target sales to analyze the relationship of sales with other variables later. On the other hand, there are good positive correlation between regional sales too. User Score is positive correlated to Critic Score with r of 0.58. There is little correlation between User Count log value and User Score.
(ref:game-dendrogram) Cluster dendrogram for numeric variables
```{r game-dendrogram, fig.cap='(ref:game-dendrogram)', fig.align='center'}
plot(hclust(as.dist(1 - cor(as.matrix(st))))) # hierarchical clustering
```
All sales’ log value except JP.Sales.Log build one cluster, scores build second cluster, and log value of counts and JP.Sales build another one. In sales cluster, Other.Sales.Log is the closest to Global.Sales.Log, then NA.Sales.Log, and EU.Sales.Log is the next.
## Analysis of score and count
(ref:game-score) Scatter and density plot for critic score and user score
```{r game-score, message=FALSE, warning=FALSE, fig.width=10, fig.height=3, fig.cap='(ref:game-score)', fig.align='center'}
library(ggpmisc) #package for function stat_poly_eq
formula <- y ~ x
p1 <- ggplot(game, aes(x = User.Score, y = Critic.Score)) +
geom_point(aes(color = Platform), alpha = .8) +
geom_smooth(method = 'lm', se = FALSE, formula = formula) + #add regression line
theme(legend.position = "none") +
stat_poly_eq(formula = formula, #add regression equation and R square value
eq.with.lhs = "italic(hat(y))~`=`~", # add ^ on y
aes(label = paste(..eq.label.., ..rr.label.., sep = "*plain(\",\")~")),
label.x.npc = "left", label.y.npc = 0.9, # position of the equation label
parse = TRUE)
p2 <- ggplot() +
geom_density(data = game, aes(x = Critic.Score), color = "darkblue", fill = "lightblue") +
geom_density(data = game, aes(x = User.Score), color = "darkgreen", fill = "lightgreen", alpha=.5) +
labs(x = "Critic.Score-blue, User.Score-green") +
theme(plot.title = element_text(hjust = 0.5))
library(gridExtra)
grid.arrange(p1, p2, nrow = 1, ncol = 2)
```
There is positive correlation between Critic.Score and User.Score. In total, Critic score is lower than user score.
```{r}
t.test(game$Critic.Score, game$User.Score)
```
T-test with p value of much less than 0.05 let us accept the alternative hypothesis with 95% confidence that there is significant difference in the means of critic score and user score. The mean of critic score is 7.03, and mean of user score is 7.19.
(ref:game-count) Binhex plot for critic count and user count
```{r game-count, fig.cap='(ref:game-count)', fig.align='center'}
p1 <- ggplot(game, aes(x = Critic.Count.Log, y = Critic.Score)) +
stat_binhex() + # Bin 2d plane into hexagons
scale_fill_gradientn(colours = c("black", "red"),
name = "Frequency") # Adding a custom continuous color palette
p2 <- ggplot(game, aes(x = User.Count.Log, y = User.Score)) +
stat_binhex() +
scale_fill_gradientn(colours = c("black", "red"), name = "Frequency") # color legend
grid.arrange(p1, p2, nrow = 1, ncol = 2)
```
Critic.Score has a pretty good correlation to Critic.Count.Log, with an r value of 0.41 in the correlation analysis above, though Critic.Count.Log doesn’t have impact over Critic.Score. While User.Score looks like independent on User.Count.Log.
```{exercise}
Use ggplot2 package to get a scatter plot with smooth line between Global_Sales and NA_Sales. Use plain sentence to explain what you find in the plot.
```
```{exercise}
Use density plot of Global_Sales, NA_Sales, EU_Sales, JP_Sales and Other_Sales to illustrate the relationship among these sales. Interpret your plot.
```
## Analysis of sales
### By Year.Release
(ref:game-SalesYear) Total sales by year
```{r game-SalesYear, fig.width=8, fig.height=4, fig.cap='(ref:game-SalesYear)', fig.align='center'}
Year.Release <- game$Year.Release
counts <- data.frame(table(Year.Release))
p <- game %>%
select(Year.Release, Global.Sales) %>%
group_by(Year.Release) %>%
summarise(Total.Sales = sum(Global.Sales))
q <- cbind.data.frame(p, counts[2])
names(q)[3] <- "count"
q$count <- as.numeric(q$count)
ggplot(q, aes(x = Year.Release, y = Total.Sales, label = q$count)) +
geom_col(fill = "green") +
geom_point(y = q$count * 500000, size = 3, shape = 21, fill = "Yellow" ) +
geom_text(y = (q$count + 50) * 500000) + # position of the text: count of games each year
theme(axis.text.x = element_text(angle = 90),
panel.background = element_rect(fill = "purple"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
scale_x_discrete("Year.Release", labels = as.character(Year.Release), breaks = Year.Release)
labs(title = "Global Sales Each Year", x = "Year Release", y = "Global Sales")
```
We can see from the histogram of total sales that there is very little sales before 1996, only one game was released for each year. For several years between 1996 and 2000 the sales increased slowly. The count of games too. After that there is a big climbing in total sales and the number of released games. The top sales happened in 2008, and the most count games was released in that year too. After that both total sales and count of games sloped down.
### By Region
(ref:game-SalesRegion) Total sales by region
```{r game-SalesRegion, fig.width=9, fig.height=4, fig.cap='(ref:game-SalesRegion)', fig.align='center'}
library(reshape2)
game %>%
select(Year.Release, NA.Sales.Log, EU.Sales.Log, JP.Sales.Log,
Other.Sales.Log, Global.Sales.Log) %>%
melt(id.vars = "Year.Release") %>%
group_by(Year.Release, variable) %>%
summarise(total.sales = sum(value)) %>%
ggplot(aes(x = Year.Release, y = total.sales, color = variable, group = variable)) +
geom_point() + geom_line() +
labs(title = "Regional Global Sales Log Distribution Each Year",
x = "Year Release", y = "Total Sales Log Value", color = "Region") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90),
panel.background = element_rect(fill="pink"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
```
The pattern of log value for these regional sales in those years are similar for Global, North America, Europe, and Others. Japan is much different from them. Same conclusion as cluster analysis.
### By Rating
(ref:game-SalesRating) Sales by rating type
```{r game-SalesRating, fig.cap='(ref:game-SalesRating)', fig.align='center'}
game$Rating.type <- as.factor(game$Rating.type)
x <- game[, c(6:10)]
matplot(t(x), type = "l", col = rainbow(5)[game$Rating.type])
legend("center", levels(game$Rating.type), fill = rainbow(5), cex = 0.8, pt.cex = 1)
text(c(1.2, 2, 3, 3.9, 4.8), 80000000, colnames(x))
```
The figure shows one E game(for everyone) which was sold mainly in North America and Europe produced a sale tale of over 80 millions' global sale, while North America contributed half of the global sales. We can check the game data and know it's Wii Sports released in 2006. We also noticed that Mature game is popular in North America(green), which contributed a lot to global sales, Everyone games(red) have good sale in Europe, while Japanese like Teen(purple) and Everyone(red) games. It's balance in rating for "other" region.
### By Genre
(ref:game-SalesGenre) Year wise log global sales by Genre
```{r game-SalesGenre, fig.cap='(ref:game-SalesGenre)', fig.align='center'}
game %>%
select(Year.Release, Global.Sales.Log, Genre) %>%
group_by(Year.Release, Genre) %>%
summarise(Total.Sales.Log = sum(Global.Sales.Log)) %>%
ggplot(aes(x = Year.Release, y = Total.Sales.Log, group = Genre, fill = Genre)) +
geom_area() +
theme(legend.position = "right", axis.text.x = element_text(angle = 90),
panel.background = element_rect(fill = "blue"),
panel.grid.major = element_blank(),
panel.grid.minor=element_blank()) +
theme(plot.title = element_text(hjust = 0.5))
```
The figure shows the golden year for games are from 2007 to 2009, these games together occur above 7000 total.sales.log each of those years. Action and sports keeps on the top sale for almost all of those 20 years, occupying biggest portion of the total global sales log. Adventure, Puzzle and Strategy are on the bottom of the sale log list.
### by Score
(ref:game-SalesScore) Global sales by critic and user score
```{r game-SalesScore, warning=FALSE, message=FALSE, fig.width=12, fig.height=4, fig.cap='(ref:game-SalesScore)', fig.align='center'}
p1 <- ggplot(game, aes(x = Critic.Score, y = Global.Sales.Log)) +
geom_point(aes(color = Genre)) +
geom_smooth()
p2 <- ggplot(game, aes(x = User.Score, y = Global.Sales.Log)) +
geom_point(aes(color = Rating)) +
geom_smooth()
grid.arrange(p1, p2, nrow = 1,ncol = 2)
```
Independent from Genre and Rating, the higher of Score, the better of Global.Sales.Log. Especially for Critic.Score bigger than 9, Global.Sales straight rising. Global.Sales rise slowly with User.Score.
```{r }
game$Name <- gsub("Brain Age: Train Your Brain in Minutes a Day", #shorten the game name
"Brain Age: Train Your Brain", game$Name)
p1 <- game %>%
select(Name, User.Score, Critic.Score, Global.Sales) %>%
group_by(Name) %>%
summarise(Total.Sales = sum(Global.Sales), Avg.User.Score = mean(User.Score),
Avg.Critic.Score = mean(Critic.Score)) %>%
arrange(desc(Total.Sales)) %>%
head(20)
ggplot(p1, aes(x = factor(Name, levels = Name))) +
geom_bar(aes(y = Total.Sales/10000000), stat = "identity", fill = "green") +
geom_line(aes(y = Avg.User.Score, group = 1, colour = "Avg.User.Score"), size = 1.5) +
geom_point( aes(y = Avg.User.Score), size = 3, shape = 21, fill = "Yellow" ) +
geom_line(aes(y = Avg.Critic.Score, group = 1, colour = "Avg.Critic.Score"), size = 1.5) +
geom_point(aes(y = Avg.Critic.Score), size = 3, shape = 21, fill = "white") +
theme(axis.text.x = element_text(angle = 90, size = 8)) +
labs(title = "Top Global Sales Game with Score", x = "Name of the top games" ) +
theme(plot.title = element_text(hjust = 0.5))
```
Among these 20 top sale games, the first two games, Wii Sports and Grand Theft Auto V have much better sales than the others. For most games, average critic score is higher than average user score, which agree with our density plot Figure\@ref(fig:game-score). Two Call of Duty games got really lower average user score comparing with other top sales games.
### By Rating & Genre & Critic score
(ref:game-Sales) Total sales for genre and rating with critic score
```{r game-Sales, fig.width=12, fig.height=8, fig.cap='(ref:game-Sales)', fig.align='center'}
p1 <- game %>%
select(Rating.type, Global.Sales, Genre, Critic.Score) %>%
group_by(Rating.type, Genre) %>%
summarise(Total.Sales = sum(Global.Sales)/10^8, Avg.Score = mean(Critic.Score))
p2 <- p1 %>% group_by(Genre) %>% summarise(Avg.Critic.Score = mean(Avg.Score))
ggplot() +
geom_bar(data = p1,
aes(x = Genre, y = Total.Sales, fill = Rating.type), stat = "Identity", position = "dodge") +
geom_line(data = p2,
aes(x = Genre, y = Avg.Critic.Score, group = 1, color = "Avg.Critic.Score"), size = 2) +
geom_point(data = p2,
aes(x = Genre, y = Avg.Critic.Score, shape = "Avg.Critic.Score"), size = 3, color = "Blue") +
scale_colour_manual("Score", breaks = "Avg.Critic.Score", values = "yellow") +
scale_shape_manual("Score", values = 19) +
theme(axis.text.x = element_text(angle = 90),
plot.title = element_text(hjust = 0.5),
legend.position="bottom",
panel.background = element_rect(fill = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
```
For genre & rating & Global.sale combination, Everyone sports game are so popular that it occupy the first in global sales in this group. Rating Mature contribute big portion in both Action and Shooter global sales. On average these three top sales genres show relatively higher critic score. Fighting, adventure and racing games got relatively lower average critic score. We can also see from the figure that adventure, puzzle and stratage do sell less comparing with other genres.
### By Platform
(ref:game-SalesPlatform) Yearly market share by platform type
```{r game-SalesPlatform, fig.cap='(ref:game-SalesPlatform)', fig.align='center'}
library(viridis)
library(scales)
p <- game %>%
group_by(Platform.type, Year.Release) %>%
summarise(total = sum(Global.Sales))
p$Year.Release. <- as.numeric(as.character(p$Year.Release))
ggplot(p, aes(x = Year.Release., fill = Platform.type)) +
geom_density(position = "fill") +
labs(y = "Market Share") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_viridis(discrete = TRUE) +
scale_y_continuous(labels = percent_format())
```
Nintendo and Xbox came after 1990. Before that PC and Playstation occupied the game market, PC are the main platform at that time. After 1995, the portion of PC and Playstation shrinked, while Nintendo and Xbox grew fast and took over more portion than Playstation and PC in the market. Together with Nintendo and Xbox, there were other game platform sprouting out in early 1990s, but they last for 20 years and disappeared. From around 2010, the portions of these 4 platforms keep relatively evenly and stablly.
```{r}
#compute 1-way ANOVA test for log value of global sales by Platform Type
model <- aov(Global.Sales.Log ~ Platform.type, data = game)
summary(model)
tukey <- TukeyHSD(model)
par(mar = c(4, 10, 2, 1))
plot(tukey, las = 1)
```
ANOVA test shows that there is at lease one of the platform type is significant different from the others. In detail, the plot of Turkey tests tells us that there is significant difference between all other pairs of platform types but between Xbox and Nintendo, others and Nintendo.
(ref:game-PlatformRating) Global sales log by platform and rating type
```{r game-PlatformRating, fig.width=10, fig.height=4, fig.cap='(ref:game-PlatformRating)', fig.align='center'}
game$Platform.type <- as.factor(game$Platform.type)
ggplot(game, aes(x = Platform.type, y = Global.Sales.Log, fill = Rating.type)) +
geom_boxplot()
```
In total, PC has lower Global sales log comparing with other platform type, while Playstation and Xbox have higher sale mediums for different rating types. Rating of Everyone sold pretty well in all platform type, while rating Mature sold better in PC, Playstation and Xbox.
(ref:game-PlatformScore) Global sales log by critic score for different platform type and genre
```{r game-PlatformScore, fig.width=10, fig.height=6, fig.cap='(ref:game-PlatformScore)', fig.align='center'}
ggplot(game, aes(Critic.Score, Global.Sales.Log, color = Platform.type)) +
geom_point() + facet_wrap(~ Genre) +
theme(plot.title = element_text(hjust = 0.5))
```
Most genre plots in Figure \@ref(fig:game-PlatformScore) illustrate that there are positive correlation between Global.Sales.Log and Critic Score, the higher the critic score, the better the global sales log value. Most puzzle games were from Nintendo, while lots of stratage games are PC. For other genres, all platforms shared the portion relatively evenly. Lots of PC(green) shared lower market portion in different genres, while some of Nintendo(red) games in sports, racing, platform, and misc were sold really well. At the same time, Playstation action and racing games, and Xbox misc, action and shooter games show higher global sales log too.
## Effect of platform type to priciple components
```{r}
st <- game[, c(11, 13, 17:23)]
pca = prcomp(st, scale = T) #scale = T to normalize the data
percentVar <- round(100 * summary(pca)$importance[2, 1:3], 0) # compute % variances
```
(ref:game-pca) PCA plot colored with platform type
```{r game-pca, fig.cap='(ref:game-pca)', fig.align='center'}
#head(pca$x) # the new coordinate values for each observation
pcaData <- as.data.frame(pca$x[, 1:2]) #First and Second principal component value
pcaData <- cbind(pcaData, game$Platform.type) #add platform type as third col. for cluster purpose
colnames(pcaData) <- c("PC1", "PC2", "Platform")
ggplot(pcaData, aes(PC1, PC2, color = Platform, shape = Platform)) +
geom_point(size = 0.8) +
xlab(paste0("PC1: ", percentVar[1], "% variance")) + # x label
ylab(paste0("PC2: ", percentVar[2], "% variance")) + # y label
theme(aspect.ratio = 1) # width and height ratio
```
PC, Xbox, Playstation and Nintendo occupy in their own positions in the PCA figure, which illustrate that they play different important role in components of the variance of PC1 and PC2.
(ref:game-Kmeans) Kmeans PCA figure using ggfortify
```{r game-Kmeans, fig.cap='(ref:game-Kmeans)', fig.align='center'}
library(ggfortify)
set.seed(1)
autoplot(kmeans(st, 3), data = st, label = FALSE, label.size = 0.1)
```
Together with PCA Figure \@ref(fig:game-pca), we will find that the first cluster is contributed mainly by PC. The second cluster is contributed mainly by Xbox and Playstation. Xbox, Nintendo, and Playstation together build the third cluster.
## Models for global sales
Because there are too many of levels in Publisher and Developer, and there is apparent correlation between them, we use only top 12 levels of Publisher and classified the other publishers as "Others"; Because of the good correlation between Critic.Score and User.Score, we use only critic score; Also we use only log value of user score count because of it's closer correlation to global sales log. We will not put other sales log variables in our model because their apparent correlation with global sales log.
```{r}
#re-categorize publisher into 13 groups
Publisher. <- head(names(sort(table(game$Publisher), decreasing = TRUE)), 12)
game <- game %>%
mutate(Publisher.type = ifelse(Publisher %in% Publisher., as.character(Publisher), "Others"))
game.lm <- game[, c(3:4, 11, 21, 23:26)]
#game.log$Genre.type <- as.factor(game.log$Genre.type)
#game.lm$Publisher.type <- as.factor(game.lm$Publisher.type)
```
```{r}
model <- lm(Global.Sales.Log ~ ., data = game.lm)
summary(model)
model <- aov(Global.Sales.Log ~ ., data = game.lm)
summary(model)
```
Global sales log is mostly effected by factors of critic score, user count log, platform type, Publisher type and genre in glm analysis. ANOVA shows every factors are in the contribution to global sales log, critic score and user count log are the most important factors.
Critic score and User.Count.Log positively affect the global sales log, while other factors like Platform type and Genre either lift up or pull down the global sales according to their types. This model will explain the global sales log with R-Square of 0.57.
Because of the curve smooth line at global sale ~ critic score plot in our previous analysis and its big contribution in linear model analysis, We try a polynomial fit of critic score only. The first two levels are not statistically significant according to our pre-analysis, we use the third and fourth levels only here.
```{r }
model <- lm(Global.Sales.Log ~ I(Critic.Score^3) + I(Critic.Score^4), data = game.lm)
summary(model)
```
In total, the coefficients are statistically significant, the model of two levels of critic score itself will explain the data with R square 0.16.
```{r}
ModelFunc <- function(x) {model$coefficients[1] + x^3*model$coefficients[2] + x^4*model$coefficients[3]}
ggplot(data = game.lm, aes(x = Critic.Score, y = Global.Sales.Log)) +
geom_point() + stat_function(fun = ModelFunc, color = 'blue', size = 1)
```
Here is the scatter plot of Global.Sales.Log ~ Critic Score and the model line which predict the global sales log with critic score.
```{exercise}
Use different plots to visualize the distribution of NA_Sales by Year_of_Release, Genre, Rating and Platform individually or combinedly. Explain the relationship between NA_Sales with these factors. Hint: Take log value for NA_Sales first. It's better to regroup platform first.
```
```{exercise}
What's the correlation between NA_Sales and Critic_Score? Use scatter plot with smooth line or polynomial model line to show the trend. Give your interpretation.
```
```{exercise}
Use linear model and ANOVA to analyze NA_Sales with all the factors which contribute to its variance. Interpret your result breifly. Hint: Check the corrplot in Figure \@ref(fig:game-corrplot) and pay attention to the high correlation among those sales and between those scores.
```
## Conclusion
Global and regional sales are not distributed normally, while their log values are close to normal distribution. Most regional sales have the similar pattern as global sales.
There is positive correlation between critic score and user score. In total, Critic score is lower than user score. No apparent correlation was found between scores and their counts.
Critic score, user score count log, genre, rating, platform, and publisher together affect the global sales log. Critic score is the most important contributors.