Skip to content

Commit

Permalink
[BugFix] fix problems in data split (dmlc#3082)
Browse files Browse the repository at this point in the history
* [BugFix] fix problems in data split

* fix format problems in docstring

* modify statistics to fit in dgl nature

Co-authored-by: Quan (Andy) Gan <[email protected]>
Co-authored-by: zhjwy9343 <[email protected]>
  • Loading branch information
3 people authored Jul 7, 2021
1 parent 0d1dcdc commit bb89dee
Show file tree
Hide file tree
Showing 2 changed files with 65 additions and 48 deletions.
54 changes: 32 additions & 22 deletions python/dgl/data/fakenews.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,19 @@ class FakeNewsDataset(DGLBuiltinDataset):
the root node represents the news, the leaf nodes are Twitter users
who retweeted the root news. Besides, the node features are encoded
user historical tweets using different pretrained language models:
bert: the 768-dimensional node feature composed of Twitter user
historical tweets encoded by the bert-as-service
content: the 310-dimensional node feature composed of a
300-dimensional “spacy” vector plus a 10-dimensional
“profile” vector
profile: the 10-dimensional node feature composed of ten Twitter
user profile attributes.
spacy: the 300-dimensional node feature composed of Twitter user
historical tweets encoded by the spaCy word2vec encoder.
- bert: the 768-dimensional node feature composed of Twitter user
historical tweets encoded by the bert-as-service
- content: the 310-dimensional node feature composed of a
300-dimensional “spacy” vector plus a 10-dimensional
“profile” vector
- profile: the 10-dimensional node feature composed of ten Twitter
user profile attributes.
- spacy: the 300-dimensional node feature composed of Twitter user
historical tweets encoded by the spaCy word2vec encoder.
Note: this dataset is for academic use only, and commercial use is prohibited.
Expand All @@ -39,27 +43,33 @@ class FakeNewsDataset(DGLBuiltinDataset):
- Nodes: 41,054
- Edges: 40,740
- Classes:
Fake: 157
Real: 157
- Fake: 157
- Real: 157
- Node feature size:
bert: 768
content: 310
profile: 10
spacy: 300
- bert: 768
- content: 310
- profile: 10
- spacy: 300
Gossipcop:
- Graphs: 5464
- Graphs: 5,464
- Nodes: 314,262
- Edges: 308,798
- Classes:
Fake: 2732
Real: 2732
- Fake: 2,732
- Real: 2,732
- Node feature size:
bert: 768
content: 310
profile: 10
spacy: 300
- bert: 768
- content: 310
- profile: 10
- spacy: 300
Parameters
----------
Expand Down
59 changes: 33 additions & 26 deletions python/dgl/data/fraud.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,15 +177,15 @@ def _random_split(self, x, seed=717, train_size=0.7, val_size=0.1):
"must between 0 and 1 (inclusive)."

N = x.shape[0]
index = list(range(N))
index = np.arange(N)
if self.name == 'amazon':
# 0-3304 are unlabeled nodes
index = list(range(3305, N))
index = np.arange(3305, N)

np.random.RandomState(seed).permutation(index)
train_idx = index[:int(train_size * N)]
val_idx = index[int(N - val_size * N):]
test_idx = index[int(train_size * N):int(N - val_size * N)]
index = np.random.RandomState(seed).permutation(index)
train_idx = index[:int(train_size * len(index))]
val_idx = index[len(index) - int(val_size * len(index)):]
test_idx = index[int(train_size * len(index)):len(index) - int(val_size * len(index))]
train_mask = np.zeros(N, dtype=np.bool)
val_mask = np.zeros(N, dtype=np.bool)
test_mask = np.zeros(N, dtype=np.bool)
Expand All @@ -202,9 +202,9 @@ class FraudYelpDataset(FraudDataset):
The Yelp dataset includes hotel and restaurant reviews filtered (spam) and recommended
(legitimate) by Yelp. A spam review detection task can be conducted, which is a binary
classification task. 32 handcrafted features from
<http://dx.doi.org/10.1145/2783258.2783370> are taken as the raw node features. Reviews
are nodes in the graph, and three relations are:
classification task. 32 handcrafted features from <http://dx.doi.org/10.1145/2783258.2783370>
are taken as the raw node features. Reviews are nodes in the graph, and three relations are:
1. R-U-R: it connects reviews posted by the same user
2. R-S-R: it connects reviews under the same product with the same star rating (1-5 stars)
3. R-T-R: it connects two reviews under the same product posted in the same month.
Expand All @@ -213,13 +213,16 @@ class FraudYelpDataset(FraudDataset):
- Nodes: 45,954
- Edges:
R-U-R: 49,315
R-T-R: 573,616
R-S-R: 3,402,743
ALL: 3,846,979
- R-U-R: 98,630
- R-T-R: 1,147,232
- R-S-R: 6,805,486
- Classes:
Positive (spam): 6,677
Negative (legitimate): 39,277
- Positive (spam): 6,677
- Negative (legitimate): 39,277
- Positive-Negative ratio: 1 : 5.9
- Node feature size: 32
Expand Down Expand Up @@ -269,23 +272,27 @@ class FraudAmazonDataset(FraudDataset):
the raw node features .
Users are nodes in the graph, and three relations are:
1. U-P-U : it connects users reviewing at least one same product
2. U-S-U : it connects users having at least one same star rating within one week
3. U-V-U : it connects users with top 5% mutual review text similarities (measured by
TF-IDF) among all users.
1. U-P-U : it connects users reviewing at least one same product
2. U-S-U : it connects users having at least one same star rating within one week
3. U-V-U : it connects users with top 5% mutual review text similarities (measured by
TF-IDF) among all users.
Statistics:
- Nodes: 11,944
- Edges:
U-P-U: 175,608
U-S-U: 3,566,479
U-V-U: 1,036,737
ALL: 4,398,392
- U-P-U: 351,216
- U-S-U: 7,132,958
- U-V-U: 2,073,474
- Classes:
Positive (fraudulent): 821
Negative (benign): 11,123
- Positive-Negative ratio: 1 : 13.5
- Positive (fraudulent): 821
- Negative (benign): 7,818
- Unlabeled: 3,305
- Positive-Negative ratio: 1 : 10.5
- Node feature size: 25
Parameters
Expand Down

0 comments on commit bb89dee

Please sign in to comment.