Title: Home Depot Product Search Relevance

Group member:

Kewen Zhang,
Pengfei Wang,
Xiaoci Xing,
Ziyue Wu,

Quick Summary

Predict relevance scores of a given search keywords to Home Depot products, according to product description and attributes.

Galance of the training dataset

Dataset Dummary:

74067 observations.
relevance from 1 less related to 3 highly relevant.

Data Cleaning

Fixed typo: “helloWorld” －> "hello World"
Cleaned stop words: delete words like "the", "and"
Replace synonymous words
Cleared insignificant punctuation: "#.,"
Cleared plurality: "feet" -> foot
Changed to word sterm

Before	After
BEHR Premium Textured DeckOver 1-gal. #SC-141 Tugboat Wood and Concrete Coating	behr premium textur deck 1-ga sc-141 tugboat wood concret coat

Exploration

relevance distribution.

word clouds:

Search term:

Product title:

Product Description:

Feature Extraction

1.distance

We have used Jaccard coefficient

JaccardCoef(A, B) = |A ∩ B| / |A ∪ B|

to calculate the distance between "Search term", "Product Description" and "Product title" respectively

N-grams, n selection

By comparing the variance of each extracted features from 1 - 10 grams, we have the following analysis:

2.Cosine Similarity with tf-idf (term frequency - inverse document frequency)

In the case of the term frequency tf(t,d), the simplest choice is to use the raw frequency of a term in a document. The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents.

For n-grams selection of tf-idf, we have the similar result as Jaccard Coefficient distance:

then, we used cosine similarity to obtain the similarity between "Search term", "Product Description" and "Product title" respectively. Cosine Similarity: Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them

3. Customized features

As most of items have their attributes in data, we choose the most three common attributes among products, brand, color and material.

It's reasonable to compare between search term and these three product attributes. In this case we used same accard coefficient method. The number of grams here is 1, as many items only have one word in their color and material description.

4. combined features (distance + tf-idf + customized features)

Combined multiply features together.

Regressor

General Linear Model
- colinearality me
Machine Learning Regressor
＊ Cross Validation
Neural Network

Layers: 4
Nodes: 50/layers
Steps: 5000

Conclusion

We applied 5-folds cross validation for each regressors to generated the following table:

5-fold CV validated-rMSE comparison:

Feature\Regressor	Linear Regression	Ridge Regression	RF	XGB
Customized	0.5329	0.5329	0.5322	0.5323
Count	0.5211	0.5211	0.5143	0.5154
Distance	0.5133	0.5132	0.5100	0.5232
Ti-idf	0.5006	0.5006	0.4968	0.4983
All	0.4944	0.4929	0.4829	0.4829

Future work

Find better features
Increased steps of Neural Network Regressor

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
data		data
doc		doc
figs		figs
lib		lib
output		output
.gitignore		.gitignore
Pengfei Wang.csv		Pengfei Wang.csv
README.md		README.md
Word Correction.ipynb		Word Correction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Title: Home Depot Product Search Relevance

Group member:

Quick Summary

Galance of the training dataset

Dataset Dummary:

Data Cleaning

Exploration

Feature Extraction

1.distance

N-grams, n selection

2.Cosine Similarity with tf-idf (term frequency - inverse document frequency)

3. Customized features

4. combined features (distance + tf-idf + customized features)

Regressor

Conclusion

Future work

About

Releases

Packages

Contributors 4

Languages

Zac2116/TextMining

Folders and files

Latest commit

History

Repository files navigation

Title: Home Depot Product Search Relevance

Group member:

Quick Summary

Galance of the training dataset

Dataset Dummary:

Data Cleaning

Exploration

Feature Extraction

1.distance

N-grams, n selection

2.Cosine Similarity with tf-idf (term frequency - inverse document frequency)

3. Customized features

4. combined features (distance + tf-idf + customized features)

Regressor

Conclusion

Future work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages