Check out my Data Science Internship Experience here
Project Using scikit-learn, we modeled on Airbnb dataset to estimate prices of Airbnb listings for the guests depending on various features like neighborhood, zipcodes, apartment type etc.
The Jupyter notebook in this repo contains the code to run Exploratory Data Analysis and Regression estimators on the Inside Airbnb listings dataset for Denver.
The target variable is the price of the listing.
Using this dataset I tried to answer some of the questions like:
- What are the most important characteristics of a listings in Denver, and how do they influence the price?
- Which neighborhoods in Denver have the highest rental prices?
- What distinguishes hosts that have Superhost status? Do all Superhosts properly qualify the criteria that AirBnB has set for them?
- Does reducing the dimensionality of the dataset lead to loss in information?
Home Depot Product Search Relevance
This is a challenge to predict the search relevance of search results on homedepot.com. More than 73% of the products in the dataset were unique items, which presented a challenge in training the model. This dataset required text cleaning and feature extraction.
I used natural language processing (NLTK) to derive the word stems on the product title, description and search terms. I then created features based on cosine distance, shared words, Edit distances, Search query length of the product title and description. Used sckit-learn models to predict the Relevance scores. Models were evaluated based on the RMSE.
Google Analytics Customer Revenue (R)
Dataset: Google Analytics data of Google Merchandise Store website. It's visit-level data, including userid, time, geo_info, pageviews, hits, referrer, ad_click, ...
Link for the competition and dataset (here)
Objective: Predict the total purchase a user has made during the visits in the test set.
Customer traffic dataset was analyzed and pre-processed in R Studio platform to predict the natural log of sum of all transactions per user. Used Ensemble learning techniques from H20.ai (open source Leader in ML and AI) to train and run on the processed data to achieve a considerably lower error rate.
Please notice that this is my code for the competition before its relaunch in early Nov. (There is a data leakage identified in late Oct, so everything about this competition has been modified, including rule, dataset, and prediction objectives)
Overall it was a good learning experience, as I have been playing with the Google Analytics data off late at work to understand user behaviors, it’s actually great to have a chance to try predicting sales with those web data from Google Analytics.
Check out my (Tableau Public) for some of my Tableau projects.
- AirBnb Price Prediction
- Home Depot Product Search Relevance
- Mobile Phones Price Classification
- Google Analytics Customer Revenue (R)
- Web Scraping Projects - YouTube and Gmail
- Neural Networks - Projects
Page template forked from evanca