GitHub - yesufeng/tedtalks: Data analysis on the key features of different types of TED talks

Topic Modeling on TED Talks: talk recommendation and rating prediction

This is a data analysis work on data of more than 1800 TED talks (with transcripts). TED has published rich data on talks (transcripts, speaker information, topic of talk, etc.) through its API. Moreover, TED has asked audience to rate talks using 14 rating words (e.g., Beautiful, Convincing, Inspiring, Long-winded, etc.).

This project aimed at analyzing the topic distribution of these over 1800 TED talks that had been put online by May, 2015. Recommendations of similar talks can be made based on talk topic similarities and speaker background similarities. It would be also interesting to see correlations between topic distribution and the ratings a talk received. Combined with other aspects of the talks (e.g., delivery skills) these insight can not only provide prediction of ratings to a newly published TED talk but more importantly can serve as guidelines to speakers to make a tailored TED talk.

Summary of the IPython Notebooks

Topic modeling: topic analysis of over 1500 talks in the training set was done using both LSI and LDA. A summary visualization of topic distribution among the 1500 TED talks can be seen here
- Latent Semantic Indexing (LSI, Ted_9_topic_modeling_LSI)
- Latent Dirichlet Allocation (LDA, Ted_9_topic_modeling_LDA)
- Visualization of topic-word distribition (LDA, Ted_12_LDA_topic_distribution)
Talk rating prediction/recommendation based on topic similarity:
- Construct a K-nearest neighbors model and tune hyperparameters using cross-validation (Ted_10_predict_ratings)
- Predict ratings the test set talks received using the above topic similarity model (Ted_11_rating prediction of the test set)
Other aspects of talk: talk length, talking speed, use of interjections, etc., how do they affect the ratings talks received
- Clustering of talks based on their rating words was done to turn rating prediction into a classification problem. Five classes (emotional, rational, stunning, negative and funny) are derived from the rating words talks received. For example, talks in the "rational" class receive more "persuasive" and "informative" than the other talks (Ted_3_response)
- Summary of discoveries from fitting these characteristics of talks against their rating class (Ted_7_How to make a tailored TED talk?)
Data processing notebooks
- Raw text features from transcripts of the training set and test set (Ted_2_NLP_caption and Ted_2_NLP_test_caption)
- Features constructed from title, description and tag of the training talks and test set talks (Ted_5_NLP_others and Ted_5_NLP_test_others)

The main data files are:

response5.csv (training set response) and test_response.csv (test set response)
captions_f3.json (training set caption data) and captions_test_f3.json (test set caption data)
train3.json and test3.json are data set of other talk information for the training and test sets, respectively
talks_other_text.json and validset_other_text.json are the text data from (description, title, tag) for the training and test sets, respectively
./data folder saved all topic modeling results from the LSI models
./r2py folder saved all topic modeling results from the LDA models

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
r2py		r2py
.Rhistory		.Rhistory
README.md		README.md
TED talk detailed analysis.ipynb		TED talk detailed analysis.ipynb
Ted_10_predict_ratings.ipynb		Ted_10_predict_ratings.ipynb
Ted_11_rating prediction of the test set.ipynb		Ted_11_rating prediction of the test set.ipynb
Ted_12 LDA topic visualization.ipynb		Ted_12 LDA topic visualization.ipynb
Ted_1_raw.ipynb		Ted_1_raw.ipynb
Ted_2_NLP_caption.ipynb		Ted_2_NLP_caption.ipynb
Ted_2_NLP_test_caption.ipynb		Ted_2_NLP_test_caption.ipynb
Ted_3_response.ipynb		Ted_3_response.ipynb
Ted_3_response_visualization.ipynb		Ted_3_response_visualization.ipynb
Ted_4_Feature testing.ipynb		Ted_4_Feature testing.ipynb
Ted_5_NLP_others.ipynb		Ted_5_NLP_others.ipynb
Ted_5_NLP_test_others.ipynb		Ted_5_NLP_test_others.ipynb
Ted_6_Final.ipynb		Ted_6_Final.ipynb
Ted_7_How to make a tailored TED talk?.ipynb		Ted_7_How to make a tailored TED talk?.ipynb
Ted_9_topic_modeling_LDA.ipynb		Ted_9_topic_modeling_LDA.ipynb
Ted_9_topic_modeling_LSI.ipynb		Ted_9_topic_modeling_LSI.ipynb
captions_f.json		captions_f.json
captions_f3.json		captions_f3.json
captions_test_f.json		captions_test_f.json
captions_test_f3.json		captions_test_f3.json
logit_cf.R		logit_cf.R
response5.csv		response5.csv
talk_df_small.json		talk_df_small.json
talks_other_text.json		talks_other_text.json
tedrating.png		tedrating.png
test.json		test.json
test2.json		test2.json
test3.json		test3.json
test_response.csv		test_response.csv
train.json		train.json
train2.json		train2.json
train3.json		train3.json
training_feature.csv		training_feature.csv
training_label.csv		training_label.csv
validset_other_text.json		validset_other_text.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic Modeling on TED Talks: talk recommendation and rating prediction

Summary of the IPython Notebooks

The main data files are:

About

Releases

Packages

Languages

yesufeng/tedtalks

Folders and files

Latest commit

History

Repository files navigation

Topic Modeling on TED Talks: talk recommendation and rating prediction

Summary of the IPython Notebooks

The main data files are:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages