Skip to content
forked from ottsion/JData

京东JData算法大赛-高潜用户购买意向预测

Notifications You must be signed in to change notification settings

danny-design/JData

 
 

Repository files navigation

load_data.py

  • 装载三月四月份数据,保存在'data/month_34_all_data.csv'位置:

       load_data(month_3_data_path, month_3_extra_data_path, month_4_data_path,month_34_all_data_path)
    
  • 发现异常用户,返回need_delete_user的list集合:

       find_unNormal_user(month_34_all_data_path)
    
  • 删除异常用户,将所有行为数据保存至'data/all_dataSet.csv'

       clean_unNormal_data(all_dataSet_path, month_34_all_data_path)
    

generate_dataSet.py

  • 以2016-03-17到2016-04-05数据 预测2016-04-06到2016-04-10某用户是否下单某商品

  • 以2016-03-22到2016-04-10数据 预测2016-04-11到2016-04-15某用户是否下单某商品

  • 以2016-03-27到2016-04-15数据 预测2016-04-16到2016-04-20某用户是否下单某商品

      all_dataSet_path = 'data/all_dataSet.csv'
      one_dataSet_train_path = 'data/one_dataSet_train.csv'
      one_dataSet_test_path = 'data/one_dataSet_test.csv'
      two_dataSet_train_path = 'data/two_dataSet_train.csv'
      two_dataSet_test_path = 'data/two_dataSet_test.csv'
      three_dataSet_train_path = 'data/three_dataSet_train.csv'
    

    从总表中按上面时间截取部分数据作为数据集 : generate_dataSet() 切分方法函数:

      cut_data_as_time(dataSet_path, new_dataSet_path , begin_day, end_day)
    

generate_all_feature.py

train_one_train_feature_path = 'data/train_one_train_feature.csv'
train_two_train_feature_path = 'data/train_two_train_feature.csv'
train_three_train_feature_path = 'data/train_three_train_feature.csv'

总的合并各种特征,最终结果保存在上面地址:generate_all_feature()

  1. generate_feature.py

     fetch_feature(sample_filename, feature_filename, item_brand, code)
     - sample_filename:样本信息地址
        - feature_filename:提取出的特征地址
        - code : 针对不同信息样本,表征下不同的特证名
        - 此处48维特征
    
  2. generate_feature_1.py 处理用户信息特征:deal_with_user_data() 处理商品评价特征:deal_with_comment_data() 将上述特征加入到目前的特征中:

      fetch_feature_1(train_feature_path, finnal_feature_data_path)
    
  3. generate_feature_2.py 自动生成当前数据集最近2\4\6\8天的特征:

    split_dataSet_and_generate_feature()
    

    将上述特征加入到目前的特征中:

    fetch_feature_2(train_feature_path, finnal_feature_data_path, index)
    

combine_feature_dataSet.py

之前的特征位置

train_one_train_feature_path = 'data/train_one_train_feature.csv'
train_two_train_feature_path = 'data/train_two_train_feature.csv'
train_three_train_feature_path = 'data/train_three_train_feature.csv' 

源数据集:

one_dataSet_test_path = 'data/one_dataSet_test.csv'
two_dataSet_test_path = 'data/two_dataSet_test.csv'

合并后保存的位置:

one_train_dataSet_final_path = 'data/one_train_dataSet_final.csv'
two_train_dataSet_final_path = 'data/two_train_dataSet_final.csv'

核心函数包括 main_combine():

  1. 从最开始划分的不同时间数据集中找出正样本信息,与之对应到特征数据集中对相应样本进行正负样本标注:

     fetch_sample(test_data_path, feature_data_path, negative_data_path,positive_data_path)
    
  2. 负样本太多,进行抽取部分作为负样本:

     fetch_negative_sample(negative_data_path, new_negative_data_path)
    
  3. 将现有正负样本(经过标注)合并成训练集测试集等,用于最终测试

     combine_neg_and_posi(negative_data_path, positive_data_path, train_dataSet_path)
    

ceshiyanzheng.py

  1. 用来计算AB值,输入为(user_id,sku_id)格式

     evl(two_real_path, two_answer_path)
    
  2. 用来输出测试后的答案

     output_answer(dataSet_path, proba_path, before_answer_path, answer_path)
     - dataSet_path   训练集
     - proba_path    预测结果
     - before_answer_path  将结果标准化(user_id,sku_id)格式
     - answer_path    去除before_answer_path中重复的数据,为最终结果
    

## load_data.py ##

一般来说此时数据集准备就绪,但是按题意所有预测商品均在P中,我们需要删除一部分数据集中不属于这些商品的数据

 clean_data(dataSet_path, after_dataSet_path)
 - dataSet_path 输入为 特征数据集,用来训练或者预测的特征集,未清洗
 -  after_dataSet_path   合理的,将直接用于model的特征数据集

## main_test.py ##

这里选用RF测试:

clf = model_rf.classify_user_item(train_feature_dataSet, test_feature_dataSet, 
                         proba_path)
                         
model_rf.classify(clf, test_feature_dataSet, proba_path)      
model_rf.output_answer(feature_path, proba_path, two_before_answer_path, two_answer_path)

ceshiyanzheng.evl(two_real_path, two_answer_path)

About

京东JData算法大赛-高潜用户购买意向预测

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%