In online advertising, click-through rate (CTR) is a very important metric for evaluating ad performance. As a result, click prediction systems are essential and widely used for sponsored search and real-time bidding.
For this competition, we have provided 11 days worth of Avazu data to build and test prediction models. Can you find a strategy that beats standard classification algorithms? The winning models from this competition will be released under an open-source license.
id: ad identifierclick: 0/1 for non-click/clickhour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.C1 -- anonymized categorical variablebanner_possite_idsite_domainsite_categoryapp_idapp_domainapp_categorydevice_iddevice_ipdevice_modeldevice_typedevice_conn_typeC14-C21 -- anonymized categorical variables
## Loading the dataimport pandas as pdimport numpy as npimport string as stri#too large data not keeping it in memory.# will be using line by line scripting.#data = pd.read_csv("/Users/RahulAgarwal/kaggle_cpr/train")
Since the data istoo large around 6 gb , we will proceed by doing line by line analysis of data. We will try to use vowpal wabbit first of all as it is an online model and it also gives us the option of minimizing log loss as a default. It is also very fast to run and will give us quite an intuition as to how good our prediction can be. - I will use all the variables in the first implementation and we will rediscover things as we move on
Creating data in vowpal format (One Time Only)
from datetime import datetimedef csv_to_vw(loc_csv, loc_output, train=True):start = datetime.now()print("\nTurning %s into %s. Is_train_set? %s"%(loc_csv,loc_output,train))i = open(loc_csv, "r")j = open(loc_output, 'wb')counter=0with i as infile:line_count=0for line in infile:# to counter the headerif line_count==0:line_count=1continue# The data has all categorical features#numerical_features = ""categorical_features = ""counter = counter+1#print counterline = line.split(",")if train:#working on the date column. We will take day , houra = linenew_date= datetime(int("20"+a[0:2]),int(a[2:4]),int(a[4:6]))day = new_date.strftime("%A")hour= a[6:8]categorical_features += " |hr %s" % hourcategorical_features += " |day %s" % day# 24 columns in datafor i in range(3,24):if line[i] != "":categorical_features += "|c%s %s" % (str(i),line[i])else:a = linenew_date= datetime(int("20"+a[0:2]),int(a[2:4]),int(a[4:6]))day = new_date.strftime("%A")hour= a[6:8]categorical_features += " |hr %s" % hourcategorical_features += " |day %s" % dayfor i in range(2,23):if line[i] != "":categorical_features += " |c%s %s" % (str(i+1),line[i])#Creating the labels#print "a"if train: #we care about labelsif line == "1":label = 1else:label = -1 #we set negative label to -1#print (numerical_features)#print categorical_featuresj.write( "%s '%s %s\n" % (label,line,categorical_features))else: #we dont care about labels#print ( "1 '%s |i%s |c%s\n" % (line,numerical_features,categorical_features) )j.write( "1 '%s %s\n" % (line,categorical_features) )#Reporting progress#print counterif counter % 1000000 == 0:print("%s\t%s"%(counter, str(datetime.now() - start)))print("\n %s Task execution time:\n\t%s"%(counter, str(datetime.now() - start)))#csv_to_vw("/Users/RahulAgarwal/kaggle_cpr/train", "/Users/RahulAgarwal/kaggle_cpr/click.train_original_data.vw",train=True)#csv_to_vw("/Users/RahulAgarwal/kaggle_cpr/test", "/Users/RahulAgarwal/kaggle_cpr/click.test_original_data.vw",train=False)
The Vowpal Wabbit will be run on the command line itself.
vw click.train_original_data.vw -f click.model.vw --loss_function logistic
vw click.test_original_data.vw -t -i click.model.vw -p click.preds.txt
Creating Kaggle Submission File
mport mathdef zygmoid(x):return 1 / (1 + math.exp(-x))with open("kaggle.click.submission.csv","wb") as outfile:outfile.write("id,click\n")for line in open("click.preds.txt"):row = line.strip().split(" ")try:outfile.write("%s,%f\n"%(row,zygmoid(float(row))))except:pass
This solution ranked 211/371 submissions at the time and the leaderboard score was 0.4031825 while the best leaderboard score was 0.3901120
- Create a better VW model
- Shuffle the data before making the model as the VW algorithm is an online learner and might have given more preference to the latest data
- provide high weights for clicks as data is skewed. How Much?
- tune VW algorithm using vw-hypersearch. What should be tuned?
- Use categorical features like |C1 "C1"&"1"
- Create a XGBoost Model.
- Create a Sofia-ML Model and see how it works on this data.
Link to original article: Exploring Vowpal Wabbit with the Avazu Clickthrough Prediction Challenge