Decision Tree

Rumor & Susceptible User Detection

Georgetown University DSAN Anly-501 Project

A full data science life cycle



Introduction

In this tab, a classification method called Decision Tree is used to build Susceptible User Detection. We will use “cleaned_followers.csv”.

Theory

Decision Tree continuouslly asks “whether or not questions” to the input so that it can gradually divides it into different parts.



But how to decide what to ask? The trick here is to use mathematical formulas to quantify one of the following metrics to find the question.

  • The extent to which we gain new information from new answers (by calculating entropy)
  • The probability which we incorrectly classify the sample (by calculating gini index)

By using this method, we can easily find how important an attribute is, and understand what makes our research target so different.



Method

This part will show the workflow of training the optimal models with SVM.

Class Distribution

Code
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
#load the data
df=pd.read_csv("../../data/01-modified-data/cleaned_followers.csv")
sns.set_theme()
#plot the distribution
plt.hist(df.label.astype("string"))
plt.title("The distribution of the class",fontsize=18)
plt.xlabel("Class",fontsize=16)
plt.ylabel("Counts",fontsize=16)
#show the data
df.head()
user_id screen_name followers_count friends_count listed_count favourites_count tweet_num protected verfied label
0 2198516225 _Banzi_ 27 199 0 601 122 0 0 1
1 1504258025804210176 DelorbeTori 81 2771 0 0 6 0 0 1
2 1581474148571877377 omkarVyawahare2 0 56 0 24 0 0 0 1
3 2967681610 L1BERTE_S 50 68 1 8735 2238 0 0 1
4 1400890434 mistamomo_ 356 1104 19 18347 20160 0 0 1

The count of two classes are the same, 350 for each. This is dsigned when gathering the data(“details of followers”). With this kind of data, we can avoid problems brought from imbalanced data sets.



Baseline Model for Comparsion

Code
#define a baseline model which random assign labels
def random_classifier(y_data):
    ypred=[]
    max_label=np.max(y_data); #print(max_label)
    for i in range(0,len(y_data)):
        ypred.append(int(np.floor((max_label+1)*np.random.uniform(0,1))))
    print("-----RANDOM CLASSIFIER-----")
    print("accuracy",accuracy_score(y_data, ypred))
    print("percision, recall, fscore,",precision_recall_fscore_support(y_data,ypred))

random_classifier(df.label)
-----RANDOM CLASSIFIER-----
accuracy 0.5
percision, recall, fscore, (array([0.5, 0.5]), array([0.50857143, 0.49142857]), array([0.50424929, 0.49567723]), array([350, 350], dtype=int64))



What a baseline model here did is guess the class. And we can see that every metric is around 50%. So if a model perform than this baseline, than we can say it do make some sense.

Data Selection

Code
# id reflect the time account exits,we normalized it 
df["user_id"]=(df.user_id-df.user_id.mean())/df.user_id.std() 
#drop the feature we won't consider about
df.drop(columns=["screen_name"],axis=1,inplace=True)
#
X=df.drop(columns=["label"],axis=1)
y=df["label"]
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
Feature Meaning
user_id the id of users
followers_count the number of followers this account currently has
friends_count the number of users this account is following
listed_count the number of public lists that this user is a member of
favourites_count the number of Tweets this user has liked in the account’s lifetime
tweet_num the number of Tweets (including retweets) issued by the user
protected whether user has chosen to protect their Tweets
verified whether user has a verified account
(The names and meanings of features)

8 features was selected to train the model. These features are all attributes that are allowed to get and reflect the character of accounts. With these attributes, we have the biggest possibility to find the differences. Note that “user_id” is also selected because it reflect how long an account was created. The bigger an user id is, the newer the account is.

Model tuning

Code
# try different numbers of layers to find the best one
test_results=[]
train_results=[]

for num_layer in range(1,20):
    model = tree.DecisionTreeClassifier(max_depth=num_layer)
    model = model.fit(x_train,y_train)

    yp_train=model.predict(x_train)
    yp_test=model.predict(x_test)

    # print(y_pred.shape)
    test_results.append([num_layer,accuracy_score(y_test, yp_test),recall_score(y_test, yp_test,pos_label=0),recall_score(y_test, yp_test,pos_label=1)])
    train_results.append([num_layer,accuracy_score(y_train, yp_train),recall_score(y_train, yp_train,pos_label=0),recall_score(y_train, yp_train,pos_label=1)])
test_results=np.array(test_results)
train_results=np.array(train_results)
#generate plots of the performance of different layers 
def metric_plot(ylabel,layer,yptrain,yptest):
    fig=plt.figure()
    plt.plot(layer,yptrain,'o-',color="b")
    plt.plot(layer,yptest,'o-',color="r")
    plt.ylabel(ylabel+" Training (blue) and Test (red)",fontsize=16)
    plt.xlabel("Number of layers in decision tree(max_depth)",fontsize=16)
metric_plot("ACCURACY(Y=0)",test_results[:,0],train_results[:,1],test_results[:,1])
metric_plot("RECALL(Y=0)",test_results[:,0],train_results[:,2],test_results[:,2])
metric_plot("RECALL(Y=1)",test_results[:,0],train_results[:,3],test_results[:,3])



To find the most suitable number of layers, several plots was produced. We finally find that we should set max_depth as 4.

Final Results

Code
#fit the tree model with the best layer
model = tree.DecisionTreeClassifier(max_depth=4)
model = model.fit(x_train,y_train)

yp_train=model.predict(x_train)
yp_test=model.predict(x_test)

#write a function to visualize the confusion matrix
def confusion_plot(y_data,y_pred):
    print(
        "ACCURACY: "+str(accuracy_score(y_data,y_pred))+"\n"+
        "NEGATIVE RECALL (Y=0): "+str(recall_score(y_data,y_pred,pos_label=0))+"\n"+
        "NEGATIVE PRECISION (Y=0): "+str(precision_score(y_data,y_pred,pos_label=0))+"\n"+
        "POSITIVE RECALL (Y=1): "+str(recall_score(y_data,y_pred,pos_label=1))+"\n"+
        "POSITIVE PRECISION (Y=1): "+str(precision_score(y_data,y_pred,pos_label=1))+"\n"
    )
    cf=confusion_matrix(y_data, y_pred)
    # customize the anno
    group_names = ["True Neg","False Pos","False Neg","True Pos"]
    group_counts = ["{0:0.0f}".format(value) for value in cf.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in cf.flatten()/np.sum(cf)]
    labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    #plot the heatmap
    fig=sns.heatmap(cf, annot=labels, fmt="", cmap='Blues')
    plt.title("Confusion Matrix of Texts - Decision Tree",fontsize=18)
    fig.set_xticklabels(["Easily affected","Not easily affected"],fontsize=13)
    fig.set_yticklabels(["Easily affected","Not easily affected"],fontsize=13)
    fig.set_xlabel("Predicted Labels",fontsize=14)
    fig.set_ylabel("True Labels",fontsize=14)
    plt.show()
confusion_plot(y_test,yp_test)

#write a function to visualize the tree
def plot_tree(model,X,Y):
    fig = plt.figure(figsize=(10,8))
    tree_vis= tree.plot_tree(model, feature_names=X.columns,class_names=Y.name,filled=True)
plot_tree(model,x_test,y_test)
ACCURACY: 0.6357142857142857
NEGATIVE RECALL (Y=0): 0.4262295081967213
NEGATIVE PRECISION (Y=0): 0.6190476190476191
POSITIVE RECALL (Y=1): 0.7974683544303798
POSITIVE PRECISION (Y=1): 0.6428571428571429

With 4 layers, the leaf node seems quite reasonable.The model turns out to be quite trustable both on negative data. And favourite_count seems to be a remarkable metrics when grouping users.



Conclusion

The model is not bad. It can correctly distinguish most samples and the accuracy is about 70%. However, since our target is to find users easily be affected by rumors. A false positive is more acceptable than a false negative, which means it is more acceptable to warn a person not easily be affected than fail to notify a person who may trust a rumor and pass it along! So we can consider to apply this model.

Decision tree is unstable, so we can still use bagging(random forest), boosting(xgboost, GDBT, lightBGM, etc) to fit the data in the future.

Reference

[1]What is a decision tree. IBM. (n.d.). Retrieved December 4, 2022, from https://www.ibm.com/topics/decision-trees