ARM

Rumor & Susceptible User Detection

Georgetown University DSAN Anly-501 Project

A full data science life cycle



Introduction

In this tab, we try to use ARM (Association Rule Mining). This tab researches on the account descrption written by the users who sending rumors(which is the “description” column in the “cleaned_rumor_manually.csv” data set) to find the features tell users tending to spread rumors

Theory

ARM searches for the relationship between different things, recording when items happened together or they are correlated.

There exist 3 metrics(support, confidence, lift) when we try to find the pattern: how many instance support our rule(by calculating how often the items appear together), how confident we are for patterns observed (by calculating the given probability of the rules), and how much the items are related.

Method

This part will show the workflow of training the model with the method introduced before.

Data Selection

The features we use is shown below. As mentioned before, I only choose the description of the positive instances.

Code
import pandas as pd
from nltk.tokenize import word_tokenize
from apyori import apriori
import networkx as nx 
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
# read the data
df=pd.read_csv("../../data/01-modified-data/cleaned_supervised_data.csv")
df=df[df["label"]==1]
df["description"].fillna("",inplace=True)
#replace the s(which is 's originally) and the t(which is 't originially) and re 
df["description"]=df["description"].str.replace(" s","")
df["description"]=df["description"].str.replace(" t","")
df["description"]=df["description"].str.replace(" re","")
df["description"]=df["description"].str.replace(" w","")
#drop "us" since the stopwords don't include it
df["description"]=df["description"].str.replace(" us","")
# drop the stopwords
def drop_stop(word_list):
    return[word for word in word_list if word not in stopwords.words('english')]

df["description"]=df["description"].apply(word_tokenize)
df=df["description"].apply(drop_stop)
df.head()
0    [make, america, florida, let, haveome, fun, ad...
1    [libertarian, free, markets, freepeech, elfish...
2    [urvivor, yazidigenocide, human, rights, activ...
3                                         [cope, fest]
4    [feminist, cymraes, european, fbpe, fbppr, fbeie]
Name: description, dtype: object



Model Building

In this part, three functions are firstly defined, calculating the metrics, transforming dataframe to the structure suitable for building network and visualizing respectively.

Code
def reformat_results(results):

    keep =[]
    for i in range(0, len(results)):
        for j in range(0, len(list(results[i]))):
            if (j>1):
                for k in range(0, len(list(results[i][j]))):
                    if (len(results[i][j][k][0]) != 0):
                        rhs = list(results[i][j][k][0])
                        lhs = list(results[i][j][k][1])
                        conf = float(results[i][j][k][2])
                        lift = float(results[i][j][k][3])
                        keep.append([rhs,lhs,supp,conf,supp*conf,lift])
            if (j==1):
                supp = results[i][j]

    return pd.DataFrame(keep, columns =["rhs","lhs","supp","conf","supp x conf","lift"])

def convert_to_network(df):
    print(df)

    #BUILD GRAPH
    G = nx.DiGraph()  # DIRECTED
    for row in df.iterrows():
        # for column in df.columns:
        lhs="_".join(row[1][0])
        rhs="_".join(row[1][1])
        conf=row[1][3]; #print(conf)
        if(lhs not in G.nodes): 
            G.add_node(lhs)
        if(rhs not in G.nodes): 
            G.add_node(rhs)

        edge=(lhs,rhs)
        if edge not in G.edges:
            G.add_edge(lhs, rhs, weight=conf)
    return G

def plot_network(G):
    #SPECIFIY X-Y POSITIONS FOR PLOTTING
    pos=nx.random_layout(G)

    #GENERATE PLOT
    fig, ax = plt.subplots()
    fig.set_size_inches(15, 15)

    #assign colors based on attributes
    weights_e   = [G[u][v]['weight'] for u,v in G.edges()]

    #SAMPLE CMAP FOR COLORS 
    cmap=plt.cm.get_cmap('Blues')
    colors_e    = [cmap(G[u][v]['weight']*10) for u,v in G.edges()]

    #PLOT
    nx.draw(
    G,
    edgecolors="black",
    edge_color=colors_e,
    node_size=2000,
    linewidths=2,
    font_size=8,
    font_color="white",
    font_weight="bold",
    width=weights_e,
    with_labels=True,
    pos=pos,
    ax=ax
    )
    ax.set(title='Account Description of Users Sending rumors')
    plt.show()


results = list(apriori(df, min_support=0.02, min_confidence=0.2, min_length=3, max_length=2))
pd_results = reformat_results(results)
G = convert_to_network(pd_results)
plot_network(G)
               rhs             lhs  supp      conf  supp x conf       lift
0           [aime]          [host]  0.02  1.000000     0.020000  10.000000
1        [america]       [freedom]  0.02  0.400000     0.008000   5.000000
2        [freedom]       [america]  0.02  0.250000     0.005000   5.000000
3         [author]   [bestselling]  0.02  0.500000     0.010000  25.000000
4    [bestselling]        [author]  0.02  1.000000     0.020000  25.000000
5         [author]          [host]  0.02  0.500000     0.010000   5.000000
6         [christ]         [jesus]  0.02  1.000000     0.020000  50.000000
7          [jesus]        [christ]  0.02  1.000000     0.020000  50.000000
8      [christian]  [conservative]  0.02  0.666667     0.013333  11.111111
9   [conservative]     [christian]  0.02  0.333333     0.006667  11.111111
10     [christian]       [freedom]  0.02  0.666667     0.013333   8.333333
11       [freedom]     [christian]  0.02  0.250000     0.005000   8.333333
12     [christian]          [life]  0.02  0.666667     0.013333  13.333333
13          [life]     [christian]  0.02  0.400000     0.008000  13.333333
14     [christian]           [pro]  0.02  0.666667     0.013333  11.111111
15           [pro]     [christian]  0.02  0.333333     0.006667  11.111111
16  [conservative]           [pro]  0.02  0.333333     0.006667   5.555556
17           [pro]  [conservative]  0.02  0.333333     0.006667   5.555556
18         [daily]        [mirror]  0.02  0.500000     0.010000  25.000000
19        [mirror]         [daily]  0.02  1.000000     0.020000  25.000000
20         [faith]        [family]  0.02  1.000000     0.020000  16.666667
21        [family]         [faith]  0.02  0.333333     0.006667  16.666667
22         [faith]       [freedom]  0.02  1.000000     0.020000  12.500000
23       [freedom]         [faith]  0.02  0.250000     0.005000  12.500000
24        [family]       [freedom]  0.03  0.500000     0.015000   6.250000
25       [freedom]        [family]  0.03  0.375000     0.011250   6.250000
26        [family]       [friends]  0.02  0.333333     0.006667  16.666667
27       [friends]        [family]  0.02  1.000000     0.020000  16.666667
28        [family]          [life]  0.02  0.333333     0.006667   6.666667
29          [life]        [family]  0.02  0.400000     0.008000   6.666667
30       [founder]          [host]  0.02  0.500000     0.010000   5.000000
31       [founder]     [president]  0.02  0.500000     0.010000  12.500000
32     [president]       [founder]  0.02  0.500000     0.010000  12.500000
33       [freedom]          [life]  0.02  0.250000     0.005000   5.000000
34          [life]       [freedom]  0.02  0.400000     0.008000   5.000000
35       [freedom]           [pro]  0.03  0.375000     0.011250   6.250000
36           [pro]       [freedom]  0.03  0.500000     0.015000   6.250000
37           [get]          [life]  0.02  0.666667     0.013333  13.333333
38          [life]           [get]  0.02  0.400000     0.008000  13.333333
39          [ofhe]          [host]  0.02  0.666667     0.013333   6.666667
40       [podcast]          [host]  0.02  1.000000     0.020000  10.000000
41         [human]        [rights]  0.02  1.000000     0.020000  50.000000
42        [rights]         [human]  0.02  1.000000     0.020000  50.000000
43          [life]           [pro]  0.02  0.400000     0.008000   6.666667
44           [pro]          [life]  0.02  0.333333     0.006667   6.666667
45         [media]          [news]  0.02  0.400000     0.008000   6.666667
46          [news]         [media]  0.02  0.333333     0.006667   6.666667
47         [media]  [organization]  0.02  0.400000     0.008000  13.333333
48  [organization]         [media]  0.02  0.666667     0.013333  13.333333
49        [mother]          [wife]  0.02  1.000000     0.020000  50.000000
50          [wife]        [mother]  0.02  1.000000     0.020000  50.000000
51  [organization]     [political]  0.02  0.666667     0.013333  22.222222
52     [political]  [organization]  0.02  0.666667     0.013333  22.222222



Results

The result include some common sense things, like “christ” appears with “jesus”, “conservative” appears with “christian”.

However, there also exist some inspiring findings. “Political” appears with “organization”, meaning that the account tend to describe it as political organization, by reading the original data, this kind of organizations are the ones sending political information. That is align with the results of EDA, which shows that political topic is the main content of rumors.

Besides,“family”,“life” and “freedom” appear together. Unfortunately, by reading the original data, this represent the core value of many users sending rumors.

Conclusion

In this tab, apriori algorithm has been applied to analyzing the account description written by users who have sent rumors, to do mine the association rule.

The findings are that this kind of users may be an orgainzation sending political information, and they may underline the importance of a combination of family, life and freedom.

Reference

[1]Remanan, S. (2018, November 2). Association rule mining. Medium. Retrieved December 2, 2022, from https://medium.com/towards-data-science/association-rule-mining-be4122fc1793