Entry Name:  "JNU-Liu-MC1"

VAST Challenge 2020
Mini-Challenge 1

 

 

Team Members:

Rui Liu, Jinan University (Guangzhou), liuruijnu@qq.com PRIMARY

Qian Liu, Jinan University (Guangzhou,China), University at Albany SUNY(USA), tsusanliu@jnu.edu.cn PRIMARY Contact

Yong Liu, China Shanxi Water Resources Bureau, digitip@163.com

Student Team:  YES

 

Tools Used:

Gephi

Python

Tableau

 

Approximately how many hours were spent working on this submission in total?

250 hours

 

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2020 is complete? YES

 

Video

 

 

 

Center for Global Cyber Strategy (CGCS) researchers have used the data donated by the white hat groups to create anonymized profiles of the groups.One such profile has been identified by CGCS sociopsychologists as most likely to resemble the structure of the group who accidentally caused this internet outage. You have been asked to examine CGCS records and identify those groups who most closely resemble the identified profile

Questions

1 –– Using visual analytics, compare the template subgraph with the potential matches provided. Show where the two graphs agree and disagree. Use your tool to answer the following questions:

  1. Compare the five candidate subgraphs to the provided template. Show where the two graphs agree and disagree. Which subgraph matches the template the best? Please limit your answer to seven images and 500 words.

Answer to a:

Our result is Graph1 matches the best. To analyze, we first did a general visualization. Second, we compare the key features of each graph, such as betweenness centrality and average degree. Third, we discover some local features.

a.1 First, we did some general visualization of each graph and template graph data. The five graphs and the template are plotted in the flowing graphs1-5(which could also be retrieved from internet with the link above the graph). Different color of the line represents different eType of the connections, different color of the nodes representing different cluster[1] it belonged to, and larger size of the node represent the higher betweenness centrality. As shown. graph1, graph2, graph3 are partially similar to the provide template graph, where there are two major subgraphs linked with eType 5 and eType 6 edges.

https://susannusas.github.io/graph1/index.html

Figure 1: graph1 visualization

Comparing graph1 with template, similarities are they both contain two major cluster linked with eType 5 and eType6. In both of figures, these connecting nodes are communicating with each other by email and phone calls, they have relatively high betweenness centrality are connecting each other via green and blue edges (eType = 0, 1).

 

 

https://susannusas.github.io/graph2/index.html

Figure 2: graph2 visualization

Comparing graph2 with templatesimilarities are they both contain two major cluster linked with eType 5 and eType6, but in template, the two clusters are linked with fewer nodes than Figure 2. 

 

 

 

 

https://susannusas.github.io/graph3/index.html

Figure 3: graph3 visualization

Comparing Figure 3 with Figure 6, similarities are they both contain two major cluster linked with eType 5 and eType6, in Figure 3, the two clusters are linked with fewer nodes than Figure 6.

https://susannusas.github.io/graph4/index.html

Figure 4: graph4 visualization

Comparing Figure 4 with Figure 6, similarities are they both contain two major cluster linked with eType 5 and eType6, but in Figure 4, the two clusters are linked with mode nodes than Figure 6.

 

https://susannusas.github.io/graph5/index.html

Figure 5: graph5 visualization

Comparing graph5 with template, similarities are they both contain two major cluster linked with eType 5 and eType6, but in Figure 6 the template, there are more smaller clusters than Figure 5.

 

 

https://susannusas.github.io/template/index.html

Figure 6: Template graph visualization

a.2 Comparing the key features of the subgraphs and the template, we could also observe some similarities and differences. The graph 3 is the most similar one considering modality and average path length. Graph 1, and 2 is most similar considering average degree and density.

Table 1: Key features of subgraphs and the template

subgraph1

subgraph2

subgraph3

subgraph4

subgraph5

template

subgraph1

subgraph2

subgraph3

subgraph4

subgraph5

template

average degree

13.075

14.943

9.228

8.414

3.314

15.057

Density

0.142

0.174

0.118

0.098

12669.711

0.173

Modularity

0.07

0.04

0.124

0.188

0.039

0.214

Average Path length

2.083

2.086

2.026

2.43

0.109

1.875

a. 3 We also discover some local feature. When exploring the least appearing eType 432 edges, we spot that graph 1 and template has similar local features shown below.

1eTpye 2 and 3 are linked in a triangle with eTpye 0 edge, where eTpye 2 and 3 share the same time.

The source nodes of eType 432 edges are linked within four steps, by either email or telephone links(eType 0 or 1), these nodes share some high betweenness centrality neighbors nodes linked with email or telephone edges(eType 0 or 1).  These nodes form a  group of “email and telephone connectors” in the graph, connecting the two major clusters.

After comparing the general visualization, the key features of the 6 graphs in Table 1, as well as some local features. We finally conclude Graph 1 is the best match.

 

  1. Which key parts of the best match help discriminate it from the other potential matches? Please limit your answer to five images and 300 words.

Answer to b:

b.1  There are two major clusters in the template graph with eTpye 5 and 6. Edges color representing different eType of the connection. Node color represent different cluster. The purple nodes cluster and the green nodes cluster are the largest two clusters. Click for more details:

Figure b-1: two clusters detail for template

 

Figure b-2: Financial(eType 5) connection cluster detail

b.2 The biggest cluster as shown in Figure 9 has a node with highest degree connecting many with eType 5 (purple color edges), which means a lot of financial connection between this group. Labels of each node are: id label, degree, betweenness centrality.

 

Figure b-3: Travel connection (eType = 6) cluster detail

Green nodes cluster as shown in Figure b-3 are mostly connected with orange edges (eType = 6), which means there are travel history between these group of people.

Figure b-3: Email/phone connecting nodes (eType = 0, 1) detail

b.3 As shown above, few nodes with high closeness/betweenness centrality are connecting the purple node cluster and the green node cluster. We call it Email/phone connecting nodes They connect each other via green and blue edges (eType = 0, 1), indicating a lot of email and phone connections among these people.

Figure13: Call sample and travel sample data’s different geo location

2CGCS has a set of “seed” IDs that may be members of other potential networks that could have been involved. Take a look at the very large graph. Can you determine if those IDs lead to other networks that matches the template? Describe your process and findings in no more than ten images and 500 words.

Since we find features of the template, with 3 seed , we try to  find the Email/phone connecting nodes(eType 1 and 0), and then to locate the  two major cluster with eType 5 and 6.

We tried to use python to dig out the network with similar features

Some python code are listed below, but our machine never successfully perfumed an answer.

 import pandas as pd

import numpy as np

import networkx as nx

import re

path = './GraphData.csv'

bigdata = pd.read_csv(path)

alldata = bigdata[['Source','Target','eType','Weight','Time']]

 

G = nx.from_pandas_edgelist(alldata,source = 'Source',target='Target',edge_attr=True)

 

# seed linked nodes

First_Seed_Nodes = [600971,579269,538771,473043,574136,657187]

G

 

def from_list_create_txt(node_list,n):

    txt_path = '{}.txt):'.format(n)

    with open(txt_path, 'a') as file_handle:  # .txt可以不自己新建,代码会自动新建

        for i in node_list:

            file_handle.write(str(G[i]))  # 写入

            file_handle.write('\n')  # 这里是起到一个区分的作用,用来减少失误

    return txt_path

 

def from_txt_create_df(txt_path,n):

    # 将文本读出来

    with open(txt_path) as f0:

        m = f0.read().strip()

        #print(m)

    # 给数字添加双引号

    data1 = re.sub('-?\d+((/?\d+)|((\.)?\d+))',

               lambda x: '"{}"'.format(x.group()),m)

    # pattern中有冒号时,始终匹配多一个冒号,或者其他问题,最后删掉了冒号 '-?\d+(|(/?\d+)|((\.)?\d+)):'才好,但是仍然无法匹配正整数的eType

    # group(1)的话,就只有1匹配两个。。。所以改为了group()

    #print(data1)

 

    # 处理“nan”没有双引号的问题

    data2 = re.sub('(nan)',

                  lambda x: '"{}"'.format(x.group()),

                  data1)

    #print(data2)

 

    # 去除斜杠

    data3 = re.sub(r"\'",r'"',data2)

    #print(data3)

 

    # 有的双引号重复,去掉

    data4 = re.sub(r'""',r'"',data3)

    #print(data4)

 

    # 去掉分隔文件的字符

    data5 = re.sub(r'}}\n{',r'},',data4)

    # print(data5)

 

    # 最后得到的dataframe

    df = pd.read_json(data5).T

    print(df.head())

    df.to_csv('df_{}.csv'.format(n))

    return df

 

 

def from_txt_create_nodelist(df,last_node,n):

    # 预计连接4层比较保险,第一层是从种子eType==42到周围;第二层、第三层是通过eType=01连接到周围,第四层是eType==2,3的形成网络

    # 找到和0,1相连的;

    if n <= 3:

        data_second_third = df.loc[df['eType'].isin([0, 1])]

        node_list0 = data_second_third.Source.values.tolist() + data_second_third.Target.values.tolist()-last_node

        # 列表去重

        node_list = list(set(node_list0))

        return node_list

    if n > 3:

        data_second_third = df.loc[df['eType'].isin([2, 3])]

        node_list0 = data_second_third.Source.values.tolist() + data_second_third.Target.values.tolist() - last_node

        # 列表去重

        node_list = list(set(node_list0))

        return node_list

 

if __name__ == '__main__':

    # 第一层

    First_txt_path = from_list_create_txt(First_Seed_Nodes,1)

    First_df = from_txt_create_df(First_txt_path,1)

    # 第二层

    Second_node_list = from_txt_create_nodelist(First_df, First_Seed_Nodes,2)

    Second_txt_path = from_list_create_txt(Second_node_list,2)

    Second_df = from_txt_create_df(Second_txt_path,2)

    Second_nodes = First_Seed_Nodes + Second_node_list

    # 第三层

    Third_node_list = from_txt_create_nodelist(Second_df, Second_nodes,3)

    Third_txt_path = from_list_create_txt(Third_node_list,3)

    Third_df = from_txt_create_df(Third_txt_path,3)

    Third_nodes = Second_nodes+ Third_node_list

    # 第四层

    Forth_node_list = from_txt_create_nodelist(Third_df,Third_nodes,4)

    Forth_txt_path = from_list_create_txt(Forth_node_list,4)

    Forth_df = from_txt_create_df(Forth_txt_path,4)

    # Forth_nodes = Third_nodes + Forth_node_list

But we encounter difficulties at coding from the large dataset.

3Optional: Take a look at the very large graph. Can you find other subgraphs that match the template provided? Describe your process and your findings in no more than ten images and 500 words.

Since we find features of the template, we try to start with least appearing eType edges , 2 and 3, that has the same time data. Then we try to  find the Email/phone connecting nodes(eType 1 and 0),linked with them, and then to locate the  two major cluster with eType 5 and 6 linked.

But we didn’t finish

 

4Based on your answers to the question above, identify the group of people that you think is responsible for the outage. What is your rationale? Please limit your response to 5 images and 300 words.

Sorry , didn’t get the answer.

5What was the greatest challenge you had when working with the large graph data? How did you overcome that difficulty? What could make it easier to work with this kind of data?

 

Answer to 5:

Thanks for the VAST challenge and  wonderful opportunity, although we didn’t finish the whole challenge, we did enjoy discussion the try to solve problems together, and we are looking forward to see the best solutions.

Data format treatment take a lot of time.

Exploring large amount of data takes a lot of effort, reading, storing, transforming, computing data all need hours, when dealing with large data set with limited computing resources.

Some software cannot deal with too large dataset. Tableau and Gephi both can deal with large dataset.



[1] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre, Fast unfolding of communities in large networks, in Journal of Statistical Mechanics: Theory and Experiment 2008 (10), P1000