Entry Name: "SDU-WANG-MC1"

VAST Challenge 2020
Mini-Challenge 1

 

 

Team Members:

Luyu Cheng, Shandong University, chengluyu@gmail.com

Bairui Su, Shandong University, 1347054683@qq.com

Yumeng Xue, Shandong University, xym6336@gmail.com

Xiaoyu Liu, Shandong University, 1003871575@qq.com

Yunhai Wang, Shandong University, cloudseawang@gmail.com, PRIMARY

Student Team: Yes

 

Tools Used:

 

Approximately how many hours were spent working on this submission in total?

70

 

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2020 is complete? Yes

 

Video

https://vimeo.com/438452536

 

 

 

Center for Global Cyber Strategy (CGCS) researchers have used the data donated by the white hat groups to create anonymized profiles of the groups.One such profile has been identified by CGCS sociopsychologists as most likely to resemble the structure of the group who accidentally caused this internet outage. You have been asked to examine CGCS records and identify those groups who most closely resemble the identified profile

Questions

1 Using visual analytics, compare the template subgraph with the potential matches provided. Show where the two graphs agree and disagree. Use your tool to answer the following questions:

  1. Compare the five candidate subgraphs to the provided template. Show where the two graphs agree and disagree. Which subgraph matches the template the best? Please limit your answer to seven images and 500 words.

Time-person scatter chart is a scatter chart drawn with time as the horizontal axis and personnel ID as the vertical axis. Different colored dots indicate different types of activity records. The sorting method of the person ID in the vertical axis is ascending order according to the total number of activities.

By observing the flowering time-person scatter charts, it can be found that compared with template, g3 lacks the "cross" distribution in the lower right corner; the activity of g4 is evenly distributed over the whole time range, but the activities in the template are regularly gathered in a certain period of time; the activities of g5 is also evenly distributed, and the number of activities is too small. Therefore, graph3, graph4 and graph5 can be excluded directly. We only need to find the most matching graph in g1 and g2.

This following time-frequency chart is a line chart with relative time as the horizontal axis and the number of activities as the vertical axis. It can visually show the change of the number of occurrences of certain types of activities with time in a group. The figure below only shows the visualization results of the template, g1 and g2.

We compare g1 and g2 with template respectively, and draw out the differences we found with circles. In the place where the red circle is drawn, g1 has more deviations from template than g2, but in the place where the black circle is drawn, g1 is more similar. G1 has a lower peak on the right side of the Phone chart and a higher peak on the left side, and abnormally higher peaks in the Transaction chart. G2 has a small peak on the right side of the Phone chart that is slightly lower than the template. It can be seen that compared with g1, g2 is more similar to the template.

In summary, we think g2 is the best matching graph.

  1. Which key parts of the best match help discriminate it from the other potential matches? Please limit your answer to five images and 300 words.

Firstly, we try to find some key people in the time-person scatter diagrams to help us determine the similarities and differences between the graphs. We call the purchaser the "buyer" and the other one "seller" for those who trade with each other; and for those who often communicate by telephone and email and do not do other activities, we call them "correspondents"; we find out a pair of people who keep in touch and interact for a long time in the figure, and call them "partners". Then we analyze the content and object of their interaction.

We can find two partners in template, g1 and g2 that keep communicating for a long time and make transactions. Each of the three pairs consists of a buyer and a seller. Their IDs are shown in the following table:

Graph Name

Seller

Buyer

template

67

39

g1

550287

512397

g2

644830

585212

In order to analyze the difference between g1, g2 and template, we observe the activity records of buyer 39 in template firs. We hover the mouse over a person ID on the vertical axis, and the nodes representing the activities of the personnel in the scatter chart will be connected with the nodes in the graph that represent the same activities. As can be seen from the following figure, the buyer 39 and the seller 67 maintain frequent telephone interaction and purchase transactions at certain times. In addition, buyer 39 often call 66, 47, 67, 65, 40, 41 and sometimes email others.

Next, we observed the activity record of buyer 585212 in g2. As can be seen from the following figure, the buyer 585212 has frequent telephone interactions with the seller 644830, and has made purchase transactions at certain times. He/She often calls people like 488928, 644830, 534034, 527597, 599441, 563211, 541017 and 629627, and sometimes email others. These records are similar to those of buyer 39 in the template.

Let's take a look at the activity record of buyer 512397 in g1. As can be seen from the following figure, he only calls 550287 and 635665. Obviously, this is not the case in template.

In addition, we also found that: in the template, the seller 67 communicates frequently with people 47 except the buyer 39. Therefore, we observed 47's activity record. We infer that 47 should be a "correspondent" because he/she often communicated by phone and email, and did not do other activities.

We can also find a "correspondent" similar to 47 in g2: 488928.

But in g1, the seller 550287 does not communicate frequently with other people except the buyer 512397, so we can not find a "correspondent" similar to 47 in the template.

2 CGCS has a set of seed IDs that may be members of other potential networks that could have been involved. Take a look at the very large graph. Can you determine if those IDs lead to other networks that matches the template? Describe your process and findings in no more than ten images and 500 words.

Seed1 and Seed2s eType are both 4, and the only one who has Co-authorship edge in the template graph data is 0.

The publication in seed 1 is 579269. Firstly we queried the edges whose target is 579269, and there are 6 nodes related to publication 579269:

Then we query the edges whose source is 0:

Except for one edge with eType 4, the rest are all edges with eType 5.

Among the 6 nodes we previously found, only two nodes can satisfy this condition:

Using node 0 in the template graph and node 600971, node 611692 in the large graph as targets respectively to find the edges with eType 5, and the query results are as follows:

Because there is a big difference from the value of node 0's edge Weight, neither 600971 nor 611692 can match the node 0 in the template graph. Seed1 cannot find a suitable subgraph matching to template graph.

Using similar way to analyze Seed2 and Seed3, we did not find that they may lead to other subgraphs that may match the template, but the characteristics of the seeds inspired us to start from the more obvious and easily distinguishable edges in the template. Then use characteristics of these edges to find matching edges in the large graph to get key nodes and from these key nodes, wo could find the matching subgraph from the large graph.

3 Optional: Take a look at the very large graph. Can you find other subgraphs that match the template provided? Describe your process and your findings in no more than ten images and 500 words.

Firstly analyzed the entire template graph. 67 and 39 in the template graph communicate frequently, and there are transaction-related edges pointing to 657187. First, observe the edges between 67 and 39:

Then we searched the entire large graph and found the only pair of nodes that are consistent with the frequency of communication between the node 67 and 39 are 595104 and 570284. These two nodes are completely in accordance with the pattern of edges between 67 and 39:

So 595104 and 570284 correspond to 67 and 39 in template graph, respectively.

Then we can determine the transaction point they both point to:

Success! This node is 482264, which corresponds to 657187 in the template graph. Having determined these three basic nodes, we will start with these nodes and for each node in the template graph, we will find its match in the large graph.

We added some new functions to the existing visual analysis system to complete the task: find a subgraph matching the template graph from the large graph. At the same time, a matching algorithm is designed to find the best matching subgraph. The brief steps of the algorithm are as follows, which will be explained in combination with the visual analysis system functions.

First of all, we create a matched set and put the matching nodes in the template into this set; the remaining unmatched nodes are regarded as another set. Our visual analysis system will visualize the nodes in the matching set and the edges between these nodes in the template graph, and call this subgraph "template subgraph".

The second step is to establish a mapping relationship table to map the successful matching nodes in the template and the corresponding nodes in the large image. Our visual analysis system will visualize the corresponding nodes and the edges between these successfully matched nodes in the large graph, and call this subgraph "large graph subgraph". In the analysis, we can compare it with the template subgraph.

The third step is to expand the matched set. For an unmatched adjacent node B of node A in the template matching set, we add the edge between node B and the matched nodes in the template graph to the "template subgraph". Then, we find the corresponding node α of node A in the large graph through the mapping table, traverse some unmatched adjacent node β of node α, and add the edge between the node β and the matched nodes in the large graph into the "large graph subgraph". By comparing the current "template subgraph" and "large graph subgraph", we can judge whether node β can be used as the matching node of node B. If possible, node β is added to the set of possible matching nodes of node B; if it is impossible and node β is already in the set of possible matching nodes of node B, then β is deleted and β is no longer added to the set of possible matching nodes of node B.

Step 4, repeat step 3 to traverse all adjacent nodes of all nodes in the "template subgraph". In this way, we have completed a node expansion.

Step 5, if all the nodes in the template graph are not matched, jump to step 3; otherwise, exit directly.

The final matching subgraph is as follows:

4 Based on your answers to the question above, identify the group of people that you think is responsible for the outage. What is your rationale? Please limit your response to 5 images and 300 words.

According to the exploration results of questions 3, we think the people who casued the outage are:

584811,572500,615605,638752,570284,643925,649553,604223,594766,546999,636990,581406,519424,551810,593266,570847,595104

We used our visual analysis system to compare the subgraph of large graph with the template graph, and we could easily found that they are similar in graph structure, activity time and activity space. (Note: the order of nodes in the y axies of activity time chart is not exactly the same as their matching node in the template graph, so the activity time chart may looks not very similar, but if the order is adjusted, it is exactly the same. We are sorry for having no time to modify the code due to time)

5 What was the greatest challenge you had when working with the large graph data? How did you overcome that difficulty? What could make it easier to work with this kind of data?

Due to the large number of edges in the large graph, it cannot be drawn directly because no personal computer has enough performance to handle this kind of work. Even if this picture can be drawn on a very high-configuration computer, the screen cannot display the results effectively, and only a large number of overlaps will appear, which is confusing. In this case, it is impossible to match template graph through direct rendering and interaction. In order to solve the problem, we started from the data itself. The biggest challenge in dealing with such large-scale graph data is that there are too many edges and nodes in the graph, which makes data query become a very difficult problem. Directly operating on the original CSV file makes the query time unacceptable. It is a feasible solution to put all the data in memory and establish a hash table index for the nodes to improve efficiency, but this requires higher memory and cannot be implemented on a common configuration machine. And matching template graph from such large-scale graph is a very difficult task. Using a simple search method to match will make the time complexity unacceptable. For the problem of excessively large data size and high requirements for retrieval efficiency, we use a database for storage and use SQL for retrieval. By indexing key keys, the retrieval efficiency is improved, and this approach is convenient for the front end invoking. For the problem of matching difficulty caused by the large size of the graph, we manually analyze the features of the template graph to find out the rules and characteristics, and determine the initial matching nodes with obvious features, narrow the matching range, and then by the summarized rules with the features of the edges in the template graph to match the full graph from these initial matching nodes. In the fourth question we summarized a set of matching methods. Thanks to the Pandas CSV processing library, efficient database and SQL, which makes it easier for us to process these data.