Entry Name: PKU-Shao-MC1

VAST Challenge 2020
Mini-Challenge 1

 

 

Team Members:

Hanning Shao, Peking University, hanning.shao@pku.edu.cn  PRIMARY

Yuchu Luo, Peking University, luoyuchu1999@qq.com

Wenqi Wang, Peking University, wangwenqi@pku.edu.cn

Xiaoru Yuan, Peking University, xiaoru.yuan@pku.edu.cn      ADVISOR



Student Team: YES

 

Tools Used:

Excel

Python

JavaScript

 

Approximately how many hours were spent working on this submission in total?

200 hours

 

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2020 is complete?

YES

 

Video  

 

 

 

Center for Global Cyber Strategy (CGCS) researchers have used the data donated by the white hat groups to create anonymized profiles of the groups. One such profile has been identified by CGCS sociopsychologists as most likely to resemble the structure of the group who accidentally caused this internet outage. You have been asked to examine CGCS records and identify those groups who most closely resemble the identified profile

Questions

1 –– Using visual analytics, compare the template subgraph with the potential matches provided. Show where the two graphs agree and disagree. Use your tool to answer the following questions:

  1. Compare the five candidate subgraphs to the provided template. Show where the two graphs agree and disagree. Which subgraph matches the template the best? Please limit your answer to seven images and 500 words.

In order to compare the five candidate sub graphs to the provided template, force-directed graph is used to show where the two graphs agree and disagree. We include four types of nodes and five types of edges in our force-directed graph. Since the demographic channel provides us with the information of personal habits rather than the relation between different people, so we don't include it. In addition to that, we also find that the co-authorship channel only accounts for a very little proportion of the template and the candidates graphs, so we tend to exclude it from our force-directed graph.

First of all, the products involved in the five sub graphs and the template are compared. In the template and Graph1, there is only one product which ID is 657187, while in Graph2, the ID of the only product is 487668, and in Graph3, it is 476813. But Graph4 and Graph5 are quite different, these two graphs involve many products.

Second,  the core communication center in each graph show more details. The ID of the core person in each candidate sub graph is 635665(Graph1), 629672(Graph2), 528892(Graph3), and in the template, the ID is 41. But in Graph4 and Graph5, there is no such point.

Then for the core communication center existed, derived groups which communicate closely could be found. For the template, the number of people involved is 11. For the candidate sub graphs, they are 11(Graph1), 9(Graph2) and 6(Graph3).

Additionally, only four places are involved in Graph2 while in other graphs, all six places are involved.

In conclusion, from the force-directed graph, we can find out that the Graph4 and Graph5 is quite different form the template, which involve lots of purchase information. The Graph1 matches the provided template best, followed by the Graph2, and then the Graph3.

On the other hand, there are some evidences that support our conclusion in the graph of the distributions of phone calls and emails. The following figures show the distribution of the numbers of communications group by hours in a day and by weeks in a year. The bars represent the amounts of emails and phone calls, and the line show the amount of template graph for convenient comparison in each candidates graphs. We can find that Graph1 and Graph2 have similar distributions as the template while the other three graphs share fewer similarities.

  1. Which key parts of the best match help discriminate it from the other potential matches? Please limit your answer to five images and 300 words.

There are some distinct patterns on the graph of the template, which are marked on the figure below. First of all, there is a person that plays an important role in the communication network, like #41 in the graph. On his right, there is a group of 11 people (includes #41). They email or call each other frequently.

Besides, only one product #657187 is involved in the graph, with one seller #67 and one customer #39. This pattern also helps to match graphs.

Globally, people in the graph of template can be divided into three groups —— some of them communicate others frequently but never travel to any places, some are busy travelling but do not have any records of emails or phone calls, and the last kind of people have some communications and also travel to some places. These three groups of people are noticeable in the graph using a force-directed method.

We can use the pattern mentioned above to rule out Q1-Graph4 and Q1-Graph5 easily. In Graph4 and Graph5, there are special patterns that do not appear in the template. For example, person #636721 in Graph4 and person #524153 in Graph5 both sell a great many of products, but there is only one product involved in template. Also, there are complicated networks of selling and consuming in Graph4 and Graph5. As a result, we rule out these two graphs.

Additionally, when it comes to Graph3, we cannot find a key person that leads a communication group like #41 in template graph. We can only find a group of six whose members communicate with each other with a relative high frequency, but it seems quite different from the group in template graph.

2 – CGCS has a set of “seed” IDs that may be members of other potential networks that could have been involved. Take a look at the very large graph. Can you determine if those IDs lead to other networks that matches the template? Describe your process and findings in no more than ten images and 500 words.

To find out whether these seed nodes lead potential networks to match the template network, we generate a sub graph centered at each seed node as a fundamental network at first. Then we explored the fundamental sub graph to find some characteristics to reveal the agree and disagree between it and the template network. At last, we will adjust the fundamental sub graph to give out the most possible network led by the seed matching the template graph.

The first step is to give out a fundamental sub graph based on each seed. We did a careful discussion on each seed separately.

In the Q2-Seed1.csv, the only edge represents an author relationship, the person #600971 is a co-author of document #579269. Due to little information contained by the document node, we take major attention on the person node #600971. The node #600971 has 3839 neighbors. The number of neighbors with type 1 to 5 are 3768, 0, 4, 22, 45. Because we always regard the traveling and financial flow information as local personal data rather than relationships. So, we only need to consider communication (email & phone) and co-authorships to construct the fundamental subgraph. We set a threshold T and take out all nodes satisfying any one of the conditions: being a document related to 600971, being a person having a document co-authored with #600971, or being a person who has at least T times communication with #600971. To best fit the size of the template graph, the threshold T was selected to be 3.

In the Q2-Seed2.csv, it also contains only on co-authorship. The person in the relationship is #538771. However, we did not find any other relationships in the very huge graph with #538771 except co-authorship, which means the #538771 is such a marginal person in the network that we cannot excavate enough information from it to find out the matching sub graph. So, we asserted that seed2 does not lead a potential matching sub graph.

In the Q2-Seed3.csv, unlike the previous 2 seeds, the only edge there is a product selling edge from #574136. We also found the buyer node according to the sale volume, #620791. Both #574136 and #620791 have a big number of trade records as well as much more communication than #600971 in seed1, which brings more difficulty to analyze relatively.

#574136 have more trade records with comparison to seed1 but far fewer communications.

#620791 is a typical businessman who has a huge amount of selling and purchase. We can see the product related to him below, in which there the edge is too thick due to the big trade weight. And the second graph below is the communication related to #620791 under the threshold T=4 revealing the high frequent communication related to #62079.

Due to the significant trade relationship in seed3, the possibility that seed3 leads a potential matching graph to the template is relatively low.

In conclusion, the seed1 is most likely to lead a matching sub graph with the template.

3 – Optional: Take a look at the very large graph. Can you find other subgraphs that match the template provided? Describe your process and your findings in no more than ten images and 500 words.

We use different filters to find some special nodes in the large graph. For example, we set filters on personal characters like the demographic channel. We find that there are some categories on which a small part of people spent differently against the substantial proportion of people. There are few people that do not pay for the tobacco and the electricity, and there is also very small part of them spent money on nature gas.  These patterns on individuals can help to filter special persons as seed in Q2 as the start points searching for sub graph on the large graph.

4 –– Based on your answers to the question above, identify the group of people that you think is responsible for the outage. What is your rationale? Please limit your response to 5 images and 300 words.

We suppose that Graph1 shares most similarities with the template graph according to the analyses before. So we try to match each point in Graph1 to the point in template graph.

According to Q1 where we compare the structure of the graph, we can match some key nodes of the two graphs. For the selling and purchasing part, the product is #657187 and the seller should be #512397, which matches #39 on template, and the buyer should be #550287, which matches #67 on template. Besides that, the special travel pattern, which is marked by green circle on the figure, indicates that place #69 represents #509607 and the three travelers #82, #83 and #84 represent person nodes #538892, #542965 and #572391.

On the other hand, we suppose #41 represents person #635665, for they both 'lead' the group of communication —— they communicate with all the members in the group marked in blue. The two both have eleven members. We can find other evidences from the personal characters of them like the demographic information. According to the figure showed below, there are some persons in the group that do not match very well and we should do futher searches in the large graph to find the nodes they represent.

5 –– What was the greatest challenge you had when working with the large graph data? How did you overcome that difficulty? What could make it easier to work with this kind of data?

When working with the large graph data, the amount of the data is too huge to deal with, which means querying and filtering the data is the greatest challenge we had. Because of its huge numbers, filtering the data is quite time-consuming and inefficient, which took us a lot of time to console the problem.

Besides that, it also proves difficult to find a suitable sub graph from a given seed, since the nodes in the large graph usually have too many neighbor that it is almost impossible to add them all to the sub graph, otherwise, the sub graph will show few useful information when compared with the template graph with only about thirty people involved.

The countermeasures we took are to avoid the graph visualization of the large-scale data as much as possible. Through various filtering conditions, we filter data from the large graph and visualize the small-scale data to find more details. For large-scale data, we divided and classified the data into small files in order to accelerate the filtering process.

When we find that there are a great number of candidates, we simply add more filter criteria until the number of candidates drops to a reasonable amount.

In order to make it easier to work with this kind of data, more efficient data management tools would be used. If a better data server can be designed to storage and manage the data, the efficiency will be greatly improved. Also, if some algorithms are developed to do some automatic matching and rule out the candidates that are too different from the template, we would be able to explore a larger proportion of the large graph.