Hanning Shao, Peking University, hanning.shao@pku.edu.cn PRIMARY

Yuchu Luo, Peking University, luoyuchu1999@qq.com

Wenqi Wang, Peking University, wangwenqi@pku.edu.cn

Xiaoru Yuan, Peking
University, xiaoru.yuan@pku.edu.cn ADVISOR

**Student
Team:** YES

Excel

Python

JavaScript

**Approximately how many hours were spent working on
this submission in total?**

**200 hours**

**May we post your submission in the Visual Analytics
Benchmark Repository after VAST Challenge 2020 is complete?**

YES

**Video **** **

Center for Global Cyber Strategy (CGCS)
researchers have used the data donated by the white hat groups to create anonymized
profiles of the groups. One such profile has been identified by CGCS
sociopsychologists as most likely to resemble the structure of the group who
accidentally caused this internet outage. You have been asked to examine CGCS
records and identify those groups who most closely resemble the identified
profile

**Questions**

**1 –– **Using visual analytics, compare the template subgraph with the potential
matches provided. Show where the two graphs agree and disagree. Use your tool
to answer the following questions:

- Compare the five candidate
subgraphs to the provided template. Show where the two graphs agree and
disagree. Which subgraph matches the template the best? Please limit your
answer to seven images and 500 words.

In order to compare the five candidate sub
graphs to the provided template, force-directed graph is used to show where the
two graphs agree and disagree. We include four types of nodes and five types of
edges in our force-directed graph. Since the demographic channel provides us
with the information of personal habits rather than the relation between
different people, so we don't include it. In addition to that, we also find
that the co-authorship channel only accounts for a very little proportion of
the template and the candidates graphs, so we tend to exclude it from our
force-directed graph.

First of all, the products involved in the five
sub graphs and the template are compared. In the template and Graph1, there is
only one product which ID is 657187, while in Graph2, the ID of the only
product is 487668, and in Graph3, it is 476813. But Graph4 and Graph5 are quite
different, these two graphs involve many products.

Second,
the core communication center in each graph show more details. The ID of
the core person in each candidate sub graph is 635665(Graph1), 629672(Graph2),
528892(Graph3), and in the template, the ID is 41. But in Graph4 and Graph5,
there is no such point.

Then for the core communication center existed,
derived groups which communicate closely could be found. For the template, the
number of people involved is 11. For the candidate sub graphs, they are
11(Graph1), 9(Graph2) and 6(Graph3).

Additionally, only four places are involved in
Graph2 while in other graphs, all six places are involved.

In conclusion, from the force-directed graph,
we can find out that the Graph4 and Graph5 is quite different form the
template, which involve lots of purchase information. The Graph1 matches the
provided template best, followed by the Graph2, and then the Graph3.

On the other hand, there are some evidences
that support our conclusion in the graph of the distributions of phone calls
and emails. The following figures show the distribution of the numbers of
communications group by hours in a day and by weeks in a year. The bars
represent the amounts of emails and phone calls, and the line show the amount
of template graph for convenient comparison in each candidates graphs. We can
find that Graph1 and Graph2 have similar distributions as the template while
the other three graphs share fewer similarities.

- Which key parts of the best
match help discriminate it from the other potential matches? Please limit
your answer to five images and 300 words.

There are some distinct patterns on the graph
of the template, which are marked on the figure below. First of all, there is a
person that plays an important role in the communication network, like #41 in
the graph. On his right, there is a group of 11 people (includes #41). They
email or call each other frequently.

Besides, only one product #657187 is involved
in the graph, with one seller #67 and one customer #39. This pattern also helps
to match graphs.

Globally, people in the graph of template can
be divided into three groups —— some of them communicate others frequently but
never travel to any places, some are busy travelling but do not have any
records of emails or phone calls, and the last kind of people have some
communications and also travel to some places. These three groups of people are
noticeable in the graph using a force-directed method.

We can use the pattern mentioned above to rule
out Q1-Graph4 and Q1-Graph5 easily. In Graph4 and Graph5, there are special
patterns that do not appear in the template. For example, person #636721 in
Graph4 and person #524153 in Graph5 both sell a great many of products, but
there is only one product involved in template. Also, there are complicated
networks of selling and consuming in Graph4 and Graph5. As a result, we rule
out these two graphs.

Additionally, when it comes to Graph3, we
cannot find a key person that leads a communication group like #41 in template
graph. We can only find a group of six whose members communicate with each
other with a relative high frequency, but it seems quite different from the
group in template graph.

**2** – CGCS has a set of “seed” IDs that may be
members of other potential networks that could have been involved. Take a look
at the very large graph. Can you determine if those IDs lead to other networks
that matches the template? Describe your process and findings in no more than
ten images and 500 words.

To find out whether these seed nodes lead
potential networks to match the template network, we generate a sub graph
centered at each seed node as a fundamental network at first. Then we explored
the fundamental sub graph to find some characteristics to reveal the agree and
disagree between it and the template network. At last, we will adjust the
fundamental sub graph to give out the most possible network led by the seed
matching the template graph.

The first step is to give out a fundamental sub
graph based on each seed. We did a careful discussion on each seed separately.

In the Q2-Seed1.csv, the only edge represents
an author relationship, the person #600971 is a co-author of document #579269.
Due to little information contained by the document node, we take major
attention on the person node #600971. The node #600971 has 3839 neighbors. The
number of neighbors with type 1 to 5 are 3768, 0, 4, 22, 45. Because we always
regard the traveling and financial flow information as local personal data
rather than relationships. So, we only need to consider communication (email &
phone) and co-authorships to construct the fundamental subgraph. We set a
threshold T and take out all nodes satisfying any one of the conditions: being
a document related to 600971, being a person having a document co-authored with
#600971, or being a person who has at least T times communication with #600971.
To best fit the size of the template graph, the threshold T was selected to be
3.

In the Q2-Seed2.csv, it also contains only on co-authorship. The person in the relationship is #538771. However, we did not find any other relationships in the very huge graph with #538771 except co-authorship, which means the #538771 is such a marginal person in the network that we cannot excavate enough information from it to find out the matching sub graph. So, we asserted that seed2 does not lead a potential matching sub graph.

In the Q2-Seed3.csv, unlike the previous 2
seeds, the only edge there is a product selling edge from #574136. We also
found the buyer node according to the sale volume, #620791. Both #574136 and
#620791 have a big number of trade records as well as much more communication
than #600971 in seed1, which brings more difficulty to analyze relatively.

#574136 have more trade records with comparison
to seed1 but far fewer communications.

#620791 is a typical businessman who has a huge
amount of selling and purchase. We can see the product related to him below, in
which there the edge is too thick due to the big trade weight. And the second
graph below is the communication related to #620791 under the threshold T=4
revealing the high frequent communication related to #62079.

Due to the significant trade relationship in
seed3, the possibility that seed3 leads a potential matching graph to the
template is relatively low.

In conclusion, the seed1 is most likely to lead
a matching sub graph with the template.

**3** – Optional: Take a look at the very large graph.
Can you find other subgraphs that match the template provided? Describe your
process and your findings in no more than ten images and 500 words.

We use different filters to find some special
nodes in the large graph. For example, we set filters on personal characters
like the demographic channel. We find that there are some categories on which a
small part of people spent differently against the substantial proportion of
people. There are few people that do not pay for the tobacco and the
electricity, and there is also very small part of them spent money on nature gas. These patterns on individuals can help to
filter special persons as seed in Q2 as the start points searching for sub
graph on the large graph.

**4** –– Based on your
answers to the question above, identify the group of people that you think is responsible
for the outage. What is your rationale? Please limit your response to 5 images
and 300 words.

We suppose that Graph1 shares most similarities
with the template graph according to the analyses before. So we try to match
each point in Graph1 to the point in template graph.

According to Q1 where we compare the structure
of the graph, we can match some key nodes of the two graphs. For the selling
and purchasing part, the product is #657187 and the seller should be #512397,
which matches #39 on template, and the buyer should be #550287, which matches
#67 on template. Besides that, the special travel pattern, which is marked by
green circle on the figure, indicates that place #69 represents #509607 and the
three travelers #82, #83 and #84 represent person nodes #538892, #542965 and
#572391.

On the other hand, we suppose #41 represents
person #635665, for they both 'lead' the group of communication —— they
communicate with all the members in the group marked in blue. The two both have
eleven members. We can find other evidences from the personal characters of
them like the demographic information. According to the figure showed below,
there are some persons in the group that do not match very well and we should
do futher searches in the large graph to find the nodes they represent.

**5** –– What was the greatest
challenge you had when working with the large graph data? How did you overcome
that difficulty? What could make it easier to work with this kind of data?

When working with the large graph data, the
amount of the data is too huge to deal with, which means querying and filtering
the data is the greatest challenge we had. Because of its huge numbers,
filtering the data is quite time-consuming and inefficient, which took us a lot
of time to console the problem.

Besides that, it also proves difficult to find
a suitable sub graph from a given seed, since the nodes in the large graph
usually have too many neighbor that it is almost impossible to add them all to
the sub graph, otherwise, the sub graph will show few useful information when
compared with the template graph with only about thirty people involved.

The countermeasures we took are to avoid the
graph visualization of the large-scale data as much as possible. Through
various filtering conditions, we filter data from the large graph and visualize
the small-scale data to find more details. For large-scale data, we divided and
classified the data into small files in order to accelerate the filtering
process.

When we find that there are a great number of
candidates, we simply add more filter criteria until the number of candidates
drops to a reasonable amount.

In order to make it easier to work with this
kind of data, more efficient data management tools would be used. If a better
data server can be designed to storage and manage the data, the efficiency will
be greatly improved. Also, if some algorithms are developed to do some
automatic matching and rule out the candidates that are too different from the
template, we would be able to explore a larger proportion of the large graph.