Entry Name: UKON-Englert-MC1

VAST Challenge 2020
Mini-Challenge 1

 

 

Team Members

David Englert – david.englert@uni-konstanz.de – University of Konstanz – Konstanz – Germany PRIMARY

Pablo Martinez Blasco – University of Konstanz – Konstanz – Germany

Eren Cakmak – University of Konstanz – Konstanz – Germany

Udo Schlegel – University of Konstanz – Konstanz – Germany

Juri Buchmüller – University of Konstanz – Konstanz – Germany

Prof. Dr. Daniel A. Keim – University of Konstanz – Konstanz – Germany

Student Team

Yes

Tools Used

yEd

Excel

Python3: matplotlib, seaborn, pandas, networkX, graph-tool, karateclub, flask

somewraps: as existing tools fell short in analyzing the dataset, we developed our own tool Something with Graphs based on D3 and Angular

Approximately how many hours were spent working on this submission in total?

250

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2020 is complete?

Yes

Video

https://www.youtube.com/watch?v=tLoSlRUezQA

Our Tool

https://vast20.dbvis.de/vast-mc1/

 

 

 

 

Center for Global Cyber Strategy (CGCS) researchers have used the data donated by the white hat groups to create anonymized profiles of the groups. One such profile has been identified by CGCS sociopsychologists as most likely to resemble the structure of the group who accidentally caused this internet outage. You have been asked to examine CGCS records and identify those groups who most closely resemble the identified profile.

Questions

1 Using visual analytics, compare the template subgraph with the potential matches provided. Show where the two graphs agree and disagree. Use your tool to answer the following questions:

a Compare the five candidate subgraphs to the provided template. Show where the two graphs agree and disagree. Which subgraph matches the template the best? Please limit your answer to seven images and 500 words.

The node-link diagram (Figure 1) shows the template as well as each subgraph with similar nodes placed in similar areas. The persons got divided into 5 groups: demographics, communication, travel, main communicators and outlier. The representation allows identifying the subgraphs 1, 2 and 3 as being similar regarding network structure, while subgraph 4 and 5 differs from the template.

Figure 1 -Interactive exploration view of our tool that allows a comparison of the given graphs as a node-link diagram on a global network and detailed node level. Each sub-image contains a badge with the cosine similarity compared to the template graph by representing each graph with Graph2Vec embedding without using node or edge labels.

The activitystream (Figure 2) shows the absolute aggregated amount of edges per day in each channel over the time range of the dataset. This reveals significant patterns and transactions which happen in each of the graphs.

Figure 2 - The activity of a channel is quantified by aggregating the respective edge counts, on a daily level.

An even more detailed view on the previously seen patterns of the communication channel (Figure 3) facilitates comparing the template with subgraphs 1 and 2 and among themselves.

Figure 3 -Comparing patterns across the graphs at time ranges with similar communication activity patterns.

The travel timeline plot (Figure 4) allows us to distinguish between groups of people that share common travel routes and times. The main communicator group traveling together gets less discernible in the subgraphs 2 to 5. Noticeable as well are the negative travel times in the subgraphs 4 and 5, colored in black. Considering start and destination locations and overlapping travels with a different location, both completeness and believability of this data need to be brought into question.

Figure 4 -Travel activity with start, destination and duration information of a travel sequence represented as a rectangle with the respective length and location gradient. This view reveals travel clusters over time and shows suspicious and potentially purposely faked travel data.

The clustering of similarly acting persons is achieved by taking a closer look at the financial spendings. Hereby, the amount spent per category and person was taken into account. Applying PCA as a dimensionality reduction yields visible clusters, as illustrated in the heatmap (Figure 5).

Figure 5 -Financial spendings of each person ordered by graph and cluster label.

The communication frequency (Figure 6) reveals even more structural information of the networks. The communication group can be further divided into 2 groups that both have an equal amount of communication activity with the 3 main communicators based on the activity times.

Figure 6 -Node-link diagrams with the encoded frequency of communication and activity time. Temporally parallel communications get visually clustered which facilitates interpretations of the underlying patterns, as depicted.

Based on the above findings, subgraph 1 has been identified as the best match for the given template.

b Which key parts of the best match help discriminate it from the other potential matches? Please limit your answer to five images and 300 words.

Figure 7 discloses that subgraph 1 comprises a different embedding for the triangle in the graph. Yet, the respective transaction takes place within the same communication time range. Contrarily, subgraph 2 does not contain this information at all, which strongly supports the already stated hypothesis of subgraph 1 being the most similar structure. As already mentioned, the product in subgraph 2 is a different one than in the template, whereas in subgraph 1 we have a match.

Figure 7 -An In-depth look at the procurement channel with the financial information of the buyers.

Comparing the structural embeddings of the demographic categories was realized by firstly joining the two graphs and secondly clustering the derived graph. Subgraph 1 reveals a better match to the structure that arises when we combine the template with itself.

Figure 8 -Structural embedding of the demographic categories.

Contrasting once again the demographic spendings of the persons occurring in the three graphs, more similarities with spending in subgraph 1 than in subgraph 2 can be seen in Figure 9.

Figure 9 - Financial spendings of each person across 3 datasets divided into 8 clusters.

Even if the travel routes in subgraph 2 follow the first half of the template closely, those travels happen around 100 days later. However, subgraph 1 contains almost all of the travel clusters and shows a better match considering the time offset (Figure10).

Figure 10 -Detailed travel data comparison of the template against subgraph 1 and subgraph 2.

Figure 11 completes the inspection of the network's structure. The discernible differences here are the number of documents and the communication structure.

Figure 11 -Comparison of suspicious persons in the relevant graphs.

After the analysis of the underlying structures of the networks from different perspectives and angles, it can be stated that even the Graph2Vec embedding without any labeling already reveals the similarity of the graphs.

2 CGCS has a set of seed IDs that may be members of other potential networks that could have been involved. Take a look at the very large graph. Can you determine if those IDs lead to other networks that matches the template? Describe your process and findings in no more than ten images and 500 words.

A minimum spanning tree representation of the graph is used in order to show the relative position of the seed nodes in the figures. This representation is used because it limits the number of edges shown in the graph while still maintaining a force-directed layout. We can clearly distinguish three main clusters, which will be looked into in answer 3, and a lot of nodes outside of them which main edge type is communication.

The position of the seed nodes provided in the graph can be found in Figure 12. We can clearly see that seed 2 stands completely in the outside part of the graph where the main (and usually only) edge type is communication (emails and calls), since these nodes do not have financial information we can exclude them of forming part of a subgraph similar to the template provided.

Seed 1 conforms a document ownership, seed 3 is a product sell transaction. They both lie inside the main cluster of the graph (and also inside the biggest component which makes up 70% of the graph). We tried computing the embeddings of the big graph in order to find similarities with the template provided, but the computation took more resources than the ones we had available to us. Therefore, we tried to infer the knowledge gained from question 1 in looking for a similar subgraph.

At this point since we were manually going to look for a similar subgraph we needed to make an assumption, we either recognize the seeds as not part of this template subgraph or we consider them part of it. Due to limited computational power, we assumed the second option, and identify the seed edges as the only document and sell edge that appears in the template provided.

The main characteristic of the template graph we found is that the buyer of the product travels to the same location as the other two people, communicates with them, and all three of them connect to a node that directly communicates to the writer of the given document. Following this path would allow us to perform one sequence of operations to get from product to document, and we would only have to reverse it to get the product from the document. A very important issue we found with this approach is that the nodes to check scale very rapidly, as shown in Figure 13, which once again reach outside of our computing power.

With our previous assumption and the knowledge obtained from manually looking into the seeds, we can tell that both seeds 1 and 3 lead to subgraphs that could be similar to the template provided.

Figure 12 -MST of the big graph with annotated position of seed nodes.

Figure 13 -Manual exploration of seed nodes (and scalability issues). a) Possible buyers of seed 3 product. b) Common travel persons in locations of the most likely buyer. c) Communication pairs of neighbor nodes of the buyer in one shared location. d) Document-Seller communication paths

3 Optional: Take a look at the very large graph. Can you find other subgraphs that match the template provided? Describe your process and your findings in no more than ten images and 500 words.

In the previous question, it was mentioned that the graph had 3 main clusters, with the seeds being part of the biggest one. We wanted to explore the other two in order to see if there were any possible similarities with the template.

Upon a closer look, we can see that one of the clusters (left one in Figure 14) takes a similar structure to the outside of the network where it is predominantly formed of communication edges, from what we can start to assume that it probably is not similar to the template. In order to confirm this, we applied k-cores to the graph, in order to see which nodes take more importance in the network (Figures 16 and 17). We can observe that the outside of the graph has a low core number (green color), and the two main clusters contain values close to the maximum k-core (1447), as seen with the wider arrange of colors of their nodes in Figure 16. Our previous hypothesis about the third cluster is also confirmed as it shares a low k-core number further indicating the dissimilarity with the other clusters and its similarity to the outside part of the graph.

From this observations, we understand that the cluster that does not contain the seeds may also contain subgraphs that can be similar to the template, but we have not done further investigation on the matter.

Figure 14 - Cluster zoom of MST with edge type coloring.

Figure 15 - K-core decomposition of the complete graph

Figure 16 - Cluster zoom of k-core decomposition

4 Based on your answers to the question above, identify the group of people that you think is responsible for the outage. What is your rationale? Please limit your response to 5 images and 300 words.

While we have not located the intended group, we have our suspicions that seed 1 will lead to a group similar to the template given, at least with more confidence than for seed 3, since we have only 3 travel locations in its subgraph. Seed 1 on the other hand has 5 locations in the neighborhood, bringing it closer to that of the template graph.

Another important discovery we made is that the subgraphs provided for question 1 were subgraphs of the big graph where edges where cut. Based on this discovery we can make assumptions that any subgraph similar to the template needs to have a higher number of edges than the template itself (e.g. if the product is sold/bought a certain amount of times, 9 in this case, all products with a lower transaction amount can be not taking into account). Based on this, we started to obtain the products that fulfill those criteria (Figure 17) but had no further time to follow this hypothesis.

Figure 17 - JSON containing each product in the graph and the persons that have more than 9 transactions with each.

5 What was the greatest challenge you had when working with the large graph data? How did you overcome that difficulty? What could make it easier to work with this kind of data?

The biggest challenge of finding the suspects was the processing of the graph size. The use of graph databases such as Neo4j proved to be impractical due to a limited selection of algorithms. Consequently, it became necessary to resort to other approaches such as networkX. This approach turns out to require more than 80 GB of memory which leads to the usage of graph-tool which is C/C++ based and reduced the memory requirements to 5 GB. However, the reduction of memory requirements was weakened by a strong limitation of available libraries, e.g. in the area of calculation of embeddings. The extraction of connected components reduced the nodes of interest to 70% of the original graph, which was however still a lot amount of data needed to be processed.

Good graph cuts would help to analyze this kind of data. Also, experience with different graph structures, in general, may play a major role in getting access to the data.