Entry Name: "TUK-Kinger-MC1"

VAST Challenge 2020
Mini-Challenge 1

 

 

Team Members:

Krishna Kinger, TU Kaiserslautern, k_kinger18@cs.uni-kl.de PRIMARY

 

Jan-Tobias Sohns, TU Kaiserslautern, j_sohns12@cs.uni-kl.de

Heike Leitte, TU Kaiserslautern, leitte@cs.uni-kl.de

 

Student Team: YES

 

Tools Used:

Python - Plotly Python Open Source Graphing Library, Dash

Visual Analytics tool created for the analysis of spatio-temporal graph data

Neo4j

 

Approximately how many hours were spent working on this submission in total?

400 Hours

 

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2020 is complete? YES

 

Video

https://youtu.be/Fl7KnPsomf4

 

 

 

Center for Global Cyber Strategy (CGCS) researchers have used the data donated by the white hat groups to create anonymized profiles of the groups. One such profile has been identified by CGCS sociopsychologists as most likely to resemble the structure of the group who accidentally caused this internet outage. You have been asked to examine CGCS records and identify those groups who most closely resemble the identified profile

Questions

1 –– Using visual analytics, compare the template subgraph with the potential matches provided. Show where the two graphs agree and disagree. Use your tool to answer the following questions:

a.       Compare the five candidate subgraphs to the provided template. Show where the two graphs agree and disagree. Which subgraph matches the template the best? Please limit your answer to seven images and 500 words.

Ans. The temporal transactions from communication, procurement and travel channel are plotted for the template graph (shown in Figure 1) and five subgraphs (shown in Figure 2, 3, 4, 5, 6). The figure contains timeline in the horizontal axis and person nodes as categorical vertical axis. Call, email and procurement transactions are represented via color coded vertical line and travel transactions are shown via horizontal line. The lines contain distinct markers on both ends for representing the source and destination. The nodes color represents the spatial location of the person at that instance of time.   The person nodes forming clusters are arranged together and the node with highest centrality is kept near the center to remove the clutter and for better visibility of patterns. Further, the nodes with only participation in the travel channel are kept together towards the upper end of plot window. The relative position of nodes in the subgraph’s is kept relative to their similarities with the nodes of template graph. These similarities are computed on the basis of centrality measure, participation in different phases of the timeline, participation in the different channels and the demographic attributes. 

Figure 1: Transactions in template graph with patterns subdivided into phases

 

Figure 2: Transactions in subgraph-1 with patterns subdivided into phases

Figure 3: Transactions in subgraph-2 with patterns subdivided into phases

Figure 4: Transactions in subgraph-3 with patterns subdivided into phases

Figure 5: Transactions in subgraph-4 with patterns subdivided into phases

Figure 6: Transactions in subgraph-5 with patterns subdivided into phases

 

Similarities with Template: On comparison of subgraphs with the template graph, the subgraphs 1 and 2 show maximum similarities in the temporal patterns in all the six phases. The major similarity is in the communication channel and includes similar cluster in all the phases. The procurement channel transactions in both the subgraphs involve a single seller and buyer which is similar to template. Further, in the subgraph-1, the product (nodeid-657187) involved in the procurement is the same as in the template. The subgraph-1 also shows similarity in the travel channel which also includes the similar source and destination locations.

The subgraph 3 shows similarity for the communication channel in phase 2, 3 and 5 but the clusters amongst the person nodes is not similar. The travel patterns in phase 4 of subgraph 3 shows similarity with the phase 3 travel transactions of the template graph.

 

Disagreements with Template: The timespan of phase 2 in subgraph1 is less when compared to that of template graph. The number of call transactions in phase 2, 3 and 4 are less when compared to that of template graph. The communication transactions are also sparse in these phases when compared to template. The subgraph-2 differs in terms of missing clusters in phase 3 and phase 5. The procurement channel also involves sale-purchase of product different than the template.

The subgraph-3 does not include clusters similar to template and also differs due to the absence of communication transactions in phase 6.

The subgraphs 4 and 5 contains very sparse transactions and does not form any clear patterns that resembles the template graph. Further, the procurement channel in these subgraphs does not include complete seller-buyer transaction. The individual sale and purchase records also do not match the template in terms of product-id, timeline and price.

 

Overall, the subgraph-1 and subgraph-2 are potential good matches for the template with subgraph-2 showing more similarity.

 

b.      Which key parts of the best match help discriminate it from the other potential matches? Please limit your answer to five images and 300 words.

Ans:  Following are the evidence from individual channel supporting the claim that subgraph-2 is best match with the template:

1.       Communication Channel: Subgraph-1 provides overall better match in terms of clusters to template than Subgraph-2. However, there are certain aspects where the similarities of subgraph-2 make it closer to template. All the below points can be observed in Figure 1, 2 and 3.

o   Timespan: The timespan of phase-2 in subgraph-2 is much closer to the template than that of subgraph-1.

o   Degree of person nodes: Indegree and outdegree of nodes in the Phase 1 of subgraph-1 is not matching. In the template graph, the indegree and outdegree of all the 4 person nodes (41, 37, 27 and 34) is balanced whereas in subgraph 1, the nodes 635665 and 589639 have high indegree whereas nodes 490041 and 599956 have high outdegrees. The nodes in subgraph-2 have much closer similarity to the template in this aspect.

o   Cumulative frequency: The communication transactions in the subgraph-1 for all the phases are very sparse in comparison to template. The subgraph-2 again shows better similarity with template.

o   Bias in frequency of transaction type: The subgraph-1 includes less call transactions than email transactions in phase-1, 2, 3 and 5 in comparison to template whereas phase 6 shows more call transactions than email transactions. The subgraph-2 again shows better similarity to template.

 

2.       Procurement Channel: The subgraph-1 shows better similarity in the aspect of the product involved but subgraph-2 is more similar to template (shown in Figure 7) for the following reason:

o   The procurement transactions for the subgraph-1 becomes denser towards the end whereas in the template the transactions are sparse in beginning and ending. The subgraph-2 shows better match to template in this aspect.

 

3.      Demographics: The cumulative demographics for template, subgraph 1 and subgraph 2 are shown in Figure 8, 9 and 10. The range of spending and earnings for the nodes in subgraph 2 is closer to template than that of subgraph 1.

Figure 7: Procurement channel comparison of Subgraph-1 and Subgraph-2 with the Template graph. The price for each transaction is annotated

 

Figure 8: Demographics for template graph

 

 

Figure 9: Demographics for Subgraph 1

 

Figure 10: Demographics for Subgraph- 2

 

2 CGCS has a set of “seed” IDs that may be members of other potential networks that could have been involved. Take a look at the very large graph. Can you determine if those IDs lead to other networks that matches the template? Describe your process and findings in no more than ten images and 500 words.

Ans: The analysis of seed is performed using an iterative approach where related nodes are added iteratively to the seed node and non-important nodes are removed.

Seed 1:  Given the co-authorship record of person 600971 for the document node 579269.  Following are the details of each iteration for creating a subgraph from this seed:

a. For the first iteration our person list contains nodes:-> related_nodes= [600971]

b. Increasing the related person by analyzing the document records and adding the other co-authors.

related_nodes = [600971, 611692, 470085, 484189, 600971, 540660, 609913]

In this list only 611692 node has transactions available in the graph. Therefore, after the 2nd iteration, related_nodes = [600971, 611692]

c. List all the other document nodes published by 600971 and include the coauthors from those publications.

Documents published: [463021, 607498, 530665]

Below table shows all such co-authors:

Document

Coauthors

579269

470085

484189

540660

600971

609913

611692

463021

600971

609913

611692

561356

607498

600971

611692

475588

530665

540660

600971

609913

611692

Figure 11: Related documents and coauthors for seed1. The cell in green are the one provided in the seed 1.

An interesting observation from the above iteration is that the publication time for each of the above document has discrepancy on the publication time of record for the node 600971. Also, the node 600971 has always had 611692 as coauthor for all the documents. “

d. The transaction graph obtained with the nodes connected to related_nodes obtained previously is shown in Figure 12. The earning and spending heatmap are shown in Figure 13 and 14.

Figure 12: Subgraph generated by related nodes for seed 1

Figure 13: Earning (%) Heatmap for seed 1 subgraph

 

Figure 14: Spending (%) Heatmap for Seed 1 subgraph

Seed 2: Similar to Seed 1, the coauthors for the given document are taken in the first iteration. This gives us 11 nodes including the provided person node – 538771. Next, we check the other documents authored by 538771 which give us 57 documents. A common feature in all these documents is that the weight parameter is very less.

Using these nodes, the most connected nodes are identified. The transaction graph for seed 2 is shown in Figure 15. The spending and earning heatmap are shown in Figure 16 and 17.

Figure 15: Subgraph generated by related nodes for seed 2

 

Figure 16: Earning (%) Heatmap for Seed 2 subgraph

Figure 17: Spending (%) Heatmap for Seed 2 subgraph

Seed 3: This contains a sell transaction by Person node 574136 for the item node 657187.

Iteration 1: Including the person who bought this product- [574136, 620791]

Iteration 2: Including people who are involved in the procurement of this product –

[574136, 620791, 620235, 550287, 567987, 592099, 465195, 654556, 552765, 623468, 491509, 623922, 480601, 512397]

The subgraph generated from the related nodes of above list is shown in Figure 18. The spending and earning heatmap for that graph is shown in Figure 19 and 20.

 

Figure 18: Subgraph generated by related nodes for seed 3

 

Figure 19: Earning (%) Heatmap for Seed 3 subgraph

Figure 20: Spending (%) Heatmap for Seed 3 subgraph

3 Optional: Take a look at the very large graph. Can you find other subgraphs that match the template provided? Describe your process and your findings in no more than ten images and 500 words.

4 Based on your answers to the question above, identify the group of people that you think is responsible for the outage. What is your rationale? Please limit your response to 5 images and 300 words.

 Ans: Amongst all the 5 subgraphs discussed in question 1 and subgraphs resulting from 3 seed transactions in question 2, the subgraph 2 matches the template closest in temporal activities.  Considering the relative times between transactions to be an important aspect for matching, it can be easily observed in the Figure 21 that subgraph 2 is better aligned to the template graph.

Further analysis of the overall graph similarity is performed using the eigen vector centrality measure of person nodes in the template (Figure 22), subgraph1 (Figure 23) and subgraph2 (Figure 24) for the calls and emails transaction as they form majority of relationship in the given graph.

Figure 21: Comparison of transaction patterns of Subgraph - 1 and Subgraph - 2 with the template

Figure 22: Eigen Vector centrality measure for person nodes of template graph (call and email relationships)

Figure 23: Eigen Vector centrality measure for person nodes of Subgraph - 1 (call and email relationships)

Figure 24: Eigen Vector centrality measure for person nodes of Subgraph - 2 (call and email relationships)

All these factors indicate that subgraph-2 is the group responsible for the outage.

 

5 What was the greatest challenge you had when working with the large graph data? How did you overcome that difficulty? What could make it easier to work with this kind of data?

Ans: Following are the major difficulties faced and steps taken to overcome those issues with the large graph data are:

1.       Preprocessing: Since it was difficult to load the entire dataset, it was not possible to understand the possible erroneous values present in the different channels. After visualization of subgraphs, possible errors were encountered which required re-processing of the data.

The data was divided into individual channels for easier access and updates to the data.

 

2.       Storage: Due to its large size, first it was not feasible to interact with the data through the developed tool and apply filters. Slow data retrieval operations made the interactions slower.

The data was stored in the Neo4j for easy filtering and faster access for the parts of large graph based on the requirements. This also provided easy access to different graphs algorithms for analysis of graph structure.

 

3.       Dealing with temporal data: Since the challenge required to harness temporal patterns, the cumulative graph properties (e.g. node centrality, degree) cannot be considered as a good metric and a more dynamic measure was required to see change in such properties over time.

To overcome this issue, more focus was made on the visualization aspect where change in such patterns can be observed easily. This provided insight related to change in clusters, centrality in different phases and dynamicity of different transactions in the timeline. The large graph was thus analyzed over different time period for retrieving the related nodes for seed transaction.

                Things that could have made it easier to work with this kind of data:

1.       Access to faster machine with higher memory capacity.