Entry Name:  "TTU-Dang-MC1"

VAST Challenge 2020
Mini-Challenge 1

Team Members:

Hiep Vo, Luther College, PRIMARYvohi01@luther.edu

Vung Pham, IDV Lab, Texas Tech University, vung.pham@ttu.edu
Tommy Dang, IDV Lab, Texas Tech University, tommy.dang@ttu.edu

Student Team:  YES

Tools Used:

A software developed using HTML, CSS, JS, d3.js, Python, Networkx and Node2Vec – Git: https://idatavisualizationlab.github.io/V/MC1/

Approximately how many hours were spent working on this submission in total?

280

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2020 is complete? YES

Video

Video

Or youtube link here

 

Main features of our system

In this work, we developed a web application following DualNetView to capture both overall structures and temporal patterns within the graphs. The compact force-directed view gives an overview of the graph with communication links aggregated over time. This view provides information about communities and nodes with high centrality. On the other hand, the TimeArc view on the right offers temporal communications in the graph that help communicate the dynamics of the underlying network. In this view, horizontal orientation represents the timeline, and horizontal lines represent nodes, and edges represent communications. The edges are placed at the time the communications happened. The linked supplemental views utilize the strengths of both visualization techniques to provide useful insights into the given graphs. This tool is not merely used to report the results. It also offers interactions such as linking, brushing, and filtering to aid users in exploring the results further. Besides visual, qualitative analysis, we also adopt graph alignment evaluations to give a quantitative, statistical assessment of the matches between graphs. The symmetric graph alignment evaluation symmetric score. Specifically, given two graphs. G_1 = (V_1, E_1), G_2 = (V_2, E_2), find a mapping function f: V_1 \rightarrow V_2 to map nodes in the first graph to the second one. The induced graph \hat G_2 = (\hat V_2, \hat E_2) with nodes f(V_1) = \hat V_2 \in V_2, assuming that |V_1| \leq |V_2|, and edges are defined as \hat E_2 = \{(u, v) | (u, v) \in E_2; u, v \in f(V_1)\}. The edges of the first graph mapped into the induced graph are f(E_1) = \{(f(u), f(v))\}. Finally the alignment score is defined as:

\frac{|f(E_1)|}{|E_1| + |\hat E_2| - |f(E_1)|}

Questions

  1. Using visual analytics, compare the template subgraph with the potential match provided. Show where the two graphs agree and disagree. Use your tool to answer the following questions:

Here are the visual notes on the subgraphs comparing to the template graph. For the figures and visual analysis below we added a few credentials to limit and filter data such as removing data with financial type and limiting the number of maximum neighbors for each node as explained in question 5:

Within the D3.JS generated graphs of the sub graphs we have specified the edges that were not able to be matched to the template graph as red


Here is the graph for template data to compare to other sub graphs

Template Graph Figure Template.


Figure 1. Template Graph visualized with DualView approach with the force-directed layout on the left to provide aggregated patterns and the TimeArc on the right to give temporal communication patterns over time.

Graph 1 Figure 1.


Graph 1 is expected to be the most representative of template graph. After filtering out nodes with financial eType, graph 1 and the template graph have the same number of nodes: 38. Also, they have one similar product node 657137. As you can see graph 1 below, many of the edges before February and between April and May were not able to match with template but overall Graph 1 represent the template the best with score 0.52


Figure 2. Graph 1 match with template (red arcs represent those who do not match with the template).

Graph 2 Figure 2.


Graph 2 is expected to be the second-best representative of the template graph with a score of 0.34. However, it still seems to have many edges that are not able to be mapped to the template graph throughout its whole time period. The edges that seem to be the most representative are the ones before February and around September


Figure 3. Graph 2 match with template. There are more red lines than graph 1.

Graph 3 Figure 4.

Graph 4 Figure 5.

Graph 5 Figure 6.


Overall judgment from observing the visualization is that graphs 3, 4, and 5 are much less representative of the template graph. Although, graph 3 also has a unique product node and is mapped quite well on the template. However, its structure on the right does not seem to follow the top pattern of the template graph. Graphs 4 and 5 clearly do not represent the template because both have many products nodes and do not look much similar to the template


Figure 4. Graph 3 match with template (although very few red lines but they still make a large portion of the overall number of edges so the matching score is lower than the first 2).

Figure 5. Graph 4 match with template. There are too many products nodes and mismatched edges.

Figure 6. Graph 5 match with template (Same Reason as Graph 4)

For this question, we also try to map the template graph on graph 1 and graph 2


Graph 6 Figure 7.

Graph 7 Figure 8.


Figure 7. Template match with Graph 1

Figure 8. Template match with Graph 2. There are also more mismatched edges than template mapping on graph 1.

As we proceed, we need more specific qualities to decide if Graph 1 or Graph 2 is the best candidate. Here, we use both visual qualitative and statistical quantitative assessments to decide graph 1 is the best potential match


For Visual qualitative aspect, we see that both graph 1, graph 2, and template have a person node with very high in and out degrees (41 in template graph, 635665 in graph 1 and 629627 in graph 2). As well as a single unique product node (657187 in template graph, 657187 in graph 1 and 487668 in graph 2). However, Graph 1 is better than Graph 2 because there are less red unmatched edges appear when the template is mapped with graph 1


For the Statistical Quantitative aspect, we utilize symmetric structure scoring from graph similarity scoring and matching formulas paper. After calculating, we decide that graph 1 is the better match with the better score of 0.52


  1. CGCS has a set of “seed” IDs that may be members of other potential networks that could have been involved. Take a look at the very large graph. Can you determine if those IDs lead to other networks that matches the template? Describe your process and findings in no more than ten images and 500 words.

Graph 8 Seed 1 and Seed 2.

Graph 9 Seed 3.

Graph 10 Seep 3 match template.

Graph 11 Template match Seed 3.


To find the subgraphs according to the seeds, we first find the diameter of the template graph using Networkx. This diameter is then used as the number of hops to expand to further levels of neighbors. To do this, we adopt Breadth First Search algorithm to expand the connections with a limit of maximum degrees of each neighbor to be included (further details in question 5).


Figure 9. Seed 1 (A) and Seed 2 (B). These 2 mainly contain author and document edges - not comparable to template graph.

As we expand path from seed 1 and seed 2, we notice that both of them have an extremely high density of document and author edges, which all happened in the past. Therefore, we think these connections are insignificant in terms of time. Also, as we calculated the similarity scores for these two graphs, the scores are near 0% similarity. Thus, we only provide static force-directed layout node-link network views for the subgraphs generated from these two seeds.

Figure 10. Seed 3 (more variety of edges types, same product node (657137) as template and graph 1)

Figure 11. Seed 3 map Template (more matching edges towards template with score of 0.12).

Figure 12. Template map Seed 3. The template also matches well to Seed 3, comparable to Graph 2.

As we expand path on seed 3 we can see a clear variety of edges, and we can also see the signified product node 657187 in Seed 3 graph, which is also apparent in Graph 1 and Template Graph. Also, we see how seed 3 is being mapped to the template, and vice versa, seed 3 in even quite comparable to graph 2 in terms of the number of edges mapped similarity with a quite similar density of red lines. Seed 3 graph also has quite a good score for the similarity of 0.12. Thus, we believe that Seed 3 leads to a subgraph connection that is most similar to the template.


From the graph for Seed 3 which contains 14 Person nodes and 1 Product node, we decide that the criminal group is from Seed 3 connection and maybe not all, but some of the criminals have to be within these 14 people whose IDs are appearing black in the timeline graph


  1. Optional: Take a look at the very large graph. Can you find other subgraphs that match the template provided? Describe your process and your findings in no more than ten images and 500 words.

We combined the methods used in questions 1 and 2 to tackle this task. First, we already reduce the big graph into more compact and fast-to-access adjacency list representation. Second, we already have the algorithm to construct the sub-graphs from initial nodes as in question 2. Thirds, to reduce the number of initial seed nodes, we only consider nodes with degrees from 2 up to 100 degrees. After having the subgraphs, we will run our visual, qualitative graph mapping tools and the statistical, quantitative graph evaluation formulas to find the best graphs that match the template.


  1. Based on your answers to the question above, identify the group of people that you think is responsible for the outage. What is your rationale? Please limit your response to 5 images and 300 words.

As mentioned in Question 2, we have concluded that Seed 3 will lead to the most similar connections to the template. According to the graph of seed 3 expanding neighbors, we have decided the group of 14 people most likely to contain the criminal group. Their IDs are: 552765, 620791, 623922, 512397, 654556, 574136, 480601, 491509, 620235, 465195, 550287, 567987, 623468, 592099


  1. What was the greatest challenge you had when working with the large graph data? How did you overcome that difficulty? What could make it easier to work with this kind of data?

Here are the steps we took to deal with large graph data:

First as we initially observe the data in the given csv file, we see that the matrix is sparse so we decide to convert the dataframe to adjacency matrix because it would be more efficient to find paths between nodes.

Second, we have noticed in the big data some nodes with extremely high density (many neighbors) such as 660971 with 3559 neighbors. Because of these nodes, when we expand the graph through this node, it would lead to a very large number of insignificant neighbors. Also, we expect the criminals not to have that many connections because the group is only a selected few. Thus, we only filter and choose the nodes with Maximum 100 neighbors.

Third, we notice from all the data files; the connections are overcrowded with financial type connection. We believe this type will overshadow significant paths and would generate too many paths between the nodes, which would make the visualization very messy, and the found paths between the nodes become less significant. For this reason, for every data file, we filter out all nodes with financial eType.

Fourth, on a small note, we read data frames on python pandas and write results such as created adjacency matrixes and connected nodes between seeds to pickle files for easier and faster data access.