Entry Name:  ANSER – Carroll – MC1

VAST Challenge 2020
Mini-Challenge 1

 

 

Team Members:

Tim Carroll, Analytic Services, timothy.carroll@anser.org PRIMARY

Tessa Karakurt, Analytic Services, tessa.karakurt@anser.org

Dominique Malloy, Analytic Services, dominique.malloy@anser.org

Sean Quan, Analytic Services, sean.quan@anser.org

Jim Bieszka, Analytic Services, james.bieszka@anser.org

Student Team:  No

 

Tools Used:

Microsoft Excel, PowerPivot

Gephi

Tableau

Jupyter (Python)

SQL

 

Approximately how many hours were spent working on this submission in total?

Roughly 320 team hours

 

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2020 is complete? No

 

Video

 

 

 

Center for Global Cyber Strategy (CGCS) researchers have used the data donated by the white hat groups to create anonymized profiles of the groups.One such profile has been identified by CGCS sociopsychologists as most likely to resemble the structure of the group who accidentally caused this internet outage. You have been asked to examine CGCS records and identify those groups who most closely resemble the identified profile

Questions

1 –– Using visual analytics, compare the template subgraph with the potential matches provided. Show where the two graphs agree and disagree. Use your tool to answer the following questions:

  1. Compare the five candidate subgraphs to the provided template. Show where the two graphs agree and disagree. Which subgraph matches the template the best? Please limit your answer to seven images and 500 words.

Text Box: Figure 4: Gephi - Subgraph 2 -  Modularity AnalysisText Box: Figure 3:Gephi – Template subgraph – Modularity AnalysisText Box: Figure 1: Gephi - Subgraph 2Text Box: Figure 2: Gephi - Template subgraphThe template subgraph known as the CGCS Template represents a profile of activities that are likely similar to the cyber incident that impacted the global Internet. Five candidate subgraphs represent potential analogous social networks, derived from data extracted from the large data graph. Data analysis of the closeness (measure of the average farness to all other nodes– nodes with high closeness have the shortest distance to other nodes), betweenness (number of times a node lies on the shortest path in between nodes), and modularity (defined as the strength of detected communities calculated by algorithm in Gephi) factors of subgraph two bear close relation to the template graph. Moreover, visual analysis of subgraph two using Gephi correlate most with the template graph due to the similar breakdown of communications patterns (edge types, eTypes) based on percentages of the total. Looking at the figures, there was a tight clustering in Figure 1 and the same amount of overlaps in email (eType 0) and call (eType 1) records in Figure 2. In Figures 3 and 4, there is a circular layout where the size of the node depends on the degree (higher degree, larger node) and each node is color coded by its modularity class (colors are not the same across all subgraphs). All edge weights have been changed to “1” and edges are color coded by their e-type (colors are the same across all subgraphs). The modularity class of the template graph is equal to 0.327 and the modularity class of Subgraph Two is equal to 0.293. Subgraph Two displays a closer modularity to the template graph when coupled with the density of e-types.

Figure 1 and the data from Figure 2 align closely based on an analysis of the e-type of the communications involved. In the template data graph, there is a tight clustering of financial data (eType 5), with a smaller number of the correlated email (eType 1) and phone (eType 0) communications. While the clustering of a financial network is not itself noteworthy, the fact that the template data graph included overlapping email and phone communications suggests more connectivity of the nodes in question. Therefore, in our analysis of Figures 2 and 4, we found a similar correlation and assessed this group to be closely related to Text Box: Edge Type Count Analysis: An Early Discriminator

While the team was looking for best techniques to analyze the initial dataset, we discovered that a simple count analysis of the edge types of each subgraph provided a snapshot of the patterns of communications present within the network. This Tableau data shows the findings for the template and subgraph 2, which aided in focusing the final analytic position.

  
the type of network laid out by the template in Figures 1 and 3.

  1. Which key parts of the best match help discriminate it from the other potential matches? Please limit your answer to five images and 300 words.

The team identified similarities between the template and subgraph 2 in three of seven of the edge types as the primary discriminator. The percentages of edge types represent the density of a given communication type within each graph; the team identified eTypes 0 (call records), 2 (procurement sales), and 3 (procurement purchase) as definitive. Comparison of eTypes 2 and 3 in the template and subgraph show highly similar levels of activity. EType 0 for the template graph shows 7.1%, and subgraph 2 shows 7.04%. The other subgraphs, one, three, four, and five displayed percentages that diverged from the template subgraph. The table below shows a detailed comparison of all subgraphs to the template graph. The display of data strongly suggests that subgraph two most likely agrees with specifically the E-types 0, 2, and 3 of the template subgraph.

 

 

Template Subgraph

Graph 1

Graph 2

Graph 3

Graph 4

Graph 5

E-type 0

7.1%

5.41%

7.04%

4.95%

6.02%

5.61%

E-type 2

0.12%

0.10%

0.11%

0.17%

0.33%

1.40%

E-type 3

0.12%

0.10%

0.11%

0.17%

1.79%

13.68%

2CGCS has a set of “seed” IDs that may be members of other potential networks that could have been involved. Take a look at the very large graph. Can you determine if those IDs lead to other networks that matches the template? Describe your process and findings in no more than ten images and 500 words.

CGCS has a set of “seed” IDs that may be members of other potential networks that could have been involved. Take a look at the very large graph. Can you determine if those IDs lead to other networks that matches the template? Describe your process and findings in no more than ten images and 500 words.

The team was unable to make a final determination of whether any of the seed IDs resembled the template data graph, but were able to make a conclusion that the Seed 2 data graph was unlikely to resemble the template.

 

The team used a Microsoft PowerPivot table within Excel to display all of the 123.8 million records from the large data graph. From this data, we were able to extract all of the data points one degree separated from the originating source and target for each of the seed data graphs. However, the team’s ability to extract the data one step beyond that initial condition proved difficult, largely due to technical limitations. Therefore, proceeding with a one-step analysis, the team developed the following three visual analytics based upon the seed data.

 

The team was able to rule out Seed 2 as a likely candidate to compare with the template data. Our analysis was that the nature of the Seed 2 graph represented a principally academic network given the preponderance of eType 4 data within, something not characteristic of the template data.

 

The team was unable to come to a conclusion regarding the Seed 1 & 3 data due to the technical limitations imposed. Our process would have extracted every unique source and target that communicated with the initial seed source and target (the other nodes represented on the visual graph) and then searched for communications records within the large data graph when the source and target were both within the extracted unique list. In this manner, the team hoped to create a network where we could compare the seed graphs to the template by examining the interconnectedness of the networks.

5What was the greatest challenge you had when working with the large graph data? How did you overcome that difficulty? What could make it easier to work with this kind of data?

The greatest challenge was pinpointing which attributes of the data were most valuable and choosing the optimal technologies. Initially, we attempted to create network graphs using Tableau 10.3 and Python’s NumPy and Pandas packages, but we found it difficult to carry out the necessary calculations and format our graphs in a way that would allow easy data manipulation throughout the analysis. In addition, the amount of data was extremely challenging to process, inhibited by the lack of available technology and overall time constraints. Off-the-shelf and open-source software were limited in their ability to ingest, process and display data; we needed open-source software that could function easily without the need for expensive hardware. The team settled on the open-source software Gephi, which was not only capable of performing the calculations we needed and applying them to the visual network graph, but also enabled easy interaction and changes towards the graph layouts.

 

On top of choosing an optimal software, we had to decipher which attributes were worth focusing on. One example of this issue was the discussion the team had regarding the importance of “weight” when creating the visualizations. Each weight was measured with a different scale according to its data type. It was concluded to keep all the weights consistent, setting all edges with a weight of “one” and redirecting the focus to the density of e-types. To overcome this, we took an analytical approach, creating multiple graphs, to visualize disparate attributes and then considered what was most beneficial aggregately.

One of the shortfalls in the ability to conduct social network analysis at the scale required in this challenge is the dearth of open-source software able to handle the number of records and ability to conduct visual analytics on it. We found that one way to short-cut that process was to extract data prior to displaying it visually, and we think that future developments of machine learning algorithms that can identify clustered networks—such as the one in the template—could enhance the ability of analysts to examine the data.