Nicholas Spyrison, Monash University, Nicholas.firstname.lastname@example.org PRIMARY
Miji Kim, Monash University, email@example.com
Ha Nam Anh
Pham, Monash University, firstname.lastname@example.org
Student Team: YES (PhD candidate and 2 masters students respectively. Department of Human-Centred Computing, Monash University, Australia)
- R (via RStudio)
o Especially the packages: dplyr, tidyr, ggplot2, gganimate, ggraph, Rtsne
Approximately how many hours were spent working on this submission in total?
~200 hours between 3 people
May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2020 is complete? YES
Center for Global Cyber Strategy (CGCS) researchers have used the data donated by the white hat groups to create anonymized profiles of the groups. One such profile has been identified by CGCS sociopsychologists as most likely to resemble the structure of the group who accidentally caused this internet outage. You have been asked to examine CGCS records and identify those groups who most closely resemble the identified profile
How the visual analytics software helped our analysis
a. Heatmap: `ggplot2` was used to quickly grasp the distributions across the Edge Types within each Data Source.
b. Network layouts: The R packages `ggraph` and `igraph` packages were used to format into a graph object, apply layout locations according to various algorithms (especially `igraphs` Large Graph Layout) to facilitate rapid iterations in R.
c. tSNE: t-distributed stochastic neighbour embedding (“tSNE”, van der Maaten & Hinton, 2008), is a form of non-linear dimension reduction. Within each Data Source, we apply tSNE to embed 5 attribute-dimensions (3 factors: Node ID, Node Direction, Edge Type and 2 quantitative: Time [seconds], and Weight) of all cleaned rows into their own, potentially highly non-linear, 2-dimensional projection spaces. We use the same hyperparameters, one of which is a function of sample size. Namely, perplexity = ⅓ * the square root of(number of rows in this dataset). Viewing these spaces side-by-side we tried to identify features of the projection spaces to better compare and contrast the networks.
d. Visuals of weight animated across time: the `gganimate` package was used in generating animated plots. The animated bar chart presents bars, racing to the top based on ranks within each frame. It was developed with the intent of presenting the flow of procurement transactions over time. Then, the animated scatter plot with a timeline element was developed to identify and visualize similarities between each suspect and the template shown over time.
a. Heatmap: fast, light distribution of observations across 2 discrete variables.
b. Network layouts and tSNE: To identify and contrast particular features in the different networks.
c. Visuals of weight animated across time:
a. With visual data exploration, Suspects 4 & Suspect 5 were removed as they presented fewer similarities compared to Suspects 1-3. Then, we narrowed down our analysis to procurement transactions as meaningful findings were identified in the financial category during the exploration stage. Therefore, the dataset was filtered by the edge type, selecting sell and purchase data.
b. A discrete transformation was applied across time as we created a frame variable by slicing time to aggregate and animate. We are currently revisiting this transformation to see if we can adopt an agnostic approach instead of subjectively selecting durations based on integer grains of time (ie. year and month) based on the respective count of observations. The top candidates include uniform slices of time and slices of time containing a uniform number of observations.
a. Through the previous visualizations, we choose to rule out Suspects 4 & 5 as candidates and proceed to animations across time with the subset of Template, and Suspects 1-3.
a. In this write-up we tried to articulate nuisance terms including:
i. tSNE projection space: tSNE is non-linear and stochastic in nature. By this we mean that the precise transformations used to embed 5D data space into 2D projection space are obscured, and particularly projections are not a global solution, but rather local extrema that are hard to reproduce. Despite these shortcomings, we find meaningful interpretations in them corroborating our other findings. It is also worth noting that the signal suspect was not clearly identified.
ii. Selection of time duration for each “Frame”-slice of time. The animations in the video were selected subjectively based on the distribution of observations in all Data Sources across time and selected on nice, whole units of time. We are going to revisit this as previously described above.
a. The accuracy and precision of the data
b. The suspect networks include the most suspect behavior
c. Data cleaning:
d. The animated bar chart:
i. Randomized tie-breaking within each rank of a given frame
d. The animated scatter plot:
i. Disregarded direction whether it was the originator of the transaction or the recipient
1. Using visual analytics, compare the template subgraph with the potential matches provided. Show where the two graphs agree and disagree. Use your tool to answer the following questions:
The heat map shows the number of transactions made in each suspect, the template and the edge type. From the heat map, suspects 1 - 3 have similar values in the template, although the template does not include ‘co-authorship’.
To further identify the suspect subgraphs that match the template, we have used tSNE on edges. “tSNE” is a technique to visualize high-dimensional data. This technique enabled us to generate better visualizations by decreasing the tendency to crowd points together in the center of the map that linear projections suffer from. From the visualization, the template graph has more rounded, but unconnected splotches. Suspects 4 and 5 contain relatively shorter strings compared to other suspects and the template
The splotches of the template data are quite unique. While the short, choppiness of the strings in suspect 4 and 5 corroborate the findings in the earlier visualizations. We continue our search within suspects 1, 2, and 3.
2. CGCS has a set of IDs that may be members of other potential networks that could have been involved. Take a look at the very large graph. Can you determine if those IDs lead to other networks that matches the template? Describe your process and findings in no more than ten images and 500 words.
The process to create this visualization is done via the `igraph` package. Looking at the 5 suspects and the template, the direction, clustering and node types used, it can be found that the template for identifying malicious attacks contains more travel than the 5 suspects. The temple also has dense arrows pointing inward to a few nodes in a tight group. Looking at the suspects, suspects 4 and 5 exhibit these properties from the template graph.
3. Optional: Take a look at the very large graph. Can you find other subgraphs that match the template provided? Describe your process and your findings in no more than ten images and 500 words.
4. Based on your answers to the question above, identify the group of people that you think is responsible for the outage. What is your rationale? Please limit your response to 5 images and 300 words.
Based on analysis and given constraints, we believe the full-network behaviour of suspects 4 & 5 is quite unlike that of the template network. Between suspect networks 1, 2, and 3 we have not been able to positively identify one or more networks that look most like the template. What is further, the remaining candidates seem to have more in common with one another than that of the template network.
We advise an immediate meeting with CGCS socio-psychologists to discuss exactly how closely we expect the network behaviour to adhere to the template. The search may need to broaden to include other networks outside of the suspects, or perhaps further explore precise behavioural differences with domain experts.
5. What was the greatest challenge you had when working with the large graph data? How did you overcome that difficulty? What could make it easier to work with this kind of data?
b. The number of levels when all discrete variables are considered. The constant need to validate the sentiment “am I within the correct dataset for the correct Node Type and Edge Type?”.