SAS-SCHULZ-MC1

VAST Challenge 2020

Mini-Challenge 1

Team Information

Team members:

Student team?

No

Analytics tools used:

Approximately how many hours were spent working on this submission in total?

80 hours

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2020 is complete?

Yes

Video

Watch Video on YouTube

Questions

Center for Global Cyber Strategy (CGCS) researchers have used the data donated by the white hat groups to create anonymized profiles of the groups. One such profile has been identified by CGCS socio-psychologists as most likely to resemble the structure of the group who accidentally caused this internet outage. You have been asked to examine CGCS records and identify those groups who most closely resemble the identified profile.

1Using visual analytics, compare the template subgraph with the potential matches provided. Show where the two graphs agree and disagree. Use your tool to answer the following questions:

Compare the five candidate subgraphs to the provided template. Show where the two graphs agree and disagree. Which subgraph matches the template the best?

Please limit your answer to seven images and 500 words.

The template and candidate graphs were visualized to support initial comparison across their topology, composition, and temporal patterns. The highly connected nature of the financial segments of the network made it difficult to see some structures and so were visualized separately.

Resultant diagrams from plotting financial transactions separately.

From visual inspection of the template graph several noteworthy features were identified:

With these features in mind the composition of the candidate graphs and their topologies were visualized to facilitate searching for some of these same features.

Comparison of topologies of the templates and candidates graphs with financial edges shown separately.

A visual comparison of the topological structure shows the most similarly between the template and candidate graphs 1, 2, and 3. Candidate graph 4 is missing a purchase between two individuals like the one present in the template. Candidate graph 5 is missing the cluster of non-traveling communicators.

Comparison of node, edge, and temporal frequencies of network composition for templates and candidates.

Comparing the node and edge composition of the networks does not reveal much more that the network structure does not already, but the temporal comparison does provide more helpful differences. In particular the communication peaks found in the template are also found in candidate graph 1 and, to a lesser extent, candidate graph 2. The summertime window of purchasing is found in the template and candidate graphs 1 and 2 and also in a shorter window in candidate graph 3.

From this initial comparison candidate graphs 4 and 5 seem to be quite different from the template. Candidate graph 3 also has several differences in the finance related areas of the network. Candidate graphs 1 and 2 both seem reasonably similar to the temporal patterns of the template with candidate graph 1 appearing to align the best after only a visual assessment.

Which key parts of the best match help discriminate it from the other potential matches?

Please limit your answer to five images and 300 words.

Many of the visualized features of the graph helped narrow focus to candidate graphs 1 and 2. By establishing topologically relevant subgraph structures and attributes we were able to check for pattern matches in the template and candidate graphs for a more analytical approach to evaluating similarity. Searching for patterns in the full graph as well provides context for their potential discriminatory power.

Table showing subgraph patterns and their match frequency in the template, candidate graphs 1 and 2, and the full graph, highlighting patterns that illustrate similar matches between candidates.
Pattern Description Template Candidate 1 Candidate 2 Full Graph
Pairs of co-authors 0 0 0 1,000,000+
Buy/sell events 9 7 7 389,491
Multiple purchases in a single day 1 1 0 3,812
Pairs who engage in buy/sell activities at least 5 times 1 1 1 36
Pairs who travel to the same country with at least 1 day overlap 14 13 5 1,000,000+
People with more than $200k in personal income before taxes 2 6 0 2,262
People with more than $100k personal income before taxes 4 8 1 8,482
Groups of three that communicate at least 5 times each 30 4 11 8,299
Unique individual in a group of three communicating at least 5 times 16 7 12 1,426
Groups of three that communicate at least 10 times each 7 2 5 286
Unique individual in a group of three communicating at least 10 times 9 6 6 167
A buy/sell and communication happens within the span of a day 1 0 1 368,511
Pair that sequentially communicate, buy/sell, then communicate within the span of a week 1 0 1 43
Pair that travels with at least 1 day overlap and communicate with each other 3 1 3 738,393
People that communicate during travel 0 0 0 144
People who communicate with 2 others, travel, and buy/sell 1 0 1 1,885

A few of the finance and travel related patterns showed a higher incidence of matches between structures in the template graph and candidate graph 1. But many of the more communication centric patterns showed a higher number of matches between the template and candidate graph 2. Given that the communication connections are more informative about the how an event between people occurred we are inclined to bias toward candidate 2 being a stronger match.

Additionally two people communicating, buying/selling, and communicating within a week is a relatively rare pattern in the full graph that is only shared between the template and candidate graph 2.

2CGCS has a set of “seed” IDs that may be members of other potential networks that could have been involved. Take a look at the very large graph. Can you determine if those IDs lead to other networks that matches the template?

Describe your process and findings in no more than ten images and 500 words.

Our method for locating potential template matches is based on searching for the rarest of our defined patterns in the full graph and investigating the nodes that match those patterns and are found to be in close proximity to each other.

Our search technique involved looking for isomorphic subgraphs with fuzzy conditions on attributes such as time or weight.

Code example of one of the pattern filters that finds communicate then buy/sell then communicate that happens within a week.
function linkFilter(fromQ, toQ, eType);
  if (fromQ in (0,1) and toQ in (0,1)) then return (eType in (0,1));
  if (fromQ in (0,1) and toQ=2) then return (eType in (2,3));
  return (1);
endsub;

function linkPairFilter(toQ[*], weight[*], time[*]);
  if (toQ[1]=2 and toQ[2]=2) then return (weight[1]=weight[2] and time[1]=time[2]);
  if (toQ[1]=1 and toQ[2]=2) then return (abs(time[1]-time[2])/3600/24 <= &maxDays);
  return (1);
endsub;

function matchFilter(fromQ[*], toQ[*], time[*]);
  comm1Time = .;
  comm2Time = .;
  buyTime = 0;
  do i=1 to dim(time);
    if (fromQ[i]=0 and toQ[i]=2) then buyTime = time[i];
    if (fromQ[i]=0 and toQ[i]=1 and comm1Time=.) then comm1Time = time[i]; 
    if (fromQ[i]=0 and toQ[i]=1) then comm2Time = time[i];
  end;
  return (comm1Time < buyTime and buyTime < comm2Time);
endsub;

data sascas1.linksq;
input source target;
datalines;
0 1
0 1
0 2
1 2
;

proc network
  direction = directed
  links = sascas1.&graph
  linksQuery = sascas1.linksq;
  linksVar
    from = source
    to = target
    vars = (eType time);
  linksQueryVar
    from = source
    to = target;
  patternMatch
    linkFilter = linkFilter(lq.source, lq.target, l.eType)
    linkPairFilter = linkPairFilter(lq.target, l.weight, l.time)
    matchFilter = matchFilter(lq.source, lq.target, l.time)
    outMatchNodes = sascas1.outMatchNodes
    outMatchLinks = sascas1.outMatchLinks
  ;
run;
	

The result is a count of how many times a matching subgraph was found in both the template and the full graph as well as the ids of all of the nodes that were involved in a match to that pattern.

Table showing subgraph patterns and their match frequency in the template, candidate graphs 1 and 2, and the full graph, highlighting the rarest patterns in the full graph.
Pattern Description Template Full Graph
Pairs of co-authors 0 1,000,000+
Buy/sell events 9 389,491
Multiple purchases in a single day 1 3,812
Pairs who engage in buy/sell activities at least 5 times 1 36
Pairs who travel to the same country with at least 1 day overlap 14 1,000,000+
People with more than $200k in personal income before taxes 2 2,262
People with more than $100k personal income before taxes 4 8,482
Groups of three that communicate at least 5 times each 30 8,299
Unique individual in a group of three communicating at least 5 times 16 1,426
Groups of three that communicate at least 10 times each 7 286
Unique individual in a group of three communicating at least 10 times 9 167
A buy/sell and communication happens within the span of a day 1 368,511
Pair that sequentially communicate, buy/sell, then communicate within the span of a week 1 43
Pair that travels with at least 1 day overlap and communicate with each other 3 738,393
People that communicate during travel 0 144
People who communicate with 2 others, travel, and buy/sell 1 1,885

By checking for the presence of multiple of these rare patterns the number of matches to evaluate is reduced to a number that is reasonable to visualize and analyze.

Visualization of progressive narrowing of nodes indicative of template presence against three rarest patterns.

Each pattern match was analyzed individually to see the subgraphs formed by only keeping the nodes matching the pattern and any of the edges between them. They are shown below in order of decreasing rarity. Additionally any seed nodes that were directly connected to a node that matched one of the patterns are included to see which of the seeds could be involved in a template match.

Subgraph resulting from requiring a match to "Pairs who engage in buy/sell activities at least 5 times" pattern and proximity to a seed node.
Subgraph resulting from requiring a match to "Pair that sequentially communicate, buy/sell, then communicate within the span of a week" pattern and proximity to a seed node.
Subgraph resulting from requiring a match to "Unique individual in a group of three communicating at least 10 times" pattern and proximity to a seed node.

By filtering the nodes to only the set that were found to match all three patterns a much smaller set of nodes is identified. And from those nodes we found that 561428, 620791,and 462278 have direct connections to the seed nodes 574136 and 600971. This leads us to believe that those seed nodes are very likely to be connected to subgraphs that match the template.

Subgraph resulting from requiring a match to all three patterns and proximity to a seed node.

3Optional: Take a look at the very large graph. Can you find other subgraphs that match the template provided?

Describe your process and your findings in no more than ten images and 500 words.

To rule out the possibility we attempted to pattern match the topology of the template graph against the full graph without accounting for attributes such as timestamps or weights at all. Unsurprisingly there were no exact matches to the template topology in the full graph.

In testing for subgraphs adjacent to seeds for answer 2 we subset the matches in the graphs based on proximity to the seeds. However, as those patterns are able to detect the presence of template features generically we can identify more nodes that are likely members of subgraphs similar to the template using the same technique.

Additionally, we can identify roughly where in the full graph these pattern matches occur by first aggregating into communities. However, with this much aggregation there is not much that can be discerned from resultant structure.

Visualization of communities within the full graph, highlighting those containing matches for patterns found in the template.

As the patterns are all matched against people of interest the communities in which our matches were found are all highly connected, likely due to a large number of communication interconnections. It also is the case that many of the matched patterns span these communities so it is not accurate to assume for example that there may be four distinct template matches.

The more general testing for template patterns reveals that there are nodes that may be parts of subgraphs not connected to the seed. Attempting to expand the network too far beyond the nodes in a pattern can quickly result in a visualization of a significant portion of the full graph due to its interconnectedness. But the risk of not extending to reveal context is that parts of the template that were not directly searched for cannot be visualized.

Table showing the rarest pattern in the full graph that was found in the template.
Pattern Description Template Full Graph
Pairs who engage in buy/sell activities at least 5 times 1 36

Starting from the rarest pattern returns 36 matching nodes. We then added a select set of adjacent nodes with the hope that patterns similar to the template will be visually apparent. Extending one step to include travel, purchasing, and financial structures provides a graph for visual comparison against the template. It is likely that a subset of this expansion matches to one of the hacker groups being searched for.

Subgraph resulting from requiring a match to the "Pairs who engage in buy/sell activities at least 5 times" pattern and expanding to include adjacent nodes.

4Based on your answers to the question above, identify the group of people that you think is responsible for the outage. What is your rationale?

Please limit your response to 5 images and 300 words.

To make our best guess at finding a match for the responsible party we started from the six people that were matches for all of the three rarest patterns we devised for identifying the presence of the template. When looking a their procurement channel events some of these nodes have similar temporal patterns over the course of the year, which further suggests they were involved in the same group activity.

Visualization of 6 nodes that match all three rare patterns and time series of the purchasing activity for those nodes.

By extending the subgraph to include local context including travel and purchase activity we were able to identify pairs that had purchases between them, similar to one of the distinctive patterns found the in template graph.

Local area subgraph for a selected set of potential template matching nodes from the full graph.

In particular 647740 is a potential match for template node 67 due to both being the member of the purchasing pair that does not have any travel activity. Additionally 570191 and 561428 are potential matches for template node 39, which is the member of the purchasing pair with travel activity. Between them, 561428 is connected to one of the seed nodes potentially indicating a higher chance of involvement.

Template graph showing potential matches for nodes 647740, 570191, and 561428.

5What was the greatest challenge you had when working with the large graph data? How did you overcome that difficulty? What could make it easier to work with this kind of data?

Our greatest challenge was the difficulty in being able to query even a small region of interest to visually inspect the local structure and patterns. We overcame this by relying on pattern matching across a set of small, specific patterns to detect a region that matched multiple characteristics of the template.

Our pattern matching methods are able to do fuzzy matches on attributes of the nodes and edges, but currently cannot perform fuzzy matches on the topology. Having that ability would have made it easier to find matches using more complex patterns from the template. Matches on more complex patterns would have meant being able to more accurately extract the full subgraph that comprises a template match from the large graph.