Ramanand J (email@example.com) [PRIMARY CONTACT], Shishir Mane (firstname.lastname@example.org), Niranjan Pedanekar (email@example.com), Harsh Nene (firstname.lastname@example.org, Sandeep Kulkarni (email@example.com), Mayur Bodakhe (firstname.lastname@example.org)
Affiliated to: BFS Innovations, Cognizant Technology Solutions, Pune, India
GATE (http://gate.ac.uk/): an open-source text extraction & analysis tool; used to identify ‘named entities’ (i.e. people, places, organizations etc). from text.
OpenCalais (http://www.opencalais.com/): a free API for named entity identification; also used to ease the task of identifying people, places, etc. from unstructured text.
GeoMap from Google Chart Tools ( http://code.google.com/apis/visualization/documentation/gallery/geomap.html): visualization tool to represent geographical data.
GraphML Reader from Prefuse (http://prefuse.org/, http://flare.prefuse.org/): We have built an in-house tool to represent organisational social networks using Prefuse. This is reused in this submission.
Wordle (http://wordle.net/): a popular word cloud generator.
A link to our video. (Our video is in the form of a powerpoint file with embedded narration. Please play the slideshow to hear the narration.)
MC1.1: Summarize the activities that happened in each country with respect to illegal arms deals based on a synthesis of the information from the different report types and sources. State the situation in each country at the end of the period (i.e. the end of the information you have been given) with respect to illegal arms deals being pursued. Present a hypothesis about the next activities you expect to take place, with respect to the people, groups, and countries.
Solution Analysis Sequence:
1. Document Perusal: we read samples from the 5 source files to decide what to extract and how. This took half a day.
2. Event Extraction: we began trying to visually map various people, places, and relations between them. We tried sketching on paper, then using powerpoint as a canvas and so on. Soon, we realized this approach did not work because representation was cumbersome and unlikely to be suitable. This was mainly because the reports were not chronologically ordered and made references across sources, causing too much re-organisation of the initial sketches. After struggling for about a couple of days, we changed track to an event-based approach. We began extracting individual events from the given data. Each individual news item yielded one or more events, either in the past or in the future (such as planned meeting). Each event had an associated date, usually contained one or more actors, locations, type of event (meeting/police action etc.), the source of this event (news/blogs etc.), and the actual event description. This took us 3 days to complete.
The entire set of events thus identified is listed in this excel file. The first sheet is a chronological ordering of events i.e. sorted by date. The second sheet contains the original extraction, in order of news source. The date-wise ordering helps understand the overall sequence of events, aids in filling in some missing gaps and map seemingly unrelated people. An example is the use of ‘drilling equipment’ to refer to the arms cargo of IL-76. It is used in a phone conversation. The testimonies of the IL-76 crew also refer to the same phrase. The plane is owned by one Arkadi Borodinski, who hails from Kiev, which is where the phone conversation originated from, making it probable that he was the caller.
This set of events sets the stage for visualizing the various players and entities in the reports, summarizing their relations and their relative importance.
In this task, we used tools to identify special types of words such as people/places/organizations etc. (referred to as Named Entities in the Natural Language Processing community) from the text. This served as an aid to the manual reading of the text. GATE is an open-source library, while OpenCalais provides a web API. Both try to highlight candidate entities. In this case, we chose phrases that seem to be names of people, places (including countries), and organizations.
Recognition of these entities is limited to explicit names, which mean that identifying references to people (say by pronouns) is not covered. Even harder was identifying specific events. These remained manual tasks. We did not choose to implement a fully automated extraction system as the input document set was limited. However, a full-fledged system could easily use a text extraction system to identify not only the entity identification, but also relations & events.
3. Visualization: To represent the relative importance of various countries in the given subject, we use the GeoMap charting tool to show a world map where the different countries mentioned in the documents are marked. Each country is associated with a bubble, whose size and degree of redness is proportional to the number of mentions in the documents. (The source table is in the attached excel). Pakistan, UAE, and Kenya are the top three such nations. GeoMap creates a flash file chart and is very easy to use. The graph (shown in Fig 1) was created in less than an hour.
Figure 1 Countries by Mentions in various sources
(these answers are based on a reading of the event set that we generated during our analysis)
Pakistan: Though there is no confirmation that the planned meetings in Dubai in Apr 2009 took place, we assume they happened. It is likely that the Pakistanis are sourcing more arms from arms dealers that they met in Dubai. It is difficult to guess what specific operations these will be used in.
UAE: Dubai in the UAE becomes a hotspot for meetings between various dealers (particularly from Russia and Ukraine), buyers, and members associated with terrorist organizations in Pakistan and in the Middle East.
Kenya: The death of Thabiti Otieno and his wife is suspicious (no cause of death is given), given the dealings they have had. Clearly, Kenya was a source of arms for dealers. It is likely that after its release from hijackers, the ship MV Tanya containing arms cargo reached Mombasa.
Yemen: The notorious arms dealer Saleh Ahmed is reported to be in a near-death state, suggesting an attempt was made on his life, perhaps due to a fall out of arms deals going bad. This would have an impact on the strife in Yemen and perhaps neighbouring Saudi Arabia, where Saleh Ahmed was a key provider of arms to rebels. Ahmed was to have a meeting with Mikhail Dombrovski and discuss the problems arising out MV Tanya’s hijacking.
Thailand: The IL-76 crew remain in captivity pending investigations. The likes of Boonmee Khemkhaengare continue their wheeling-dealing.
Russia & Ukraine: Arms dealers from these former Soviet nations seem prominent in the illegal arms trade. Mikhail Dombrovski emerges as a key figure, connecting various nefarious characters (see the next task for more details). Like him, Nicolai Kuryakin also has scheduled meetings in Dubai in Apr 2009. Task 3 suggests that he contracted an illness, which could be related to deals gone bad.
MC1.2: Illustrate the associations among the players in the arms dealing through a social network. If there are linkages among countries, please highlight these as well in the social network. Our analysts are interested in seeing different views of the social network that might help them in counterintelligence activities (people, places, activities, communication patterns that are key to the network).
Solution Analysis Sequence:
Using our list of events, we could aggregate information about people and places. An example social network of people was derived using this list. We selected (by filtering in Excel) all the phone, email, and meeting events. We created a list of people (nodes in the social network) and assigned ids to them. We then created edges for each conversation or meeting between pairs of people (when there was a meeting of more than 3, we created edges for each pair of people present). This was done manually and is summarized in this excel sheet. By grouping identical edges, a frequency count for each pair’s conversations (irrespective of connection type) was created. The frequency served to indicate the strength of the relationship.
It emerged that there were six components in the graph, which were independent of each other (i.e. they seemed to have no contact with other members in other components). The biggest one is shown in Fig 2. Mikhail Dombrovski emerges as a key figure in this graph, have direct connections to 6 out of the 10 people in this sub-graph. The thickness of the edge between him and the likes of George Ngoki indicate a large number of exchanges. This serves as a crude approximation for the strength of the ties between these people. Similar graphs are sketched for the other components, one of which is shown in Fig 3.
Such graphs quickly help identify important players (like Dombrovski, Akram Basra, Maulana Bukhari et al.).
This social network was created using an in house tool which had been developed to represent communication networks within an organization, and was reused for this task. The tool was built using the Prefuse Flare project, and its input consisted of a GraphML xml file that encodes the nodes and edges to be graphed. The file was generated automatically from the event database and the entire visualization put together in a couple of hours.
Figure 2 Social Network of People (e.g. 1)
Figure 3 Social Network of People (e.g. 2)
This word cloud, generated using Wordle, shows an overall view of the most frequently appearing (and thus possibly important) people in the documents.
Figure 4 Mentions of people in the texts