Big Data Net Mined Discourse Study : Campaign 2012 : Pipeline Step : 2012-06-01 through 2012-06-13 : Focus=JECON
This example relates to the problem of claims of determining 'tone' or "sentiment" and what are sometimes questionable conclusions seen in the wild . During this same time period as our study, 06/01 - 06/13, there have been reports of negative "tone" or specifically somevariety of gauge of 'what people are saying' about issues and a candidate. A lot of what we see in the wild that is questionable looking appears to be based on simple counts from data sources where noise is not filtered out or not filtered successfully.
A wise sanity check on findings is to compare against baseline(s) and also using alternate methods or tools to gain a different perspective, along wih comparison to other methodologies or techniques (e.g. comparison against multiple traditional polling results).
In this case, we observed a negative "tone" determination being publicly portrayed. Even without access to the inner workings of the analysis of the producer of the tone" - we can ask questions like: Has the producer of the analysis used an inappropriate problem set up or model? Was the input data clean and relevant? How do I know if their conclusion is right or not? If I use another tool or two or apply a different view on the problem do I discover different information? My conclusion is different - have they picked up on something I missed?
For this example we applied some know-how in mining a sample and used a specific tool applying methods to get a quick look at our own sample and eyeball it.
The finding is that there is some support for insufficient problem setup and/or data cleaning and analysis in the observed negative "tone" percentage claim of significantly > 50%. Looking at a separately collected sample from the same source over the same time period, our results were that one may conclude that "people are saying" negative "toned" discourse about the candidate on the issue, if noise is not filtered out. Without noise filtering one could get the same conclusion as the observed tone claim. With noise filtering the conclusion is completely different and not over 50% negative.
The data source in question, like almost all possible public discourse data sources, does not only have what actual real people are saying in discourse - it also contains numerous items of discourse that include messaging from the opponent candidate. surrogates, known biased media, and opposition astro-turf. When those components of noise are filtered out, the results change.
In the cluster map, our interpretation is that a significant negative bias was probably introduced by those producing the claim of negative tone, by failing to filter out the noise. That this is likely or at least bears more looking into, can be seen in the cluster maps. The cluster map include the more significant environmental factors of source of the items of discourse, that a very large number of those negative items of discourse were created by known sources of noise rather than the public.
Visuals created by the tool that we used allowed us to quickly and easily gain some perspective on the claimed significant, > 50%, negative "tone" claim for the candidate in the issue area.
Baseline Pipeline Step A Discourse Study : Obama "Raise Revenue" : April 11, 2011
In this particular example, the graphs are snap-shots of the most abundant and highest ranked discourse on the Web, for a date or range of dates, relative to a particular query, and represent a visualization of one of the initial single steps in a pipeline of multiple steps carried out to result in Big Knowledge. This step is carried out as part of developing a baseline. In this step the input is generic from a variety of source search engines, (but that is not always the case, in other cases the input is specifically mined data).
Each label (term or phrase) on this graph represents a cluster of Web sources related to a discourse concept.
A single bubble has one or more smaller bubbles inside, which represent a variety of Web sites or different data sources on the Web which are related. The maps depict a summary view of the discourse that is occurring on highest ranked Web sites given the specific topic of discourse we are interested in. The graphs do not show a map of the Internet. The structure of the graph shows which phrases and concepts are connected. In general, the connections between the concept bubbles reflect which phrases or concepts are associated within the overall discourse topic at that snapshot in time. As examples of single point in time snap shot examples, the graphs do not reflect an average or trend based on a series over time.
It is possible to include other simple non-discourse describing features including geographical location, date/time, types of sources (e.g. video versus bloggers), and others, as 'environmental' features. By including environmental features we, for example, identify where discourse surrounding a particular campaign slogan is located (e.g. Indianapolis newspaper sites), or whether a concept is associated with a campaign's own Web sites versus a TV Station, Media personality, etc.
We can include "environmental" features such as the associated geographical locations of Web sources, or media type and compare that to the locations a candidate has visited. In this example only text sources were searched and clusters however it is also possible to search multiple media types and also the major languages in use globally.
In this example, we have included the association with several days dates in April near to the April 14th 'deficit speech' by President Obama.