CAVANAUGH: This is KPBS Midday Edition. I'm Maureen Cavanaugh. About a week ago, the news of saturated with political polls. We were told almost every day what likely voters would do. As it turns out, many of those polls were quite accurate. But as it also turns out, polling is becoming a rather old fashioned way of predicting outcomes. The newest idea in forecasting everything from politics to armed conflict to disease outbreaks is by Amassing big data. A project underway at San Diego state is aimed at tracking and analyzing websites for clusters of words and phrases that characterize developing events including epidemics and hostile social movements. My guests are Depak Gupta is distinguished professor of political since at SDSU. GUPTA: Thank you very much. CAVANAUGH: And Ming Tsou is professor of geography, welcome to the program. TSOU: Thank you for having us. CAVANAUGH: Professor Tsou, let's talk first about your data from social media trying to predict last week's election. You write that your model using tweets came pretty close to the national polling data. TSOU: Yes, I think our very much utilizing the tweeter. There is a search to gather over 2 million tweets before the election days. And we're analyzing the tweets not just we gathering together, but city by city. We selected top-thirty US cities and comparing the key words related to Obama versus Mitt Romney and Chen compared the ratio. And the outcome is surprisingly similar to the current poll, and even after the election result. CAVANAUGH: But you also say that this method, even though it produced some very good results is not quite ready to be used today. Why is that? TSOU: I think my explanation is I was surprised to see that they are so related to the outcome of elections. However, we cannot explain the detail why the data is so related. We do know that social media, especially the tweeter, has a lot of bias. 50% of the data is nonsense on the tweets. 30% of the tweets are retweets. CAVANAUGH: I see. Right. TSOU: But astounding as a result is really the prediction capability. So I think our research, we really need to focus on the deeper understanding about the reason behind those correlation. Then we can say, yes, we could do the prediction. CAVANAUGH: Right. So as it stands right now, because you don't have that background information, you don't have the scientific reasoning behind why it works, it could be just a coincidence. TSOU: It could be. But I think we have some similar research compared to the disease outbreak. And even the movie industry, box office, and all this domain as a high correlation efficiency compared to the outcome in the real world, and the number of tweets we've collected by the key words. CAVANAUGH: Doctor Gupta, could you tell us about this overall project? I understand the idea is as professor Tsou has been explaining to you, social media and websites to map all kinds of things. ; is that right? GUPTA: That's right. Our project is called Mapping Ideas. We literally by listening to people, listening to their conversation on a publicly available -- information, whether they're tweets, whether they're blogs or they're websites, we can get a very good idea of what they're talking about and what is their -- what is the idea. For instance, we can see ideas, good or bad, from extremism to all sorts of things spreading. We did a search when there was the news report that a church in Florida was trying to burn Qur'an, and when we started plotting where those things started coming up, all over the world, in the United States, we saw some incredibly interesting patterns. So it has a wide variety of applications in many fields, from epidemiology, how the diseases get spread, who are talking about, for instance, whooping cough. People these days, when their children have cough, persistent cough, they start tweeting, they start writing blogs, they inquire. CAVANAUGH: How do you get access to what people are writing in their twitter accounts, in their blogging? Is this all just public information that you use? GUPTA: These are all public information. Twitter. Every time you tweet something, you have to remember that you don't own that. There is no privacy toward your writing, to protecting your thoughts. In fact, it's the twitter corporation that owns it. And they sell it for premium prices. CAVANAUGH: As you know! [ LAUGHTER ] GUPTA: Absolutely, we do! CAVANAUGH: Okay. Now, what are you looking for? Is that where the twist in this research comes in? To know the right words or the right combination of phrases to start looking at? Professor Tsou? TSOU: Yeah, I think in general our project tried to track, we call the digital footprint of human beings. Because we are engaged with a lot of smart phone, your i-Pod, your iPhone. When you type in those tweets, you are leaving a digital footprint in different time, different space. When we gather together those open source, web pain, social media, Twitter, and we can aggregate and understand the pattern and human moves in a different way. And we analyze the key words, and when we're analyzing the data, actually we are also learn think this is a new community. The tweeter, the user, we are learning the new key words. We are struggling to analyze -- this one key was sent out, it's called NT. And we have no idea. I am the computer geek. There is a Windows system called NT. But one student told me on twitter, NT is no thanks. [ LAUGHTER ] TSOU: So we are learning new key words and new vocabulary. CAVANAUGH: Are you developing software in order to go through this aggregate information that you get? TSOU: Yes. Right now we are developing a series of key word, or the vocabulary specific to interpreting the concept on the tweets and the social media. I think the social media, especially Twitter is different from the traditional web blog. Social media is a very small character. Only 140 character. So easy to recognize the meaningful concept behind that compared to the big web blogs. CAVANAUGH: Right because on a blog, people can just go off and ponder about this I thinks and, they don't need to necessarily stick to the subject. But if they're going to tweet something, they better stick to the subject because they don't have that much room. Doctor Gupta, I read about some software called Condor, developed by an MIT professor which can be used to analyze social protest. And it does something called sentiment analysis. Tell us about that. GUPTA: Well, sentiment analysis is a subject that's been around for sometime. We have in our group linguists who are focusing on understanding the grammatical structure of what we write and how we write to understand whether it is for or against or neutral. And we can understand a lot about people's conversation if a very scientific way by the usage of their language. Specific words. And imagine if we could eves drop on other people's conversation what we would learn. We would learn a lot, right? And that's exactly what's going on in the digital area. So it's a brand-new area where we don't need to have, say, 300 people surveyed. And you know that surveys can be extremely biassed. Focus groups can be biassed. And here we are talking about a million tweets, you know, millions of web pages and blogs. So we are getting perhaps a truer picture. There is variation in every methodology. Nothing is fool-proof. But here it's the law of numbers that is working for us. CAVANAUGH: Can you give us an example of the way language might be used to indicate a certain -- whether a certain social protest might be on the rise or whether it's -- the people involved with it are getting a little demoralized? GUPTA: Well, one of the things that we studied after President Obama got elected, and there is a thing called word cloud. That means we can literally find out and visualize all the words that are coming up by their relative size and position on a piece of paper. There you can see what people were saying. There are racial slurs embedded in those, so we can understand some of the antipathies that some people are feeling toward him based on his race or politics. So we can figure those things out. And also we can find out what people are saying things that are very nice about the president. And one other interesting thing we found out, when people are using the term global warming, there are a lot more skeptics than when people are using the word climate change. So if you look for climate change, you're likely to find a lot more scientific information, and people are much more -- I wouldn't say favorably, I mean, they are not skeptics. But when you talk about global warming, then people are using it to express their skepticism. CAVANAUGH: Right. Professor Tsou, it seems to me that the CIA, intelligence agencies, have been trying to use this kind of are big data to figure out where the hot spots are around the world. And one of the criticisms that I heard about this kind of research was that is seems like -- there was no luck in predicting the Arab Spring, this huge revolt in many Arab nations that caused the downfall of the Egyptian president. TSOU: I think there is a challenge in the linguistic aspect. Before the Arab Spring, there was no such key word called Arab Spring. Every movement has created a new key word. But there is some way we'll be able to detective the culture of a people. When we analyze the social media, we can have a mind core label analysis into the individual. So there is some users can predict the individual uses of a tweet. They are gathering into the specific area over 20 or 30%. In that movement, individual past movement, we can detect there is some social event. I think there is research in Japan analyzing like fireworks or holiday season, and they are able to dedebt those events. So in the future we can utilize eye similar measure. But that triggers a privacy concern. This is another big issue when we are analyzing the social media. And most people when they type in your opinion, you don't realizing every single key word, are monitored or trackable in the database, in the digital world. So there's a lot of argument about the legal issue, and then the privacy concern. I think our research, we do respect the privacy of the people. So while we do the analysis, we try to do it using -- an aggregate rather than pinpointing an individual one. CAVANAUGH: You were pinpointing one of the burgeoning aspects of this kind of research is what it can do for epidemiology, to find out how a disease is spreading. And you have some fascinating peoples where you've tracked the spread of whooping cough across the United States this year. &%F0 GUPTA: Yes. And I think Ming is perhaps better to address that question. TSOU: I think when we tracked the whooping cough, there was an outbreak in Washington state. So we're analyzing the number of the tweets, contained two key words, whooping cough. The initial name. We found out the key word makes sense. The other key word doesn't because it's too technical. And the percentage increase weekly is coresponding to the case we found on the CDC record. So it does have a good indication about the outbreak disease. CAVANAUGH: It sounds like it's going to be very, very useful for people, instead of having to phone up every medical facility in a particular area, every doctor, etc. GUPTA: Absolutely. It lends itself not only to the understanding of something happening but also to its prediction. And I'm involved with another project where we are trying to do just that. We are trying to predict what may happen. For instance, the spread of diseases, if people start talking about having fever, and then we notice that there has been a sudden increase in temperature and rainfall, we know this is the time when the mosquitos are going to breed. So we can put all this information together and come up with a strategic plan before it actually does happen. And we can help the policy makers with that. But we do have to be very careful about privacy issues and larger social political questions about this. CAVANAUGH: And my last 30 seconds, professor Tsou, it does sound the way that you're describing this, it's almost as much of an art as it is a science. TSOU: Yeah, I would agree with you. I'm a geographer, and also the cartographer, when we're using the spatial information to recreate the representation of our digital world. It really depends on what kind of aspects, and you are the artist using a different way to represent how the people are connected together, how are the people are talking, communicate together. So it is an art form.
Ming-Hsaing Tsou is sure Twitter can be used to accurately predict who will win a presidential election—just not quite yet.
Tsou and Dipak Gupta, both San Diego State University professors, are part of a project called Mapping Ideas. The idea is to track and analyze publicly-accessible websites, as well as social media like Twitter and Facebook, for key words and phrases and then track their spread.
The project tracked the frequency of mentions of "Obama" and "Romney" on Twitter the week before the election. Remarkably, its results closely resemble the polls taken during that last week.
This kind of mapping, which uses various technologies to gather and analyze data, can also track the spread hate groups or terrorism and of diseases or illness, like the flu or whooping cough.
"Ideas spread through the world, and they always take definite paths, from fashion to political extremism," Gupta said. "This project is designed to capture why and how they spread, and what does it mean for the rest of us."
Tsou said the project "aims to test the hypothesis that the spread of ideas is not random, that there are places that are more prone to host these sites (and accept and spread an idea) than others."
For example, when Florida pastor Terry Jones said he planned to burn a copy of the Koran, "immediately it spread like wildfire," Gupta said.
By tracking who was most interested in the news, they found a "hot spot" in Topeka, Kansas.
"We wanted to know why, and we found out that there was another small church that were saying they were going to do the same thing," he said.