Do young people post less on Facebook? The pitfalls of social big data
The talk will introduce to some of the opportunities and challenges of social media data analysis by reviewing key methods and issues in data retrieval and mining. The primary focus of the talk will be the study of 1,000 Danes’ private Facebook data from 2007-2014 and the development of the software Digital Footprints (www.digitalfootprints.dk). Anja Bechmann will present and discuss preliminary findings on questions such as do young people post less on Facebook, does the Pareto principle apply to Facebook posting patterns and what kind of secret, open and closed groups are Danes and South Koreans members of.
View transcript
Certain patterns in these pictures. Otherwise we do it reverse. We put in a lot of data, a lot of text, textual data, and we process it and we try to find patterns. And we are trying to make patterns as similar to the world of the user as possible. So we're trying to understand the user in context, which is quite difficult. And also quite difficult actually to put in the data. First of all to retrieve the data, and then to make the data in such a form that it makes sense to put in the alphabet. And sometimes we also evaluate what is actually the alphabet doing, because what we do with the data is a lot of machine learning, a lot of algorithms that process data. But often these machine learning techniques and the algorithms are black boxes. We don't really know what the basis is. We don't know what the basic assumptions are behind these algorithms and machine learning principles. We know sort of what they're doing, but we're not really sure what they're doing. So what we're doing is also to test how do we perform. How do we tweak the algorithm or the machine learning principles to adjust so that we get the most perfect patterns of the user or the most perfect picture of the user. And we're actually an alphabet controller, just to make that clear. And we're trying to analyze both what is input, what is the black box of the machine learning algorithms, and what can we say about the output. Does it make sense? Some of you have read the participation of the AI. And I know some of you are colleagues, but in the industrial context, that you are working with social media analytics yourself. And I think that most of you have worked with Facebook analytics, right? Hands up if you have worked with Facebook analytics. Yes? So Facebook analytics is also kind of a black box, right? Because we really don't know exactly what the actual output is. We don't know exactly what the age of these people are. We don't know exactly if it is a female or a male. Because we assume that what they are portraying themselves as on Facebook, is what they are in reality. But our studies also show, our quantitative studies show, that this is not exactly the case. are actually just writing that I'm a male because they just do not want to be profiled or they do not want to have advertising saying I'm a female in my 40s, so I presumably would like to have something with weight loss or something. If I'm single, I would like to have a lot of dating sites offered. So we're trying to tweak the algorithm a bit by adjusting our data. So the basic assumption behind Facebook and Liz, we can actually not use in our industries, in our companies, right? We cannot assume that a male is actually a male without asking him or her. So it becomes an alphabet as well. Okay. So what is the vision in our group? Our vision. Our vision is a world where we can make decisions based on data with full knowledge of the individuals behind so that we do not make decisions on false, error-prone, and manipulated grounds. We do not think that we can stop the use of data altogether, and because we cannot do so, we want to contribute to the quality of decisions in order to create as transparent and informed decision-making ground as possible. So who is the user in this case? First, we started out with the segments, either on demographics, trying to understand age, gender, education, etc. Or as colors, we have the blue people, we have the green people, we have the pink people, purple people, and the, what is it? Pink? So we are kind of trying to make segments in different ways. Then we focus on user behavior patterns. For instance, in Facebook analytics, trying to describe the web stats on log file data. And my mission has been to try to describe the individual user and to target this user by understanding the user's patterns in depth. What I call around me method. But now we are trying to move away from the ethnomining or the around me approach and towards a combination of the around me idea and segmentation. For instance, how are collective content patterns grouped according to traditional segmentation patterns, but in demographics? And secondly, are these content patterns grouped in other segmentation variables, such as interest or page-like patterns? So why is this necessary? As the lunar storm also showed, we see a growing commercial and marketing bubbles. To optimize product sales, services, and advertising, we need a deep understanding of the users, their incentives, behavior, and needs. But social media is also used for recruitment, risk assessment, political decisions. So I see a need to focus on how social media as input data for decisions potentially threaten society values such as inclusion and equality. This is what I call the human factor in big data. The seemingly objective processing of data is filled with human interpretations, adjustments, error prone data, false basic assumptions on human behavior. I'm not working alone on this. I'm not able to do all it takes to work with this data in the best possible way. We work as a team, supplementing each other with different competencies and clear goals. We also make software that helps 140 social media researchers around the world in 13 countries. And these are some of the competencies that we are working with. I want to share with you some of the competencies that we are working with. draw on when we do this kind of analysis. We have programmers, we have people focusing on data models, we have natural language processing people, we have statisticians, we have people with field knowledge, for instance with a focus on social media or mobile media, and we have performance testers because we want to know manually how well this performs when we compare with manual coding. Normally our social media data analysis works in different steps that each have black boxes connected to the stage that potentially affect the outcome and result negatively. So I will here give some examples of the data collection and pre-processing phase and lastly on the data mining. So, we collected recently, or not recently but one and a half year ago, we collected a lot of data from the Danish population in collaboration with the Internet Panel User Needs. So we collected a thousand Danish total Facebook feed, newsgroup, Facebook groups, walls, back from 2006 to 2014. And in doing so we got quite a lot of data, so afterwards we collected the same kind of data for South Korea here in order to compare the data patterns. So, when we look at the pros and cons of doing this kind of thing we say that the pros of working with API data is that we have these naturally occurring setting, setting and private data through our software, collected through our software. And we also recruited outside Facebook, which means that user needs guaranteed that the demographics of the participants mirror the Danish population. We did not assume that Facebook could give us the demographics that are right. We also have detailed semantic knowledge. We know a little about the context of the user. So we try to say that we are not studying usage. We're studying data. Because usage happens in context. And we only know a little about this context. We only know what app has been used, the GPS location, time codes. And we know who is in the room, if they are tagged in pictures or if they are tagged in the status, But the black box also has some cons. We don't exactly know if what we are having through the API is a sample size and not the total data set. And we do not know if there is a server breakdown. Suddenly that can be quite kind of, the time code doesn't match, the time code sequence don't match, and we need to kind of reverse engineer what has happened. So we're trying to start the retrieval process again. We also, as I told, we are not sure if the Facebook demographics, if we use the Facebook demographics, we're not sure if it is actually the demographics. We also lack complete data context. And what is actually the Facebook population in Denmark? We do not have any state government measures of who is actually on Facebook. We can make some assumptions and we can make some qualified guesses, but it is a guess. So we cannot, our statistical analysis kind of had this quite big limitations when we do study the population or trying to make some significant conclusions. Yes. So the data preprocessing also has some black box issues. We do not process, oh, how do we process personalized languages, personalized languages, emoticons, different languages? In our Facebook groups we have Pakistani groups, we have American groups, we have Italian groups, and we have Danish groups. We also have a lot of groups or a lot of status updates saying, I love this and that. And machine learning does not recognize love with a lot of O's. It also has difficulties in processing emoticons if we are not translating emoticons first. So there's a lot of obstacles that we have that I hope that you can learn from, when you're doing your analysis. How do we make sure it is the right data? We have a lot of data, but what is the right data? That's an obstacle as well. And how do we make sure that we have the right data model? And that's actually kind of a small problem when I say it out loud, but in reality it's quite a big problem in our daily life, because, you know, we make a lot of data models and then we have like a week to evaluate them, and then we suddenly find out, okay, this analysis will take a year to process. We need to do it again. So in reality, that is a big problem to us as well. One example that I have taken with me, brought with me, is that we, this is the wall posts in our 1000Danes data set. And what we see is that we have some big outliers that are kind of moving the posts in a very high direction. And we looked at the data, because that's the quality of API data, we can actually look at what is happening. And we see that this is the apps posting on users' behalf. And in the API, we can sort out apps but we cannot sort out apps that are acting like the user. So when we look at this, this is Farmville. It's a user, had 8,000 Farmville posts in the data. So that kind of draw our findings in a very different direction. So we tried to sort that out, tried to normalize the data, so we had more even curves. And we see here that there's quite a few, Farmville, and then what is it called, Farmville 2, and Facebook for iPhone, Facebook for Android, blah, blah, blah. This is a list of the apps that are posting on behalf of the user. So we kind of see it's an app. So what we did was instead of kind of deleting the apps that were not allowed in our data set, we were whitelisting the apps that we wanted to be there. So we took all the native apps, Facebook for iPhone, Facebook for Android, et cetera, and included in our data set instead. So that's just to illustrate some of the black boxing in the preprocessing phase. Okay, so to the data mining phase and machine learning phase. There's also quite a few things that are black boxing. And that's actually the phase that we have been looking at the least in the years from 2011 until now. So we are trying to be better at this at the moment, recruiting a lot of new people. Okay, so I promised this. Do young people post less on Facebook? What do you think? Do they? Do they not? Okay. So let's try to count posts. That's a way to approach the problem, right? So we just took the database, trying to count the posts, look, how did it look? So what we see is, if you look at the green, post counts. From 2007 to 2014. If we didn't know what we know now, this is the Facebook, or the phone wheel apps, et cetera, that are driving the post padding up. So if we did not know that, we would assume that actually there is a decrease in posts, right? There's a major decrease in posts in the end of the period. So if we take that out, it would equal out, but we still have a little dip in the end. But this is not significant patterns, because this is just taking data out of our database. And look, and how it developed. Okay, so now we're trying to make significant findings. And when we put the status updates in the graph, after having corrected for app inflation by whitelisting the phone apps, a small decrease in status updates appear around 2011, but there's still a significant increase in the monthly wall posts for the entire period. But there's no significant development overall in the last year. And when we control for demographics, we see that if we look at the age, we see a low Y intercept for the 60 plus category, but a high increase level. We see a higher Y intercept for the blue line, the young people, but it equals out compared to, for instance, the yellow one. We also see a lower intercept for males, a higher increase level, a higher Y intercept for the red, the females, but a lower increase level. And the short is not statistic significant, so we just equal that out, and we only look at the medium and long educated. And we also looked at rural, urban, and that's actually very much similar to male and female, which we kind of expected to be. So that's for the whole period. It's not answering the question, because what we are interested in, do we see a dip in the last period from 2013, to 2014? So what we do see is that we can actually see a significant increase in post on the 60 plus category. And we can measure a tiny decrease, a statistic significant tiny decrease for the younger people, but nothing kind of, not what I would speak of and say, yes, this is a decrease, and it's just a very strong findings. The only statistic significant result in the decrease of young people is when we do a negative binomial regression analysis. So it's there, but it's not, it's significant, but it's not big. That's the question, and that's the answer. So the black boxing of regression analysis. Are we looking at the right data and the right period? Do we assume a linearity that is not there? Are we overfitting? Are we trying to adjust our analysis to this particular data set, not looking, not having enough data to say something on a broader scale? Another example is to look at the Pareto principles. Does 20% post and 80% listen? No, not in our findings. In the 60 plus, yes, but in the younger age group, it's almost reverse in some time in the period. And then again, there are black boxes there as well. The users in the panel may post more than average Danish users. The size of the network may influence the willingness to post a lot versus less. Yes. So just to sum up, we are now trying to understand how social Facebook groups, the patterns in Facebook groups in Denmark and Korea. And in order to do so, we are looking at different machine learning algorithms to cluster these data. But there are black boxes there as well that makes it difficult actually to do so. And one of the things is that the cluster analysis, the machine learning algorithms that cluster, it assumes co-occurrence. So if they are talking about the same things, it would be a cluster. But that's not, maybe not the case when we think of it in a semantic way. When we manually code it, we might not, we may disagree with the algorithm. So we are trying to struggle with these black box basic assumptions idea to solve that problem. So that may be something that we can talk about the Internet Week in 2016. Yes. So the concluding remark is, we not only look at input, we look at the black box and we look at the output as well. Thanks. Thank you very much, Anja, for providing an insight into how research is being done on social media and here in Aarhus. Any questions or comments this time around? Well, I'll kick start with one. Have you looked at the data and the comparison between the Danish population and the South Korean population in this regard? We have started looking at it and we have Chi Young, who is from Yungnam University here in November, December. And we have already tried to just to look at how our data set is different. We can see that we almost have two times as many groups in the data set as the Korean. So it looks like Danes use groups more than South Korean people do. We can also see that the South Korean use open groups more, where we tend to use closed groups that are more private privacy settings than in the Korean data set. There are some tendencies that the Korean data set contains more business-oriented communication in the groups than the Danish data set. Interesting. Any other questions? I will add one more that I was thinking of. How important is this research field for Aarhus University? How big an interest are you seeing from your colleagues in what you're doing here? I think that one of my colleagues is sitting in the back. And maybe he can agree that the level of difficulty in doing this and we need to work together with a lot of different people in order to do this kind of research. So the transdisciplinarity can be a challenge to us because we are spending a lot of time just to understand each other. So bootstrap in computer science is not the same as bootstrapping in statistical language. So there's a lot of difficulties. And actually just communicating about this and to know what the basic assumptions and definitions are in the different fields. But yes, there's obviously a need for this because we get more and more data also in media studies. Thank you once again. And I think this proves the beauty of what Internet Week Denmark is also all about, is bringing together the forefront of research together with businesses working within this field. And we definitely look forward to the next step in your research that you can hopefully share with us in May. Thank you, Anja. Thank you.