Projects - Schools in Computational Social Science and Data Science

During our summer schools, participants and experts work together through the entire research process in projects on a specific research topic with methods of computational social science.

1. Ethnic and sexual minority representation in parliamentary speeches: A comparative analysis of the United States and Germany from 1945 to present

Project Leaders: Christian Czymara & Maximilian Weber

Short title: Minority representation

Research questions:

How are ethnic and sexual minorities addressed in parliament speeches?
How does this change over time?
Are there different trends for the US and Germany?
Do patterns of polarization across parties differ between both countries?

Data: Parliament speeches of Germany (Bundestag) and the US (Congress) from 1945 onwards. Data are available at https://data.stanford.edu/congress_text and https://www.bundestag.de/en/documents/textarchive.

Methods: Natural language processing is used to obtain a classifier on hand-labeled data. The classifier is trained using different approaches, including Naive Bayes, Support Vector Machine, and large language models (deep learning). In addition, we might use topic models to explore differences in content over time.

Description: The aim of this research project is to investigate how ethnic and sexual minorities are addressed in parliamentary speeches and how this representation changes over time in the United States and Germany. The project will also examine whether there are differences in trends between the two countries and whether patterns of polarization across parties differ.
To answer these research questions, we will collect and analyze parliamentary speeches from the United States and Germany since the end of World War II. We will use natural language processing and content analysis techniques to identify and categorize references to ethnic and sexual minorities in the speeches. We will also use sentiment analysis to examine the tone and attitude of the speeches toward these groups.
To determine changes over time, we will conduct a longitudinal analysis of the data and track changes in the representation of ethnic and sexual minorities over time. We will also compare the data between the two countries to identify any differences in trends. To examine polarization across parties, we will analyze the speeches of different political parties in each country and compare their attitudes and representation of ethnic and sexual minorities.

This research project will shed light on the representation of ethnic and sexual minorities in parliamentary speeches, which can have significant implications for policymaking and social justice. The findings of this research project can be used to inform policymakers and civil society organizations on how to improve the representation and inclusion of these groups in political discourse.

2. Effects of self-labeling on the democratic debate and group polarization

Project Leaders: Ivan Puga-Gonzalez & André Martins

Short title: Self-labelling

Research question: People assign themselves labels to signal political preferences. Does the existence of those labels contribute to a more polarized debate?

Method: We will use agent-based models to implement a computational scenario where agents debate and use labels to identify themselves in order to investigate how the existence of those labels might alter perceptions and the division of society into polarized groups.

Data: To be decided.

Description: Labels are a central part of our social lives. Some of them are assigned to us, while other labels are products of our own choice. In particular, when discussing ideas we prefer or defend, we usually choose to belong or at least identify ourselves with groups that share similar ideas. We pick labels, and we defend those ideas. Experiments show this might lead to motivated reasoning, biases, and distrust between competing groups. It is natural to ask what are the societal effects of this labeling process, from a possible increase in polarization of groups to a more significant propensity to commit extreme actions.

Our project will explore how the emergence of self-labels and their adoption by other agents (i.e., ingroup identification) affects their willingness to engage in constructive debates or resort to groupthink and echo chambers. Through analyzing the interactions between agents, we aim to explore the mechanisms underlying the polarization of opinions and the emergence of extremist beliefs. We will explore whether and how self-labels produce groups of agents with correlated sets of opinions, and how this affects the strength of their opinion and the formation of echo chambers. We believe this project will contribute to a better understanding of the role of self-assigned labels and group identification in opinion formation and polarization. Additionally, the insights gained from this study may have implications for designing strategies that promote constructive democratic debate.

3. Cooperation and defection in the drafting of resolutions at the UN General Assembly

Project leaders: Rafael Mesquita & Marion Hoffman

Short title: UN General Assembly

Research question: Which factors lead states to cooperate or refrain from cooperating in resolution proposals at the United Nations General Assembly?

Data: The UN General Assembly Sponsorship Dataset [1] monitors the co-sponsorship of draft resolutions at the UN General Assembly (UNGA) for all member states from 2000 to 2020. It has information on over 5 thousand drafts, with approximately 30 variables related to drafts’ metadata, including date, title, theme, participating groups, outcome, etc. More importantly, it contains over 190 variables on sponsorship by each UNGA member state, which allows identifying participating countries for every draft. Drafts can undergo multiple revisions, meaning that it is also possible to pinpoint the precise moment when each country joined or withdrew from a given proposal.
Several exogenous factors can explain countries’ decision of whether or not to jointly cooperate in a draft (at the country level: common regional group, same regime type, commercial ties, etc; at the document level: topic salience, textual content, edits between revisions, etc.). Additional data can be consulted and/or collected to build the independent variables and controls for the model. Promising options will be pursued in the course of the research project.

Methods: We aim to adapt a recently proposed relational event model, the DyNAM-i model [2], initially developed for group interactions. This model follows the tradition of DyNAM models [3] and Stochastic Actor-Oriented Models [4]. Our model represents a two-step process: first, states get active to make a decision; second, they can decide whether to sponsor or not an existing draft. Both processes depend on attributes of the drafts, the states, as well as previous decisions. By capturing all these dependencies, we can capture a multitude of mechanisms related to actors’ preferences, contextual factors, and relational dynamics. Finally, models are estimated using the goldfish R package to statistically test hypotheses related to our research question.

Description: The latest decade has been particularly challenging for multilateral cooperation among states. For many observers, events such as the rise of nationalist movements, the COVID-19 pandemic, and renewed great power competition beckon a new phase in international politics characterized by diminished cross-border collaboration, a departure from global forums, and retrenchment in more limited forms of cooperation [5].
Such readings of the situation of multilateral politics invite, nonetheless, deeper analysis. Although policymakers worldwide share a concern with preventing a “crisis in multilateralism” [6], we know very little about the factors that predispose countries to be multilateral cooperators in the first place, and the topics that are more conciliatory or more divisive. The drivers of cooperation and defection are foundational questions for International Relations [7], and Computational Social Sciences can deliver unique contributions to this agenda. To understand the dynamics that lead countries to cooperate or withhold cooperation, we require an approach that can account for causal factors across several levels of observation (unit, dyadic, group), the inherently relational aspect of cooperation, the chain of decisions, and the institutional features of the organization where members interact.
This research project aims to fill that gap by analyzing the determinants of country-to-country cooperation in international institutions. We plan to analyze the co-sponsorship of draft resolutions at the UN General Assembly (UNGA) from 2000 to 2020. We focus on the UNGA due to its universality, both in membership (all sovereign states participate) and mandate (virtually all topics are addressed through its resolutions). Records of co-sponsorship at the UNGA provide a rich and precise source of information regarding countries’ preferences regarding partners and topics. Using the DyNAM-i model, we can model and statistically test the determinants of support or defection in this multilateral norm-making and ask questions such as: Which countries are more collaborative? What groups tend to form? Which themes are more divisive? How do past decisions influence new sponsoring events? Our goal is eventually to characterize the patterns of global cooperation over the last decades.

Literature:

[1] Seabra, Pedro, and Rafael Mesquita. (2022). “UN General Assembly Sponsorship Dataset”, https://doi.org/10.7910/DVN/MPQUE2 , Harvard Dataverse, V3. Read more at http://penewsletter.org/dataprofiles/seabra_mesquita_2022/
[2] Hoffman, Marion, Per Block, Timon Elmer, and Christoph Stadtfeld. (2020). “A model for the dynamics of face-to-face interactions in social groups”. Network Science 1-22. doi:10.1017/nws.2020.3
[3] Stadtfeld, C., & Block, P. (2017). “Interactions, actors, and time: Dynamic network actor models for relational events”. Sociological Science, 4, 318-352.
[4] Snijders, T. A., Van de Bunt, G. G., & Steglich, C. E. (2010). “Introduction to stochastic actor-based models for network dynamics”. Social networks, 32(1), 44-60.
[5] Lake, David A., Lisa L. Martin, and Thomas Risse. (2021). “Challenges to the Liberal Order: Reflections on International Organization.” International Organization 75(2): 225–57. doi:10.1017/S0020818320000636.
[6] See for instance the Franco-German “Alliance for Multilateralism” launched in 2019 (https://www.dw.com/en/germany-launches-alliance-for-multilateralism/a-50600084).
[7] Axelrod, Robert, and Robert O. Keohane. (1985). “Achieving cooperation under anarchy: Strategies and institutions.” World Politics 38(1): 226-254. doi:10.2307/2010357.

4. Editorial control? National newspaper characteristics, substantive news coverage, and the 1960s civil rights movement

Project leaders: Weijun Yuan & Neal Caren

Short title: Editorial control

Research questions: What factors shape how social movement issues and demands enter the public sphere? How do newspaper characteristics shape their coverage of social movement actors?

Data: Political Organizations in the News (PONs) dataset, which encompasses media coverage of U.S. social movement organizations in the four most prominent national newspapers: the New York Times, Washington Post, Los Angeles Times, and Wall Street Journal. This will be supplemented with additional local and ethnic media articles.

Methods: content analysis, topic modeling, sentiment analysis, regression analysis, semantic networks, network clustering, large language models

Description: Do characteristics of professional newspapers, including their opinion pages, reporter assignments, or types of articles published, influence how social movement actors are treated in the news? News coverage plays a crucial role in shaping public opinion and can have a significant cultural impact on social movements. Previous research has shown that social movement organizations are often the subject of media attention, particularly in the case of recent movements such as Occupy, Tea Party, Black Lives Matter, and the alt-right/white supremacy movements. However, not all movements receive substantive coverage, including discussing their demands and issues. While recent scholarship has examined how the institutions of movement target mediate the treatment of social movements in the news, less attention has been paid to the influence of differences among mainstream, professional news organizations and their coverage of social movement actors. We argue that newspaper and article characteristics, such as editorial orientation, reporter characteristics, and article type, may play a crucial role in shaping the quality of coverage. In this study, we aim to identify potential newspaper and article features that influence the quality of coverage, and investigate their impact on news coverage. We have identified and manually coded more than 1,000 articles in extensive runs of coverage of five prominent African American rights organizations across four national newspapers in the 1960s and intend to use this training set to estimate broader patterns on all newspaper articles that mentioned African American rights organizations in terms of their coverage of movement demands and issues. Machine learning and regression analyses will allow us to investigate the impact of newspaper and article characteristics on substantive coverage. Furthermore, we will employ structural topic models to examine how different newspapers with varying characteristics prioritize various movement issues. We will also test the ability of large language models, such as ChatGPT, to augment or replace human coders or traditional machine learning methods for this type of analysis. We may also analyze the dynamic organization-issue network. Treating news and article characteristics as edge attributes, we will explore how these attributes contribute to the salience of certain issues.

5. Using attitude networks to explore polarization and its implications for cooperation

Project leaders: Dino Carpentras & Philip Warncke

Short title: Attitude networks

Research Questions: How do we identify individual political beliefs that people subjectively find most polarizing? Which cognitive phenomena are sufficient or necessary for the emergence of a structured (and so, polarized) attitude network? How do structures in the attitude networks relate to affective polarization and possibilities for cooperation and collective intelligence? How can we discover political issue beliefs that might enable collective action among polarized political groups?

Data: American National Election Studies (ANES), European Social Surveys (ESS), European Election Studies (EES)

Methods: ResIN (short for Response Item Networks) combines the statistical prowess of item-response theory (IRT) with the computational efficiency and ease of interpretation of belief network analysis (BNA). ResIN provides a flexible and easy to implement framework to visualize and make statistical inferences about complex inter-relationships between survey item responses. More so than either IRT or BNA, the method grants straightforward insight into the often hidden complexities of political attitude data. PhaFNA (short for Physics-aided Factor Network Analysis) combines detailed insights from ResIN networks with the flexibility and wide applicability of Confirmatory Factor Analysis (CFA). By simulating item responses as charged particles in a latent attitude space, PhaFNA utilizes the location of political attitude nodes in a latent ideological space.

PhaFNA models thus benefit from additional information about the spatial intersections of major attitudinal cleavage lines, the relative strengths and correlations between different latent attitude factors, and the approximate position of attitude nodes that bridge different ideological camps.

Agent-based models are computational models used to simulate how individual agents within a system interact and influence each other over time. These models are dynamic in nature, so they can show which behavioral rules are sufficient for passing from unpolarized conditions (where cooperation can thrive), to situations of high polarization. In the attitude network, these two conditions would respectively correspond to the case of low and high structure.

Workshop participants will also help design a simple survey experiment that can test predictions about the spacial location of polarizing and de-polarizing attitude nodes in ResIN/PhaFNA models. The goal is to assess the extent to which the spatial location of attitude-bridging nodes correspond to the real world ability of such attitudes to enable compromise between different ideological groups.

Description: The advent of the internet has enabled people to connect, access information, and engage in free debate, leading to remarkable collective projects such as Wikipedia and the Polymath project. However, we are also witnessing a contrasting phenomenon, where individuals are becoming increasingly segregated into ideological camps often engaged in an ideological war against each other. For example, in the United States only 4% of couples are currently composed of a mix of Republicans and Democrats. This environment has shown to be heavily detrimental for collective action potential and so for thriving societies. In the past, research on collective intelligence has focused on the importance of diversity. In this workshop, we would use methods such as ResIN to study the distinction between positive and negative diversity among people’s socio-political attitudes. Under positive diversity, people hold numerous, often diffuse opinion combinations on different topics. In case of negative diversity, opinions on different topics tend to be strongly correlated (for example, in the United States, people who support gay rights tend also to be in favor of women rights and gun control). Therefore, positive attitude diversity harbors greater potential for two or more groups to find a common denominator and enable cross-group collaboration.

ResIN (short for Response Item Network) is an effective method to distinguish between positive and negative diversity. During the workshop, we will examine longitudinal datasets such as the ANES and ESS to observe how attitude networks have evolved over time, and how these structures are connected to variables such as affective polarization (i.e. the dislike between people of different ideological groups), which can limit the potential for cooperation and collective action. Using ABMs we will also explore which conditions are sufficient for producing highly structured networks and the foundations of affective polarization. For example, some preliminary studies suggest that simply the possibility of recognizing the association between two attitudes in other people, together with the ability of learning this pattern, may be enough to produce polarization. Finally, we will combine ABM and data analysis to study possible interventions. Specifically, we will focus on which attitudes may be responsible for most polarization and which types of online interaction could increase or decrease polarization. The participants of this project will receive a detailed introduction to ResIN and PhaFNA, as well as on ABMs that can integrate these methods and get familiar with the computational implementation of these methods in R and/or Python. The goal of the workshop will be to identify a low number of potential attitude candidates for a simple experiment on polarization reduction in the US and in one other European country. Participants are highly encouraged to bring in their own ideas about additional applications of these research methods and how to test them in an experimental context.

For more information about the ResIN method, please visit: https://www.resinmethod.net/

6. Predicting lock-in in social influence processes

Project leaders: Alexandros Gelastopoulo & Pantelis P. Analytis

Short title: Predicting lock-in

Research Questions: Can social influence lead a suboptimal option (e.g. a piece of fake news) to become extremely popular (widely believed)? How can we extract the degree of social influence from behavioral data?

Data: Depending on the interests of the students, we will use a subset of the following datasets: data from a behavioral experiment on the unpredictability of political opinion polarization by Macy et al. [7], a study on misinformation by Stein et al. [9], a study on the persistence of information cascades by Goeree et al. [3], data on self-correcting collective dynamics by van der Rijt [10], and data on the wisdom of the crowds in collective decision-making by Frey and van de Rijt [2] and Hardy et al. [4].

Methods:

Estimation of stimulus (popularity) response curves for binary choice data (e.g. using isotonic regression)
Out-of-sample prediction (e.g. using cross validation)
Evaluation of model predictions and model comparison (root mean squared errors, contingency table statistics)

Description: Whether it is about choosing a party to vote for, selecting music to listen to, or about deciding whether to adopt an innovation or whether to get vaccinated during a pandemic, people rely on the popularity of the available options to inform and guide their choices, and they are more likely to choose options that have been selected by many other individuals. When social influence is strong enough, it can trigger a positive feedback loop that drives popular options to become even more popular, a phenomenon known as cumulative advantage or the Matthew effect. This can potentially lead to unwanted lock-in, where options of lesser quality dominate in the long-term, despite better options being available. In the context of democratic debate, it can lead to the viral spread of fake news [5, 6] or extreme political opinion polarization [7].
Although there is plenty of anecdotal evidence of lock-ins across disciplines in the social sciences, experimental evidence varies. In an influential recent study, Van de Rijt [10] suggested a simple condition for predicting whether lock-in is possible, and found that several widely cited social influence experiments, including the famous Music Lab experiment [8], failed to satisfy this condition, suggesting that lock-ins are much less common than previously thought [10]. His model is implicitly based on a formal framework that uses stimulus response curves to describe sequential choice systems [1], where the stimulus is an option’s popularity. This framework can be used to derive precise predictions about the long-term popularity of the various alternatives and in particular the possibility of lock-in. The connection between theory and experimental evidence, however, has not been established. In this project, we will test the predictions of the theory using data from various multiple-world experiments (social dynamics experiments run with multiple independent trials) in the social and behavioral sciences.

Literature:

[1] W Brian Arthur. Competing technologies, increasing returns, and lock-in by historical events. The Economic Jurnal, 99(394):116–131, 1989.
[2] Vincenz Frey and Arnout van de Rijt. Social influence undermines the wisdom of the crowd in sequential decision making. Management science, 67(7):4273–4286, 2021.
[3] Jacob K Goeree, Thomas R Palfrey, Brian W Rogers, and Richard D McKelvey. Self-correcting information cascades. The Review of Economic Studies, 74(3):733–762, 2007.
[4] Mathew D Hardy, Bill D Thompson, PM Krafft, and Thomas L Griffiths. Bias amplification in experimental social networks is reduced by resampling. arXiv:2208.07261, 2022.
[5] David MJ Lazer, Matthew A Baum, Yochai Benkler, Adam J Berinsky, Kelly M Greenhill, Filippo Menczer, Miriam J Metzger, Brendan Nyhan, Gordon Pennycook, David Rothschild, et al. The science of fake news. Science, 359(6380):1094–1096, 2018.
[6] Stephan Lewandowsky, Ullrich KH Ecker, Colleen M Seifert, Norbert Schwarz, and John Cook. Misinformation and its correction: Continued influence and successful debiasing. Psychological science in the public interest, 13(3):106–131, 2012.
[7] Michael Macy, Sebastian Deri, Alexander Ruch, and Natalie Tong. Opinion cascades and the unpredictability of partisan polarization. Science advances, 5(8):eaax0754, 2019.
[8] Matthew J Salganik, Peter Sheridan Dodds, and Duncan J Watts. Experimental study of inequality and unpredictability in an artificial cultural market. Science, 311(5762):854–856, 2006.
[9] Jonas Stein, Marc Keuschnigg, and Arnout van de Rijt. Network segregation and the propagation of misinformation. Scientific Reports, 13(1):917, 2023.
[10] Arnout van de Rijt. Self-correcting dynamics in social influence processes. American journal of sociology, 124(5):1468–1495, 2019.

7. Simulating conspiracy beliefs through large language models and evolutionary psychology

Project leaders: Veronika Batzdorfer

Short title: Conspiracy beliefs

Research Questions: Under which conditions can LLMs be used to annotate conspiracy features on social forums? How are results contingent on prompt design and confounders such as topicality?

Data: TBD (e.g., Reddit, Twitter)

Methods: Natural language processing, machine learning

Description: The use of large language models, such as OpenAI’s GPT-3, which are trained on vast amounts of text data, has brought about a significant paradigm shift in the scientific community, as they can produce consistent response distributions, such as moral values or decision-making heuristics (see Horton, 2023). The research project’s two main objectives relate to: leveraging large language models for annotation and prompt sensitivity.

Leveraging Large Language Models to annotate features of conspiracy beliefs. Generic and operationalisable conceptions of conspiracy beliefs are missing. The current project aims to evaluate whether the framework of viewing the origins of conspiracy beliefs through an evolutionary perspective is valid for identifying conspiracy statements. Specifically, the project examines the ability of five generic features (actor, action, threat, pattern, and secrecy) to distinguish generic conspiracy beliefs from non-conspiracy beliefs. Psychological mechanisms like agency detection were crucial for survival from an evolutionary standpoint. However, if overexpressed, they can result in over-attributing malicious intentions and underestimating coincidences (see also van Prooijen and Van Vugt, 2018). A working hypothesis would be that voicing all five generic features would be sufficient to discern conspiracy mindsets with social media postings. In this context, exploring how LLMs can generate ground truth to train machine learning algorithms is a significant implication.

The second objective relates to the sensitivity of prompts to perturbations that must be accounted for. Some variability stems from stochastic variability, implicit and explicit aspects and how instructions are formulated (Liu et al., 2023).

More specifically, we will work with an open-access dataset that discusses conspiracy theories and control for confounders, such as topics to advance the field. We will set up a prompting strategy with ChatGPT (GPT-3.5). One strategy may be to use the five generic conspiracy features as the basis for prompts with a demonstration example (i.e., providing the answer to the first prompt) or look into automated prompt engineering (Zhou et al., 2022) to annotate different text documents. We will compare the performance with other LLMs, such as FLAN-T5 (Li et al., 2022).

Literature:

Horton, J. J. (2023). Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus?. arXiv preprint arXiv:2301.07543.

Li, X., Li, Y., Liu, L., Bing, L., & Joty, S. (2022). Is GPT-3 a Psychopath? Evaluating Large Language Models from a Psychological Perspective. arXiv preprint arXiv:2212.10529.

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1-35.

van Prooijen, J. W., & Van Vugt, M. (2018). Conspiracy theories: Evolved functions and psychological mechanisms. Perspectives on psychological science, 13(6), 770-788. Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., & Ba, J. (2022). Large language models are human-level prompt engineers—arXiv preprint arXiv:2211.01910.