Publications

  • Fake news on Twitter during the 2016 U.S. presidential election
    Nir Grinberg, Kenneth Joseph, Lisa Friedland, Briony Swire-Thompson, and David Lazer (2019). Science, 363(6425):374–378. ( media coverage, abstract, paper, supplementary material, replication package )
    The spread of fake news on social media became a public concern in the United States after the 2016 presidential election. We examined exposure to and sharing of fake news by registered voters on Twitter and found that engagement with fake news sources was extremely concentrated. Only 1% of individuals accounted for 80% of fake news source exposures, and 0.1% accounted for nearly 80% of fake news sources shared. Individuals most likely to engage with fake news sources were conservative leaning, older, and highly engaged with political news. A cluster of fake news sources shared overlapping audiences on the extreme right, but for people across the political spectrum, most political news exposure still came from mainstream media outlets.
  • Auditing Partisan Audience Bias within Google Search
    Ronald Robertson, Shan Jiang, Kenneth Joseph, Lisa Friedland, David Lazer, and Christo Wilson (2018). In Proceedings of the ACM on Human-Computer Interaction (CSCW), Volume 2, Article 148. ( abstract, pdf, info )
    There is a growing consensus that online platforms have a systematic influence on the democratic process. However, research beyond social media is limited. In this paper, we report the results of a mixed-methods algorithm audit of partisan audience bias and personalization within Google Search. Following Donald Trump's inauguration, we recruited 187 participants to complete a survey and install a browser extension that enabled us to collect Search Engine Results Pages (SERPs) from their computers. To quantify partisan audience bias, we developed a domain-level score by leveraging the sharing propensities of registered voters on a large Twitter panel. We found little evidence for the "filter bubble" hypothesis. Instead, we found that results positioned toward the bottom of Google SERPs were more left-leaning than results positioned toward the top, and that the direction and magnitude of overall lean varied by search query, component type (e.g. "answer boxes"), and other factors. Utilizing rank-weighted metrics that we adapted from prior work, we also found that Google's rankings shifted the average lean of SERPs to the right of their unweighted average.
    Accompanying blog post (11/2018): Is it the Algorithms or Us?
    Honorable mention for Best Paper at CSCW
    25.5% acceptance rate
  • ConStance: Modeling Annotation Contexts to Improve Stance Classification
    Kenneth Joseph, Lisa Friedland, William Hobbs, David Lazer, and Oren Tsur (2017). In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1115–1124. ( abstract, paper, supplementary material, code and replication data, info )
    Manual annotations are a prerequisite for many applications of machine learning. However, weaknesses in the annotation process itself are easy to overlook. In particular, scholars often choose what information to give to annotators without examining these decisions empirically. For subjective tasks such as sentiment analysis, sarcasm, and stance detection, such choices can impact results. Here, for the task of political stance detection on Twitter, we show that providing too little context can result in noisy and uncertain annotations, whereas providing too strong a context may cause it to outweigh other signals. To characterize and reduce these biases, we develop ConStance, a general model for reasoning about annotations across information conditions. Given conflicting labels produced by multiple annotators seeing the same instances with different contexts, ConStance simultaneously estimates gold standard labels and also learns a classifier for new instances. We show that the classifier learned by ConStance outperforms a variety of baselines at predicting political stance, while the model's interpretable parameters shed light on the effects of each context.
    26% acceptance rate
  • "Voters of the Year": 19 Voters Who Were Unintentional Election Poll Sensors on Twitter
    William Hobbs, Lisa Friedland, Kenneth Joseph, Oren Tsur, Stefan Wojcik, and David Lazer (2017). International AAAI Conference on Web and Social Media (ICWSM), pp. 544–547. ( abstract, pdf, info )
    Public opinion and election prediction models based on social media typically aggregate, weight, and average signals from a massive number of users. Here, we analyze political attention and poll movements to identify a small number of social "sensors" — individuals whose levels of social media discussion of the major parties' candidates characterized the candidates' ups and downs over the 2016 U.S. presidential election campaign. Starting with a sample of approximately 22,000 accounts on Twitter that we linked to voter registration records, we used penalized regressions to identify a set of 19 accounts (sensors) that were predictive for the candidates' poll numbers (5 for Hillary Clinton, 13 for Donald Trump, and 1 for both). The predictions based on the activity of these handfuls of sensors accurately tracked later movements in poll margins. Despite the regressions allowing both supportive and opposition sensors, our separate models for Trump and Clinton poll support identified sensors for Hillary Clinton who were disproportionately women and for Donald Trump who were disproportionately white. The method did not predict changes in levels of undecideds and underestimated support for Donald Trump in September 2016, where the errors were correlated with discussions of protests of police shootings.
    32% acceptance rate
    © AAAI, 2017.
  • Combating Fake News: An Agenda for Research and Action
    David Lazer, Matthew Baum, Nir Grinberg, Lisa Friedland, Kenneth Joseph, Will Hobbs, and Carolina Mattsson (May 2, 2017). Final report from conference held Feb 17–18, 2017 at Harvard and Northeastern Universities. ( abstract, pdf )

    Recent shifts in the media ecosystem raise new concerns about the vulnerability of democratic societies to fake news and the public's limited ability to contain it. Fake news as a form of misinformation benefits from the fast pace that information travels in today's media ecosystem, in particular across social media platforms. An abundance of information sources online leads individuals to rely heavily on heuristics and social cues in order to determine the credibility of information and to shape their beliefs, which are in turn extremely difficult to correct or change. The relatively small, but constantly changing, number of sources that produce misinformation on social media offers both a challenge for real-time detection algorithms and a promise for more targeted socio-technical interventions.

    There are some possible pathways for reducing fake news, including:
    (1) offering feedback to users that particular news may be fake (which seems to depress overall sharing from those individuals); (2) providing ideologically compatible sources that confirm that particular news is fake; (3) detecting information that is being promoted by bots and "cyborg" accounts and tuning algorithms to not respond to those manipulations; and (4) because a few sources may be the origin of most fake news, identifying those sources and reducing promotion (by the platforms) of information from those sources.
    As a research community, we identified three courses of action that can be taken in the immediate future: involving more conservatives in the discussion of misinformation in politics, collaborating more closely with journalists in order to make the truth "louder," and developing multidisciplinary community-wide shared resources for conducting academic research on the presence and dissemination of misinformation on social media platforms.
    Moving forward, we must expand the study of social and cognitive interventions that minimize the effects of misinformation on individuals and communities, as well as of how socio-technical systems such as Google, YouTube, Facebook, and Twitter currently facilitate the spread of misinformation and what internal policies might reduce those effects. More broadly, we must investigate what the necessary ingredients are for information systems that encourage a culture of truth.

  • Detecting Anomalously Similar Entities in Unlabeled Data
    Lisa Friedland (2016). Ph.D. thesis, University of Massachusetts Amherst. ( abstract, pdf )

    In this work, the goal is to detect closely-linked entities within a data set. The entities of interest have a tie causing them to be similar, such as a shared origin or a channel of influence. Given a collection of people or other entities with their attributes or behavior, we identify unusually similar pairs, and we pose the question: Are these two people linked, or can their similarity be explained by chance?

    Computing similarities is a core operation in many domains, but two constraints differentiate our version of the problem. First, the score assigned to a pair should account for the probability of a coincidental match. Second, no training data is provided; we must learn about the system from the unlabeled data and make reasonable assumptions about the linked pairs. This problem has applications to social network analysis, where it can be valuable to identify implicit relationships among people from indicators of coordinated activity. It also arises in situations where we must decide whether two similar observations correspond to two different entities or to the same entity observed twice.

    This dissertation explores how to assess such ties and, in particular, how the similarity scores should depend on not only the two entities in question but also properties of the entire data set. We develop scoring functions that incorporate both the similarity and rarity of a pair. Then, using these functions, we investigate the statistical power of a data set to reveal (or conceal) such pairs.

    In the dissertation, we develop generative models of linked pairs and independent entities and use them to derive scoring functions for pairs in three different domains: people with job histories, Gaussian-distributed points in Euclidean space, and people (or entities) in a bipartite affiliation graph. For the first, we present a case study in fraud detection that highlights the potential, as well as the complexities, of using these methods to address real-world problems. In the latter two domains, we develop an inference framework to estimate whether two entities were more likely generated independently or as a pair. In these settings, we analyze how the scoring function works in terms of similarity and rarity; how well it can detect pairs as a function of the data set; and how it differs from existing similarity functions when applied to real data.

  • Classifier-Adjusted Density Estimation for Anomaly Detection and One-Class Classification
    Lisa Friedland, Amanda Gentzel, and David Jensen (2014). In Proceedings of the 2014 SIAM International Conference on Data Mining (SDM), pp. 578–586. ( abstract, paper, supplementary material, poster, info )
    Density estimation methods are often regarded as unsuitable for anomaly detection in high-dimensional data due to the difficulty of estimating multivariate probability distributions. Instead, the scores from popular distance- and local-density-based methods, such as local outlier factor (LOF), are used as surrogates for probability densities. We question this infeasibility assumption and explore a family of simple statistically-based density estimates constructed by combining a probabilistic classifier with a naive density estimate. Across a number of semi-supervised and unsupervised problems formed from real-world data sets, we show that these methods are competitive with LOF and that even simple density estimates that assume attribute independence can perform strongly. We show that these density estimation methods scale well to data with high dimensionality and that they are robust to the problem of irrelevant attributes that plagues methods based on local estimates.
    31% acceptance rate

    © Society for Industrial and Applied Mathematics, 2014.

  • Copy or coincidence? A model for detecting social influence and duplication events
    Lisa Friedland, David Jensen, and Michael Lavine (2013). In Journal of Machine Learning Workshop and Conference Proceedings 28(3), Proceedings of the 30th International Conference on Machine Learning (ICML), pp. 1175–1183. ( abstract, paper / supplementary material, poster )
    In this paper, we analyze the task of inferring rare links between pairs of entities that seem too similar to have occurred by chance. Variations of this task appear in such diverse areas as social network analysis, security, fraud detection, and entity resolution. To address the task in a general form, we propose a simple, flexible mixture model in which most entities are generated independently from a distribution but a small number of pairs are constrained to be similar. We predict the true pairs using a likelihood ratio that trades off the entities' similarity with their rarity. This method always outperforms using only similarity; however, with certain parameter settings, similarity turns out to be surprisingly competitive. Using real data, we apply the model to detect twins given their birth weights and to re-identify cell phone users based on distinctive usage patterns.
  • Detecting insider threats in a real corporate database of computer usage activities
    Ted Senator, Henry Goldberg, et al. (2013). In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1393–1401. ( abstract, paper )
    This paper reports on methods and results of an applied research project by a team consisting of SAIC and four universities to develop, integrate, and evaluate new approaches to detect the weak signals characteristic of insider threats on organizations' information systems. Our system combines structural and semantic information from a real corporate database of monitored activity on their users' computers to detect independently developed red team inserts of malicious insider activities. We have developed and applied multiple algorithms for anomaly detection based on suspected scenarios of malicious insider behavior, indicators of unusual activities, high-dimensional statistical patterns, temporal sequences, and normal graph evolution. Algorithms and representations for dynamic graph processing provide the ability to scale as needed for enterprise- level deployments on real-time data streams. We have also developed a visual language for specifying combinations of features, baselines, peer groups, time periods, and algorithms to detect anomalies suggestive of instances of insider threat behavior. We defined over 100 data features in seven categories based on approximately 5.5 million actions per day from approximately 5,500 users. We have achieved area under the ROC curve values of up to 0.979 and lift values of 65 on the top 50 user-days identified on two months of real data.
    hide
  • Feature extraction and machine learning on symbolic music using the music21 toolkit
    Michael Cuthbert, Chris Ariza, and Lisa Friedland (2011). In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), pp. 387–392. ( abstract, paper )
    Machine learning and artificial intelligence have great potential to help researchers understand and classify musical scores and other symbolic musical data, but the difficulty of preparing and extracting characteristics (features) from symbolic scores has hindered musicologists (and others who examine scores closely) from using these techniques. This paper describes the "feature" capabilities of music21, a general-purpose, open source toolkit for analyzing, searching, and transforming symbolic music data. The features module of music21 integrates standard feature-extraction tools provided by other toolkits, includes new tools, and also allows researchers to write new and powerful extraction methods quickly. These developments take advantage of the system's built-in capacities to parse diverse data formats and to manipulate complex scores (e.g., by reducing them to a series of chords, determining key or metrical strength automatically, or integrating audio data). This paper's demonstrations combine music21 with the data mining toolkits Orange and Weka to distinguish works by Monteverdi from works by Bach and German folk music from Chinese folk music.
    hide
  • Detecting Social Ties and Copying Events from Affiliation Data
    Lisa Friedland (2010). In Proceedings of the 24th AAAI Conference of Artificial Intelligence, pp. 1982–1983. (Fifteenth AAAI/SIGART Doctoral Consortium.) ( abstract, pdf, info )
    The goal of my work is to detect implicit social ties or closely-linked entities within a data set. In data consisting of people (or other entities) and their affiliations or discrete attributes, we identify unusually similar pairs of people, and we pose the question: Can their similarity be explained by chance, or it is due to a direct ("copying") relationship between the people? The thesis will explore how to assess this question, and in particular how one's judgments and confidence depend not only on the two people in question but also on properties of the entire data set. I will provide a framework for solving this problem and experiment with it across multiple synthetic and real-world data sets. My approach requires a model of the copying relationship, a model of independent people, and a method for distinguishing between them. I will focus on two aspects of the problem: (1) choosing background models to fit arbitrary, correlated affiliation data, and (2) understanding how the ability to detect copies is affected by factors like data sparsity and the numbers of people and affiliations, independent of the fit of the models.
    hide
    © AAAI, 2010.
  • Joke retrieval: Recognizing the same joke told differently
    Lisa Friedland and James Allan (2008). In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM), pp. 883–892. ( abstract, info )
    In a corpus of jokes, a human might judge two documents to be the "same joke" even if characters, locations, and other details are varied. A given joke could be retold with an entirely different vocabulary while still maintaining its identity. Since most retrieval systems consider documents to be related only when their word content is similar, we propose joke retrieval as a domain where standard language models may fail. Other meaning-centric domains include logic puzzles, proverbs and recipes; in such domains, new techniques may be required to enable us to search effectively. For jokes, a necessary component of any retrieval system will be the ability to identify the "same joke," so we examine this task in both ranking and classification settings. We exploit the structure of jokes to develop two domain- specific alternatives to the "bag of words" document model. In one, only the punch lines, or final sentences, are compared; in the second, certain categories of words (e.g., professions and countries) are tagged and treated as interchangeable. Each technique works well for certain jokes. By combining the methods using machine learning, we create a hybrid that achieves higher performance than any individual approach.
    hide
    17% acceptance rate
    © ACM, 2008. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. This is a minor revision of the work published at http://doi.acm.org/10.1145/1458082.1458199
  • Anomaly detection for inferring social structure
    Lisa Friedland (2008). In Encyclopedia of Data Warehousing and Mining, Second Edition (John Wang, Ed.), pp. 39–44. IGI Global. ( abstract, sample pages )
    In traditional data analysis, data points lie in a Cartesian space, and an analyst asks certain questions: (1) What distribution can I fit to the data? (2) Which points are outliers? (3) Are there distinct clusters or substructure? Today, data mining treats richer and richer types of data. Social networks encode information about people and their communities; relational data sets incorporate multiple types of entities and links; and temporal information describes the dynamics of these systems. With such semantically complex data sets, a greater variety of patterns can be described and views constructed of the data. This article describes a specific social structure that may be present in such data sources and presents a framework for detecting it. The goal is to identify tribes, or small groups of individuals that intentionally coordinate their behavior—individuals with enough in common that they are unlikely to be acting independently. While this task can only be conceived of in a domain of interacting entities, the solution techniques return to the traditional data analysis questions. In order to find hidden structure, we use an anomaly detection approach: develop a model to describe the data, then identify outliers.
    hide
  • Finding tribes: Identifying close-knit individuals from employment patterns
    Lisa Friedland and David Jensen (2007). In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 290–299. ( abstract, info, paper, poster )
    We present a family of algorithms to uncover tribes—groups of individuals who share unusual sequences of affiliations. While much work inferring community structure describes large-scale trends, we instead search for small groups of tightly linked individuals who behave anomalously with respect to those trends. We apply the algorithms to a large temporal and relational data set consisting of millions of employment records from the National Association of Securities Dealers. The resulting tribes contain individuals at higher risk for fraud, are homogenous with respect to risk scores, and are geographically mobile, all at significant levels compared to random or to other sets of individuals who share affiliations.
    hide
    <20% acceptance rate
    © ACM, 2007. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version is available at doi.acm.org/10.1145/1281192.1281226
  • Relational data pre-processing techniques for improved securities fraud detection
    Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, David Jensen, Henry G. Goldberg, and John Komoroske (2007). In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 941–949. ( abstract, info )
    Commercial datasets are often large, relational and dynamic. They contain many records of people, places, things and events and their interactions over time. Such datasets are rarely structured appropriately for important knowledge discovery tasks, and they often contain variables whose meaning changes across different subsets of the data. We describe how these challenges were addressed in a collaborative analysis project undertaken by the University of Massachusetts Amherst and the National Association of Securities Dealers (NASD). We describe several methods for data pre-processing that we applied to transform a large, dynamic, and relational dataset describing nearly the entirety of the U.S. securities industry, and we show how these methods made the dataset suitable for learning statistical relational models. To better utilize social structure, we first applied known consolidation and link formation techniques to associate individuals with branch office locations. In addition, we developed an innovative technique to infer professional associations by exploiting dynamic employment histories. Finally, we applied normalization techniques to create a suitable class label that adjusts for spatial, temporal, and other heterogeneity within the data. We show how these pre-processing techniques combine to provide the necessary foundation for learning high-performing statistical models of fraudulent activity.
    hide
    20% acceptance rate
    © ACM, 2007. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version is available at doi.acm.org/10.1145/1281192.1281293
  • Exploiting relational structure to understand publication patterns in high-energy physics
    Amy McGovern, Lisa Friedland, Michael Hay, Brian Gallagher, Andrew Fast, Jennifer Neville, and David Jensen (2003). SIGKDD Explorations, 5(2):165-172. ( abstract, info )
    We analyze publication patterns in theoretical high-energy physics using a relational learning approach. We focus our analyses on four related areas: understanding and identifying patterns of citations, examining publication patterns at the author level, predicting whether a paper will be accepted by specific journals, and identifying research communities from the citation patterns and paper text. Each of these analyses contributes to an overall understanding of theoretical high-energy physics that could not have been achieved without examining each area in detail.
    hide
    First-place winner, task 4 of KDD Cup 2003.
  • Learning relational probability trees
    Jennifer Neville, David Jensen, Lisa Friedland, and Michael Hay (2003). In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 625–630. ( abstract, info, paper, poster )
    Classification trees are widely used in the machine learning and data mining communities for modeling propositional data. Recent work has extended this basic paradigm to probability estimation trees. Traditional tree learning algorithms assume that instances in the training data are homogenous and independently distributed. Relational probability trees (RPTs) extend standard probability estimation trees to a relational setting in which data instances are heterogeneous and interdependent. Our algorithm for learning the structure and parameters of an RPT searches over a space of relational features that use aggregation functions (e.g. AVERAGE, MODE, COUNT) to dynamically propositionalize relational data and create binary splits within the RPT. Previous work has identified a number of statistical biases due to characteristics of relational data such as autocorrelation and degree disparity. The RPT algorithm uses a novel form of randomization test to adjust for these biases. On a variety of relational learning tasks, RPTs built using randomization tests are significantly smaller than other models and achieve equivalent, or better, performance.
    hide
    27% acceptance rate for research-track papers and posters
    © ACM, 2003. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version is available at http://doi.acm.org/10.1145/956750.956830
  • Characterization of single-nucleotide polymorphisms in coding regions of human genes
    Michele Cargill, David Altshuler, et al. (1999). Nature Genetics, 22(3):231–8. ( abstract, info, pdf )
    Michele Cargill, David Altshuler, James Ireland, Pamela Sklar, Kristin Ardlie, Nila Patil, Nila Shaw, Charles R. Lane, Esther P. Lim, Nilesh Kalyanaraman, James Nemesh, Liuda Ziaugra, Lisa Friedland, Alex Rolfe, Janet Warrington, Robert Lipshutz, George Q. Daley, and Eric S. Lander
    collapse authors
    A major goal in human genetics is to understand the role of common genetic variants in susceptibility to common diseases. This will require characterizing the nature of gene variation in human populations, assembling an extensive catalogue of single-nucleotide polymorphisms (SNPs) in candidate genes and performing association studies for particular diseases. At present, our knowledge of human gene variation remains rudimentary. Here we describe a systematic survey of SNPs in the coding regions of human genes. We identified SNPs in 106 genes relevant to cardiovascular disease, endocrinology and neuropsychiatry by screening an average of 114 independent alleles using 2 independent screening methods. To ensure high accuracy, all reported SNPs were confirmed by DNA sequencing. We identified 560 SNPs, including 392 coding-region SNPs (cSNPs) divided roughly equally between those causing synonymous and non-synonymous changes. We observed different rates of polymorphism among classes of sites within genes (non-coding, degenerate and non-degenerate) as well as between genes. The cSNPs most likely to influence disease, those that alter the amino acid sequence of the encoded protein, are found at a lower rate and with lower allele frequencies than silent substitutions. This likely reflects selection acting against deleterious alleles during human evolution. The lower allele frequency of missense cSNPs has implications for the compilation of a comprehensive catalogue, as well as for the subsequent application to disease association.
    hide