Crowdsourcing salient information from news articles and tweets

DOI

This repository contains preliminary work results for identifying linguistic features for novelty detection in news articles and tweets. We report here results of a crowdsourcing experimental pipeline of assessing the relevance of various tweets and news article snippets and the sentiments and intensities they indicate. The main focus of this dataset if to gather initial relevant and novel information insights, with regard to the event of "whaling".

All the crowdsourcing experiments were performed through the CrowdTruth platform, while the results were processed and analyzed using the CrowdTruth methodology and metrics. For more information, check the CrowdTruth website. For gathering the annotated data, we used the CrowdFlower marketplace.

Check the Results & Download the Data: Salience-In-News-And-Tweets

Table of Contents:

Dataset Files:

|--/aggregate

Various aggregated datasets collected as part of collecting the salient features in news articles and tweets workflow. We describe here the most important files:

|--/aggregate/aggregatedResults_newsArticles.csv

This file contains the processed ground truth for the news articles related to the whaling event, in comma-separated format. The file contains aggregated results of the snippets relevance and the snippets and relevant event mentions sentiment and intensity. The columns are:

  • Dataset: reference to the dataset, news - DS1
  • Unit Id: unique ID of the data entry
  • Title Id: news article unique title ID
  • Title: news article title
  • Snippet Id: news article unique snippet ID
  • Snippet: news article snippet
  • Overlapping Snippet: binary value describing whether the snippet contains overlapping tokens with the title (1) or not (0)
  • Snippet Relevance Score: the snippet relevance score; computed using the cosine similarity measure, shows the likelihood that the given snippet is relevant for the news article title
  • Number of Relevant Mentions: total number of relevant event mentions identified by the crowd in the given snippet
  • Overall Sentiment-Intensity: binary value describing whether the following columns contain sentiment and intensity scores for the snippet (1) or for the relevant event mentions identified in the given snippet (0)
  • Relevant Mention: relevant event mention
  • Relevant Mention Score: the event mention relevance score; computed using the cosine similarity measure, shows the likelihood that the given mention in the snippet is relevant for the news article title
  • Positive Sentiment, Negative Sentiment, Neutral Sentiment: the sentiment scores of the snippets and event mentions; computed using the cosine similarity measure, shows the likelihood that the given snippet or mention expresses the given sentiment
  • High Intensity, Low Intensity, Medium Intensity: the intensity scores of the snippets and event mentions; computed using the cosine similarity measure, shows the likelihood that the given snippet or mention expresses a sentiment with the given intensity
|--/aggregate/aggregatedResults_tweets2014&2015.csv

This file contains the processed ground truth for the tweets related to the whaling event - from 2014 and 2015, in comma-separated format. The file contains aggregated results of the tweets relevance, tweets relevant event mentions and the sentiment and intensity of the overall tweet and event mentions. The columns are:

  • Dataset: reference to the dataset, tweets 2014 - DS2, tweets 2015 - DS3
  • Tweet Id: unique ID of the tweet data entry
  • Tweet Author: tweet author
  • Tweet Date: tweet date
  • Tweet Seed Index: unique tweet-event ID
  • Tweet Content: tweet content
  • Tweet Event Relevance Score: the tweet relevance score with regard to the whaling event; computed using the cosine similarity measure, shows the likelihood that the given tweet is relevant for the whaling event
  • Number of Relevant Mentions: total number of relevant event mentions identified by the crowd in the given tweet
  • Overall Sentiment-Intensity: binary value describing whether the following columns contain sentiment and intensity scores for the tweet (1) or for the relevant event mentions identified in the given tweet (0)
  • Relevant Mention: relevant event mention
  • Relevant Mention Score: the event mention relevance score; computed using the cosine similarity measure, shows the likelihood that the given mention in the tweet is relevant for the whaling event
  • Positive Sentiment, Negative Sentiment, Neutral Sentiment: the sentiment scores of the tweet and event mentions; computed using the cosine similarity measure, shows the likelihood that the given tweet or mention expresses the given sentiment
  • High Intensity, Low Intensity, Medium Intensity: the intensity scores of the tweet and event mentions; computed using the cosine similarity measure, shows the likelihood that the given tweet or mention expresses a sentiment with the given intensity
|--aggregate/orderedSnippetsByRelevance.csv

The file contains the relevant news snippets ordered by their relevance score. The overlapping news snippets are ordered in a descending way, while the non-overlapping news snippets are ordered ascending.

|--aggregate/snippetsPositionInArticle.csv

The file contains measures for snippets relevance with regard to their position in the news articles.

|--aggregate/orderedNewsSnippetsBySentiments.csv

The file contains the relevant news snippets ordered by their sentiments: positive sentiment - descending, negative sentiment - ascending.

|--aggregate/orderedTweetsByRelevance.csv

The file contains the relevant tweets ordered by their relevance score.

|--aggregate/orderedTweetsByMentions.csv

The file contains the relevant tweets ordered by their total number of relevance event mentions.

|--aggregate/histogramRelevantTweets.csv

The file contains the number of relevant tweets for each relevance score intervals.

|--aggregate/orderedTweetsBySentiments.csv

The file contains the relevant tweets ordered by their sentiments: positive sentiment - descending, negative sentiment - ascending.

|--aggregate/tweetsChangeInSentiment.csv

The file contains relevant event mentions in tweets that refer to "whaling ban". Each such relevant event mention has the associated sentiment and intensity acores.

|--/input
|  |--/seedWords_domainExperts.csv

The file contains relevant seed words for the whaling event, obtained from the social sciences domain experts. Each column of the file represents a type: Event, Location, Actor/Organization, Other

|--/raw
|  |--/Relevance Analysis
|  |  |--/News
|  |  |--/Tweets
|  |--/Sentiment Analysis
|  |  |--/News
|  |  |--/Tweets

The raw data collected from crowdsourcing for each of the 2 tasks.

Crowdsourcing Experiments:

The overall workflow consists of two crowdsourcing tasks for each dataset:

  1. Relevance Analysis: to identify the relevant news snippets and tweets;
  2. Sentiment Analysis: to identify (1) the sentiment of each relevant event mention in all the relevant news snippets and tweets and (2) the overall sentiment of all the relevant news snippets and tweets.

During the "Relevance Analysis" task, for the news articles dataset, the crowd is first asked to select all the relevant snippets with regard to the article title, where the title is considered as an expression of the event and then highlight in them all the relevant event mentions. For the tweets dataset, the crowd is asked to assign relevant events (from a list of predefined events) for each tweet and also highlight all the relevant event mentions in it. This results in a set of relevant snippets and tweets and a set of relevant event mentions in those. Using CrowdTruth cosine similarity metric we compute relevance scores for each snippet, tweet and event mention of the "whaling event".

During "Sentiment Analysis" we gather from the crowd the sentiment (in terms of positive, neutral or negative) and its intensity (high, medium, low) for (1) all event mentions identified in the "Relevance Analysis" task, and (2) the overall sentiment and its intensity of each snippet and tweet. Here again, we use the CrowdTruth cosine similarity metric to compute sentiment and intensity scores for each event mention, tweet or snippet.

Check the crowdsourcing templates below.

Relevance Analysis Task on News articles (DS1) (click here to enlarge the picture and read the crowdsourcing task instructions)

Fig.1: CrowdTruth Workflow for Identifying Salient Features in News - DS1.

The relevant News Articles to the Whaling Event are used as input for the Sentiment Analysis Task.

Sentiment Analysis Task on News articles (DS1) (click here to enlarge the picture and read the crowdsourcing task instructions)

Fig.1: CrowdTruth Workflow for Identifying Salient Features in News - DS1.

Relevance Analysis Task on Tweets (DS2&DS3) (click here to enlarge the picture and read the crowdsourcing task instructions)

Fig.2: CrowdTruth Workflow for Identifying Salient Features in Tweets 2014 - DS2 & Tweets 2015 - DS3.

The relevant Tweets to the Whaling Event are used as input for the Sentiment Analysis Task on Tweets

Sentiment Analysis Task on Tweets (DS2&DS3) (click here to enlarge the picture and read the crowdsourcing task instructions)

Fig.2: CrowdTruth Workflow for Identifying Salient Features in Tweets 2014 - DS2 & Tweets 2015 - DS3.

Experiments Results:

Relevance Analysis Task on News Articles (DS1)

Extract the relevant text snippets, both with overlapping tokens with the article title, and without overlapping tokens with the title:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
%matplotlib inline

orderedDF = pd.read_csv('aggregate/orderedSnippetsByRelevance.csv', '\t')

The following two plots show the distribution of the relevance scores of the text snippets (overlapping with the title - blue, non-overlapping with the title - red) and the distribution of the number of relevant mentions identified by the crowd, across the relevant text snippets. As we can observe, the distribution of relevant mentions in a text snippet follows the same trend as the relevance score of the text snippets: the more relevant the text snippet, the more relevant mentions found by the crowd.

In [2]:
plt.rcParams['figure.figsize'] = 15, 5
plt.subplot(1, 2, 1)
plt.title('Snippets Relevance Distribution', size=20)
orderedDF['subset'] = np.select([orderedDF['Overlapping Snippet'] == 1, orderedDF['Overlapping Snippet'] == 0], ['relevance score - overlapping snippets', 'relevance score - nonoverlapping snippets'], -1)
for color, label in zip(['cornflowerblue', 'indianred'], ['relevance score - overlapping snippets', 'relevance score - nonoverlapping snippets', -1]):
    subset = orderedDF[orderedDF['subset'] == label]
    plt.plot(subset['index'], subset['Snippet Relevance Score'], c=color, label=str(label))
    plt.fill_between(subset['index'], subset['Snippet Relevance Score'], 0, where=subset['index'], color=color)
plt.xlim(0, len(orderedDF['index']))
plt.ylabel('snippet relevance score', fontsize=16)
plt.xlabel('snippet index', fontsize=16)
plt.legend(prop={'size':14})

plt.subplot(1, 2, 2)
plt.title('# Relevant Mentions Distribution', size=20)
orderedDF['subset'] = np.select([orderedDF['Overlapping Snippet'] == 1, orderedDF['Overlapping Snippet'] == 0], ['#mentions - overlapping snippets', '#mentions - nonoverlapping snippets'], -1)
for color, label in zip(['cornflowerblue', 'indianred'], ['#mentions - overlapping snippets', '#mentions - nonoverlapping snippets', -1]):
    subset = orderedDF[orderedDF['subset'] == label]
    plt.plot(subset['index'], subset['Number of Relevant Mentions'], c=color, label=str(label))
    plt.fill_between(subset['index'], subset['Number of Relevant Mentions'], 0, where=subset['index'], color=color)
    z = np.polyfit(subset['index'],subset['Number of Relevant Mentions'],2)
    p = np.poly1d(z)
    plt.plot(subset['index'],subset['Number of Relevant Mentions'], color, subset['index'], p(subset['index']),'k-')
plt.xlim(0, len(orderedDF['index']))
plt.ylabel('# relevant mentions in snippet', fontsize=16)
plt.xlabel('snippet index', fontsize=16)
plt.legend(prop={'size':14})
plt.show()

Correlation between the text snippet relevance score and the text snippet position in the article

  • we take here as an example the article "25940", where the number represents the 'Title Id'.
In [3]:
dfRelevantSnippets = pd.read_csv('aggregate/aggregatedResults_newsArticles.csv')
dfRelevantSnippets = dfRelevantSnippets.loc[dfRelevantSnippets['Overall Sentiment-Intensity'] != 0]

dfRelevantSnippets.sort(['Title Id'], ascending = [True], inplace = True)
dfRelevantSnippets['Title Snippet Index'] = [int(i.split('-')[0]) for i in dfRelevantSnippets['Snippet Id']]
dfRelevantSnippets.sort(['Title Id', 'Title Snippet Index'], ascending = [True, True], inplace = True)
chosenTitle = dfRelevantSnippets.loc[dfRelevantSnippets['Title Id'] == 25940]

plt.rcParams['figure.figsize'] = 12, 4
plt.title('Relevant Snippets Scores Distribution in a News Article', size=20)
chosenTitle['subset'] = np.select([chosenTitle['Overlapping Snippet'] == 1, chosenTitle['Overlapping Snippet'] == 0], ['overlapping snippet', 'nonoverlapping snippet'], -1)
for color, label in zip(['cornflowerblue', 'indianred'], ['overlapping snippet', 'nonoverlapping snippet', -1]):
    subset = chosenTitle[chosenTitle['subset'] == label]
    plt.bar(subset['Title Snippet Index'], subset['Snippet Relevance Score'], color=color, label=str(label))
z = np.polyfit(chosenTitle['Title Snippet Index'],chosenTitle['Snippet Relevance Score'],2)
p = np.poly1d(z)
plt.plot(chosenTitle['Title Snippet Index'], p(chosenTitle['Title Snippet Index']),'k-')
plt.xlim(3)
plt.ylabel('snippet relevance score', fontsize=16)
plt.xlabel('snippet position in article', fontsize=16)
plt.legend(bbox_to_anchor=(1.01, 1), loc=2, prop={'size':14})
Out[3]:
<matplotlib.legend.Legend at 0x42cb210>

Next, we generalize this across all the news articles by splitting each article into 3 parts:

  • news snippets located at the beginning of the article
  • news snippets located in the middle of the article
  • news snippets that are located at the end of the article.
In [4]:
dfSnippetsPosition = pd.read_csv('aggregate/snippetsPositionInArticle.csv')
dfSnippetsPosition
Out[4]:
Snippets Position # snippets with maximum relevance score average maximum relevance score average relevance score
0 Beginning 12 0.79 0.34
1 Middle 9 0.73 0.26
2 End 9 0.80 0.20

3 rows × 4 columns

In [5]:
plt.rcParams['figure.figsize'] = [18.0, 4.0]

labels = ['Begining','Middle','End']
x = range(3)
plt.subplot(1, 2, 1)
plt.text(-0.1, 1.25, "Avg. Snippets Relevance wrt Snippets Position in Article",fontsize=18)
plt.xticks(x, labels, fontsize=16)
plt.plot(x, dfSnippetsPosition['average maximum relevance score'], '-o', color = 'g', lw = 2, label = "avg. maximum relevance score")
plt.plot(x, dfSnippetsPosition['average relevance score'], '-s', color = 'b', lw = 2, label = "avg. relevance score")
plt.ylim(0.0, 1.0)
plt.ylabel('scores', fontsize=16)
plt.xlabel('snippets position in article', fontsize=16)
plt.legend(loc=9, bbox_to_anchor=(0.5, 1.15), ncol=2, prop={'size':12})

plt.subplot(1, 2, 2)
plt.text(-0.1, 16.0, "Max Snippets Relevance wrt Snippets Position in Article",fontsize=18)
plt.xticks(x, labels, fontsize=16)
plt.plot(x, dfSnippetsPosition['# snippets with maximum relevance score '], '-o', color = 'g', lw = 2, label = "# snippets with max relevance score")
plt.ylim(1.0, 13.0)
plt.ylabel('# snippets', fontsize=16)
plt.xlabel('snippets position in article', fontsize=16)
plt.legend(loc=9, bbox_to_anchor=(0.5, 1.15), ncol=2, prop={'size':12})
Out[5]:
<matplotlib.legend.Legend at 0x4795a10>

Sentiment Analysis Task on News Articles (DS1)

Overview of the sentiment distribution for the relevant news snippets:

In [6]:
overallSentiment = pd.read_csv('aggregate/orderedNewsSnippetsBySentiments.csv', '\t')
In [7]:
plt.rcParams['figure.figsize'] = 12, 3
plt.text(60, 1.25, "Sentiment Distribution of Relevant Snippets",fontsize=18)
plt.plot(overallSentiment['index'], overallSentiment['Neutral Sentiment'], color = 'y', lw = 1, label = "neutral sentiment")
plt.fill_between(overallSentiment['index'], overallSentiment['Neutral Sentiment'], 0, where=overallSentiment['index'], color='y')
plt.plot(overallSentiment['index'], overallSentiment['Negative Sentiment'], color = 'r', lw = 1, label = "negative sentiment" , alpha='0.5')
plt.fill_between(overallSentiment['index'], overallSentiment['Negative Sentiment'], 0, where=overallSentiment['index'], color='r', alpha='0.5')
plt.plot(overallSentiment['index'], overallSentiment['Positive Sentiment'], color = 'b', lw = 1, label = "positive sentiment", alpha='0.5')
plt.fill_between(overallSentiment['index'], overallSentiment['Positive Sentiment'], 0, where=overallSentiment['index'], color='b', alpha='0.5')
plt.xlim(0, len(overallSentiment['index']))
plt.ylabel('snippets sentiment scores', fontsize=14)
plt.xlabel('snippet index', fontsize=14)
plt.legend(loc=9, bbox_to_anchor=(0.5, 1.2), ncol=3, prop={'size':12})
Out[7]:
<matplotlib.legend.Legend at 0x49a1bd0>

Relevance Analysis Task on Tweets (DS2&DS3)

We merge the results of the two tweets datasets (from 2014 and 2015) and use their aggregation for the rest of the analysis.

In [8]:
orderedDF = pd.read_csv('aggregate/orderedTweetsByRelevance.csv', '\t')
orderedMentions = pd.read_csv('aggregate/orderedTweetsByMentions.csv', '\t')
histDF = pd.read_csv('aggregate/histogramRelevantTweets.csv', '\t')

Overview of:

  • #### tweets-event relevance score distribution
  • #### number of relevant event mentions identified by the crowd
  • #### histogram of tweet-event relevance score
In [9]:
plt.rcParams['figure.figsize'] = 15, 5
plt.subplot(1, 3, 1)
plt.title('Tweet-Event Relevance Score Distribution')
plt.plot(orderedDF['index'], orderedDF['Tweet Event Relevance Score'], c='cornflowerblue', label='tweet-event relevance score')
plt.fill_between(orderedDF['index'], orderedDF['Tweet Event Relevance Score'], 0, where=orderedDF['index'], color='cornflowerblue')
plt.xlim(0, len(orderedDF['index']))
plt.ylabel('tweet event relevance score')
plt.xlabel('tweet index')
plt.legend()

plt.subplot(1, 3, 2)
plt.title('# Relevant Mentions Distribution')
plt.plot(orderedMentions['index'], orderedMentions['Number of Relevant Mentions'], c='cornflowerblue', label='# relevant mentions')
plt.fill_between(orderedMentions['index'], orderedMentions['Number of Relevant Mentions'], 0, where=orderedMentions['index'], color='cornflowerblue')
z = np.polyfit(orderedMentions['index'],orderedMentions['Number of Relevant Mentions'],2)
p = np.poly1d(z)
plt.plot(orderedMentions['index'], p(orderedMentions['index']),'k-')
plt.xlim(0, len(orderedDF['index']))
plt.ylabel('# relevant mentions in tweet')
plt.xlabel('tweet index')
plt.legend()

plt.subplot(1, 3, 3)
plt.bar(histDF['Relevance Interval'], histDF['Tweets Count'], width = 0.08, color='cornflowerblue',align='center')
plt.title("Histogram of Tweet-Event Relevance")
plt.xlabel("Tweet-Event Relevance Interval")
plt.ylabel("Tweets Count")
plt.xticks(histDF['Relevance Interval'])
plt.show()

Sentiment Analysis Task on Tweets (DS2&DS3)

Overview of the sentiment distribution for the relevant tweets:

In [10]:
overallSentiment = pd.read_csv('aggregate/orderedTweetsBySentiments.csv', '\t')

plt.rcParams['figure.figsize'] = 12, 3
plt.text(170, 1.25, "Sentiment Distribution of Relevant Tweets",fontsize=18)
plt.plot(overallSentiment['index'], overallSentiment['Neutral Sentiment'], color = 'y', lw = 1, label = "neutral sentiment")
plt.fill_between(overallSentiment['index'], overallSentiment['Neutral Sentiment'], 0, where=overallSentiment['index'], color='y')
plt.plot(overallSentiment['index'], overallSentiment['Negative Sentiment'], color = 'r', lw = 1, label = "negative sentiment" , alpha='0.5')
plt.fill_between(overallSentiment['index'], overallSentiment['Negative Sentiment'], 0, where=overallSentiment['index'], color='r', alpha='0.5')
plt.plot(overallSentiment['index'], overallSentiment['Positive Sentiment'], color = 'b', lw = 1, label = "positive sentiment", alpha='0.5')
plt.fill_between(overallSentiment['index'], overallSentiment['Positive Sentiment'], 0, where=overallSentiment['index'], color='b', alpha='0.5')
plt.xlim(0, len(overallSentiment['index']))
plt.ylabel('snippets sentiment scores', fontsize=14)
plt.xlabel('snippet index', fontsize=14)
plt.legend(loc=9, bbox_to_anchor=(0.5, 1.2), ncol=3, prop={'size':12})
Out[10]:
<matplotlib.legend.Legend at 0x582eed0>

Change in sentiment for relevant event mentions of "whaling ban"

We extracted from the tweets datasets, DS2 and DS3, a subset of the tweets identified as relevant by the crowd and containing relevant event mentions "whaling ban". There is a strong positive sentiment about the decision to ban whaling in Japan. However, this drastically transforms into a negative sentiment immediately after facts such as Japan plans to continue whaling are published.

In [11]:
dfTweetsSentiment = pd.read_csv('aggregate/tweetsChangeInSentiment.csv')
dfTweetsSentiment = dfTweetsSentiment.reset_index(drop=True)
dfTweetsSentiment.reset_index(inplace=True)
dfTweetsSentiment
Out[11]:
index Relevant Mention Positive Sentiment Negative Sentiment Neutral Sentiment High Intensity Low Intensity Medium Intensity
0 0 Finds Way Around Ban 0.00 0.99 0.14 0.20 0.78 0.59
1 1 Japan changes its mind about Antarctic whaling... 0.00 0.37 0.93 0.00 0.93 0.37
2 2 Japan finds way around ban on whaling 0.08 0.78 0.45 0.19 0.77 0.49
3 3 Japan changes its mind about Antarctic whaling... 0.15 0.88 0.44 0.14 0.95 0.27
4 4 whaling fleet leaves Japan since UN hunting ba... 0.22 0.44 0.87 0.23 0.69 0.69
5 5 Japanese whaling fleet leaves port weeks after... 0.27 0.80 0.53 0.00 0.71 0.71
6 6 International court delivers ban verdict 0.45 0.00 0.89 0.00 0.98 0.20
7 7 Japan Accepted Court ban on Antarctic Whaling 0.55 0.53 0.39 0.00 0.73 0.59
8 8 Japan accepts court ban 0.55 0.00 0.83 0.00 0.83 0.55
9 9 Antarctic whaling ban 0.63 0.22 0.70 0.04 0.87 0.40
10 10 ban on Antarctic whaling 0.70 0.17 0.70 0.00 0.99 0.12
11 11 Japan accepts court ban on Antarctic whaling 0.72 0.20 0.61 0.26 0.88 0.26
12 12 UN hunting ban in Antarctic 0.78 0.59 0.20 0.18 0.91 0.37
13 13 Japan whaling ban 0.79 0.37 0.40 0.08 0.75 0.49
14 14 Japan whaling ban welcomed 0.97 0.16 0.16 0.10 0.75 0.65

15 rows × 8 columns

In [12]:
plt.rcParams['figure.figsize'] = 15, 10
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
yLabels = dfTweetsSentiment['Relevant Mention'].tolist()
y_pos = np.arange(len(yLabels))
ax.barh(y_pos, dfTweetsSentiment['Positive Sentiment'], color='cornflowerblue',align='center',label='Positive Sentiment')
lefts = dfTweetsSentiment['Positive Sentiment']
ax.barh(y_pos, dfTweetsSentiment['Negative Sentiment'], color='indianred', left=lefts,align='center',label='Negative Sentiment')
lefts = lefts + dfTweetsSentiment['Negative Sentiment']
ax.set_yticks(y_pos)
ax.set_yticklabels(yLabels, fontsize=24)
ax.set_xlabel('Sentiment Scores', fontsize=24)
ax.set_xticklabels(ax.get_xticks(), fontsize=24)
plt.text(0.5, 1.18, "Change in sentiment for relevant event mentions of 'whaling ban'",
         horizontalalignment='center', fontsize=30, transform = ax.transAxes)
plt.legend(loc=9, bbox_to_anchor=(0.5, 1.14), ncol=2, prop={'size':24})
plt.show()