Harnessing diversity in crowds and machines for better NER performance

DOI

This repository contains the experimental results of identifying and typing named entities in English Wikipedia sentences. Even though current named entity recognition tools achieve nearly human-like performance or particular data types or domains, they are still highly dependent on the gold standard used for training and testing. The mainstream approach of gathering ground truth or gold standard for training and evaluating named entity recognition tools is still by means of experts, who are typically expensive and hard to find. Furthermore, for each new input type, or each new domain, new gold standards need to be created. Overall, the experts follow over-generalized annotation guidelines, meant to increase the inter-annotator agreement between experts. Such guidelines are thus prone to denying the intrinsic language ambiguity, multitude of perspectives and interpretations. Thus, ground truth datasets might not always be 'gold' or 'true' in terms of capturing the real text meaning and interpretation diversity. In the last decade crowdsourcing has also proven to be a suitable method for gathering such ground truth, but data ambiguity is still not handled.

However, in our work we focus on capturing the inter-annotator disagreement to provide a new type of ground truth, i.e., crowd truth - by applying the CrowdTruth metrics and methodology, where language features are taken into consideration. All the crowdsourcing experiments were performed through the CrowdTruth platform, while the results were processed and analyzed using the CrowdTruth methodology and metrics. For more information, check the CrowdTruth website. For gathering the annotated data, we used the CrowdFlower marketplace.

We propose a novel approach for extracting and typing named entities in texts, i.e.m a hybrid multi-machine-crowd approach where state-of-the-art NER tools are combined and their aggregated output is validated and improved through crowdsourcing. We report here results of:

  1. Five state-of-the-art named entity recognition tools (SingleNER)
  2. The combined output of the five state-of-the-art named entity recognition tools (MultiNER)
  3. Crowdsourcing experiments for correcting and improving the Multi-NER output and also for improving the expert-based gold standard (MultiNER+Crowd).

Check the Results & Download the Data: Crowdsourcing-Improved-NE-Gold-Standard

Table of Contents:

Experimental Data:

We performed named entity extraction with five state-of-the-art NER tools: NERD-ML, TextRazor, THD, DBpediaSpotlight, and SemiTags. We performed a comparative analysis of (1) their performance (output) and (2) their combined performance (output), on two ground truth (GT) evaluation datasets used during Task 1 of the Open Knowledge Extraction (OKE) semantic challenge at ESWC in 2015 (OKE2015) and 2016 (OKE2016) respectively. The datasets can be checked here:

  1. OKE2015: Open Knowledge Extraction 2015 (OKE2015) semantic challenge: https://github.com/anuzzolese/oke-challenge
  2. OKE2016: Open Knowledge Extraction 2016 (OKE2016) semantic challenge: https://github.com/anuzzolese/oke-challenge-2016

In summary, there are $156$ Wikipedia sentences with $1007$ annotated named entities of types place, person, organization and role distributed across datasets in the following way:

OKE2015 OKE2016
Sentences Named Entities Sentences Named Entities
101 Place 120 55 Place 44
Person 304 Person 105
Organization 139 Organization 105
Role 103 Role 86
Total 101 664 55 340

Dataset Files:

|--/aggregate

Various aggregated datasets for analyzing the output of multiple state-of-the-art named entity recognition tools (SingleNER), their combined output (MultiNER) and crowdsourcing data for correcting and improving the MultiNER approach and the gold standard.

|--/aggregate/OKE2015/OKE2015_SingleNER_and_MultiNER_eval.csv
|--/aggregate/OKE2016/OKE2016_SingleNER_and_MultiNER_eval.csv

These files contain the results of the five SOTA NER tools and the results of the MultiNER approach on the two gold standards datasets aforementioned. The files contain all the named entities in the gold standards and all the other alternatives (overlapping expressions) that were extracted by any SOTA NER for that entity. The columns are:

  • Identifier: sentence ID as referenced in the gold standard datasets
  • Sentence: sentence content as referenced in the gold standard datasets
  • NamedEntity: a potential named entity extracted by any of the five SOTA NER tools;
  • StartOffset: start offset of the named entity
  • EndOffset: end offset of the named entity
  • GoldEntityType: the type of the named entity as provided in the gold standard
  • EntityScore: the likelihood of an entity to be in the gold standard based on how many NER tools extracted it. The score is equal to the ratio of NER tools that extracted the entity.
  • SingleNERCount: the number of SOTA NER tools that extracted the named entity
  • Gold: binary value describing whether the named entity is contained in the gold standard (1) or not (0)
  • MultiNER: binary value describing whether any of the NER tools extracted the named entity (1) or not (0)
  • NERD,TextRazor,SemiTags,THD,DBpediaSpotlight: binary value describing whether the given NER tool extracted the named entity (1) or not (0)
  • TP_MultiNER: binary value describing whether the named entity is a TP case (1) or not (0), with regard to the MultiNER approach
  • TP_NERD,TP_TextRazor,TP_SemiTags,TP_THD,TP_DBpediaSpotlight: binary value describing whether the named entity is a TP case (1) or not (0), with regard to the SingleNER approach
  • TN_MultiNER: binary value describing whether the named entity is a TN case (1) or not (0), with regard to the MultiNER approach
  • TN_NERD,TN_TextRazor,TN_SemiTags,TN_THD,TN_DBpediaSpotlight: binary value describing whether the named entity is a TN case (1) or not (0), with regard to the SingleNER approach
  • FP_MultiNER: binary value describing whether the named entity is a FP case (1) or not (0), with regard to the MultiNER approach
  • FP_NERD,FP_TextRazor,FP_SemiTags,FP_THD,FP_DBpediaSpotlight: binary value describing whether the named entity is a FP case (1) or not (0), with regard to the SingleNER approach
  • FN_MultiNER: binary value describing whether the named entity is a FN case (1) or not (0), with regard to the MultiNER approach
  • FN_NERD,FN_TextRazor,FN_SemiTags,FN_THD,FN_DBpediaSpotlight: binary value describing whether the named entity is a FN case (1) or not (0), with regard to the SingleNER approach
|--/aggregate/OKE2015/OKE2015_MultiNER_and_Crowd_eval.csv
|--/aggregate/OKE2016/OKE2016_MultiNER_and_Crowd_eval.csv

These files contain the results of the five SOTA NER tools, the results of the MultiNER approach on the two gold standards and the crowdsourcing results for every named entity in the gold standard that has multiple alternatives. The columns are:

  • Identifier: sentence ID as referenced in the gold standard datasets
  • Sentence: sentence content as referenced in the gold standard datasets
  • NamedEntity: a potential named entity extracted by any of the five SOTA NER tools;
  • StartOffset: start offset of the named entity
  • EndOffset: end offset of the named entity
  • GoldEntityType: the type of the named entity as provided in the gold standard
  • EntityScore: the likelihood of an entity to be in the gold standard based on how many NER tools extracted it. The score is equal to the ratio of NER tools that extracted the entity.
  • SingleNERCount: the number of SOTA NER tools that extracted the named entity
  • Gold: binary value describing whether the named entity is contained in the gold standard (1) or not (0)
  • CrowdGold: binary value describing whether the named entity is considered a valid named entity by the crowd (1) or not (0)
  • MultiNER: binary value describing whether any of the NER tools extracted the named entity (1) or not (0)
  • NERD,TextRazor,SemiTags,THD,DBpediaSpotlight: binary value describing whether the given NER tool extracted the named entity (1) or not (0)
  • TP_MultiNER: binary value describing whether the named entity is a TP case (1) or not (0), with regard to the MultiNER approach
  • TP_NERD,TP_TextRazor,TP_SemiTags,TP_THD,TP_DBpediaSpotlight: binary value describing whether the named entity is a TP case (1) or not (0), with regard to the SingleNER approach
  • TN_MultiNER: binary value describing whether the named entity is a TN case (1) or not (0), with regard to the MultiNER approach
  • TN_NERD,TN_TextRazor,TN_SemiTags,TN_THD,TN_DBpediaSpotlight: binary value describing whether the named entity is a TN case (1) or not (0), with regard to the SingleNER approach
  • FP_MultiNER: binary value describing whether the named entity is a FP case (1) or not (0), with regard to the MultiNER approach
  • FP_NERD,FP_TextRazor,FP_SemiTags,FP_THD,FP_DBpediaSpotlight: binary value describing whether the named entity is a FP case (1) or not (0), with regard to the SingleNER approach
  • FN_MultiNER: binary value describing whether the named entity is a FN case (1) or not (0), with regard to the MultiNER approach
  • FN_NERD,FN_TextRazor,FN_SemiTags,FN_THD,FN_DBpediaSpotlight: binary value describing whether the named entity is a FN case (1) or not (0), with regard to the SingleNER approach
  • MainAlternativeSpan: the largest span extracted by any NER tools that overlaps with a named entity in the gold standard;
  • AlternativeStartOffset: start offset of the named entity alternative
  • AlternativeEndOffset: end offset of the named entity alternative
  • AlternativeCrowdScore: the likelihood of an entity to be in the gold standard based on the crowd assessment. The score is computed using the cosine similarity measure
  • RoleScore,PersonScore,OrganizationScore,PlaceScore,OtherScore: the likelihood of an entity to refer to the given type based on the crowd assessment. The score is computed using the cosine similarity measure
|--/input
|  |--/Valid Named Entity Expressions
|  |  |--/OKE2015
|  |  |--/OKE2016

The files contain the input for the crowdsourcing tasks for each dataset. An input unit is composed of a sentence and a set of expressions that refer to a named entity.

|--/raw
|  |--/Valid Named Entity Expressions
|  |  |--/OKE2015
|  |  |--/OKE2016

The raw data collected from crowdsourcing tasks for each of the 2 datasets.

Crowdsourcing Experiments:

Overall, the aim of the crowdsourcing experiments is to:

  1. correct the mistakes of the NER tools
  2. identify the ambiguities in the ground truth and provide a better ground truth

Crowdsourcing Experimental Data

We select every entity in the ground truth for which the NER tools provided alternatives. We have the following two cases:

  • Crowd reduces the number of FP: For each named entity in the ground truth that has multiple alternatives (span alternative) we create an entity cluster. We also add the largest span among all the alternatives.
  • Crowd reduces the number of FN: For each named entity in the ground truth that was not extracted, we create an entity cluster that contains the FN named entity and the alternatives returned by the NER. Further, we add every other combination of words contained in all the alternatives. This step is necessary because we do not want to introduce bias in the task, i.e., the crowd should see all the possibilities, not only the expected one.

Crowdsourcing Annotation Task

For the two cases described above, the goal of the crowdsourcing task is two-fold:

  • identification of valid expressions from a list that refer to a highlighted phrase in yellow (Step 2 from the crowdsourcing template below)
  • selection of the type for each expression in the list, from a predefined set of choices - place, person, organization, role and other (Step 3 from the crowdsourcing template below).

The input of the crowdsourcing task consists of a sentence and a named entity for which multiple expressions were given by the five state-of-the-art NER tools.

Check the crowdsourcing templates below.

Fig.1: CrowdTruth Workflow for Identifying Valid Named Entity Expressions and their Type. Fig.2: CrowdTruth Workflow for Identifying Valid Named Entity Expressions and their Type.

Experimental Resuls:

SingleNER vs. MultiNER - entity surface

In order to compare the performance of individual state-of-the-art named entity recognition tools with our MultiNER approach, we analyze all the named entities in the gold standards and all their alternatives that were extracted by any individual state-of-the-art named entity recognition tool.

  • true positive (TP): the named entity has the same surface form and the same offsets as the named entity in the gold standard;
  • false positive (FP): the named entity is only a partial overlap with the named entity in the gold standard;
  • false negative (FN): the named entities in the gold standard that were not extracted by any NER, nor the Multi-NER.
OKE2015
TP FP FN Precision Recall F1-score
NERD 401 93 263 0.812 0.604 0.693
SemiTags 366 37 298 0.908 0.551 0.686
THD 199 114 465 0.636 0.3 0.407
DBpediaSpotlight 411 234 253 0.637 0.619 0.628
TextRazor 431 177 232 0.709 0.65 0.678
Multi-NER 555 403 109 0.579 0.836 0.684
OKE2016
TP FP FN Precision Recall F1-score
NERD 209 37 131 0.85 0.615 0.713
SemiTags 161 14 179 0.92 0.474 0.625
THD 122 73 218 0.626 0.359 0.456
DBpediaSpotlight 228 119 112 0.657 0.671 0.664
TextRazor 207 105 133 0.663 0.609 0.635
Multi-NER 299 218 41 0.578 0.879 0.698

Annotation quality F1 per negative/positive threshold for each named entity score - Multi-NER approach.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
%matplotlib inline

singleNER_multiNER_2015 = pd.read_csv("aggregate/OKE2015/OKE2015_SingleNER_and_MultiNER_eval.csv")
singleNER_multiNER_2016 = pd.read_csv("aggregate/OKE2016/OKE2016_SingleNER_and_MultiNER_eval.csv")

multiNER_ann_quality_f1_2015 = np.zeros(shape=(6, 2))
for idx in range(1, 6):
    thresh = 2 * idx / 10.0
    tp = 0
    fp = 0
    tn = 0
    fn = 0
    
    for gt_idx in range(0, len(singleNER_multiNER_2015)):
        if singleNER_multiNER_2015["EntityScore"][gt_idx] >= thresh:
            if singleNER_multiNER_2015["Gold"][gt_idx] == 1:
                tp = tp + 1.0
            else: 
                fp = fp + 1.0
        else:
            if singleNER_multiNER_2015["Gold"][gt_idx] == 1:
                fn = fn + 1.0
            else:
                tn = tn + 1.0
                
    multiNER_ann_quality_f1_2015[idx, 0] = thresh
    multiNER_ann_quality_f1_2015[idx, 1] = 2.0 * tp / (2.0 * tp + fp + fn)

tp = 0
tn = 0
fp = 0
fn = 0
majority_vote_multiNER_ann_quality_f1_2015 = 0
for gt_idx in range(0, len(singleNER_multiNER_2015)):
    if singleNER_multiNER_2015["EntityScore"][gt_idx] >= 0.6:
        if singleNER_multiNER_2015["Gold"][gt_idx] == 1:
            tp = tp + 1.0
        else: 
            fp = fp + 1.0
    else:
        if singleNER_multiNER_2015["Gold"][gt_idx] == 1:
            fn = fn + 1.0
        else:
            tn = tn + 1.0
majority_vote_multiNER_ann_quality_f1_2015 = 2.0 * tp / (2.0 * tp + fp + fn)
NERD_f1_2015 = 0.693
TextRazor_f1_2015 = 0.678
DBpediaSpotlight_f1_2015 = 0.628
SemiTags_f1_2015 = 0.686
THD_f1_2015 = 0.407


multiNER_ann_quality_f1_2016 = np.zeros(shape=(6, 2))
for idx in range(1, 6):
    thresh = 2 * idx / 10.0
    tp = 0
    fp = 0
    tn = 0
    fn = 0
    
    for gt_idx in range(0, len(singleNER_multiNER_2016)):
        if singleNER_multiNER_2016["EntityScore"][gt_idx] >= thresh:
            if singleNER_multiNER_2016["Gold"][gt_idx] == 1:
                tp = tp + 1.0
            else: 
                fp = fp + 1.0
        else:
            if singleNER_multiNER_2016["Gold"][gt_idx] == 1:
                fn = fn + 1.0
            else:
                tn = tn + 1.0
                
    multiNER_ann_quality_f1_2016[idx, 0] = thresh
    multiNER_ann_quality_f1_2016[idx, 1] = 2.0 * tp / (2.0 * tp + fp + fn)

tp = 0
tn = 0
fp = 0
fn = 0
majority_vote_multiNER_ann_quality_f1_2016 = 0
for gt_idx in range(0, len(singleNER_multiNER_2016)):
    if singleNER_multiNER_2016["EntityScore"][gt_idx] >= 0.6:
        if singleNER_multiNER_2016["Gold"][gt_idx] == 1:
            tp = tp + 1.0
        else: 
            fp = fp + 1.0
    else:
        if singleNER_multiNER_2016["Gold"][gt_idx] == 1:
            fn = fn + 1.0
        else:
            tn = tn + 1.0
majority_vote_multiNER_ann_quality_f1_2016 = 2.0 * tp / (2.0 * tp + fp + fn)
NERD_f1_2016 = 0.713
TextRazor_f1_2016 = 0.635
DBpediaSpotlight_f1_2016 = 0.664
SemiTags_f1_2016 = 0.625
THD_f1_2016 = 0.456
In [2]:
plt.rcParams['figure.figsize'] = 15, 5
plt.subplot(1, 2, 1)
plt.plot(multiNER_ann_quality_f1_2015[:,0], multiNER_ann_quality_f1_2015[:,1], color = 'g', lw = 2, label = "Multi-NER")
plt.xlim(0.2,1.0)
plt.title('Annotation quality F1 per negative/positive \n sentence-entity threshold for OKE2015', size=18)
plt.axhline(y = majority_vote_multiNER_ann_quality_f1_2015, color = 'cornflowerblue', lw = 2, label = "Majority Vote")
plt.axhline(y = NERD_f1_2015, color = 'darksalmon', lw = 2, label = "NERD-ML")
plt.axhline(y = TextRazor_f1_2015, color = 'black', lw = 2, label = "TextRazor")
plt.axhline(y = DBpediaSpotlight_f1_2015, color = 'orange', lw = 2, label = "DBpediaSpotlight")
plt.axhline(y = SemiTags_f1_2015, color = 'b', lw = 2, label = "SemiTags")
plt.axhline(y = THD_f1_2015, color = 'purple', lw = 2, label = "THD")

max_thresh = np.argmax(multiNER_ann_quality_f1_2015[:, 1])
plt.axvline(x = multiNER_ann_quality_f1_2015[max_thresh, 0], color = 'r', ls = ':')
plt.axhline(y = multiNER_ann_quality_f1_2015[max_thresh, 1], color = 'r', ls = ':')

plt.ylabel('annotation quality F1 score', fontsize=14)
plt.xlabel('neg/pos sentence-entity score', fontsize=14)
plt.legend(bbox_to_anchor=(0.45, 0.5), loc=2, borderaxespad=0.)


plt.subplot(1, 2, 2)
plt.plot(multiNER_ann_quality_f1_2016[:,0], multiNER_ann_quality_f1_2016[:,1], color = 'g', lw = 2, label = "Multi-NER")
plt.xlim(0.2,1.0)
plt.title('Annotation quality F1 per negative/positive \n sentence-entity threshold for OKE2016', size=18)
plt.axhline(y = majority_vote_multiNER_ann_quality_f1_2016, color = 'cornflowerblue', lw = 2, label = "Majority Vote")
plt.axhline(y = NERD_f1_2016, color = 'darksalmon', lw = 2, label = "NERD-ML")
plt.axhline(y = TextRazor_f1_2016, color = 'black', lw = 2, label = "TextRazor")
plt.axhline(y = DBpediaSpotlight_f1_2016, color = 'orange', lw = 2, label = "DBpediaSpotlight")
plt.axhline(y = SemiTags_f1_2016, color = 'b', lw = 2, label = "SemiTags")
plt.axhline(y = THD_f1_2016, color = 'purple', lw = 2, label = "THD")

max_thresh = np.argmax(multiNER_ann_quality_f1_2015[:, 1])
plt.axvline(x = multiNER_ann_quality_f1_2015[max_thresh, 0], color = 'r', ls = ':')
plt.axhline(y = multiNER_ann_quality_f1_2015[max_thresh, 1], color = 'r', ls = ':')

plt.ylabel('annotation quality F1 score', fontsize=14)
plt.xlabel('neg/pos sentence-entity score', fontsize=14)
plt.legend(bbox_to_anchor=(0.45, 0.5), loc=2, borderaxespad=0.)

plt.show()

SingleNER vs. MultiNER - entity surface and entity type

In order to understand which are the most problematic types we perform a similar analysis at the level of entity surface and entity type.


OKE2015
TP FP FN
Place People Org Role Total Place People Org Role Total Place People Org Role Total
NERD 90 142 106 65 403 22 21 42 17 102 30 162 33 38 263
SemiTags 100 168 100 0 368 16 2 19 2 39 20 136 39 103 298
THD 62 35 55 49 201 17 17 62 29 125 58 269 84 54 465
DBpediaSpotlight 99 156 81 77 413 26 62 124 26 238 21 148 58 26 253
TextRazor 110 174 109 40 434 31 14 118 24 187 9 130 30 63 232
Multi-NER 117 219 130 92 558 54 91 214 66 425 4 85 9 11 108

OKE2016
TP FP FN
Place People Org Role Total Place People Org Role Total Place People Org Role Total
NERD 40 47 71 51 209 1 3 30 6 40 4 58 34 35 131
SemiTags 36 57 67 1 161 5 2 7 1 15 8 48 38 85 179
THD 36 12 33 41 122 3 1 55 14 73 8 93 72 45 218
DBpediaSpotlight 38 70 56 64 228 5 7 93 14 119 6 35 49 22 112
TextRazor 36 57 83 31 207 15 4 79 12 110 8 48 22 55 133
Multi-NER 44 78 100 77 299 21 13 157 34 225 0 27 5 9 41
Place People Organization Role
P R F1 P R F1 P R F1 P R F1
OKE2015 0.69 0.98 0.81 0.70 0.72 0.71 0.38 0.94 0.54 0.59 0.89 0.71
OKE2016 0.68 1.00 0.81 0.86 0.74 0.80 0.39 0.95 0.55 0.70 0.90 0.79

Crowd Enhanced Multi-NER

We measure and plot here (1) the Multi-NER+Crowd performance (the Multi-NER improved through the crowdsourcing task) compared to the Multi-NER approach; and (2) the Multi-NER+CrowdGT, a crowd-improved ground truth, by using a manual evaluation of the crowd results.

In [3]:
multiNER_crowd_2015 = pd.read_csv("aggregate/OKE2015/OKE2015_MultiNER_and_Crowd_eval.csv")
multiNER_crowd_2016 = pd.read_csv("aggregate/OKE2016/OKE2016_MultiNER_and_Crowd_eval.csv")
multiNER_crowd_2015 = multiNER_crowd_2015.replace(np.nan,' ', regex=True)
multiNER_crowd_2016 = multiNER_crowd_2016.replace(np.nan,' ', regex=True)

multiNER_crowd_ann_quality_f1_2015 = np.zeros(shape=(10, 2))
for idx in range(1, 10):
    thresh = idx / 10.0
    tp = 0
    fp = 0
    tn = 0
    fn = 0
    
    for gt_idx in range(0, len(multiNER_crowd_2015)):
        if multiNER_crowd_2015["AlternativeCrowdScore"][gt_idx] == "":
            if multiNER_crowd_2015["AlternativeCrowdScore"][gt_idx] >= 0.2:
                if multiNER_crowd_2015["Gold"][gt_idx] == 1:
                    tp = tp + 1.0
                else: 
                    fp = fp + 1.0
            else:
                if multiNER_crowd_2015["Gold"][gt_idx] == 1:
                    fn = fn + 1.0
                else:
                    tn = tn + 1.0
        else:
            if multiNER_crowd_2015["AlternativeCrowdScore"][gt_idx] >= thresh:
                if multiNER_crowd_2015["Gold"][gt_idx] == 1:
                    tp = tp + 1.0
                else: 
                    fp = fp + 1.0
            else:
                if multiNER_crowd_2015["Gold"][gt_idx] == 1:
                    fn = fn + 1.0
                else:
                    tn = tn + 1.0
                    
    multiNER_crowd_ann_quality_f1_2015[idx, 0] = thresh
    multiNER_crowd_ann_quality_f1_2015[idx, 1] = 2.0 * tp / (2.0 * tp + fp + fn)

multiNER_crowdgt_ann_quality_f1_2015 = np.zeros(shape=(10, 2))
for idx in range(1, 10):
    thresh = idx / 10.0
    tp = 0
    fp = 0
    tn = 0
    fn = 0
    
    for gt_idx in range(0, len(multiNER_crowd_2015)):
        if multiNER_crowd_2015["AlternativeCrowdScore"][gt_idx] == "":
            if multiNER_crowd_2015["AlternativeCrowdScore"][gt_idx] >= 0.2:
                if multiNER_crowd_2015["CrowdGold"][gt_idx] == 1:
                    tp = tp + 1.0
                else: 
                    fp = fp + 1.0
            else:
                if multiNER_crowd_2015["CrowdGold"][gt_idx] == 1:
                    fn = fn + 1.0
                else:
                    tn = tn + 1.0
        else:
            if multiNER_crowd_2015["AlternativeCrowdScore"][gt_idx] >= thresh:
                if multiNER_crowd_2015["CrowdGold"][gt_idx] == 1:
                    tp = tp + 1.0
                else: 
                    fp = fp + 1.0
            else:
                if multiNER_crowd_2015["CrowdGold"][gt_idx] == 1:
                    fn = fn + 1.0
                else:
                    tn = tn + 1.0
                    
    multiNER_crowdgt_ann_quality_f1_2015[idx, 0] = thresh
    multiNER_crowdgt_ann_quality_f1_2015[idx, 1] = 2.0 * tp / (2.0 * tp + fp + fn)
multiNER_f1_score_2015 = 0.684

multiNER_crowd_ann_quality_f1_2016 = np.zeros(shape=(10, 2))
for idx in range(1, 10):
    thresh = idx / 10.0
    tp = 0
    fp = 0
    tn = 0
    fn = 0
    
    for gt_idx in range(0, len(multiNER_crowd_2016)):
        if multiNER_crowd_2016["AlternativeCrowdScore"][gt_idx] == "":
            if multiNER_crowd_2016["AlternativeCrowdScore"][gt_idx] >= 0.2:
                if multiNER_crowd_2016["Gold"][gt_idx] == 1:
                    tp = tp + 1.0
                else: 
                    fp = fp + 1.0
            else:
                if multiNER_crowd_2016["Gold"][gt_idx] == 1:
                    fn = fn + 1.0
                else:
                    tn = tn + 1.0
        else:
            if multiNER_crowd_2016["AlternativeCrowdScore"][gt_idx] >= thresh:
                if multiNER_crowd_2016["Gold"][gt_idx] == 1:
                    tp = tp + 1.0
                else: 
                    fp = fp + 1.0
            else:
                if multiNER_crowd_2016["Gold"][gt_idx] == 1:
                    fn = fn + 1.0
                else:
                    tn = tn + 1.0
                    
    multiNER_crowd_ann_quality_f1_2016[idx, 0] = thresh
    multiNER_crowd_ann_quality_f1_2016[idx, 1] = 2.0 * tp / (2.0 * tp + fp + fn)

multiNER_crowdgt_ann_quality_f1_2016 = np.zeros(shape=(10, 2))
for idx in range(1, 10):
    thresh = idx / 10.0
    tp = 0
    fp = 0
    tn = 0
    fn = 0
    
    for gt_idx in range(0, len(multiNER_crowd_2016)):
        if multiNER_crowd_2016["AlternativeCrowdScore"][gt_idx] == 0:
            if multiNER_crowd_2016["AlternativeCrowdScore"][gt_idx] >= 0.2:
                if multiNER_crowd_2016["CrowdGold"][gt_idx] == 1:
                    tp = tp + 1.0
                else: 
                    fp = fp + 1.0
            else:
                if multiNER_crowd_2016["CrowdGold"][gt_idx] == 1:
                    fn = fn + 1.0
                else:
                    tn = tn + 1.0
        else:
            if multiNER_crowd_2016["AlternativeCrowdScore"][gt_idx] >= thresh:
                if multiNER_crowd_2016["CrowdGold"][gt_idx] == 1:
                    tp = tp + 1.0
                else: 
                    fp = fp + 1.0
            else:
                if multiNER_crowd_2016["CrowdGold"][gt_idx] == 1:
                    fn = fn + 1.0
                else:
                    tn = tn + 1.0
                    
    multiNER_crowdgt_ann_quality_f1_2016[idx, 0] = thresh
    multiNER_crowdgt_ann_quality_f1_2016[idx, 1] = 2.0 * tp / (2.0 * tp + fp + fn)
multiNER_f1_score_2016 = 0.698
In [4]:
plt.rcParams['figure.figsize'] = 15, 5
plt.subplot(1, 2, 1)
plt.plot(multiNER_crowd_ann_quality_f1_2015[:,0], multiNER_crowd_ann_quality_f1_2015[:,1], color = 'lightgreen', lw = 2, label = "Multi-NER+Crowd")
plt.plot(multiNER_crowdgt_ann_quality_f1_2015[:,0], multiNER_crowdgt_ann_quality_f1_2015[:,1], color = 'darkgreen', lw = 2, label = "Multi-NER_CrowdGT")

plt.xlim(0.1,0.9)
plt.ylim(0.4,1.0)
plt.title('Annotation quality F1 per negative/positive \n crowd-sentence-entity threshold for OKE2015', size=18)
plt.axhline(y = multiNER_f1_score_2015, color = 'cornflowerblue', lw = 2, label = "Multi-NER")

max_thresh = np.argmax(multiNER_crowdgt_ann_quality_f1_2015[:, 1])
plt.axvline(x = multiNER_crowdgt_ann_quality_f1_2015[max_thresh, 0], color = 'r', ls = ':')
plt.axhline(y = multiNER_crowdgt_ann_quality_f1_2015[max_thresh, 1], color = 'r', ls = ':')

plt.ylabel('annotation quality F1 score', fontsize=14)
plt.xlabel('neg/pos sentence-entity score', fontsize=14)
plt.legend(bbox_to_anchor=(0.55, 0.25), loc=2, borderaxespad=0.)


plt.subplot(1, 2, 2)
plt.plot(multiNER_crowd_ann_quality_f1_2016[:,0], multiNER_crowd_ann_quality_f1_2016[:,1], color = 'lightgreen', lw = 2, label = "Multi-NER+Crowd")
plt.plot(multiNER_crowdgt_ann_quality_f1_2016[:,0], multiNER_crowdgt_ann_quality_f1_2016[:,1], color = 'darkgreen', lw = 2, label = "Multi-NER+CrowdGT")

plt.xlim(0.1,0.9)
plt.ylim(0.4,1.0)
plt.title('Annotation quality F1 per negative/positive \n crowd-sentence-entity threshold for OKE2016', size=18)
plt.axhline(y = multiNER_f1_score_2016, color = 'cornflowerblue', lw = 2, label = "Multi-NER")

max_thresh = np.argmax(multiNER_crowdgt_ann_quality_f1_2016[:, 1])
plt.axvline(x = multiNER_crowdgt_ann_quality_f1_2016[max_thresh, 0], color = 'r', ls = ':')
plt.axhline(y = multiNER_crowdgt_ann_quality_f1_2016[max_thresh, 1], color = 'r', ls = ':')

plt.ylabel('annotation quality F1 score', fontsize=14)
plt.xlabel('neg/pos sentence-entity score', fontsize=14)
plt.legend(bbox_to_anchor=(0.55, 0.25), loc=2, borderaxespad=0.)

plt.show()

Crowd Performance on Identifying Named Entity Types

In [5]:
crowd_entity_type_eval = pd.read_csv("aggregate/OKE2015_OKE2016_Crowd_EntityType_Eval.csv")
crowd_entity_type_eval
Out[5]:
Crowd-EntityType-Score MicroF1_OKE2015 MacroF1_OKE2015 MicroF1_OKE2016 MacroF1_OKE2016
0 0.1 0.728 0.754 0.690 0.657
1 0.2 0.758 0.793 0.734 0.706
2 0.3 0.775 0.811 0.773 0.748
3 0.4 0.797 0.826 0.796 0.763
4 0.5 0.861 0.869 0.848 0.802
5 0.6 0.904 0.906 0.887 0.848
6 0.7 0.937 0.938 0.888 0.859
7 0.8 0.892 0.900 0.879 0.844
8 0.9 0.738 0.767 0.715 0.759

9 rows × 5 columns

In [6]:
plt.rcParams['figure.figsize'] = 10, 5
plt.plot([x / 10.0 for x in range(1, 10)], crowd_entity_type_eval["MicroF1_OKE2015"][range(0,9)],
         color = 'r', lw = 2, label = "MicroF1_OKE2015")
plt.plot([x / 10.0 for x in range(1, 10)], crowd_entity_type_eval["MacroF1_OKE2015"][range(0,9)],
         color = 'orange', lw = 2, label = "MacroF1_OKE2015")
plt.plot([x / 10.0 for x in range(1, 10)], crowd_entity_type_eval["MicroF1_OKE2016"][range(0,9)],
         color = 'lightgreen', lw = 2, label = "MicroF1_OKE2016")
plt.plot([x / 10.0 for x in range(1, 10)], crowd_entity_type_eval["MacroF1_OKE2016"][range(0,9)],
         color = 'darkgreen', lw = 2, label = "MacroF1_OKE2016")
plt.title("Annotation quality F1 per neg/pos crowd entity-type threshold", fontsize=18)
plt.xlim(0.1, 0.9)
plt.ylim(0.5, 1)

plt.ylabel('crowd annotation quality F1 score', fontsize=14)
plt.xlabel('neg/pos crowd entity types score', fontsize=14)
plt.legend(bbox_to_anchor=(0.65, 0.4), loc=2, borderaxespad=0.)
Out[6]:
<matplotlib.legend.Legend at 0x429d810>
In [ ]: