Sketch Engine KB - Why are some words in the corpus tagged incorrectly?

Part-of-speech (POS) tagging is performed automatically. It would take many years and a large team of people to tag or check multi-billion-word corpora manually.

The tagging tool (tagger) only works with a certain level of accuracy which is very high, typically around 95%. The English tagger has an accuracy of over 95%. This would mean, theoretically, that every 20th token is tagged incorrectly.

The tagging errors are not distributed evenly across the corpus. They are limited to situations when words are used in a highly non-standard context or without context or with very little context. For example, 'Help!' may be sometimes tagged as a noun and sometimes as a verb.

A newspaper article will have very few tagging errors. A tweet, on the other hand, will have more tagging errors because it will probably contain incomplete sentences and very little context including non-standard variants of words.

It is important to understand that automatic taggers learn from people. Taggers are trained on a manually tagged piece of text. This can also be a source of problems. Situations, where people cannot agree on the correct POS, will be replicated by the tagger during the automatic tagging.

Therefore, 'go to work' may be tagged as verb + preposition + noun or verb + infinitive especially in situations when even humans cannot agree on the correct tag.

'people like that' might be tagged noun+preposition+pronoun or noun+verb+pronoun.