Why is the number of words in my corpus different or wrong?

If you used your corpus in a different tool such as AntConc, LancsBox or WordSmith Tools, you may notice that Sketch Engine (and possibly each of the other tools) shows a different number of words in the same corpus.
Different systems may use different tokenizations, i.e. the rules for determining what the smallest constituent of a corpus is. See: https://www.sketchengine.eu/my_keywords/token/
In addition to tokens, Sketch Engine also gives the total number of words. Each system may define a word differently. Sketch Engine defines a word as a token which starts with an alphabetic character. Tokens starting with digits or punctuation are not words. See: https://www.sketchengine.eu/my_keywords/word/

Examples of words in English that can be tokenized as one or two or more tokens:
  • contractions (isn't, didn't…)
  • possessives (John's, children's…) 
  • multiple punctuations such as ellipsis (…), duplicate question marks (???), multiple exclamation marks (!!), sequences of punctuation (!")
Similar differences can be found in other languages too.