from nltk.tokenize import RegexpTokenizer
While you can tokenize sentences into words using the
TreebankTokenizer, NLTK also provides the
RegexpTokenizer to give you more flexibility in tokenizing sentences. You could do the same thing wih the built-in
re module, but he interface to the
RegexpTokenizer matches the other tokenizers so you can use use them interchangeably depending on what you need.
By default, the
RegexpTokenizer will match words and split on anything that doesn't match the expression given, assuming that they make up the gaps. Here's how to match any alphanumeric characters and apostrophes.
tokenizer = RegexpTokenizer("[\w']+")
source = "I'm a man of wealth and taste, and touch, and smell." print(tokenizer.tokenize(source))
["I'm", 'a', 'man', 'of', 'wealth', 'and', 'taste', 'and', 'touch', 'and', 'smell']
Note that by adding the apostrophe ("'") to the expression we were able to keep the contraction together, something that the
TreebankTokenizer doesn't do.
Alternatively, you can define what the tokenizer should split on. There are already tokenizers for white-space, but you could also add punctiation this way.
tokenizer = RegexpTokenizer("[\s,]+", gaps=True) print(tokenizer.tokenize(source))
["I'm", 'a', 'man', 'of', 'wealth', 'and', 'taste', 'and', 'touch', 'and', 'smell.']
The only difference here is that the period was kept, but if you use a sentence tokenizer this would likely not happen anyway.