Tokenizing Words With Regular Expressions
from nltk.tokenize import RegexpTokenizer
Introduction
While you can tokenize sentences into words using the TreebankTokenizer
, NLTK also provides the RegexpTokenizer
to give you more flexibility in tokenizing sentences. You could do the same thing wih the built-in re
module, but he interface to the RegexpTokenizer
matches the other tokenizers so you can use use them interchangeably depending on what you need.
Word Tokenization
By default, the RegexpTokenizer
will match words and split on anything that doesn't match the expression given, assuming that they make up the gaps. Here's how to match any alphanumeric characters and apostrophes.
tokenizer = RegexpTokenizer("[\w']+")
source = "I'm a man of wealth and taste, and touch, and smell."
print(tokenizer.tokenize(source))
["I'm", 'a', 'man', 'of', 'wealth', 'and', 'taste', 'and', 'touch', 'and', 'smell']
Note that by adding the apostrophe ("'") to the expression we were able to keep the contraction together, something that the TreebankTokenizer
doesn't do.
Gap Tokenization
Alternatively, you can define what the tokenizer should split on. There are already tokenizers for white-space, but you could also add punctiation this way.
tokenizer = RegexpTokenizer("[\s,]+", gaps=True)
print(tokenizer.tokenize(source))
["I'm", 'a', 'man', 'of', 'wealth', 'and', 'taste', 'and', 'touch', 'and', 'smell.']
The only difference here is that the period was kept, but if you use a sentence tokenizer this would likely not happen anyway.