Tokenizing Words With Regular Expressions

from nltk.tokenize import RegexpTokenizer

Introduction

While you can tokenize sentences into words using the TreebankTokenizer, NLTK also provides the RegexpTokenizer to give you more flexibility in tokenizing sentences. You could do the same thing wih the built-in re module, but he interface to the RegexpTokenizer matches the other tokenizers so you can use use them interchangeably depending on what you need.

Word Tokenization

By default, the RegexpTokenizer will match words and split on anything that doesn't match the expression given, assuming that they make up the gaps. Here's how to match any alphanumeric characters and apostrophes.

tokenizer = RegexpTokenizer("[\w']+")
source = "I'm a man of wealth and taste, and touch, and smell."
print(tokenizer.tokenize(source))
["I'm", 'a', 'man', 'of', 'wealth', 'and', 'taste', 'and', 'touch', 'and', 'smell']

Note that by adding the apostrophe ("'") to the expression we were able to keep the contraction together, something that the TreebankTokenizer doesn't do.

Gap Tokenization

Alternatively, you can define what the tokenizer should split on. There are already tokenizers for white-space, but you could also add punctiation this way.

tokenizer = RegexpTokenizer("[\s,]+", gaps=True)
print(tokenizer.tokenize(source))
["I'm", 'a', 'man', 'of', 'wealth', 'and', 'taste', 'and', 'touch', 'and', 'smell.']

The only difference here is that the period was kept, but if you use a sentence tokenizer this would likely not happen anyway.