Tokenizing Words With Regular Expressions

from nltk.tokenize import RegexpTokenizer

Introduction

While you can tokenize sentences into words using the TreebankTokenizer, NLTK also provides the RegexpTokenizer to give you more flexibility in tokenizing sentences. You could do the same thing wih the built-in re module, but he interface to the RegexpTokenizer matches the other tokenizers so you can use use them interchangeably depending on what you need.

Word Tokenization

By default, the RegexpTokenizer will match words and split on anything that doesn't match the expression given, assuming that they make up the gaps. Here's how to match any alphanumeric characters and apostrophes.

tokenizer = RegexpTokenizer("[\w']+")
source = "I'm a man of wealth and taste, and touch, and smell."
print(tokenizer.tokenize(source))
["I'm", 'a', 'man', 'of', 'wealth', 'and', 'taste', 'and', 'touch', 'and', 'smell']

Note that by adding the apostrophe ("'") to the expression we were able to keep the contraction together, something that the TreebankTokenizer doesn't do.

Gap Tokenization

Alternatively, you can define what the tokenizer should split on. There are already tokenizers for white-space, but you could also add punctiation this way.

tokenizer = RegexpTokenizer("[\s,]+", gaps=True)
print(tokenizer.tokenize(source))
["I'm", 'a', 'man', 'of', 'wealth', 'and', 'taste', 'and', 'touch', 'and', 'smell.']

The only difference here is that the period was kept, but if you use a sentence tokenizer this would likely not happen anyway.

Sentences to Words

# from pypi
from nltk.tokenize import (
    TreebankWordTokenizer,
    )

Introduction

We're going to look at an NLTK tokenizer that takes a sentence and breaks it up into words. In the simplest case you could just split the sentence on whitespace, but the presence of punctuation makes the job a little harder.

TreebankWordTokenizer

The TreebankWordTokenizer uses the Penn Treebank which created a corpus using Wall Street Journal articles.

source = "I'm a man who can't say no."
tokenizer = TreebankWordTokenizer()
print(tokenizer.tokenize(source))
['I', "'m", 'a', 'man', 'who', 'ca', "n't", 'say', 'no', '.']

Looking at the output, you can see that words with contractions are broken up. At first this seemed odd to me, but when you realize that contractions are made up of multiple words this makes sense, although the actual output seems like it would be hard to use (how would you know tha 'I' "'m" means 'I' 'am'?).

Tokenizing Text To Sentences With NLTK

# python standard library
from http import HTTPStatus
# pypi
import requests
import nltk

Introduction

This is going to be a quick look at splitting a text source into sentences.

The Source

I'm going to use Siddhartha, by Herman Hesse, which is available from Project Gutenberg.

url = "https://www.gutenberg.org/cache/epub/2500/pg2500.txt"
response = requests.get(url)
assert response.status_code == HTTPStatus.OK
source = response.text
print(source[:200])
The Project Gutenberg EBook of Siddhartha, by Herman Hesse

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
r

The Punkt Sentence Tokenizer

The NLTK PunktSentenceTokenizer uses an unsupervised model to break texts up into sentences. Since you have to train the model to make it work you should load it from a pickle file. In this case we want the model trained on the English language so I'll load english.pickle.

tokenizer = nltk.data.load("tokenizers/punkt/PY3/english.pickle")

Now you can break Sidhartha up into sentences using the tokenizer.

sentences = tokenizer.tokenize(response.text)
print(len(sentences))
1972

for index, sentence in enumerate(sentences[1:5]):
    print("   {}: {}".format(index, sentence.replace("\r", "")))
   0: You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Siddhartha

Author: Herman Hesse

Translator: Gunther Olesch, Anke Dreher, Amy Coulter, Stefan Langer and Semyon Chaichenets

Release Date: April 6, 2008 [EBook #2500]
Last updated: July 2, 2011
Last updated: January 23, 2013

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK SIDDHARTHA ***




Produced by Michael Pullen,  Chandra Yenco, Isaac Jones





SIDDHARTHA

An Indian Tale

by Hermann Hesse





FIRST PART

To Romain Rolland, my dear friend




THE SON OF THE BRAHMAN

In the shade of the house, in the sunshine of the riverbank near the
boats, in the shade of the Sal-wood forest, in the shade of the fig tree
is where Siddhartha grew up, the handsome son of the Brahman, the young
falcon, together with his friend Govinda, son of a Brahman.
   1: The sun
tanned his light shoulders by the banks of the river when bathing,
performing the sacred ablutions, the sacred offerings.
   2: In the mango
grove, shade poured into his black eyes, when playing as a boy, when
his mother sang, when the sacred offerings were made, when his father,
the scholar, taught him, when the wise men talked.
   3: For a long time,
Siddhartha had been partaking in the discussions of the wise men,
practising debate with Govinda, practising with Govinda the art of
reflection, the service of meditation.

So you can see that it didn't get the pre-amble quite right. It took all the lines up until the first sentence of the text as being one sentence, but then after that it seemed to do okay. It was probably looking for periods or some such. This is important to note, because it means that it might not work as well for things like logs or other sources where the input isn't made up of complete, well-formed sentences.

References

  • “Natural Language Toolkit — NLTK 3.3 Documentation.” Accessed July 30, 2018. http://www.nltk.org/.
  • Perkins, Jacob. Python 3 Text Processing with NLTK 3 Cookbook: Over 80 Practical Recipes on Natural Language Processing Techniques Using Python’s NLTK 3.0. 2. ed. Packt Open Source. Birmingham: Packt Publ, 2014.