Tokenizing Text To Sentences With NLTK
# python standard library
from http import HTTPStatus
# pypi
import requests
import nltk
Introduction
This is going to be a quick look at splitting a text source into sentences.
The Source
I'm going to use Siddhartha, by Herman Hesse, which is available from Project Gutenberg.
url = "https://www.gutenberg.org/cache/epub/2500/pg2500.txt"
response = requests.get(url)
assert response.status_code == HTTPStatus.OK
source = response.text
print(source[:200])
The Project Gutenberg EBook of Siddhartha, by Herman Hesse This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or r
The Punkt Sentence Tokenizer
The NLTK PunktSentenceTokenizer uses an unsupervised model to break texts up into sentences. Since you have to train the model to make it work you should load it from a pickle file. In this case we want the model trained on the English language so I'll load english.pickle
.
tokenizer = nltk.data.load("tokenizers/punkt/PY3/english.pickle")
Now you can break Sidhartha up into sentences using the tokenizer.
sentences = tokenizer.tokenize(response.text)
print(len(sentences))
1972
for index, sentence in enumerate(sentences[1:5]):
print(" {}: {}".format(index, sentence.replace("\r", "")))
0: You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org Title: Siddhartha Author: Herman Hesse Translator: Gunther Olesch, Anke Dreher, Amy Coulter, Stefan Langer and Semyon Chaichenets Release Date: April 6, 2008 [EBook #2500] Last updated: July 2, 2011 Last updated: January 23, 2013 Language: English *** START OF THIS PROJECT GUTENBERG EBOOK SIDDHARTHA *** Produced by Michael Pullen, Chandra Yenco, Isaac Jones SIDDHARTHA An Indian Tale by Hermann Hesse FIRST PART To Romain Rolland, my dear friend THE SON OF THE BRAHMAN In the shade of the house, in the sunshine of the riverbank near the boats, in the shade of the Sal-wood forest, in the shade of the fig tree is where Siddhartha grew up, the handsome son of the Brahman, the young falcon, together with his friend Govinda, son of a Brahman. 1: The sun tanned his light shoulders by the banks of the river when bathing, performing the sacred ablutions, the sacred offerings. 2: In the mango grove, shade poured into his black eyes, when playing as a boy, when his mother sang, when the sacred offerings were made, when his father, the scholar, taught him, when the wise men talked. 3: For a long time, Siddhartha had been partaking in the discussions of the wise men, practising debate with Govinda, practising with Govinda the art of reflection, the service of meditation.
So you can see that it didn't get the pre-amble quite right. It took all the lines up until the first sentence of the text as being one sentence, but then after that it seemed to do okay. It was probably looking for periods or some such. This is important to note, because it means that it might not work as well for things like logs or other sources where the input isn't made up of complete, well-formed sentences.
References
- “Natural Language Toolkit — NLTK 3.3 Documentation.” Accessed July 30, 2018. http://www.nltk.org/.
- Perkins, Jacob. Python 3 Text Processing with NLTK 3 Cookbook: Over 80 Practical Recipes on Natural Language Processing Techniques Using Python’s NLTK 3.0. 2. ed. Packt Open Source. Birmingham: Packt Publ, 2014.