Python 3 Text Processing with NLTK 3 Cookbook

How do you tokenize text and what is WordNet?

How do you tokenize text into sentences?

How do you tokenize sentences into words?

How do you tokenize sentences using regular expressions?

How do you train a sentence tokenizer?

How do you filter stopwords?

What are Synsets and how do you use them?

What are lemmas and synonyms and how do you use them?

How do you calculate similarity using WordNet Synsets?

Hod do you discover word Collocations?

How do you replace and correct words?

How do you create a custom corpora?

What is part-of-speech tagging and how do you do it?

How do you extract chunks of text?

How do you transform chunks and trees?

How do you classify text?

How do you used distributed processing to handle large data sets?

How do you parse specific data types?


  • 1. Perkins J. Python 3 text processing with NLTK 3 cookbook: over 80 practical recipes on natural language processing techniques using Python’s NLTK 3.0. 2. ed. Birmingham: Packt Publ; 2014. 288 p. (Packt open source).

The Undoing Project


  • Lewis, Michael. The Undoing Project: A Friendship That Changed Our Minds. First edition. New York: W.W. Norton & Company, 2017.

Barking Up the Wrong Tree

What really produces success?

The introduction opens with some information about a record-holder in the Race Across America, Jure Robic, who succeeded because of what might be seen as a mental illness. The point is that the conventional wisdom of how to be successful is often wrong. Instead of relying on platitudes, the author suggests that we should use what research has found to guide us on how to be more successful.

Should we play it safe and do what we're told if we want to succeed?

On Valedictorians

  • High School success predicts college success
  • College success predicts professional success
  • College success doesn't predict exceptional success, just reasonable success
  • Valedictorians almost never reach the top of their field - "they settle into the system instead of shaking it up" (Karen Arnold)
  • School rewards meeting expectations and generalization, not excelling at any one area

Filtered and Unfiltered Leaders

  • Most leaders rise up through the ranks - they are pre-filtered and will do what is expected of them
  • Some leaders get in by a chance of history, and they tend to be the ones we remember as great (or horrible), because they don't do what the consensus expects
  • The unfiltered leaders are great or horrible based on whether the circumstances favor their idiosyncracies or not

Orchid Children and Intensifiers

  • The same genes that lead to bad outcomes (e.g. alcoholism) can also lead to unusually good ones depending on the person's upbringing
  • Some genes seem to intensify traits based on the situation, rather than dictate what they will be
  • Understanding the Orchid Child (Berkeley Wellness)
  • The Science of Success (The Atlantic)
  • Dandelions: They will come out the same in pretty much any circumstance
  • Orchids: They are sensitive to their settings


  • abused or neglected children with this chromosome tend to become alcoholics and bullies
  • those raised with good parenting tended to be unusually generous


  • poor parenting produces delinquents
  • good parenting produces unusually successful adults


  • poor parenting produces cheaters
  • good parenting produces rule-followers

Hopeful Monsters

  • Nature produces 'freaks' because in the right circumstances they can have advantages over the 'normal'

Know Thyself

  • whether you are a conformist or a hopeful monster you are unlikely to change, so change your circumstances to favor who you are
  • Feedback Analysis (Peter Drucker): Anytime you take on a project write down what you think will happen then when it's over write down what actually happen. Over time you'll figure out what works for you and what doesn't. Sounds like the NPR/Scientific method.
    • What did you think would happen?
    • What actually happened?
    • Why do you think it happened the way it did?

Choose the right pond

  • Whether you succeed or fail will depend on whether your context supports your particular traits
  • Once you know who you are, pick a context that supports you, don't try and conform to a place you don't belong (because you never will)

Do nice guys finish last?

Do quitters never win and winners never quit?

It's not what you know, it's who you know (unless it really is what you know)

Believe in yourself… sometimes

Work, work, work… or work-life balance?

What makes a successful life?


  • Barker, Eric. Barking up the Wrong Tree: The Surprising Science behind Why Everything You Know about Success Is (Mostly) Wrong. First edition. New York, NY: HarperOne, 2017.

Notes on the Elements of Data Analytic Style

What's this then?

These are my notes on The Elements Of Data Analytic Style, by Jeff Leek. I first encountered it while taking the John Hopkins Data Science Specialization (which I eventually dropped because I was turned off by R, maybe I'll take it up again and use rpy2, or maybe not).

What is the data analytic question?

Before doing a data analysis you should give a broad categorization for the question you are trying to answer using your analysis. This doesn't ask what the question is, specifically, but rather it gives your question a type.

Define the data analytic question first


What is a descriptive data analysis?

A descriptive data analysis attempts to summarize the data as a value without interpreting what it means. The U.S. Census is an example of a descriptive analysis. They provide summary statistics about subjects like the population, the economy, etc. but they leave it to other people to interpret what the numbers mean.

What is an exploratory data analysis?

An exploratory data analysis extends a descriptive data analysis by looking for interesting things within the data.

What is an inferential data analysis?

An inferential data analysis tries to give the probability that a discovery made in an exploratory data analysis will hold if a new sample is collected.

What is a predictive data analysis?

While an inferential data analysis tries to quantify existing relationships within a population a predictive data analysis attempts to predict an outcome based on some other measured data. Finding that unemployed males voted more frequently for a candidate is inferential, predicting that unemployed males will vote for a certain candidate based on polling data is the outcome of a predictive data analysis.

What is a causal data analysis?

A causal data analysis tries to find out what would happen to one measurement if you make a change to another. If you introduce a treatement, will the outcome be different from the control group that didn't get the treatment?

What is a mechanistic data analysis?

Causal data analysis uses probablity to infer relationships, you are trying to see if, on average, there will be a change in the outcome given a treatment. A mechanistic data analysis tries to see if there is an effect that will always occur. If you change the resistance of a brake pad, then given a certain set of conditions you would expect the change it brings to be the same every time. This type of analysis is largely limited to engineering projects.

What are some common mistakes people make when doing a data analysis?

Why does correlation not imply causation when doing inference?

When you do an inferential analysis you are seeing if there is some kind of a relationship between variables, but without doing a randomized study you can't tell if one variable is causing a change in another. The author Tyler Vigen has a web-site illustrating corellations taken from real data-sources that appear to be entirely unrelated.

What is overfitting in an exploratory analysis?

If you try to interpret an exploratory analysis as predictive, then you are likely going to overfit the data. You should keep the training data for your model separated from your testing data so that you don't fit your model to the specific sub-set of data you have.

What is an n of 1 analysis?

n of 1 refers to the common mistake of trying to infer things about the population from a single (anecdotal) example. In order to make an inferential analysis you need a large data set that is representative of your population of interest.

What is data dredging?

This is also known as p-hacking and refers to the practice of repeatedly changing your hypothesis while testing the same data-set until you find something with statistical significance. The data used to construct the hypothesis (the exploratory data) should always be different from the data used to test the hypothesis.

What is tidy data?

Tidy Data is a concept characterized by Hadley Wickham in a paper called, appropriately enough, Tidy Data. It is a way of organizing the data to makey it easy to share, use in computation, and analyze.

What makes up a data set?

These are the components of a dataset that has been processed:

  1. The raw data
  2. A tidy data set
  3. A code book that describes each variable and its values in the tidy data set
  4. An explicit and exact recipe to go from the raw data to a tidy data set followed by a code book

What is Raw Data?

This is data that you haven't changed in any way, exactly as it was recorded.

Why is Raw Data relative?

Sometimes the raw data needs to have at least some processing to be useful. If it's in a format you can't interpret, for instance, you won't be able to work with it until it has had at least some pre-processing. You want it to be as raw as possible, but it still has to be usable.

What are the four priciples of Tidy Data?

  1. Each variable should be in one column
  2. Each observation should be in a different row
  3. There should be one table for each "type" of variable
  4. If you have multiple tables, they should include a column in the table that allows you to link them

What should the first row in the data be?

How should you share it if you are using Excel?

What is a code book?

How do you create an instruction list or script?

What is the ideal instruction list?

What do you have to include if you don't include a script?

What are some common errors people make when creating tidy data?

Checking the data

Exploratory analysis

Statistical modeling and inference

Prediction and machine learning


Written analyses

Creating figures

Presenting data


A few matters of form

The data analysis checklist

Answering The Question [/]

  • [ ] Did you specify the type of data analytic question (e.g. exploration, association causality) before touching the data?
  • [ ] Did you define the metric for success before beginning?
  • [ ] Did you understand the context for the question and the scientific or business application?
  • [ ] Did you record the experimental design?
  • [ ] Did you consider whether the question could be answered with the available data?

Checking the Data

  • [ ] Did you plot univariate and multivariate summaries of the data?
  • [ ] Did you check for outliers?
  • [ ] Did you identify the missing data code?

Tidying the data

  • [ ] Is each variable one column?
  • [ ] Is each observation one row?
  • [ ] Do different data types appear in each table?
  • [ ] Did you record the recipe for moving from raw to tidy data?
  • [ ] Did you create a code book?
  • [ ] Did you record all parameters, units, and functions applied to the data?

Exploratory analysis

  • [ ] Did you identify missing values?
  • [ ] Did you make univariate plots?
    • Histograms
    • Density Plots
    • Boxplots
  • [ ] Did you consider correlations between variables (scatterplots)?
  • [ ] Did you check the units of all data points to make user they are in the right range?
  • [ ] Did you consider plotting on a log scale?
  • [ ] Would a scatterplot be more informative?




Written analyses






  1. Leek, Jeff. Elements of Data Analytic Style. Leanpub; 2015. 93 p. (