Messy Medical Data
Introduction
The source data is some messy medical data. The goal is to extract dates from the data using regular expressions.
Data Description
The data is in a file named medical_records.csv
.
- Each line of the
medical_records.csv
file corresponds to a medical note. - Each note has a date that needs to be extracted, but there is no single format for the dates.
class Messy:
source = "medical_records.csv"
def __init__(self):
self._lines = None
return
@property
def lines(self):
"""Lines of medical notes
Each line represents one medical none
Returns:
list: list of medical notes
"""
if self._lines is None:
with open(self.source) as reader:
self._lines = reader.readlines()
return self._lines
def __len__(self):
"""the number of lines in the data"""
return len(self.lines)
def __getitem__(self, slicer):
"""Return the line matching the slice
Args:
slicer (slice): slice for the lines
Raises:
IndexError: index out of range for the lines
"""
return self.lines[slicer]
There are 500 notes in the messy medical data.
Sample Lines
Here are the first ten lines in the file.
for line in messy_data[:10]:
print(line.strip())
The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates.
Here is a list of some of the variants you might encounter in this dataset.
Description | Examples |
---|---|
Month/Day/Year | 04/20/2009; 04/20/09; 4/20/09; 4/3/09 |
Named Month Day Year | Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; |
Day Month Year | 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009 |
Month Ordinal Day Year | - Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009 |
Named Month Year | Feb 2009; Sep 2009; Oct 2010 |
Month/Year | 6/2008; 12/2009 |
Year | 2009; 2010 |
Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:
- Assume all dates in xx/xx/xx format are mm/dd/yy
- Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
- If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
- If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.
For example if the original series was this:
0 1999 1 2010 2 1978 3 2015 4 1985
0 2 1 4 2 0 3 1 4 3
Your score will be calculated using Kendall's tau, a correlation measure for ordinal data.
This function should return a Series of length 500 and dtype int.
Imports
# from pypi
import pandas
Loading The Data
with open('dates.txt') as reader:
data = pandas.Series(reader.readlines())
data.head(10)
data.describe()
The Grammar
Cardinality
ZERO_OR_MORE = '*'
ONE_OR_MORE = "+"
ZERO_OR_ONE = '?'
EXACTLY_TWO = "{2}"
ONE_OR_TWO = "{1,2}"
EXACTLY_ONE = '{1}'
Groups and Classes
GROUP = r"({})"
NAMED = r"(?P<{}>{})"
CLASS = "[{}]"
NEGATIVE_LOOKAHEAD = "(?!{})"
NEGATIVE_LOOKBEHIND = "(?<!{})"
POSITIVE_LOOKAHEAD = "(?={})"
POSITIVE_LOOKBEHIND = "(?<={})"
ESCAPE = "\{}"
Numbers
DIGIT = r"\d"
ONE_DIGIT = DIGIT + EXACTLY_ONE
ONE_OR_TWO_DIGITS = DIGIT + ONE_OR_TWO
NON_DIGIT = NEGATIVE_LOOKAHEAD.format(DIGIT)
TWO_DIGITS = DIGIT + EXACTLY_TWO
THREE_DIGITS = DIGIT + "{3}"
EXACTLY_TWO_DIGITS = DIGIT + EXACTLY_TWO + NON_DIGIT
FOUR_DIGITS = DIGIT + r"{4}" + NON_DIGIT
String Literals
SLASH = r"/"
OR = r'|'
LOWER_CASE = "a-z"
SPACE = "\s"
DOT = "."
DASH = "-"
COMMA = ","
PUNCTUATION = CLASS.format(DOT + COMMA + DASH)
EMPTY_STRING = ""
Dates
These are parts to build up the date-expressions.
MONTH_SUFFIX = (CLASS.format(LOWER_CASE) + ZERO_OR_MORE
+ CLASS.format(SPACE + DOT + COMMA + DASH) + ONE_OR_TWO)
MONTH_PREFIXES = "Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec".split()
MONTHS = [month + MONTH_SUFFIX for month in MONTH_PREFIXES]
MONTHS = GROUP.format(OR.join(MONTHS))
DAY_SUFFIX = CLASS.format(DASH + COMMA + SPACE) + ONE_OR_TWO
DAYS = ONE_OR_TWO_DIGITS + DAY_SUFFIX
YEAR = FOUR_DIGITS
This is for dates like Mar 21st, 2009, those with suffixes on the days.
CONTRACTED = (ONE_OR_TWO_DIGITS
+ LOWER_CASE
+ EXACTLY_TWO
)
CONTRACTION = NAMED.format("contraction",
MONTHS
+ CONTRACTED
+ DAY_SUFFIX
+ YEAR)
This is for dates that have no days in them, like May 2009.
NO_DAY_BEHIND = NEGATIVE_LOOKBEHIND.format(DIGIT + SPACE)
NO_DAY = NAMED.format("no_day", NO_DAY_BEHIND + MONTHS + YEAR)
This is for the most common form (that I use) - May 21, 2017.
WORDS = NAMED.format("words", MONTHS + DAYS + YEAR)
This is for the case where the day is placed before them month - 20 March, 2009.
BACKWARDS = NAMED.format("backwards", ONE_OR_TWO_DIGITS + SPACE + MONTHS + YEAR)
This is the case where slashes are used but only two digits were used for the year (so we're assuming it's in the twentieth century) - 8/4/98.
slashed = SLASH.join([ONE_OR_TWO_DIGITS,
ONE_OR_TWO_DIGITS,
EXACTLY_TWO_DIGITS])
dashed = DASH.join([ONE_OR_TWO_DIGITS,
ONE_OR_TWO_DIGITS,
EXACTLY_TWO_DIGITS])
TWENTIETH_CENTURY = NAMED.format("twentieth",
OR.join([slashed, dashed]))
This is the case where digits with slashes are used and all four digits are used for the year - 8/4/1998.
NUMERIC = NAMED.format("numeric",
SLASH.join([ONE_OR_TWO_DIGITS,
ONE_OR_TWO_DIGITS,
FOUR_DIGITS]))
This is the case where only month and year are given as digits - 9/2009. There are two expressions, because the day can be one or two digits.
NO_PRECEDING_SLASH = NEGATIVE_LOOKBEHIND.format(SLASH)
NO_PRECEDING_SLASH_DIGIT = NEGATIVE_LOOKBEHIND.format(CLASS.format(SLASH + DIGIT))
NO_ONE_DAY = (NO_PRECEDING_SLASH_DIGIT
+ ONE_DIGIT
+ SLASH
+ FOUR_DIGITS)
NO_TWO_DAYS = (NO_PRECEDING_SLASH
+ TWO_DIGITS
+ SLASH
+ FOUR_DIGITS)
NO_DAY_NUMERIC = NAMED.format("no_day_numeric",
NO_ONE_DAY
+ OR
+ NO_TWO_DAYS
)
This is the case where only a year was given. This is the hardest case, since you don't want to accidentally match the other cases, but the text preceding and following it could be anything. For the look-behind, all the cases have to have the same number of characters so we can't re-use the other expressions
CENTURY = GROUP.format('19' + OR + "20") + TWO_DIGITS
DIGIT_SLASH = DIGIT + SLASH
DIGIT_DASH = DIGIT + DASH
DIGIT_SPACE = DIGIT + SPACE
LETTER_SPACE = CLASS.format(LOWER_CASE) + SPACE
COMMA_SPACE = COMMA + SPACE
YEAR_PREFIX = NEGATIVE_LOOKBEHIND.format(OR.join([
DIGIT_SLASH,
DIGIT_DASH,
DIGIT_SPACE,
LETTER_SPACE,
COMMA_SPACE,
]))
YEAR_ONLY = NAMED.format("year_only",
YEAR_PREFIX + CENTURY
)
These are leftovers that don't really match anything.
IN_PREFIX = POSITIVE_LOOKBEHIND.format(CLASS.format('iI') + 'n' + SPACE) + CENTURY
SINCE_PREFIX = POSITIVE_LOOKBEHIND.format(CLASS.format("Ss") + 'ince' + SPACE) + CENTURY
AGE = POSITIVE_LOOKBEHIND.format("Age" + SPACE + TWO_DIGITS + COMMA + SPACE) + CENTURY
AGE_COMMA = POSITIVE_LOOKBEHIND.format("Age" + COMMA + SPACE + TWO_DIGITS + COMMA + SPACE) + CENTURY
OTHERS = ['delivery', "quit", "attempt", "nephrectomy", THREE_DIGITS]
OTHERS = [POSITIVE_LOOKBEHIND.format(label + SPACE) + CENTURY for label in OTHERS]
OTHERS = OR.join(OTHERS)
LEFTOVERS_PREFIX = OR.join([IN_PREFIX, SINCE_PREFIX, AGE, AGE_COMMA]) + OR + OTHERS
LEFTOVERS = NAMED.format("leftovers", LEFTOVERS_PREFIX)
This is the combined expression for all the dates - the one that should be used to extract them from the data.
DATE = NAMED.format("date", OR.join([NUMERIC,
TWENTIETH_CENTURY,
WORDS,
BACKWARDS,
CONTRACTION,
NO_DAY,
NO_DAY_NUMERIC,
YEAR_ONLY,
LEFTOVERS]))
def twentieth_century(date):
"""adds a 19 to the year
Args:
date (re.Regex): Extracted date
"""
month, day, year = date.group(1).split(SLASH)
year = "19{}".format(year)
return SLASH.join([month, day, year])
def take_two(line):
match = re.search(TWENTIETH_CENTURY, line)
if match:
return twentieth_century(match)
return line
Applying The Grammer
def extract_and_count(expression, data, name):
"""extract all matches and report the count
Args:
expression (str): regular expression to match
data (pandas.Series): data with dates to extratc
name (str): name of the group for the expression
Returns:
tuple (pandas.Series, int): extracted dates, count
"""
extracted = data.str.extractall(expression)[name]
count = len(extracted)
print("'{}' matched {} rows".format(name, count))
return extracted, count
numeric, numeric_count = extract_and_count(NUMERIC, data, 'numeric')
twentieth, twentieth_count = extract_and_count(TWENTIETH_CENTURY, data, 'twentieth')
words, words_count = extract_and_count(WORDS, data, 'words')
backwards, backwards_count = extract_and_count(BACKWARDS, data, 'backwards')
contraction_data, contraction = extract_and_count(CONTRACTION, data, 'contraction')
no_day, no_day_count = extract_and_count(NO_DAY, data, 'no_day')
no_day_numeric, no_day_numeric_count = extract_and_count(NO_DAY_NUMERIC, data,
"no_day_numeric")
year_only, year_only_count = extract_and_count(YEAR_ONLY, data, "year_only")
leftovers, leftovers_count = extract_and_count(LEFTOVERS, data, "leftovers")
found = data.str.extractall(DATE)
total_found = len(found.date)
print("Total Found: {}".format(total_found))
print("Remaining: {}".format(len(data) - total_found))
print("Discrepancy: {}".format(total_found - (numeric_count
+ twentieth_count
+ words_count
+ backwards_count
+ contraction
+ no_day_count
+ no_day_numeric_count
+ year_only_count
+ leftovers_count)))
missing = [label for label in data.index if label not in found.index.levels[0]]
try:
print(missing[0], data.loc[missing[0]])
except IndexError:
print("all rows matched")
Unifying the Formats
To make it simpler, I'm going to use the mm/dd/yyyy
format for the dates. I'm going to use the extracted series to avoid having different clean-up cases contaminating each other - e.g. dealing with 'January' when the day comes first as opposed to when the month comes first.
Helper Functions
Clean
This is a generic function to clean up some data. I was initially using it directly, but for cases where the expression and replacement function are used more than once, there are helper functions to make it easier.
def clean(source, expression, replacement, sample=5):
"""applies the replacement to the source
as a side-effect shows sample rows before and after
Args:
source (pandas.Series): source of the strings
expression (str): regular expression to match what to replace
replacement: function or expression to replace the matching expression
sample (int): number of randomly chosen examples to show
Returns:
pandas.Series: the source with the replacement applied to it
"""
print("Random Sample Before:")
print(source.sample(sample))
cleaned = source.str.replace(expression, replacement)
print("\nRandom Sample After:")
print(cleaned.sample(sample))
print("\nCount of cleaned: {}".format(len(cleaned)))
assert len(source) == len(cleaned)
return cleaned
Clean Punctuation
def clean_punctuation(source, sample=5):
"""removes punctuation
Args:
source (pandas.Series): data to clean
sample (int): size of sample to show
Returns:
pandas.Series: source with punctuation removed
"""
print("Cleaning Punctuation")
if any(source.str.contains(PUNCTUATION)):
source = clean(source, PUNCTUATION, EMPTY_STRING)
return source
Convert Long Month Names to Three-Letter Names
LONG_TO_SHORT = dict(January="Jan",
February="Feb",
March="Mar",
April="Apr",
May="May",
June="Jun",
July="Jul",
August="Aug",
September="Sep",
October="Oct",
November="Nov",
December="Dec")
# it turns out there are spelling errors in the data so this has to be fuzzy
LONG_TO_SHORT_EXPRESSION = OR.join([GROUP.format(month)
+ CLASS.format(LOWER_CASE)
+ ZERO_OR_MORE
for month in LONG_TO_SHORT.values()])
def long_month_to_short(match):
"""convert long month to short
Args:
match (re.Match): object matching a long month
Returns:
str: shortened version of the month
"""
return match.group(match.lastindex)
This next function is the one you would actually use to make the conversion.
def convert_long_months_to_short(source, sample=5):
"""convert long month names to short
Args:
source (pandas.Series): data with months
sample (int): size of sample to show
Returns:
pandas.Series: data with short months
"""
return clean(source,
LONG_TO_SHORT_EXPRESSION,
long_month_to_short)
Add January 1 to year-only dates
def add_month_date(match):
"""adds 01/01 to years
Args:
match (re.Match): object that only matched a 4-digit year
Returns:
str: 01/01/YYYY
"""
return "01/01/" + match.group()
And now the function to actually call.
def add_january_one(source):
"""adds /01/01/ to year-only dates
Args:
source (pandas.Series): data with the dates
Returns:
pandas.Series: years in source with /01/01/ added
"""
return clean(source, YEAR_ONLY, add_month_date)
Two-Digit Numbers
This makes sure that there are exactly two digits in a number, adding a leading zero if needed.
two_digit_expression = GROUP.format(ONE_OR_TWO_DIGITS) + POSITIVE_LOOKAHEAD.format(SLASH)
def two_digits(match):
"""add a leading zero if needed
Args:
match (re.Match): match with one or two digits
Returns:
str: the matched string with leading zero if needed
"""
# for some reason the string-formatting raises an error if it's a string
# so cast it to an int
return "{:02}".format(int(match.group()))
This is the function to call for the case where the number is followed by a slash (e.g. 2/
).
def clean_two_digits(source, sample=5):
"""makes sure source has two-digits
Args:
source (pandas.Series): data with digit followed by slash
sample (int): number of samples to show
Returns:
pandas.Series: source with digits coerced to two digits
"""
return clean(source, two_digit_expression, two_digits, sample)
This is like clean_two_digits
but it doesn't check for the trailing slash. Use this if you have an isolated column of numbers that need to be two-digits.
def clean_two_digits_isolated(source, sample=5):
"""cleans two digits that are standalone
Args:
source (pandas.Series): source of the data
sample (int): number of samples to show
Returns:
pandas.Series: converted data
"""
return clean(source, ONE_OR_TWO_DIGITS, two_digits, sample)
Cleaning Up Months
These clean up and convert written months (e.g. change Aug
to 08
).
digits = ("{:02}".format(month) for month in range(1, 13))
MONTH_TO_DIGITS = dict(zip(MONTH_PREFIXES, digits))
SHORT_MONTHS_EXPRESSION = OR.join((GROUP.format(month) for month in MONTH_TO_DIGITS))
def month_to_digits(match):
"""converts short month to digits
Args:
match (re.Match): object with short-month
Returns:
str: month as two-digit number (e.g. Jan -> 01)
"""
return MONTH_TO_DIGITS[match.group()]
def convert_short_month_to_digits(source, sample=5):
"""converts three-letter months to two-digits
Args:
source (pandas.Series): data with three-letter months
sample (int): number of samples to show
Returns:
pandas.Series: source with short-months coverted to digits
"""
return clean(source,
SHORT_MONTHS_EXPRESSION,
month_to_digits,
sample)
This function runs the previous three and is the main one that should be used. The others can be run individually for troubleshooting, though.
def clean_months(source, sample=5):
"""clean up months (which start as words)
Args:
source (pandas.Series): source of the months
sample (int): number of random samples to show
"""
cleaned = clean_punctuation(source)
print("Converting long months to short")
cleaned = clean(cleaned,
LONG_TO_SHORT_EXPRESSION,
long_month_to_short, sample)
print("Converting short months to digits")
cleaned = clean(cleaned,
SHORT_MONTHS_EXPRESSION,
month_to_digits, sample)
return cleaned
Frame To Series
This is for the case where the date-fields were broken up into columns in a data-frame.
def frame_to_series(frame, index_source, samples=5):
"""re-combines data-frame into a series
Args:
frame (pandas.DataFrame): frame with month, day, year columns
index_source (pandas.series): source to copy index from
samples (index): number of random entries to print when done
Returns:
pandas.Series: series with dates as month/day/year
"""
combined = frame.month + SLASH + frame.day + SLASH + frame.year
combined.index = index_source.index
print(combined.sample(samples))
return combined
Year Only
For the case where there is only a year, I'll add January 1 to the dates.
year_only_cleaned = add_january_one(year_only)
Leftovers
These were the odd cases that didn't seem to have a real pattern. Since I used a negative lookbehind they only have the years in them, like the other year-only cases.
leftovers_cleaned = add_january_one(leftovers)
cleaned = pandas.concat([year_only_cleaned, leftovers_cleaned])
print(len(cleaned))
No Day Numeric
This is for the case where the date is formatted with slashes and there are no day-values. To make the months uniform I'm going to make them all two-digits first.
no_day_numeric_cleaned = clean_two_digits(no_day_numeric)
Now I'll add the day.
no_day_numeric_cleaned = clean(no_day_numeric_cleaned,
SLASH,
lambda m: "/01/")
And add it to the total.
original = len(cleaned)
cleaned = pandas.concat([cleaned, no_day_numeric_cleaned])
assert len(cleaned) == no_day_numeric_count + original
print(len(cleaned))
No Day
This is for cases like Mar 2011 where no day was given. We're going to assume that it's the first day of the month for each case.
no_day_cleaned = clean_months(no_day)
Now we need to replace the spaces with the days.
no_day_cleaned = clean(no_day_cleaned,
SPACE + ONE_OR_MORE,
lambda match: "/01/")
Now we can add it to the cleaned.
original = len(cleaned)
cleaned = pandas.concat([cleaned, no_day_cleaned])
print(len(cleaned))
Now to make sure we're where we expect we are.
assert len(cleaned) == no_day_count + original
Contraction
There were no matches for the contraction so I'll ignore it for now.
Backwards
This is the case where the day comes first. The first thing I'll do is split them up.
frame = pandas.DataFrame(backwards.str.split().tolist(),
columns="day month year".split())
frame.head()
The next thing to do is to make sure the days all have two digits.
frame.day = clean_two_digits(frame.day)
Next comes the months. This is basically the same problem as with the no day case so I'll re-use some of the code for that.
frame.month = clean_months(frame.month)
Now we need to combine them back together. In hindsight it might have been easier to convert everything into data frames instead of the other way around. Or maybe not. Since we want the indexes from the original data as our final answer I also have to copy the index from the original series
backwards_cleaned = frame_to_series(frame, backwards)
140 0 02/14/1995 177 0 01/18/1990 156 0 05/26/1974 189 0 10/21/1977 160 0 10/21/2007 dtype: object
No it gets added to the combined series.
original = len(cleaned)
cleaned = pandas.concat([cleaned, backwards_cleaned])
assert len(cleaned) == original + backwards_count
print(len(cleaned))
Words
Since working with the data frame was easier than I though it would be I'll do that again.
frame = pandas.DataFrame(words.str.split().tolist(), columns="month day year".split())
print(frame.head())
First we'll clean out the months.
frame.month = clean_months(frame.month)
Now we'll clean up the punctuation for the days.
frame.day = clean_punctuation(frame.day)
So, what do we have so far?
frame.head()
At this point we need to combine everything with a slash and restore the index.
words_cleaned = frame_to_series(frame, words)
Now we'll add it to the total.
original = len(cleaned)
cleaned = pandas.concat([cleaned, words_cleaned])
assert len(cleaned) == original + words_count
print(len(cleaned))
Twentieth Century
We'll do the same trick with creating a dataframe. The first thing, though, is to replace the dashes with slashes.
print(twentieth.iloc[21])
twentieth_cleaned = twentieth.str.replace(DASH, SLASH)
print(cleaned.iloc[21])
Now, we'll create the frame.
frame = pandas.DataFrame(twentieth_cleaned.str.split(SLASH).tolist(),
columns=["month", "day", "year"])
print(frame.head())
Months
The months need to be converted to two-digits.
frame.month = clean_two_digits_isolated(frame.month)
As do the days.
frame.day = clean_two_digits_isolated(frame.day)
frame.head()
Now we have to add 19
to each of the years.
frame.year = clean(frame.year, TWO_DIGITS, lambda match: "19" + match.group())
Now we have to join them back up.
twentieth_cleaned = frame_to_series(frame, twentieth)
original = len(cleaned)
cleaned = pandas.concat([cleaned, twentieth_cleaned])
assert len(cleaned) == original + twentieth_count
Numeric
The final category is dates with the format mm/dd/yyyy
.
print(numeric.head())
We should check and make sure there are no dashes here.
has_dashes = numeric.str.contains(DASH)
print(numeric[has_dashes])
It looks like it doesn't so we'll skip this check.
frame = pandas.DataFrame(numeric.str.split(SLASH).tolist(),
columns="month day year".split())
print(frame.head())
frame.month = clean_two_digits_isolated(frame.month)
frame.day = clean_two_digits_isolated(frame.day)
numeric_cleaned = frame_to_series(frame, numeric)
original = len(cleaned)
cleaned = pandas.concat([cleaned, numeric_cleaned])
assert len(cleaned) == original + numeric_count
print(len(cleaned))
At this point it looks like we've cleaned all the cases.
Re-combining The Cleaned
Because these notebooks can execute things out of order I'm going to create one monolithic concatenation and ignore the one that I was using to keep the running total.
cleaned = pandas.concat([numeric_cleaned,
twentieth_cleaned,
words_cleaned,
backwards_cleaned,
no_day_cleaned,
no_day_numeric_cleaned,
year_only_cleaned,
leftovers_cleaned,
])
print(len(cleaned))
print(cleaned.head())
assert len(cleaned) == len(data)
Convert to Datetimes
print(cleaned.head())
datetimes = pandas.to_datetime(cleaned, format="%m/%d/%Y")
print(datetimes.head())
sorted_dates = datetimes.sort_values()
print(sorted_dates.head())
print(sorted_dates.tail())
The grader wants a Series with the indices of the original data put in the order of the sorted dates.
answer = pandas.Series(sorted_dates.index.labels[0])
print(answer.head())
The datesorter Function
This is the function called by the grader. Since the work was done outside of it we just need to make sure that it returns our answer.
def date_sorter():
return answer
note: This produced a 94% score, so there are still some cases not correctly handled.