Messy Medical Data

Cloistered Monkey

2018-07-23 12:12

Introduction

The source data is some messy medical data. The goal is to extract dates from the data using regular expressions.

Data Description

The data is in a file named medical_records.csv.

Each line of the medical_records.csv file corresponds to a medical note.
Each note has a date that needs to be extracted, but there is no single format for the dates.

class Messy:
    source = "medical_records.csv"
    def __init__(self):
	self._lines = None
	return

    @property
    def lines(self):
	"""Lines of medical notes

	Each line represents one medical none

	Returns:
	 list: list of medical notes
	"""
	if self._lines is None:
	    with open(self.source) as reader:
		self._lines = reader.readlines()
	return self._lines

    def __len__(self):
	"""the number of lines in the data"""
	return len(self.lines)

    def __getitem__(self, slicer):
	"""Return the line matching the slice

	Args:
	 slicer (slice): slice for the lines

	Raises:
	 IndexError: index out of range for the lines
	"""
	return self.lines[slicer]

There are 500 notes in the messy medical data.

Sample Lines

Here are the first ten lines in the file.

for line in messy_data[:10]:
    print(line.strip())

The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates.

Here is a list of some of the variants you might encounter in this dataset.

Description	Examples
Month/Day/Year	04/20/2009; 04/20/09; 4/20/09; 4/3/09
Named Month Day Year	Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
Day Month Year	20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Month Ordinal Day Year	- Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Named Month Year	Feb 2009; Sep 2009; Oct 2010
Month/Year	6/2008; 12/2009
Year	2009; 2010

Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:

Assume all dates in xx/xx/xx format are mm/dd/yy
Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.

For example if the original series was this:

Your score will be calculated using Kendall's tau, a correlation measure for ordinal data.

This function should return a Series of length 500 and dtype int.

Imports

# from pypi
import pandas

Loading The Data

with open('dates.txt') as reader:
    data = pandas.Series(reader.readlines())

data.head(10)

data.describe()

The Grammar

Cardinality

ZERO_OR_MORE = '*'
ONE_OR_MORE = "+"
ZERO_OR_ONE = '?'
EXACTLY_TWO = "{2}"
ONE_OR_TWO = "{1,2}"
EXACTLY_ONE = '{1}'

Groups and Classes

GROUP = r"({})"
NAMED = r"(?P<{}>{})"
CLASS = "[{}]"
NEGATIVE_LOOKAHEAD = "(?!{})"
NEGATIVE_LOOKBEHIND = "(?<!{})"
POSITIVE_LOOKAHEAD = "(?={})"
POSITIVE_LOOKBEHIND = "(?<={})"
ESCAPE = "\{}"

Numbers

DIGIT = r"\d"
ONE_DIGIT  = DIGIT + EXACTLY_ONE
ONE_OR_TWO_DIGITS = DIGIT + ONE_OR_TWO
NON_DIGIT = NEGATIVE_LOOKAHEAD.format(DIGIT)
TWO_DIGITS = DIGIT + EXACTLY_TWO
THREE_DIGITS = DIGIT + "{3}"
EXACTLY_TWO_DIGITS = DIGIT + EXACTLY_TWO + NON_DIGIT
FOUR_DIGITS = DIGIT + r"{4}" + NON_DIGIT

String Literals

SLASH = r"/"
OR = r'|'
LOWER_CASE = "a-z"
SPACE = "\s"
DOT = "."
DASH = "-"
COMMA = ","
PUNCTUATION = CLASS.format(DOT + COMMA + DASH)
EMPTY_STRING = ""

Dates

These are parts to build up the date-expressions.

MONTH_SUFFIX = (CLASS.format(LOWER_CASE) + ZERO_OR_MORE
		+ CLASS.format(SPACE + DOT + COMMA + DASH) + ONE_OR_TWO)
MONTH_PREFIXES = "Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec".split()
MONTHS = [month + MONTH_SUFFIX for month in MONTH_PREFIXES]
MONTHS = GROUP.format(OR.join(MONTHS))
DAY_SUFFIX = CLASS.format(DASH + COMMA + SPACE) + ONE_OR_TWO
DAYS = ONE_OR_TWO_DIGITS + DAY_SUFFIX
YEAR = FOUR_DIGITS

This is for dates like Mar 21st, 2009, those with suffixes on the days.

CONTRACTED = (ONE_OR_TWO_DIGITS
	      + LOWER_CASE
	      + EXACTLY_TWO
	      )
CONTRACTION = NAMED.format("contraction",
			   MONTHS
			   + CONTRACTED
			   + DAY_SUFFIX
			   + YEAR)

This is for dates that have no days in them, like May 2009.

NO_DAY_BEHIND = NEGATIVE_LOOKBEHIND.format(DIGIT + SPACE)
NO_DAY = NAMED.format("no_day", NO_DAY_BEHIND + MONTHS + YEAR)

This is for the most common form (that I use) - May 21, 2017.

WORDS = NAMED.format("words", MONTHS + DAYS + YEAR)

This is for the case where the day is placed before them month - 20 March, 2009.

BACKWARDS = NAMED.format("backwards", ONE_OR_TWO_DIGITS + SPACE + MONTHS + YEAR)

This is the case where slashes are used but only two digits were used for the year (so we're assuming it's in the twentieth century) - 8/4/98.

slashed = SLASH.join([ONE_OR_TWO_DIGITS,
		      ONE_OR_TWO_DIGITS,
		      EXACTLY_TWO_DIGITS])
dashed = DASH.join([ONE_OR_TWO_DIGITS,
		    ONE_OR_TWO_DIGITS,
		    EXACTLY_TWO_DIGITS])
TWENTIETH_CENTURY = NAMED.format("twentieth",
				 OR.join([slashed, dashed]))

This is the case where digits with slashes are used and all four digits are used for the year - 8/4/1998.

NUMERIC = NAMED.format("numeric",
		       SLASH.join([ONE_OR_TWO_DIGITS,
				   ONE_OR_TWO_DIGITS,
				   FOUR_DIGITS]))

This is the case where only month and year are given as digits - 9/2009. There are two expressions, because the day can be one or two digits.

NO_PRECEDING_SLASH = NEGATIVE_LOOKBEHIND.format(SLASH)
NO_PRECEDING_SLASH_DIGIT = NEGATIVE_LOOKBEHIND.format(CLASS.format(SLASH + DIGIT))
NO_ONE_DAY = (NO_PRECEDING_SLASH_DIGIT
	      + ONE_DIGIT
	      + SLASH
	      + FOUR_DIGITS)
NO_TWO_DAYS = (NO_PRECEDING_SLASH
	       + TWO_DIGITS
	       + SLASH
	       + FOUR_DIGITS)
NO_DAY_NUMERIC = NAMED.format("no_day_numeric",
			      NO_ONE_DAY
			      + OR
			      + NO_TWO_DAYS
			      )

This is the case where only a year was given. This is the hardest case, since you don't want to accidentally match the other cases, but the text preceding and following it could be anything. For the look-behind, all the cases have to have the same number of characters so we can't re-use the other expressions

CENTURY = GROUP.format('19' + OR + "20") + TWO_DIGITS
DIGIT_SLASH = DIGIT + SLASH
DIGIT_DASH = DIGIT + DASH
DIGIT_SPACE = DIGIT + SPACE
LETTER_SPACE = CLASS.format(LOWER_CASE) + SPACE
COMMA_SPACE = COMMA + SPACE
YEAR_PREFIX = NEGATIVE_LOOKBEHIND.format(OR.join([
    DIGIT_SLASH,
    DIGIT_DASH,
    DIGIT_SPACE,
    LETTER_SPACE,
    COMMA_SPACE,    
]))

YEAR_ONLY = NAMED.format("year_only",
			 YEAR_PREFIX + CENTURY
)

These are leftovers that don't really match anything.

IN_PREFIX = POSITIVE_LOOKBEHIND.format(CLASS.format('iI') + 'n' + SPACE) + CENTURY
SINCE_PREFIX = POSITIVE_LOOKBEHIND.format(CLASS.format("Ss") + 'ince' + SPACE) + CENTURY
AGE = POSITIVE_LOOKBEHIND.format("Age" + SPACE + TWO_DIGITS + COMMA + SPACE) + CENTURY
AGE_COMMA = POSITIVE_LOOKBEHIND.format("Age" + COMMA + SPACE + TWO_DIGITS + COMMA + SPACE) + CENTURY
OTHERS = ['delivery', "quit", "attempt", "nephrectomy", THREE_DIGITS]
OTHERS = [POSITIVE_LOOKBEHIND.format(label + SPACE) + CENTURY for label in OTHERS]
OTHERS = OR.join(OTHERS)
LEFTOVERS_PREFIX = OR.join([IN_PREFIX, SINCE_PREFIX, AGE, AGE_COMMA]) + OR + OTHERS
LEFTOVERS = NAMED.format("leftovers", LEFTOVERS_PREFIX)

This is the combined expression for all the dates - the one that should be used to extract them from the data.

DATE = NAMED.format("date", OR.join([NUMERIC,
				     TWENTIETH_CENTURY,
				     WORDS,
				     BACKWARDS,
				     CONTRACTION,
				     NO_DAY,
				     NO_DAY_NUMERIC,
				     YEAR_ONLY,
				     LEFTOVERS]))

def twentieth_century(date):
    """adds a 19 to the year

    Args:
     date (re.Regex): Extracted date
    """
    month, day, year = date.group(1).split(SLASH)
    year = "19{}".format(year)
    return SLASH.join([month, day, year])

def take_two(line):
    match = re.search(TWENTIETH_CENTURY, line)
    if match:
	return twentieth_century(match)
    return line

Applying The Grammer

def extract_and_count(expression, data, name):
    """extract all matches and report the count

    Args:
     expression (str): regular expression to match
     data (pandas.Series): data with dates to extratc
     name (str): name of the group for the expression

    Returns:
     tuple (pandas.Series, int): extracted dates, count
    """
    extracted = data.str.extractall(expression)[name]
    count = len(extracted)
    print("'{}' matched {} rows".format(name, count))
    return extracted, count

numeric, numeric_count = extract_and_count(NUMERIC, data, 'numeric')

twentieth, twentieth_count = extract_and_count(TWENTIETH_CENTURY, data, 'twentieth')

words, words_count = extract_and_count(WORDS, data, 'words')

backwards, backwards_count = extract_and_count(BACKWARDS, data, 'backwards')

contraction_data, contraction = extract_and_count(CONTRACTION, data, 'contraction')

no_day, no_day_count = extract_and_count(NO_DAY, data, 'no_day')

no_day_numeric, no_day_numeric_count = extract_and_count(NO_DAY_NUMERIC, data,
							 "no_day_numeric")

year_only, year_only_count = extract_and_count(YEAR_ONLY, data, "year_only")

leftovers, leftovers_count = extract_and_count(LEFTOVERS, data, "leftovers")

found = data.str.extractall(DATE)
total_found = len(found.date)

print("Total Found: {}".format(total_found))
print("Remaining: {}".format(len(data) - total_found))
print("Discrepancy: {}".format(total_found - (numeric_count
					      + twentieth_count
					      + words_count
					      + backwards_count
					      + contraction
					      + no_day_count
					      + no_day_numeric_count
					      + year_only_count
					      + leftovers_count)))

missing = [label for label in data.index if label not in found.index.levels[0]]
try:
    print(missing[0], data.loc[missing[0]])
except IndexError:
    print("all rows matched")

Unifying the Formats

To make it simpler, I'm going to use the mm/dd/yyyy format for the dates. I'm going to use the extracted series to avoid having different clean-up cases contaminating each other - e.g. dealing with 'January' when the day comes first as opposed to when the month comes first.

Helper Functions

Clean

This is a generic function to clean up some data. I was initially using it directly, but for cases where the expression and replacement function are used more than once, there are helper functions to make it easier.

def clean(source, expression, replacement, sample=5):
    """applies the replacement to the source

    as a side-effect shows sample rows before and after

    Args:
     source (pandas.Series): source of the strings
     expression (str): regular expression to match what to replace
     replacement: function or expression to replace the matching expression
     sample (int): number of randomly chosen examples to show

    Returns:
     pandas.Series: the source with the replacement applied to it
    """
    print("Random Sample Before:")
    print(source.sample(sample))
    cleaned = source.str.replace(expression, replacement)
    print("\nRandom Sample After:")
    print(cleaned.sample(sample))
    print("\nCount of cleaned: {}".format(len(cleaned)))
    assert len(source) == len(cleaned)
    return cleaned

Clean Punctuation

def clean_punctuation(source, sample=5):
    """removes punctuation

    Args:
     source (pandas.Series): data to clean
     sample (int): size of sample to show

    Returns:
     pandas.Series: source with punctuation removed
    """
    print("Cleaning Punctuation")
    if any(source.str.contains(PUNCTUATION)):
	source = clean(source, PUNCTUATION, EMPTY_STRING)
    return source

Convert Long Month Names to Three-Letter Names

LONG_TO_SHORT = dict(January="Jan",
		     February="Feb",
		     March="Mar",
		     April="Apr",
		     May="May",
		     June="Jun",
		     July="Jul",
		     August="Aug",
		     September="Sep",
		     October="Oct",
		     November="Nov",
		     December="Dec")

# it turns out there are spelling errors in the data so this has to be fuzzy
LONG_TO_SHORT_EXPRESSION = OR.join([GROUP.format(month)
				    + CLASS.format(LOWER_CASE)
				    + ZERO_OR_MORE
				    for month in LONG_TO_SHORT.values()])

def long_month_to_short(match):
    """convert long month to short

    Args:
     match (re.Match): object matching a long month

    Returns:
     str: shortened version of the month
    """
    return match.group(match.lastindex)

This next function is the one you would actually use to make the conversion.

def convert_long_months_to_short(source, sample=5):
    """convert long month names to short

    Args:
     source (pandas.Series): data with months
     sample (int): size of sample to show

    Returns:
     pandas.Series: data with short months
    """
    return clean(source,
		 LONG_TO_SHORT_EXPRESSION,
		 long_month_to_short)

Add January 1 to year-only dates

def add_month_date(match):
    """adds 01/01 to years

    Args:
     match (re.Match): object that only matched a 4-digit year

    Returns:
     str: 01/01/YYYY
    """
    return "01/01/" + match.group()

And now the function to actually call.

def add_january_one(source):
    """adds /01/01/ to year-only dates

    Args:
     source (pandas.Series): data with the dates

    Returns:
     pandas.Series: years in source with /01/01/ added
    """
    return clean(source, YEAR_ONLY, add_month_date)

Two-Digit Numbers

This makes sure that there are exactly two digits in a number, adding a leading zero if needed.

two_digit_expression = GROUP.format(ONE_OR_TWO_DIGITS) + POSITIVE_LOOKAHEAD.format(SLASH)

def two_digits(match):
    """add a leading zero if needed

    Args:
     match (re.Match): match with one or two digits

    Returns:
     str: the matched string with leading zero if needed
    """
    # for some reason the string-formatting raises an error if it's a string
    # so cast it to an int
    return "{:02}".format(int(match.group()))

This is the function to call for the case where the number is followed by a slash (e.g. 2/).

def clean_two_digits(source, sample=5):
    """makes sure source has two-digits

    Args:
     source (pandas.Series): data with digit followed by slash
     sample (int): number of samples to show

    Returns:
     pandas.Series: source with digits coerced to two digits
    """
    return clean(source, two_digit_expression, two_digits, sample)

This is like clean_two_digits but it doesn't check for the trailing slash. Use this if you have an isolated column of numbers that need to be two-digits.

def clean_two_digits_isolated(source, sample=5):
    """cleans two digits that are standalone

    Args:
     source (pandas.Series): source of the data
     sample (int): number of samples to show

    Returns:
     pandas.Series: converted data
    """
    return clean(source, ONE_OR_TWO_DIGITS, two_digits, sample)

Cleaning Up Months

These clean up and convert written months (e.g. change Aug to 08).

digits = ("{:02}".format(month) for month in range(1, 13))
MONTH_TO_DIGITS = dict(zip(MONTH_PREFIXES, digits))
SHORT_MONTHS_EXPRESSION = OR.join((GROUP.format(month) for month in MONTH_TO_DIGITS))
def month_to_digits(match):
    """converts short month to digits

    Args:
     match (re.Match): object with short-month

    Returns:
     str: month as two-digit number (e.g. Jan -> 01)
    """
    return MONTH_TO_DIGITS[match.group()]

def convert_short_month_to_digits(source, sample=5):
    """converts three-letter months to two-digits

    Args:
     source (pandas.Series): data with three-letter months
     sample (int): number of samples to show

    Returns:
     pandas.Series: source with short-months coverted to digits
    """
    return clean(source,
		 SHORT_MONTHS_EXPRESSION,
		 month_to_digits,
		 sample)

This function runs the previous three and is the main one that should be used. The others can be run individually for troubleshooting, though.

def clean_months(source, sample=5):
    """clean up months (which start as words)

    Args:
     source (pandas.Series): source of the months
     sample (int): number of random samples to show
    """
    cleaned = clean_punctuation(source)

    print("Converting long months to short")
    cleaned = clean(cleaned,
		    LONG_TO_SHORT_EXPRESSION,
		    long_month_to_short, sample)

    print("Converting short months to digits")
    cleaned = clean(cleaned,
		    SHORT_MONTHS_EXPRESSION,
		    month_to_digits, sample)
    return cleaned

Frame To Series

This is for the case where the date-fields were broken up into columns in a data-frame.

def frame_to_series(frame, index_source, samples=5):
    """re-combines data-frame into a series

    Args:
     frame (pandas.DataFrame): frame with month, day, year columns
     index_source (pandas.series): source to copy index from
     samples (index): number of random entries to print when done

    Returns:
     pandas.Series: series with dates as month/day/year
    """
    combined = frame.month + SLASH + frame.day + SLASH + frame.year
    combined.index = index_source.index
    print(combined.sample(samples))
    return combined

Year Only

For the case where there is only a year, I'll add January 1 to the dates.

year_only_cleaned = add_january_one(year_only)

Leftovers

These were the odd cases that didn't seem to have a real pattern. Since I used a negative lookbehind they only have the years in them, like the other year-only cases.

leftovers_cleaned = add_january_one(leftovers)

cleaned = pandas.concat([year_only_cleaned, leftovers_cleaned])
print(len(cleaned))

No Day Numeric

This is for the case where the date is formatted with slashes and there are no day-values. To make the months uniform I'm going to make them all two-digits first.

no_day_numeric_cleaned = clean_two_digits(no_day_numeric)

Now I'll add the day.

no_day_numeric_cleaned = clean(no_day_numeric_cleaned,
			       SLASH,
			       lambda m: "/01/")

And add it to the total.

original = len(cleaned)
cleaned = pandas.concat([cleaned, no_day_numeric_cleaned])
assert len(cleaned) == no_day_numeric_count + original

print(len(cleaned))

No Day

This is for cases like Mar 2011 where no day was given. We're going to assume that it's the first day of the month for each case.

no_day_cleaned = clean_months(no_day)

Now we need to replace the spaces with the days.

no_day_cleaned = clean(no_day_cleaned,
		       SPACE + ONE_OR_MORE,
		       lambda match: "/01/")

Now we can add it to the cleaned.

original = len(cleaned)
cleaned = pandas.concat([cleaned, no_day_cleaned])
print(len(cleaned))

Now to make sure we're where we expect we are.

assert len(cleaned) == no_day_count + original

Contraction

There were no matches for the contraction so I'll ignore it for now.

Backwards

This is the case where the day comes first. The first thing I'll do is split them up.

frame = pandas.DataFrame(backwards.str.split().tolist(),
			 columns="day month year".split())
frame.head()

The next thing to do is to make sure the days all have two digits.

frame.day = clean_two_digits(frame.day)

Next comes the months. This is basically the same problem as with the no day case so I'll re-use some of the code for that.

frame.month = clean_months(frame.month)

Now we need to combine them back together. In hindsight it might have been easier to convert everything into data frames instead of the other way around. Or maybe not. Since we want the indexes from the original data as our final answer I also have to copy the index from the original series

backwards_cleaned = frame_to_series(frame, backwards)

140  0        02/14/1995
177  0        01/18/1990
156  0        05/26/1974
189  0        10/21/1977
160  0        10/21/2007
dtype: object

No it gets added to the combined series.

original = len(cleaned)
cleaned = pandas.concat([cleaned, backwards_cleaned])
assert len(cleaned) == original + backwards_count

print(len(cleaned))

Words

Since working with the data frame was easier than I though it would be I'll do that again.

frame = pandas.DataFrame(words.str.split().tolist(), columns="month day year".split())
print(frame.head())

First we'll clean out the months.

frame.month = clean_months(frame.month)

Now we'll clean up the punctuation for the days.

frame.day = clean_punctuation(frame.day)

So, what do we have so far?

frame.head()

At this point we need to combine everything with a slash and restore the index.

words_cleaned = frame_to_series(frame, words)

Now we'll add it to the total.

original = len(cleaned)
cleaned = pandas.concat([cleaned, words_cleaned])
assert len(cleaned) == original + words_count
print(len(cleaned))

Twentieth Century

We'll do the same trick with creating a dataframe. The first thing, though, is to replace the dashes with slashes.

print(twentieth.iloc[21])
twentieth_cleaned = twentieth.str.replace(DASH, SLASH)
print(cleaned.iloc[21])

Now, we'll create the frame.

frame = pandas.DataFrame(twentieth_cleaned.str.split(SLASH).tolist(),
			 columns=["month", "day", "year"])
print(frame.head())

Months

The months need to be converted to two-digits.

frame.month = clean_two_digits_isolated(frame.month)

As do the days.

frame.day = clean_two_digits_isolated(frame.day)

frame.head()

Now we have to add 19 to each of the years.

frame.year = clean(frame.year, TWO_DIGITS, lambda match: "19" + match.group())

Now we have to join them back up.

twentieth_cleaned = frame_to_series(frame, twentieth)

original = len(cleaned)
cleaned = pandas.concat([cleaned, twentieth_cleaned])

assert len(cleaned) == original + twentieth_count

Numeric

The final category is dates with the format mm/dd/yyyy.

print(numeric.head())

We should check and make sure there are no dashes here.

has_dashes = numeric.str.contains(DASH)
print(numeric[has_dashes])

It looks like it doesn't so we'll skip this check.

frame = pandas.DataFrame(numeric.str.split(SLASH).tolist(),
			 columns="month day year".split())
print(frame.head())

frame.month = clean_two_digits_isolated(frame.month)

frame.day = clean_two_digits_isolated(frame.day)

numeric_cleaned = frame_to_series(frame, numeric)

original = len(cleaned)
cleaned = pandas.concat([cleaned, numeric_cleaned])
assert len(cleaned) == original + numeric_count
print(len(cleaned))

At this point it looks like we've cleaned all the cases.

Re-combining The Cleaned

Because these notebooks can execute things out of order I'm going to create one monolithic concatenation and ignore the one that I was using to keep the running total.

cleaned = pandas.concat([numeric_cleaned,
			 twentieth_cleaned,
			 words_cleaned,
			 backwards_cleaned,
			 no_day_cleaned,
			 no_day_numeric_cleaned,
			 year_only_cleaned,
			 leftovers_cleaned,
])
print(len(cleaned))
print(cleaned.head())
assert len(cleaned) == len(data)

Convert to Datetimes

print(cleaned.head())
datetimes = pandas.to_datetime(cleaned, format="%m/%d/%Y")
print(datetimes.head())

sorted_dates = datetimes.sort_values()
print(sorted_dates.head())

print(sorted_dates.tail())

The grader wants a Series with the indices of the original data put in the order of the sorted dates.

answer = pandas.Series(sorted_dates.index.labels[0])
print(answer.head())

The date_sorter Function

This is the function called by the grader. Since the work was done outside of it we just need to make sure that it returns our answer.

def date_sorter():
    return answer

note: This produced a 94% score, so there are still some cases not correctly handled.