Date Mean Squared Error

What is this?

This is a short sketch to figure out how to group a bunch of values by month and calculate the Root-Mean-Squared-Error (RMSE) for the mean for the values in that month. This probably isn't the most efficient way to do this, but I'm trying to double check everything as I go and doing the typical Train Wreck like you see in most examples on Stack Overflow.

Imports

From PyPi

Just pandas.

import pandas

The Data

I'm going to create some simple values so that it's easy(ish) to do the math by hand and double-check what comes out. I'll use the pandas Timestamp for the dates. I'm still not one-hundred percent sure why it's better than date-time, but hopefully it's optimized or something.

data = {"date": [
    pandas.Timestamp("2018-09-01"),
    pandas.Timestamp("2018-09-05"),
    pandas.Timestamp("2018-09-05"),
    pandas.Timestamp("2018-10-01"),
    pandas.Timestamp("2018-10-05"),
                 ],
        "value": [1, 2, 3, 1, 2]}
frame = pandas.DataFrame.from_dict(data)

I'm going to use pandas' resample method to group the data by months. the resample method expets the data to have the dates as the index, so I'm going to create a new frame by setting the index to the date-column.

date_frame = frame.set_index("date")

The Mean

The value I'm going to use to estimate the values for each month is the mean.

monthly = date_frame.resample("M")
means = monthly.mean()
print(means)
assert all(means.value == [2.0, 1.5])
            value
date             
2018-09-30    2.0
2018-10-31    1.5

Getting the Mean Back Into the Frame

Now that we have the monthly means, I want to re-add them to the original data-frame by giving them a common column named year_month (using apply) so I can broadcast the means by merging the two data-frames.

frame["year_month"] = frame.date.apply(
    lambda date: pandas.Timestamp(year=date.year,
                                  month=date.month, day=1))
print(frame.head())
        date  value year_month
0 2018-09-01      1 2018-09-01
1 2018-09-05      2 2018-09-01
2 2018-09-05      3 2018-09-01
3 2018-10-01      1 2018-10-01
4 2018-10-05      2 2018-10-01
mean_frame = means.reset_index()
mean_frame["year_month"] = mean_frame.date.apply(
    lambda date: pandas.Timestamp(year=date.year,
                                  month=date.month,
                                  day=1))
print(mean_frame)
        date  value year_month
0 2018-09-30    2.0 2018-09-01
1 2018-10-31    1.5 2018-10-01

The value column in the mean_frame is actually the mean of the values for that month so I'll re-name it before I forget.

mean_frame.rename(dict(value="mean"), axis="columns",
                  inplace=True)
print(mean_frame)
        date  mean year_month
0 2018-09-30   2.0 2018-09-01
1 2018-10-31   1.5 2018-10-01

Now I'll merge the two data frames on the year_month column using the default inner-join (intersection) method.

merged = frame.merge(mean_frame, on="year_month")
del(merged["date_y"])
merged.rename(dict(date_x="date"), axis="columns", inplace=True)
print(merged)
assert all(merged["mean"] == [2, 2, 2, 1.5, 1.5])
        date  value year_month  mean
0 2018-09-01      1 2018-09-01   2.0
1 2018-09-05      2 2018-09-01   2.0
2 2018-09-05      3 2018-09-01   2.0
3 2018-10-01      1 2018-10-01   1.5
4 2018-10-05      2 2018-10-01   1.5

Note that I had to use the merged["mean"] form because the data-frame has a mean method which the dot-notation (merged.mean) would call instead of grabbing the column.

Calculating the RMSE

Error

Since I'm estimating the values for each month using the mean the error is the difference between the mean and each of the values.

merged["error"] = merged["value"] - merged["mean"]
print(merged)
assert all(merged.error==[-1, 0, 1, -.5, .5])
        date  value year_month  mean  error
0 2018-09-01      1 2018-09-01   2.0   -1.0
1 2018-09-05      2 2018-09-01   2.0    0.0
2 2018-09-05      3 2018-09-01   2.0    1.0
3 2018-10-01      1 2018-10-01   1.5   -0.5
4 2018-10-05      2 2018-10-01   1.5    0.5

Error Squared

Now I'll square the error to get rid of the negative error values (which would cancel each other out when we take the mean errors) and to make the effect of the errors non-linear (the errors are exagerrated).

merged["error_squared"] = merged.error.pow(2)
print(merged)
        date  value year_month  mean  error  error_squared
0 2018-09-01      1 2018-09-01   2.0   -1.0           1.00
1 2018-09-05      2 2018-09-01   2.0    0.0           0.00
2 2018-09-05      3 2018-09-01   2.0    1.0           1.00
3 2018-10-01      1 2018-10-01   1.5   -0.5           0.25
4 2018-10-05      2 2018-10-01   1.5    0.5           0.25

Mean Squared Error

So now we take the mean of our squared errors to get an initial estimate of how much we are off each month.

mean_of = merged.groupby("year_month").mean()
print(mean_of.error_squared)
year_month
2018-09-01    0.666667
2018-10-01    0.250000
Name: error_squared, dtype: float64

RMSE

Since the squared error would have units squared, I'll take the root of it to get a more interpretable estimate of the error.

print(mean_of.error_squared.pow(.5))
year_month
2018-09-01    0.816497
2018-10-01    0.500000
Name: error_squared, dtype: float64