Date Mean Squared Error
Table of Contents
What is this?
This is a short sketch to figure out how to group a bunch of values by month and calculate the Root-Mean-Squared-Error (RMSE) for the mean for the values in that month. This probably isn't the most efficient way to do this, but I'm trying to double check everything as I go and doing the typical Train Wreck like you see in most examples on Stack Overflow.
Imports
From PyPi
Just pandas.
import pandas
The Data
I'm going to create some simple values so that it's easy(ish) to do the math by hand and double-check what comes out. I'll use the pandas Timestamp for the dates. I'm still not one-hundred percent sure why it's better than date-time, but hopefully it's optimized or something.
data = {"date": [
pandas.Timestamp("2018-09-01"),
pandas.Timestamp("2018-09-05"),
pandas.Timestamp("2018-09-05"),
pandas.Timestamp("2018-10-01"),
pandas.Timestamp("2018-10-05"),
],
"value": [1, 2, 3, 1, 2]}
frame = pandas.DataFrame.from_dict(data)
I'm going to use pandas' resample method to group the data by months. the resample
method expets the data to have the dates as the index, so I'm going to create a new frame by setting the index to the date-column.
date_frame = frame.set_index("date")
The Mean
The value I'm going to use to estimate the values for each month is the mean.
monthly = date_frame.resample("M")
means = monthly.mean()
print(means)
assert all(means.value == [2.0, 1.5])
value date 2018-09-30 2.0 2018-10-31 1.5
Getting the Mean Back Into the Frame
Now that we have the monthly means, I want to re-add them to the original data-frame by giving them a common column named year_month
(using apply) so I can broadcast the means by merging the two data-frames.
frame["year_month"] = frame.date.apply(
lambda date: pandas.Timestamp(year=date.year,
month=date.month, day=1))
print(frame.head())
date value year_month 0 2018-09-01 1 2018-09-01 1 2018-09-05 2 2018-09-01 2 2018-09-05 3 2018-09-01 3 2018-10-01 1 2018-10-01 4 2018-10-05 2 2018-10-01
mean_frame = means.reset_index()
mean_frame["year_month"] = mean_frame.date.apply(
lambda date: pandas.Timestamp(year=date.year,
month=date.month,
day=1))
print(mean_frame)
date value year_month 0 2018-09-30 2.0 2018-09-01 1 2018-10-31 1.5 2018-10-01
The value
column in the mean_frame
is actually the mean of the values for that month so I'll re-name it before I forget.
mean_frame.rename(dict(value="mean"), axis="columns",
inplace=True)
print(mean_frame)
date mean year_month 0 2018-09-30 2.0 2018-09-01 1 2018-10-31 1.5 2018-10-01
Now I'll merge the two data frames on the year_month
column using the default inner-join (intersection) method.
merged = frame.merge(mean_frame, on="year_month")
del(merged["date_y"])
merged.rename(dict(date_x="date"), axis="columns", inplace=True)
print(merged)
assert all(merged["mean"] == [2, 2, 2, 1.5, 1.5])
date value year_month mean 0 2018-09-01 1 2018-09-01 2.0 1 2018-09-05 2 2018-09-01 2.0 2 2018-09-05 3 2018-09-01 2.0 3 2018-10-01 1 2018-10-01 1.5 4 2018-10-05 2 2018-10-01 1.5
Note that I had to use the merged["mean"]
form because the data-frame has a mean method which the dot-notation (merged.mean
) would call instead of grabbing the column.
Calculating the RMSE
Error
Since I'm estimating the values for each month using the mean the error is the difference between the mean and each of the values.
merged["error"] = merged["value"] - merged["mean"]
print(merged)
assert all(merged.error==[-1, 0, 1, -.5, .5])
date value year_month mean error 0 2018-09-01 1 2018-09-01 2.0 -1.0 1 2018-09-05 2 2018-09-01 2.0 0.0 2 2018-09-05 3 2018-09-01 2.0 1.0 3 2018-10-01 1 2018-10-01 1.5 -0.5 4 2018-10-05 2 2018-10-01 1.5 0.5
Error Squared
Now I'll square the error to get rid of the negative error values (which would cancel each other out when we take the mean errors) and to make the effect of the errors non-linear (the errors are exagerrated).
merged["error_squared"] = merged.error.pow(2)
print(merged)
date value year_month mean error error_squared 0 2018-09-01 1 2018-09-01 2.0 -1.0 1.00 1 2018-09-05 2 2018-09-01 2.0 0.0 0.00 2 2018-09-05 3 2018-09-01 2.0 1.0 1.00 3 2018-10-01 1 2018-10-01 1.5 -0.5 0.25 4 2018-10-05 2 2018-10-01 1.5 0.5 0.25
Mean Squared Error
So now we take the mean of our squared errors to get an initial estimate of how much we are off each month.
mean_of = merged.groupby("year_month").mean()
print(mean_of.error_squared)
year_month 2018-09-01 0.666667 2018-10-01 0.250000 Name: error_squared, dtype: float64
RMSE
Since the squared error would have units squared, I'll take the root of it to get a more interpretable estimate of the error.
print(mean_of.error_squared.pow(.5))
year_month 2018-09-01 0.816497 2018-10-01 0.500000 Name: error_squared, dtype: float64