# Date Mean Squared Error

## Table of Contents

## What is this?

This is a short sketch to figure out how to group a bunch of values by month and calculate the Root-Mean-Squared-Error (RMSE) for the mean for the values in that month. This probably isn't the most efficient way to do this, but I'm trying to double check everything as I go and doing the typical Train Wreck like you see in most examples on Stack Overflow.

## Imports

### From PyPi

Just pandas.

```
import pandas
```

## The Data

I'm going to create some simple values so that it's easy(ish) to do the math by hand and double-check what comes out. I'll use the pandas Timestamp for the dates. I'm still not one-hundred percent sure why it's better than date-time, but hopefully it's optimized or something.

```
data = {"date": [
pandas.Timestamp("2018-09-01"),
pandas.Timestamp("2018-09-05"),
pandas.Timestamp("2018-09-05"),
pandas.Timestamp("2018-10-01"),
pandas.Timestamp("2018-10-05"),
],
"value": [1, 2, 3, 1, 2]}
frame = pandas.DataFrame.from_dict(data)
```

I'm going to use pandas' resample method to group the data by months. the `resample`

method expets the data to have the dates as the index, so I'm going to create a new frame by setting the index to the date-column.

```
date_frame = frame.set_index("date")
```

## The Mean

The value I'm going to use to estimate the values for each month is the mean.

```
monthly = date_frame.resample("M")
means = monthly.mean()
print(means)
assert all(means.value == [2.0, 1.5])
```

value date 2018-09-30 2.0 2018-10-31 1.5

## Getting the Mean Back Into the Frame

Now that we have the monthly means, I want to re-add them to the original data-frame by giving them a common column named `year_month`

(using apply) so I can broadcast the means by merging the two data-frames.

```
frame["year_month"] = frame.date.apply(
lambda date: pandas.Timestamp(year=date.year,
month=date.month, day=1))
print(frame.head())
```

date value year_month 0 2018-09-01 1 2018-09-01 1 2018-09-05 2 2018-09-01 2 2018-09-05 3 2018-09-01 3 2018-10-01 1 2018-10-01 4 2018-10-05 2 2018-10-01

```
mean_frame = means.reset_index()
mean_frame["year_month"] = mean_frame.date.apply(
lambda date: pandas.Timestamp(year=date.year,
month=date.month,
day=1))
print(mean_frame)
```

date value year_month 0 2018-09-30 2.0 2018-09-01 1 2018-10-31 1.5 2018-10-01

The `value`

column in the `mean_frame`

is actually the mean of the values for that month so I'll re-name it before I forget.

```
mean_frame.rename(dict(value="mean"), axis="columns",
inplace=True)
print(mean_frame)
```

date mean year_month 0 2018-09-30 2.0 2018-09-01 1 2018-10-31 1.5 2018-10-01

Now I'll merge the two data frames on the `year_month`

column using the default inner-join (intersection) method.

```
merged = frame.merge(mean_frame, on="year_month")
del(merged["date_y"])
merged.rename(dict(date_x="date"), axis="columns", inplace=True)
print(merged)
assert all(merged["mean"] == [2, 2, 2, 1.5, 1.5])
```

date value year_month mean 0 2018-09-01 1 2018-09-01 2.0 1 2018-09-05 2 2018-09-01 2.0 2 2018-09-05 3 2018-09-01 2.0 3 2018-10-01 1 2018-10-01 1.5 4 2018-10-05 2 2018-10-01 1.5

Note that I had to use the `merged["mean"]`

form because the data-frame has a mean method which the dot-notation (`merged.mean`

) would call instead of grabbing the column.

## Calculating the RMSE

### Error

Since I'm estimating the values for each month using the mean the *error* is the difference between the *mean* and each of the *values*.

```
merged["error"] = merged["value"] - merged["mean"]
print(merged)
assert all(merged.error==[-1, 0, 1, -.5, .5])
```

date value year_month mean error 0 2018-09-01 1 2018-09-01 2.0 -1.0 1 2018-09-05 2 2018-09-01 2.0 0.0 2 2018-09-05 3 2018-09-01 2.0 1.0 3 2018-10-01 1 2018-10-01 1.5 -0.5 4 2018-10-05 2 2018-10-01 1.5 0.5

### Error Squared

Now I'll square the error to get rid of the negative error values (which would cancel each other out when we take the mean errors) and to make the effect of the errors non-linear (the errors are exagerrated).

```
merged["error_squared"] = merged.error.pow(2)
```

```
print(merged)
```

date value year_month mean error error_squared 0 2018-09-01 1 2018-09-01 2.0 -1.0 1.00 1 2018-09-05 2 2018-09-01 2.0 0.0 0.00 2 2018-09-05 3 2018-09-01 2.0 1.0 1.00 3 2018-10-01 1 2018-10-01 1.5 -0.5 0.25 4 2018-10-05 2 2018-10-01 1.5 0.5 0.25

### Mean Squared Error

So now we take the mean of our squared errors to get an initial estimate of how much we are off each month.

```
mean_of = merged.groupby("year_month").mean()
print(mean_of.error_squared)
```

year_month 2018-09-01 0.666667 2018-10-01 0.250000 Name: error_squared, dtype: float64

### RMSE

Since the squared error would have units squared, I'll take the root of it to get a more interpretable estimate of the error.

```
print(mean_of.error_squared.pow(.5))
```

year_month 2018-09-01 0.816497 2018-10-01 0.500000 Name: error_squared, dtype: float64