GridCure: Predictive Modeling Challenge¶

In persuing some data scientist jobs, I stumbled across an interesting challenge associated with a Data Scientist job posting at GridCure. GridCure roughly "offers simple and customizable solutions to help utilities make sense of their data and implement data-driven change."

Since I'm working as an energy efficiency engineer contracting primarily to electric utilities, I thought my domain expertise might provide some unique insight.

For those interested in exploring the data themsevles, you can download the 'Files for Electric Car Practrice Problem' here.

This will be a fairly lengthy series of posts. Rather than just framing the problem, stating my conclusions and key assumptions, I wanted to outline my workflow, which, as someone who has been self-teaching, I'm hoping to get some feedback on.

In this post, I seek to wrap my head around the problem statement with some initial exploration of the data.

Here are the files I have at my disposal:

EV_files
1. Electric Vehicle Detection-1.docx
2. EV_test.csv
3. EV_train_labels.csv
4. EV_train.csv
5. sample_submission.csv

Problem Statement: Electric Vehicle Detection¶

The training set contains two months of smart meter power readings from 1590 houses. The readings were taken at half-hour intervals. Some of the homes have electric vehicles and some do not. The file "EV_train_labels.csv" indicates the time intervals on which an electric vehicle was charging (1 indicates a vehicle was charging at some point during the interval and 0 indicates no vehicle was charging at any point during the interval). Can you determine:

A) Which residences have electric vehicles?

B) When the electric vehicles were charging?

C) Any other interesting aspects of the dataset?

A solution to part B might consist of a prediction of the probability that an electric car was charging for each house and time interval in the test set. Please include code and explain your reasoning. What do you expect the accuracy of your predictions to be?

Part 0: Data Exploration¶

Let's try and wrap our head around the data we have at our disposal using Pandas and some visualization in matplotlib, starting with EV_train.csv.

In [50]:

import pandas as pd
train = pd.read_csv('/Users/ky/Documents/Workspace/gridcure/EV_files/EV_train.csv')

In [51]:

train.head()

Out[51]:

	House ID	Interval_1	Interval_2	Interval_3	Interval_4	Interval_5	Interval_6	Interval_7	Interval_8	Interval_9	...	Interval_2871	Interval_2872	Interval_2873	Interval_2874	Interval_2875	Interval_2876	Interval_2877	Interval_2878	Interval_2879	Interval_2880
0	11655099	0.950	0.826	0.361	0.238	0.342	0.233000	0.351000	0.194000	0.292000	...	0.664	0.783	0.601	0.639	0.417	0.439	0.226	0.190	0.710	0.728
1	11633257	0.353	0.327	0.358	0.292	0.285	0.304000	0.361000	0.342000	0.355000	...	0.536	0.558	0.622	0.634	0.513	0.421	0.273	0.296	0.291	0.289
2	11651552	0.150	0.181	0.150	0.150	0.131	0.125000	0.088000	0.106000	0.094000	...	2.125	0.881	0.481	1.194	0.138	0.119	0.038	0.088	0.056	0.113
3	11636092	2.088	2.075	2.121	2.098	2.046	2.081000	1.847000	0.420000	0.399000	...	0.620	0.487	0.563	0.419	0.379	0.359	0.347	0.325	0.330	0.340
4	11647239	1.416	1.250	1.270	1.258	1.239	1.753105	4.609256	4.619256	4.075151	...	1.596	1.667	1.569	1.664	1.580	1.635	1.568	1.565	1.575	1.571

5 rows × 2881 columns

In [52]:

train.shape

Out[52]:

(1590, 2881)

As promised, smart meter power readings (in kW) from 1590 houses taken at half-hour intervals. 2880 measurements at half-hour intervals means the data spans 60 days. Let's see how much of that data looks good.

In [53]:

total_meas = 1590 * 2880
print("Total number of measurements:", total_meas)
missing_meas = train.isnull().values.sum()
print("Missing measurements:", missing_meas)
print("%.3f%% of measurements missing." % (100*float(missing_meas)/float(total_meas)))

('Total number of measurements:', 4579200)
('Missing measurements:', 720)
0.016% of measurements missing.

Not bad, less than 0.016% of the data is missing. We don't have a handle on faulty readings or outliers yet, but that will come.

Lets's sample a few random houses to get an idea how the data looks.

In [54]:

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')
import numpy as np

train.sample(3)

Out[54]:

	House ID	Interval_1	Interval_2	Interval_3	Interval_4	Interval_5	Interval_6	Interval_7	Interval_8	Interval_9	...	Interval_2871	Interval_2872	Interval_2873	Interval_2874	Interval_2875	Interval_2876	Interval_2877	Interval_2878	Interval_2879	Interval_2880
183	11653078	1.001	0.995	1.019	0.994	0.998	0.975	1.046	1.019	1.002	...	2.030	2.922	3.619	3.469	2.516	1.239	1.235	1.180	1.088	1.123
667	11651252	1.106	1.006	0.413	0.406	0.425	0.450	0.438	0.381	0.406	...	0.888	0.563	0.338	0.100	0.056	0.075	0.063	0.056	0.050	0.050
389	11628822	0.019	0.018	0.018	0.018	0.017	0.018	0.018	0.018	0.400	...	0.390	0.393	0.393	0.390	0.393	0.026	0.013	0.014	0.013	0.014

3 rows × 2881 columns

Let's plot the 60 days of data we have for these houses to see what it looks like.

In [55]:

x = np.linspace(1,2880,num=2880) # x values for intervals[1,2...2880]
y1 = train.iloc[996][1:] # interval data for house at index 996
y2 = train.iloc[482][1:]
y3 = train.iloc[610][1:]

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(x, y1, c='b', marker='.', label='y1')
ax1.scatter(x, y2, c='r', marker='.', label='y2')
ax1.scatter(x, y3, c='g', marker='.', label='y3')
plt.legend(prop={'size':10})
plt.title('60 days of 30-minute interval power data')
plt.ylabel('Power')
plt.xlabel('Interval #')
plt.xlim(0,3000)
plt.ylim(0)
plt.show()

Some houses appear to have a pretty steady elecrical usage, while others exhibit more variance. Remeber we're looking over a 60 day interval. It would be interesting to see what these houses do over the course of a day or week. Is their some periodicty to the data (i.e get home from work and change electric vehicle)? Let's slice the data into a 24-hour period.

In [56]:

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(x[:48], y1[:48], c='b', label='y1') # 48 intervals (24 hours) of data
ax1.scatter(x[:48], y2[:48], c='r', label='y2')
ax1.scatter(x[:48], y3[:48], c='g', label='y3')
plt.legend(prop={'size':10})
plt.title('24 hours of 30-minute interval power data')
plt.ylabel('Power')
plt.xlabel('Interval #')
plt.xlim(0,50)
plt.ylim(0)
plt.show()

Now let's plot the data at a week's resolution.

In [57]:

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(x[:48*7], y1[:48*7], c='b', label='y1') # one week of data
ax1.scatter(x[:48*7], y2[:48*7], c='r', label='y2')
ax1.scatter(x[:48*7], y3[:48*7], c='g', label='y3')
plt.legend(prop={'size':10})
plt.title('One week of 30-minute interval power data')
plt.ylabel('Power')
plt.xlabel('Interval #')
plt.xlim(0,350)
plt.ylim(0)
plt.show()

My takeaway: It's not obivous when EVs are charging (maybe these sample houses don't even have electric EVs). If we had timestamps, maybe we could glean some day of the week, time of day, insights...but without anymore info we're not going to build a very robust model.

Luckily we have another piece of of data, EV_train_labels.csv, which indicates the time intervals on which an electric vehicle was charging (1 indicates a vehicle was charging at some point during the interval and 0 indicates no vehicle was charging at any point during the interval). Let's make sure the data lines up with EV_train.csv.

In [58]:

labels = pd.read_csv('/Users/ky/Documents/Workspace/gridcure/EV_files/EV_train_labels.csv')

In [59]:

train.head()

Out[59]:

	House ID	Interval_1	Interval_2	Interval_3	Interval_4	Interval_5	Interval_6	Interval_7	Interval_8	Interval_9	...	Interval_2871	Interval_2872	Interval_2873	Interval_2874	Interval_2875	Interval_2876	Interval_2877	Interval_2878	Interval_2879	Interval_2880
0	11655099	0.950	0.826	0.361	0.238	0.342	0.233000	0.351000	0.194000	0.292000	...	0.664	0.783	0.601	0.639	0.417	0.439	0.226	0.190	0.710	0.728
1	11633257	0.353	0.327	0.358	0.292	0.285	0.304000	0.361000	0.342000	0.355000	...	0.536	0.558	0.622	0.634	0.513	0.421	0.273	0.296	0.291	0.289
2	11651552	0.150	0.181	0.150	0.150	0.131	0.125000	0.088000	0.106000	0.094000	...	2.125	0.881	0.481	1.194	0.138	0.119	0.038	0.088	0.056	0.113
3	11636092	2.088	2.075	2.121	2.098	2.046	2.081000	1.847000	0.420000	0.399000	...	0.620	0.487	0.563	0.419	0.379	0.359	0.347	0.325	0.330	0.340
4	11647239	1.416	1.250	1.270	1.258	1.239	1.753105	4.609256	4.619256	4.075151	...	1.596	1.667	1.569	1.664	1.580	1.635	1.568	1.565	1.575	1.571

5 rows × 2881 columns

In [60]:

labels.head()

Out[60]:

	House ID	Interval_6	Interval_7	Interval_8	Interval_9	...
0	11655099	0	0	0	0	...
1	11633257	0	0	0	0	...
2	11651552	0	0	0	0	...
3	11636092	0	0	0	0	...
4	11647239	1	1	1	1	...

5 rows × 2881 columns

In [61]:

labels.shape == train.shape

Out[61]:

True

Both dataframes have the same shape, and the rows and columns line up, lovely! First let's rename all the columns so they have distinct names from our training data, which will come in handy if we merge that dataframes (Interval_1 --> IND_Interval_1).

In [62]:

labels.columns = "IND_" + labels.columns
labels.head()

Out[62]:

	IND_House ID	IND_Interval_6	IND_Interval_7	IND_Interval_8	IND_Interval_9	...
0	11655099	0	0	0	0	...
1	11633257	0	0	0	0	...
2	11651552	0	0	0	0	...
3	11636092	0	0	0	0	...
4	11647239	1	1	1	1	...

5 rows × 2881 columns

Then let's get an idea of the distribution of our sample...how many households have EVs?

In [63]:

charging_indicators = list(labels) # list of columns
charging_indicators.remove('IND_House ID') # drops House ID so you only have intervals data
labels['Has EV'] = labels[charging_indicators].max(axis=1) # returns 1 if househould ever charged an EV.
labels.head()

Out[63]:

	IND_House ID	IND_Interval_6	IND_Interval_7	IND_Interval_8	IND_Interval_9	...	Has EV
0	11655099	0	0	0	0	...	0
1	11633257	0	0	0	0	...	0
2	11651552	0	0	0	0	...	0
3	11636092	0	0	0	0	...	0
4	11647239	1	1	1	1	...	1

5 rows × 2882 columns

In [64]:

houses_with_EVs = labels['Has EV'].sum()
print("Houses with EVs:", houses_with_EVs)

print("%.1f%% of househoulds in the sample have EVs." % (100*float(houses_with_EVs)/len(labels)))

('Houses with EVs:', 485)
30.5% of househoulds in the sample have EVs.

Let's concatenate the data such that we have columns of 2880 power interval readings, followed by another 2880 interval readings indicationg (with a 1 or 0) whether or not an EV vehicle was charging at any point during the interval.

In [65]:

result = pd.concat([train, labels], axis=1, join="inner")

In [66]:

result.head()

Out[66]:

	House ID	Interval_1	Interval_2	Interval_3	Interval_4	Interval_5	Interval_6	Interval_7	Interval_8	Interval_9	...	Has EV
0	11655099	0.950	0.826	0.361	0.238	0.342	0.233000	0.351000	0.194000	0.292000	...	0
1	11633257	0.353	0.327	0.358	0.292	0.285	0.304000	0.361000	0.342000	0.355000	...	0
2	11651552	0.150	0.181	0.150	0.150	0.131	0.125000	0.088000	0.106000	0.094000	...	0
3	11636092	2.088	2.075	2.121	2.098	2.046	2.081000	1.847000	0.420000	0.399000	...	0
4	11647239	1.416	1.250	1.270	1.258	1.239	1.753105	4.609256	4.619256	4.075151	...	1

5 rows × 5763 columns

Let's get a little more info about the each household before we start playing with the data visualizing the data.

In [67]:

power_readings = list(train) # list of columns in train
power_readings.remove('House ID') # removes House ID so list only contains interval power data
result['Average Power'] = result[power_readings].mean(axis=1) # average power for each house

In [68]:

result.head()

Out[68]:

	House ID	Interval_1	Interval_2	Interval_3	Interval_4	Interval_5	Interval_6	Interval_7	Interval_8	Interval_9	...	Has EV	Average Power
0	11655099	0.950	0.826	0.361	0.238	0.342	0.233000	0.351000	0.194000	0.292000	...	0	0.569631
1	11633257	0.353	0.327	0.358	0.292	0.285	0.304000	0.361000	0.342000	0.355000	...	0	0.667629
2	11651552	0.150	0.181	0.150	0.150	0.131	0.125000	0.088000	0.106000	0.094000	...	0	0.659690
3	11636092	2.088	2.075	2.121	2.098	2.046	2.081000	1.847000	0.420000	0.399000	...	0	1.397184
4	11647239	1.416	1.250	1.270	1.258	1.239	1.753105	4.609256	4.619256	4.075151	...	1	1.423121

5 rows × 5764 columns

Let's get a sense if having an EV influences your average power.

In [69]:

result.groupby(['Has EV']).describe()['Average Power'] # summary stats for groubed by EV (1/0)

Out[69]:

Has EV       
0       count    1105.000000
        mean        1.383133
        std         2.556401
        min         0.282908
        25%         0.611511
        50%         0.794672
        75%         1.177832
        max        45.663464
1       count     485.000000
        mean        1.438249
        std         4.073427
        min         0.361415
        25%         0.757703
        50%         0.935592
        75%         1.265925
        max        84.905235
Name: Average Power, dtype: float64

The average power draw from a house without an EV is 1.383 kW, while the average power draw from a house with EV(s) is 1.438 kW. That's a 4% increase in power usage.

Conclusion¶

We've done a bit of exploration on the training data set. Here's what we know:

30.5% of our sample of 1590 houses have EVs.
Houses with EVs exhibit a 4% increase in average power
Average power of a house over the 60 days ranges from 0.28 kW to 84.9 kW
- Min: Someone on vacation for 60 days maybe
- Max: Multi-unit home perhaps
- 2015 average house power was 1.23 kW

Next Steps¶

We'll need to pair each indicator variable (EV charging - 1/0) with the corresponding power data for the interval. I can think how to do this Excel...with one house, but not how to generalize this analysis to 1590 houses.

It's going to involve machine learning...Python's package for this is scikit-learn, which I've never worked with, so I'm going to read up on that and consult with some friends who have more know-how than I.

Hopefully, we can build out a model on the training set, and apply to the test set to determine:

A) Which residences have electric vehicles?

B) When the electric vehicles were charging?

C) Any other interesting aspects of the dataset?

Stay tuned for Part 1...