Predicting Which Houses have Electric Vehicles
Posted on Sun 11 December 2016 in Energy
GridCure: Predictive Modeling Challenge¶
In persuing some data scientist jobs, I stumbled across an interesting challenge associated with a Data Scientist job posting at GridCure. GridCure roughly "offers simple and customizable solutions to help utilities make sense of their data and implement data-driven change."
Since I'm working as an energy efficiency engineer contracting primarily to electric utilities, I thought my domain expertise might provide some unique insight.
For those interested in exploring the data themsevles, you can download the 'Files for Electric Car Practrice Problem' here.
This will be a fairly lengthy series of posts. Rather than just framing the problem, stating my conclusions and key assumptions, I wanted to outline my workflow, which, as someone who has been self-teaching, I'm hoping to get some feedback on.
In this post, I seek to wrap my head around the problem statement with some initial exploration of the data.
Here are the files I have at my disposal:
- EV_files
- Electric Vehicle Detection-1.docx
- EV_test.csv
- EV_train_labels.csv
- EV_train.csv
- sample_submission.csv
Problem Statement: Electric Vehicle Detection¶
The training set contains two months of smart meter power readings from 1590 houses. The readings were taken at half-hour intervals. Some of the homes have electric vehicles and some do not. The file "EV_train_labels.csv" indicates the time intervals on which an electric vehicle was charging (1 indicates a vehicle was charging at some point during the interval and 0 indicates no vehicle was charging at any point during the interval). Can you determine:
A) Which residences have electric vehicles?
B) When the electric vehicles were charging?
C) Any other interesting aspects of the dataset?
A solution to part B might consist of a prediction of the probability that an electric car was charging for each house and time interval in the test set. Please include code and explain your reasoning. What do you expect the accuracy of your predictions to be?
Part 0: Data Exploration¶
Let's try and wrap our head around the data we have at our disposal using Pandas and some visualization in matplotlib, starting with EV_train.csv.
import pandas as pd
train = pd.read_csv('/Users/ky/Documents/Workspace/gridcure/EV_files/EV_train.csv')
train.head()
train.shape
As promised, smart meter power readings (in kW) from 1590 houses taken at half-hour intervals. 2880 measurements at half-hour intervals means the data spans 60 days. Let's see how much of that data looks good.
total_meas = 1590 * 2880
print("Total number of measurements:", total_meas)
missing_meas = train.isnull().values.sum()
print("Missing measurements:", missing_meas)
print("%.3f%% of measurements missing." % (100*float(missing_meas)/float(total_meas)))
Not bad, less than 0.016% of the data is missing. We don't have a handle on faulty readings or outliers yet, but that will come.
Lets's sample a few random houses to get an idea how the data looks.
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')
import numpy as np
train.sample(3)
Let's plot the 60 days of data we have for these houses to see what it looks like.
x = np.linspace(1,2880,num=2880) # x values for intervals[1,2...2880]
y1 = train.iloc[996][1:] # interval data for house at index 996
y2 = train.iloc[482][1:]
y3 = train.iloc[610][1:]
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(x, y1, c='b', marker='.', label='y1')
ax1.scatter(x, y2, c='r', marker='.', label='y2')
ax1.scatter(x, y3, c='g', marker='.', label='y3')
plt.legend(prop={'size':10})
plt.title('60 days of 30-minute interval power data')
plt.ylabel('Power')
plt.xlabel('Interval #')
plt.xlim(0,3000)
plt.ylim(0)
plt.show()
Some houses appear to have a pretty steady elecrical usage, while others exhibit more variance. Remeber we're looking over a 60 day interval. It would be interesting to see what these houses do over the course of a day or week. Is their some periodicty to the data (i.e get home from work and change electric vehicle)? Let's slice the data into a 24-hour period.
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(x[:48], y1[:48], c='b', label='y1') # 48 intervals (24 hours) of data
ax1.scatter(x[:48], y2[:48], c='r', label='y2')
ax1.scatter(x[:48], y3[:48], c='g', label='y3')
plt.legend(prop={'size':10})
plt.title('24 hours of 30-minute interval power data')
plt.ylabel('Power')
plt.xlabel('Interval #')
plt.xlim(0,50)
plt.ylim(0)
plt.show()
Now let's plot the data at a week's resolution.
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(x[:48*7], y1[:48*7], c='b', label='y1') # one week of data
ax1.scatter(x[:48*7], y2[:48*7], c='r', label='y2')
ax1.scatter(x[:48*7], y3[:48*7], c='g', label='y3')
plt.legend(prop={'size':10})
plt.title('One week of 30-minute interval power data')
plt.ylabel('Power')
plt.xlabel('Interval #')
plt.xlim(0,350)
plt.ylim(0)
plt.show()
My takeaway: It's not obivous when EVs are charging (maybe these sample houses don't even have electric EVs). If we had timestamps, maybe we could glean some day of the week, time of day, insights...but without anymore info we're not going to build a very robust model.
Luckily we have another piece of of data, EV_train_labels.csv, which indicates the time intervals on which an electric vehicle was charging (1 indicates a vehicle was charging at some point during the interval and 0 indicates no vehicle was charging at any point during the interval). Let's make sure the data lines up with EV_train.csv.
labels = pd.read_csv('/Users/ky/Documents/Workspace/gridcure/EV_files/EV_train_labels.csv')
train.head()
labels.head()
labels.shape == train.shape
Both dataframes have the same shape, and the rows and columns line up, lovely! First let's rename all the columns so they have distinct names from our training data, which will come in handy if we merge that dataframes (Interval_1 --> IND_Interval_1).
labels.columns = "IND_" + labels.columns
labels.head()
Then let's get an idea of the distribution of our sample...how many households have EVs?
charging_indicators = list(labels) # list of columns
charging_indicators.remove('IND_House ID') # drops House ID so you only have intervals data
labels['Has EV'] = labels[charging_indicators].max(axis=1) # returns 1 if househould ever charged an EV.
labels.head()
houses_with_EVs = labels['Has EV'].sum()
print("Houses with EVs:", houses_with_EVs)
print("%.1f%% of househoulds in the sample have EVs." % (100*float(houses_with_EVs)/len(labels)))
Let's concatenate the data such that we have columns of 2880 power interval readings, followed by another 2880 interval readings indicationg (with a 1 or 0) whether or not an EV vehicle was charging at any point during the interval.
result = pd.concat([train, labels], axis=1, join="inner")
result.head()
Let's get a little more info about the each household before we start playing with the data visualizing the data.
power_readings = list(train) # list of columns in train
power_readings.remove('House ID') # removes House ID so list only contains interval power data
result['Average Power'] = result[power_readings].mean(axis=1) # average power for each house
result.head()
Let's get a sense if having an EV influences your average power.
result.groupby(['Has EV']).describe()['Average Power'] # summary stats for groubed by EV (1/0)
The average power draw from a house without an EV is 1.383 kW, while the average power draw from a house with EV(s) is 1.438 kW. That's a 4% increase in power usage.
Conclusion¶
We've done a bit of exploration on the training data set. Here's what we know:
- 30.5% of our sample of 1590 houses have EVs.
- Houses with EVs exhibit a 4% increase in average power
- Average power of a house over the 60 days ranges from 0.28 kW to 84.9 kW
- Min: Someone on vacation for 60 days maybe
- Max: Multi-unit home perhaps
- 2015 average house power was 1.23 kW
Next Steps¶
We'll need to pair each indicator variable (EV charging - 1/0) with the corresponding power data for the interval. I can think how to do this Excel...with one house, but not how to generalize this analysis to 1590 houses.
It's going to involve machine learning...Python's package for this is scikit-learn, which I've never worked with, so I'm going to read up on that and consult with some friends who have more know-how than I.
Hopefully, we can build out a model on the training set, and apply to the test set to determine:
A) Which residences have electric vehicles?
B) When the electric vehicles were charging?
C) Any other interesting aspects of the dataset?
Stay tuned for Part 1...