EDA on Senor Reliability

As part of my home automation project I have built a number of wireless temperature sensors using the wonderful little Arduino Mini (more on coming in the Home Automation Section here). These submit temperature, humidity and light level readings every 30 or 60 seconds however I have started to notice that they are not the most reliable of things and can often go for extending periods (10 minutes to a couple of hours) without any readings being recorded.

What kind of aspiring data scientist would I be if I didn’t want to explore what was going on a little using data to see if it can help me understand where the issue lies…

Loading the Libaries and data

In [1]:
import pandas as pd
import numpy as np
from IPython.core.interactiveshell import InteractiveShell

The setting below can really help if you want to look at multiple outputs from a single cell. It does cause some strange behaviour when plotting however so we'll turn if off again later on.

In [2]:
InteractiveShell.ast_node_interactivity = "all"

I have created a structured dataset from my raw sensors logs using my Hadoop cluster and using a function from a previous entry (Combing Pig and Python to explore raw datasets via Pydoop) to read the output into a pandas dataframe. As this entry is focusing on EDA on python rather than Hadoop I haven’t gone into the details as that is covered in other entries.

In [3]:
import pydoop.hdfs as hdfs
def read_csv_from_hdfs(path, cols, col_types=None):
  files = hdfs.ls(path);
  pieces = []
  for f in files:
    fhandle = hdfs.open(f)
    pieces.append(pd.read_csv(fhandle, names=cols, dtype=col_types))
    fhandle.close()
  return pd.concat(pieces, ignore_index=True)
In [4]:
TempReadings = read_csv_from_hdfs('/user/olloyd/Hive Outputs/temperature.csv',["Device","datetime","datetimestring","Temperature","secondssincelast"],
                                  {"Device": object, 
                                   "datetime": object,
                                   "datetimestring":object,
                                   "Temperature":float,
                                   "secondssincelast":int
                                  })

As this step can take a few minutes I save the dataframe to disk so that I can read it again later without touching the cluster.

In [5]:
TempReadings.to_pickle('20170812_TempReadings')
In [6]:
TempReadings = pd.read_pickle('20170812_TempReadings')

Now let’s take a quick look at the dataset. This is where the InteractiveShell setting comes in handy, without it these would need to be in 3 separate cells.

In [7]:
TempReadings.shape
TempReadings.columns
TempReadings.sample(10)
Out[7]:
(1266840, 5)
Out[7]:
Index([u'Device', u'datetime', u'datetimestring', u'Temperature',
       u'secondssincelast'],
      dtype='object')
Out[7]:
Device datetime datetimestring Temperature secondssincelast
1013758 Lounge TempHum_Temperature 2017-05-20 13:49:37 2017/05/20 13:49:37 23.100000 31
748850 Bathroom TempHum_Temperature 2017-07-05 23:01:59 2017/07/05 23:01:59 26.700001 33
421195 Temperature Sensor Lounge_Temperature 2017-06-06 08:24:58 2017/06/06 08:24:58 23.000000 35
159441 Bedroom TempHum_Temperature 2017-06-21 07:03:24 2017/06/21 07:03:24 27.299999 0
916256 Lounge TempHum_Temperature 2017-04-17 08:14:28 2017/04/17 08:14:28 21.100000 32
1200905 Lounge TempHum_Temperature 2017-08-06 04:47:38 2017/08/06 04:47:38 23.799999 0
102705 Bedroom TempHum_Temperature 2017-05-29 20:15:25 2017/05/29 20:15:25 26.600000 0
1054837 Lounge TempHum_Temperature 2017-06-02 04:47:12 2017/06/02 04:47:12 24.100000 0
866877 Bathroom TempHum_Temperature 2017-07-31 19:30:52 2017/07/31 19:30:52 29.700001 33
478820 Temperature Sensor Lounge_Temperature 2017-08-08 21:23:10 2017/08/08 21:23:10 25.200001 34

As you can see I have a number of different temperature sensors around my house. To start off with we will look at the Bathroom sensor that seems to be the most problematic (will be interesting to see if the data supports this!)

Bathroom Sensor

I will create a new dataset called Bathroom just looking at that Sensor.

In [8]:
Bathroom=TempReadings.query('Device == "Bathroom TempHum_Temperature"')
In [9]:
Bathroom.shape
Bathroom.sample(10)
Out[9]:
(397777, 5)
Out[9]:
Device datetime datetimestring Temperature secondssincelast
801848 Bathroom TempHum_Temperature 2017-07-19 05:31:24 2017/07/19 05:31:24 24.500000 33
662400 Bathroom TempHum_Temperature 2017-06-15 07:54:51 2017/06/15 07:54:51 24.200001 33
801802 Bathroom TempHum_Temperature 2017-07-19 05:18:46 2017/07/19 05:18:46 24.600000 34
899446 Bathroom TempHum_Temperature 2017-08-11 21:59:52 2017/08/11 21:59:52 23.299999 1
512954 Bathroom TempHum_Temperature 2017-04-16 13:13:01 2017/04/16 13:13:01 21.500000 33
886192 Bathroom TempHum_Temperature 2017-08-05 20:48:29 2017/08/05 20:48:29 25.299999 0
589709 Bathroom TempHum_Temperature 2017-05-11 20:52:22 2017/05/11 20:52:22 23.400000 33
526386 Bathroom TempHum_Temperature 2017-04-19 03:09:27 2017/04/19 03:09:27 19.000000 33
707897 Bathroom TempHum_Temperature 2017-06-26 03:24:25 2017/06/26 03:24:25 22.600000 33
823683 Bathroom TempHum_Temperature 2017-07-23 09:57:39 2017/07/23 09:57:39 22.799999 0

We can see here that indeed the dataset has been filtered just to look the Bathroom Sensor

We will use the matplotlib libraries to supply the charts

In [10]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

So this on the surface doesnt appear to tell us much and looks like a blank graph.

In [11]:
fig1 = plt.hist((Bathroom['secondssincelast']), bins= np.arange(min(Bathroom['secondssincelast'])
   ,max(Bathroom['secondssincelast'])+ 100,100))
In [12]:
Bathroom.describe()
Out[12]:
Temperature secondssincelast
count 397777.000000 397777.000000
mean 23.904850 26.014672
std 3.477373 482.423301
min 16.600000 0.000000
25% 21.500000 0.000000
50% 23.200001 32.000000
75% 25.700001 33.000000
max 38.900002 300222.000000

Couple of surprises here.

The bathroom sensor is set to send a temp reading every 30 seconds. We expect some above this as not all readings get through (battery died, interference etc) and the whole point of looking is that I've noticed that they are unreliable with large gaps between readings at times. The surprise however is the large number of readings less than 30 seconds?

We can also see some large outliers presuming when the battery died and it took me a while to notice. That also explains the seemingly blank chart.

If we limit the x axis of the chart to exclude the outliers and see what we get.

In [13]:
fig2 = plt.hist((Bathroom['secondssincelast']), bins= np.arange(min(Bathroom['secondssincelast'])
   ,max(Bathroom['secondssincelast'])+ 100,100));plt.xlim(0,1000)
Out[13]:
(0, 1000)

Doesn’t look quite so blank now and we’ll pick this up a little bit later but for now… Those readings coming in less than 30 seconds apart are really intriguing me.

We can explore this in a little more detail by describing a filtered dataset. This introduces a quiet powerful concept of combing multiple actions into a single line. We’re taking out dataset, filtering it to the records we want and then performing a function on it.

In [14]:
Bathroom[Bathroom.secondssincelast < 30].describe()
Out[14]:
Temperature secondssincelast
count 192548.000000 192548.000000
mean 23.896168 0.216221
std 3.475227 0.900133
min 16.600000 0.000000
25% 21.500000 0.000000
50% 23.200001 0.000000
75% 25.700001 0.000000
max 38.900002 29.000000

A very large number of zeros. It could be interesting to see the percentage of zero to non zero gaps. This can be done by using the .count() function on a filtered dataset. There is a little trick here to convert the number to floats before the calculation else python we think I want a integer returned.. Not very useful for a percentage!

In [15]:
Bathroom.secondssincelast[Bathroom.secondssincelast == 0].count()
Bathroom.secondssincelast.count()
Bathroom.secondssincelast[Bathroom.secondssincelast == 0].count()*1.0/Bathroom.secondssincelast.count()*1.0
Out[15]:
156283
Out[15]:
397777
Out[15]:
0.3928909916862966

So 39% of my readings come in at the same time.If I tweek less than say 2 seconds it becomes 48% - Suspiciously close to 50%

In [16]:
Bathroom.secondssincelast[Bathroom.secondssincelast < 2].count()*1.0/Bathroom.secondssincelast.count()*1.0
Out[16]:
0.48291127943546258

So to cut a long story short... looking at the c++ code for my sensors and the output logs from domoticz (the software I use for my home automation) the reason lies in the the way domoticz creates the sensors. My sensors are actually return 3 readings one each for temperature, humitidy and light level. For each reading the sensor sends a seperate message to the server. At the server end the Temperature and Humidty combine into a single sensor in Domoticz. It appears that a message to either one of the components actually triggers an update to the other hence the doubleing up of readings. The updated value is just the sameas the previous reading so when I start looking at temperatures need to make careful I filter out the incorrect readings.

Now lets look at the other end of the scale where the time between readings is more than I would expect.

In [17]:
Bathroom[Bathroom.secondssincelast > 34].describe()
Out[17]:
Temperature secondssincelast
count 27409.000000 27409.000000
mean 24.023791 163.091795
std 3.486359 1831.365698
min 16.600000 35.000000
25% 21.500000 66.000000
50% 23.400000 131.000000
75% 25.900000 132.000000
max 38.700001 300222.000000

So clearly there are outliers - which we’ve already put down to periods when I didn’t notice that the batteries had died. Let’s assume that anything greater than 3 hours (10800 seconds) represents a dead battery rather than unstable readings and remove them... But before we do let’s just take a quick look just to be sure. I never like removing data with out first checking it out!

In [18]:
Bathroom[Bathroom.secondssincelast > 10800]
Out[18]:
Device datetime datetimestring Temperature secondssincelast
624280 Bathroom TempHum_Temperature 2017-05-29 04:06:56 2017/05/29 04:06:56 23.000000 15977
624284 Bathroom TempHum_Temperature 2017-05-29 07:46:44 2017/05/29 07:46:44 23.200001 13056
630645 Bathroom TempHum_Temperature 2017-06-04 20:12:46 2017/06/04 20:12:46 26.600000 300222

3 Readings above this threshold. 2 on the same day and one about 5 days later. A possible explanation could be that the batter started dying and the readings got very unstable and then 5 days later I replaced the battery. But I’m quite happy to remove these from my dataset.

To speed things up a little I'm going to create a filtered dataset just looking at the readings I'm interested in. Here I am creating a filtered dataset then filtering that dataset again to create my final data. Will we see how we can combine this into a single statement later on.

In [19]:
BathroomFiltered = Bathroom[Bathroom.secondssincelast < 10800]
BathroomFiltered  = BathroomFiltered[BathroomFiltered.secondssincelast > 2]

Lets take a quick look at whats thats given us

In [20]:
BathroomFiltered.describe()
Out[20]:
Temperature secondssincelast
count 205535.000000 205535.000000
mean 23.913923 48.569241
std 3.477929 93.599093
min 16.600000 3.000000
25% 21.600000 33.000000
50% 23.200001 33.000000
75% 25.700001 33.000000
max 38.900002 8389.000000
In [21]:
fig3 = plt.hist((BathroomFiltered['secondssincelast']), bins= np.arange(min(BathroomFiltered['secondssincelast'])
   ,max(BathroomFiltered['secondssincelast'])+ 100,100))

Let’s now look at the lower end of the scale

In [22]:
fig5 = plt.hist((BathroomFiltered['secondssincelast']), bins= np.arange(min(BathroomFiltered['secondssincelast'])
   ,max(BathroomFiltered['secondssincelast'])+ 10,10));plt.xlim(0,1000)
Out[22]:
(0, 1000)

We can start seeing some distinct spikes in the data. Converting to y axis to a log scale could reveal a little bit more of the detail.

In [23]:
fig6 = plt.hist((BathroomFiltered['secondssincelast']), bins= np.arange(min(BathroomFiltered['secondssincelast'])
   ,max(BathroomFiltered['secondssincelast'])+ 10,10));plt.xlim(0,1000);plt.yscale('log')
Out[23]:
(0, 1000)

Conclusion #1

What is this telling us?

Well we can see distinct spikes at regular intervals rather than an even distribution. This suggests that the underlying timekeeping of the device is not at fault and it could be more to do with the delivery of the messages. The spikes suggest that the device is keeping to its 30 (or 33!) second update cycle. It’s just not all the messages are getting sent / through.

What this doesn’t tell us is the nature of those missed messages - are they evenly spread out over time suggesting some form of general interference of for certain periods does the sensor become much less reliable maybe due to battery life or specific interference from another device or maybe its being influenced by environment conditions (ie temperature and/or humidity!).

Analysis over time

Whenever looking at time periods we need to import the datetime library

In [24]:
from datetime import datetime

When importing the date, the datetime comes in as a string and must be converted to a datetime datatype. In addition to help with the analysis I’m also going to create specific columns representing specific elements of the date and time; DayofMonth, Month, Week, Hour and Day (This represents the Month and day of month to help plotting across different weeks.

In [25]:
BathroomFiltered['Date/Time']=pd.to_datetime(TempReadings['datetime'])
BathroomFiltered['DayofMonth'] = BathroomFiltered['Date/Time'].apply(lambda x: x.day)
BathroomFiltered['Month'] = BathroomFiltered['Date/Time'].apply(lambda x: x.month)
BathroomFiltered['Week'] = BathroomFiltered['Date/Time'].apply(lambda x: x.week)
BathroomFiltered['Hour'] = BathroomFiltered['Date/Time'].apply(lambda x: x.hour)
BathroomFiltered['Day'] = BathroomFiltered['Month'] + BathroomFiltered['DayofMonth']/100
BathroomFiltered.sample(10)
Out[25]:
Device datetime datetimestring Temperature secondssincelast Date/Time DayofMonth Month Week Hour Day
696178 Bathroom TempHum_Temperature 2017-06-23 09:54:58 2017/06/23 09:54:58 24.200001 33 2017-06-23 09:54:58 23 6 25 9 6.23
635121 Bathroom TempHum_Temperature 2017-06-06 08:07:31 2017/06/06 08:07:31 21.299999 33 2017-06-06 08:07:31 6 6 23 8 6.06
678005 Bathroom TempHum_Temperature 2017-06-19 08:29:33 2017/06/19 08:29:33 25.900000 33 2017-06-19 08:29:33 19 6 25 8 6.19
609895 Bathroom TempHum_Temperature 2017-05-22 00:40:22 2017/05/22 00:40:22 22.200001 33 2017-05-22 00:40:22 22 5 21 0 5.22
805862 Bathroom TempHum_Temperature 2017-07-20 00:06:39 2017/07/20 00:06:39 24.799999 32 2017-07-20 00:06:39 20 7 29 0 7.20
804365 Bathroom TempHum_Temperature 2017-07-19 17:13:52 2017/07/19 17:13:52 26.700001 33 2017-07-19 17:13:52 19 7 29 17 7.19
890439 Bathroom TempHum_Temperature 2017-08-07 05:23:10 2017/08/07 05:23:10 21.200001 33 2017-08-07 05:23:10 7 8 32 5 8.07
853608 Bathroom TempHum_Temperature 2017-07-29 03:16:10 2017/07/29 03:16:10 21.000000 32 2017-07-29 03:16:10 29 7 30 3 7.29
900563 Bathroom TempHum_Temperature 2017-08-12 03:32:02 2017/08/12 03:32:02 22.400000 32 2017-08-12 03:32:02 12 8 32 3 8.12
582046 Bathroom TempHum_Temperature 2017-05-08 00:04:30 2017/05/08 00:04:30 20.500000 32 2017-05-08 00:04:30 8 5 19 0 5.08

I can now easily create charts looking at specific time periods.

In [26]:
fig7 =plt.hist((BathroomFiltered.query("Week == 28")['secondssincelast']), bins= np.arange(min(BathroomFiltered['secondssincelast'])

                                                                                           ,max(BathroomFiltered['secondssincelast'])+ 10,10));plt.xlim(0,1000);plt.yscale('log')
Out[26]:
(0, 1000)

However, to truly be able to compare across time periods we need to see them side by side in a grid rather than individually. Now this is one of my gripes about matplotlib. For those familiar with R will be screaming facet_wrap about now… doesn’t exist in matplotlib. We need to do that ourselves which is a pain (there is a ggplot python library out there and I’ll be taking a look at that in another entry)

To do this we create a grid of plots, group our data and then loop through the group and populate each plot as we go remembering to t move to the next row when we’ve populated all the columns in that row.

This is where turning off that InteractiveShell setting comes in handy else we create a huge heap of messy arrays before the charts get shown.

In [27]:
InteractiveShell.ast_node_interactivity = "last_expr"
In [28]:
col_num = 5
row_num = 4
row, col = 0, 0 
fig, axes = plt.subplots(nrows=row_num, ncols=col_num, sharex= True, sharey= True,figsize=(15, 15))
for name, group in  BathroomFiltered.groupby('Week'):
    axes[row,col].hist((group['secondssincelast']), bins= np.arange(min(group['secondssincelast'])
   ,max(group['secondssincelast'])+ 10,10));plt.xlim(0,1000);plt.yscale('log');plt.ylim([1,10**5]) 
    axes[row,col].set_ylabel('No. of Readings') # set y label
    axes[row,col].annotate(str(name), xy= (900, 1)) # write date in corner of each plot
    axes[row,col].annotate("Total readings: " + str(group['secondssincelast'].count().max()), xy= (100, 5000))
    col += 1
    if col%col_num == 0: 
        row += 1
        col  = 0 

Here we can see individual histograms showing the time between readings split out by week. I’ve annotated the charts so we can clearly see which week is being shown and I’ve also added the total number of readings for that week to make it easier to spot any differences.

Looking at this we can see that one week stands out more than the rest. Week number 30. In this week the readings seem to be much more reliable with the total number of readings being close the maximum amount possible (assuming a reading approximately every 33 seconds then the total number of readings in a week should be around 18,300).

So apart from a small secondary peak showing under 100 readings week 30 seems pretty much spot on. Why is this week so much better and what was different?

I think a zoom in around this time period could be interesting...

If we want to filter a data set for multiple values we can use the ‘isin’ function. This operates very similar to the SQL ‘IN’ function for those like me coming from database background.

In [29]:
BathroomFiltered[BathroomFiltered.Week.isin([29,30])].sample(5)
Out[29]:
Device datetime datetimestring Temperature secondssincelast Date/Time DayofMonth Month Week Hour Day
792652 Bathroom TempHum_Temperature 2017-07-17 07:12:24 2017/07/17 07:12:24 23.700001 33 2017-07-17 07:12:24 17 7 29 7 7.17
811936 Bathroom TempHum_Temperature 2017-07-21 03:59:02 2017/07/21 03:59:02 21.700001 33 2017-07-21 03:59:02 21 7 29 3 7.21
822959 Bathroom TempHum_Temperature 2017-07-23 06:38:42 2017/07/23 06:38:42 21.100000 33 2017-07-23 06:38:42 23 7 29 6 7.23
817603 Bathroom TempHum_Temperature 2017-07-22 06:02:31 2017/07/22 06:02:31 22.799999 33 2017-07-22 06:02:31 22 7 29 6 7.22
817857 Bathroom TempHum_Temperature 2017-07-22 07:12:53 2017/07/22 07:12:53 23.000000 33 2017-07-22 07:12:53 22 7 29 7 7.22

We can now change our grid to group by day just for specific weeks. I’ve chosen to change the number of columns to 7 so that each row represents a week of data and added the month and day of month onto the graph. (Note that 7.1 actually means the 10th not the 1st)

In [30]:
col_num = 7
row_num = 4
row, col = 0, 0 
fig, axes = plt.subplots(nrows=row_num, ncols=col_num, sharex= True, sharey= True,figsize=(15, 15))
for name, group in  BathroomFiltered[BathroomFiltered.Week.isin([28,29,30,31])].groupby('Day'):
    axes[row,col].hist((group['secondssincelast']), bins= np.arange(min(group['secondssincelast'])
   ,max(group['secondssincelast'])+ 10,10));plt.xlim(0,1000);plt.yscale('log');plt.ylim([1,10**5]) 
    axes[row,col].set_ylabel('No. of Readings') # set y label
    axes[row,col].annotate(str(name), xy= (900, 1)) # write date in corner of each plot
    axes[row,col].annotate("Total readings: " + str(group['secondssincelast'].count().max()), xy= (100, 5000))
    col += 1
    if col%col_num == 0: 
        row += 1
        col  = 0 

So it looks like between the 16th of July and 30th July the reading seem much more reliable. There is a noticeable exception on the 17th and 18th - but even these dates the total reading counts higher.

What is was special about this period… well is just so happens that very early on the 16th of July we left our house to fly out to Italy for a few weeks. Guess when we returned? Late evening on the 30th!

It could be interesting now to look at how the other sensors behaved during this period.

Other Sensors

We will now revisit the main dataset we imported at the beginning and add the same date time columns that we did for the bathroom dataset.

Then we will create a filtered dataset for the other Sensors and produce similar plots.

In [31]:
TempReadings['Date/Time']=pd.to_datetime(TempReadings['datetime'])
TempReadings['DayofMonth'] = TempReadings['Date/Time'].apply(lambda x: x.day)
TempReadings['Month'] = TempReadings['Date/Time'].apply(lambda x: x.month)
TempReadings['Week'] = TempReadings['Date/Time'].apply(lambda x: x.week)
TempReadings['Hour'] = TempReadings['Date/Time'].apply(lambda x: x.hour)
TempReadings['Day'] = TempReadings['Month'] + TempReadings['DayofMonth']/100
TempReadings.sample(2)
Out[31]:
Device datetime datetimestring Temperature secondssincelast Date/Time DayofMonth Month Week Hour Day
70917 Bedroom TempHum_Temperature 2017-05-16 09:49:31 2017/05/16 09:49:31 25.200001 62 2017-05-16 09:49:31 16 5 20 9 5.16
933892 Lounge TempHum_Temperature 2017-04-20 15:44:01 2017/04/20 15:44:01 21.600000 0 2017-04-20 15:44:01 20 4 16 15 4.20

Lounge

We can apply multiple filters at the same time by using & within the filter.

In [32]:
Lounge=TempReadings[(TempReadings["Device"] == "Lounge TempHum_Temperature")
              & (TempReadings["secondssincelast"] <18000)
              & (TempReadings["secondssincelast"] >2)]
Lounge.describe()
Out[32]:
Temperature secondssincelast DayofMonth Month Week Hour Day
count 158250.000000 158250.000000 158250.000000 158250.000000 158250.000000 158250.000000 158250.000000
mean 24.283980 54.969864 16.781750 5.500392 22.039084 11.141308 5.668209
std 1.595651 124.534765 8.617764 1.114601 4.411913 7.096819 1.076045
min 19.700001 3.000000 1.000000 4.000000 15.000000 0.000000 4.140000
25% 23.200001 32.000000 10.000000 5.000000 18.000000 5.000000 5.020000
50% 24.299999 32.000000 18.000000 6.000000 22.000000 11.000000 6.020000
75% 25.299999 32.000000 24.000000 6.000000 26.000000 17.000000 6.260000
max 29.799999 17163.000000 31.000000 8.000000 32.000000 23.000000 8.120000
In [33]:
col_num = 5
row_num = 4
row, col = 0, 0 
fig, axes = plt.subplots(nrows=row_num, ncols=col_num, sharex= True, sharey= True,figsize=(15, 15))
for name, group in  Lounge.groupby('Week'):
    axes[row,col].hist((group['secondssincelast']), bins= np.arange(min(group['secondssincelast'])
   ,max(group['secondssincelast'])+ 10,10));plt.xlim(0,1000);plt.yscale('log');plt.ylim([1,10**5]) 
    axes[row,col].set_ylabel('No. of Readings') # set y label
    axes[row,col].annotate(str(name), xy= (900, 1)) # write date in corner of each plot
    axes[row,col].annotate("Total readings: " + str(group['secondssincelast'].count().max()), xy= (100, 5000))
    col += 1
    if col%col_num == 0: 
        row += 1
        col  = 0 
In [34]:
col_num = 7
row_num = 4
row, col = 0, 0 
fig, axes = plt.subplots(nrows=row_num, ncols=col_num, sharex= True, sharey= True,figsize=(15, 15))
for name, group in  Lounge[Lounge.Week.isin([28,29,30,31])].groupby('Day'):
    axes[row,col].hist((group['secondssincelast']), bins= np.arange(min(group['secondssincelast'])
   ,max(group['secondssincelast'])+ 10,10));plt.xlim(0,1000);plt.yscale('log');plt.ylim([1,10**5]) 
    axes[row,col].set_ylabel('No. of Readings') # set y label
    axes[row,col].annotate(str(name), xy= (900, 1)) # write date in corner of each plot
    axes[row,col].annotate("Total readings: " + str(group['secondssincelast'].count().max()), xy= (100, 5000))
    col += 1
    if col%col_num == 0: 
        row += 1
        col  = 0 

Notice the lack of data for the 17,18,19,20 etc etc. This shows the battery dying whilst we were on holiday and I only noticed on the 2nd Aug when I replaced it. The number of readings are still very low however but I dont think the data on this sensor can be included due.

Bedroom

Lets look at the last sensor. This sensor is only set to return a reading every 60 seconds not every 30 seconds as shown in the stats below.

In [35]:
Bedroom = TempReadings[(TempReadings["Device"] == "Bedroom TempHum_Temperature")
              & (TempReadings["secondssincelast"] <18000)
              & (TempReadings["secondssincelast"] >2)]
Bedroom.describe()
Out[35]:
Temperature secondssincelast DayofMonth Month Week Hour Day
count 151659.000000 151659.000000 151659.000000 151659.000000 151659.000000 151659.000000 151659.000000
mean 25.165810 67.684377 16.170277 5.932375 23.837141 11.428441 6.094077
std 1.645430 53.461854 8.868133 1.201952 4.955950 6.951519 1.174410
min 17.900000 3.000000 1.000000 4.000000 15.000000 0.000000 4.140000
25% 23.700001 62.000000 8.000000 5.000000 20.000000 5.000000 5.160000
50% 25.200001 62.000000 17.000000 6.000000 24.000000 11.000000 6.150000
75% 26.299999 63.000000 24.000000 7.000000 28.000000 17.000000 7.140000
max 30.400000 12250.000000 31.000000 8.000000 32.000000 23.000000 8.120000
In [36]:
col_num = 5
row_num = 4
row, col = 0, 0 
fig, axes = plt.subplots(nrows=row_num, ncols=col_num, sharex= True, sharey= True,figsize=(15, 15))
for name, group in  Bedroom.groupby('Week'):
    axes[row,col].hist((group['secondssincelast']), bins= np.arange(min(group['secondssincelast'])
   ,max(group['secondssincelast'])+ 10,10));plt.xlim(0,1000);plt.yscale('log');plt.ylim([1,10**5]) 
    axes[row,col].set_ylabel('No. of Readings') # set y label
    axes[row,col].annotate(str(name), xy= (900, 1)) # write date in corner of each plot
    axes[row,col].annotate("Total readings: " + str(group['secondssincelast'].count().max()), xy= (100, 5000))
    col += 1
    if col%col_num == 0: 
        row += 1
        col  = 0 
In [37]:
col_num = 7
row_num = 4
row, col = 0, 0 
fig, axes = plt.subplots(nrows=row_num, ncols=col_num, sharex= True, sharey= True,figsize=(15, 15))
for name, group in  Bedroom[Bedroom.Week.isin([28,29,30,31])].groupby('Day'):
    axes[row,col].hist((group['secondssincelast']), bins= np.arange(min(group['secondssincelast'])
   ,max(group['secondssincelast'])+ 10,10));plt.xlim(0,1000);plt.yscale('log');plt.ylim([1,10**5]) 
    axes[row,col].set_ylabel('No. of Readings') # set y label
    axes[row,col].annotate(str(name), xy= (900, 1)) # write date in corner of each plot
    axes[row,col].annotate("Total readings" + str(group['secondssincelast'].count().max()), xy= (100, 5000))
    col += 1
    if col%col_num == 0: 
        row += 1
        col  = 0 

This shows a similar story to the Bathroom sensor although the stability seems to go on for a little bit longer even after we returned from holiday.

What could be very useful is to compare these two sensors side by side.

To do this I will create a comparison dataset looking at just these two sensors and then replotting but this time grouping by both Day and Device. Plotting this with two columns means that each sensor will be shown side by side.

In [38]:
ComparisonData = TempReadings[(TempReadings["Device"].isin(["Bedroom TempHum_Temperature",
                                               "Bathroom TempHum_Temperature"]))
              & (TempReadings["secondssincelast"] <18000)
              & (TempReadings["secondssincelast"] >2)]
ComparisonData[["secondssincelast","Temperature","Device"]].groupby("Device").describe()
Out[38]:
Temperature secondssincelast
Device
Bathroom TempHum_Temperature count 205537.000000 205537.000000
mean 23.913915 48.710023
std 3.477913 104.010922
min 16.600000 3.000000
25% 21.600000 33.000000
50% 23.200001 33.000000
75% 25.700001 33.000000
max 38.900002 15977.000000
Bedroom TempHum_Temperature count 151659.000000 151659.000000
mean 25.165810 67.684377
std 1.645430 53.461854
min 17.900000 3.000000
25% 23.700001 62.000000
50% 25.200001 62.000000
75% 26.299999 63.000000
max 30.400000 12250.000000
In [39]:
col_num = 2
row_num = 28
row, col = 0, 0 
fig, axes = plt.subplots(nrows=row_num, ncols=col_num, sharex= True, sharey= True,figsize=(15, 50))
for name, group in  ComparisonData[ComparisonData.Week.isin([28,29,30,31])].groupby(["Day","Device"]):
    axes[row,col].hist((group['secondssincelast']), bins= np.arange(min(group['secondssincelast'])
   ,max(group['secondssincelast'])+ 10,10));plt.xlim(0,1000);plt.yscale('log');plt.ylim([1,10**5]) 
    axes[row,col].set_ylabel('No. of Readings') # set y label
    axes[row,col].annotate(str(name), xy= (1, 2000)) # write date in corner of each plot
    axes[row,col].annotate("Total readings: " + str(group['secondssincelast'].count().max()), xy= (100, 5000))
    col += 1
    if col%col_num == 0: 
        row += 1
        col  = 0