class: center, middle, inverse, title-slide # Exploring probability distributions for bivariate temporal granularities ###
Sayani Gupta
Sayani07
@SayaniGupta07
https://sayanigupta-iisa2019.netlify.com/
###
International Indian Statistical Association
December 26, 2019 --- # Electricity smart meter technology (~ 40 billion half hourly observations) <!-- .pull-left[ --> <!-- .center-left[ --> - Source : Department of the Environment and Energy, Australia <br> <br> - Frequency: Half hourly (interval meter reading (Kwh)) <br> <br> - Time Span: 2012 to 2014 <br> <br> - Spread: 14K (approx.) households based in Newcastle, New South Wales, and parts of Sydney <br> <br> ??? Smart meters record electricity usage (per kWh) every 30 minutes and send this information to the electricity retailer for billing **Consumers** can save considerable amount on their electricity bill by - Switching on their hot water heater or do laundry when energy is cheaper, or when their solar system is generating surplus energy - Switching off appliances during peak demands - Check usage and compare with similar homes **Retailers** can reduce costs and increase efficiency - Lowering metering and connection fees - Drawing insights into when customer is home, or sleeping, or even what appliances they are using based on usage figures - Rewarding customers for mindful usage Just to give you some perspective I have this data from Department of Energy and Environment, Australia that provides interval meter reading data every 30 minutes from 2012 to 2014. So you can think of it like, that the finest temporal unit here is half hour, whereas the coarsest temporal unit is year. This data is made available for 14k customers located in different local government areas across places.. So this is a data which is spread across both time and space and hence is a spatio-temporal data. --- <!-- class: hide-slide-number --> ## Visualize the raw data from from 2012 - 2014 for 50 households <img src="images/smart_allcust.gif" style="display: block; margin: auto;" /> --- <!-- class: hide-slide-number --> ## Visualize the periodicities in half-hourly energy usage for 1 household from 2012 to 2014 <img src="figure/motivation5-1.svg" style="display: block; margin: auto;" /> --- # Explorartory Data Analysis <img src="images/dino-saurus.gif" style="display: block; margin: auto;" /> ### ["Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing"](https://www.autodeskresearch.com/publications/samestats) <!-- -- [Nick Tierney, WOMBAT2019](https://github.com/njtierney/wombat19) --> --- background-image: url("images/problem.png") background-position: center background-size: contain ??? Well, there can be numerous ways to analyse this data! But I was interested in answering the question - that given this huge volume and spread, how can one explore this data systematically? --- class: center,middle ## **Problem** : How do we systematically explore large quantities of temporal data across different deconstructions of time (half-hour, day, type of day, year) to find regular patterns or anomalies in behaviour? ## **Solution** : Visualize probability distributions over different time granularities. ??? Developed by **John Tukey** as a way of _*systematically*_ using the tools of statistics on a problem before a hypotheses about the data were developed. This encourages to break the big problem into pieces and focusing on subsets. So the reduced goal that I set for myself is to look at time only and to provide ... . The smart meter example is the one that motivated me for this problem, how the idea is to provide the same for any temporal data following an hierarchy. The key terms are decontructing time and visualizing distribution. In the next couples of slides, we will talk about the strength and challenges for each of these. --- class: middle center .animated.bounce[ <img src="images/gravitas.png" height=280px> ] ## Visualize probability distributions over different time granularities --- # Time granularities ### abstractions of time based on calendar <br> .pull-left[ ### Arrangement <br> <i> **Linear**</i> - days, weeks, months, years <br> <i> **Cyclic** </i> - <i> **Circular** </i> day-of-week, month-of-year or hour-of-day - <i> **Quasi-circular** </i> day-of-month, week-of-month - <i> **Aperiodic** </i> public holidays, school vacations ] .pull-right[ <!-- # ```{r lineartime, out.width="260%", out.height="300%"} --> <!-- # ``` --> <!-- # <br> --> <!-- # ```{r circulartime, out.width = "100%", out.height="70%"} --> <!-- # ``` --> <!-- # --> ### Order - <i>**Single-order-up**</i> second-of-minute, hour-of-day <br> <br> <br> - <i>**Multiple-order-up**</i> second-of-hour, hour-of-week <!-- # ```{r calendar, out.width="100%"} --> <!-- # --> <!-- # ``` --> ] --- class: middle center ## Data Structure for exploration #### Extension of a tsibble - data abstraction for tidy temporal data <img src="images/datastructure.png", height = "400px"> <!-- # Computation of granularities --> <!-- `\(z\)` : index of a tsibble --> <!-- <br> --> <!-- `\(x\)`, `\(y\)` : two units in the hierarchy table with `\(order(x) < order(y)\)` --> <!-- <br> --> <!-- `\(f(x, y)\)` : accessor function for computing the granularity --> <!-- <br> --> <!-- `\(c(x, y)\)` : a constant which relates x and y --> <!-- <br> --> <!-- #### **Single-order-up** --> <!-- `$$f(x, y) = \lfloor z/c(z,x) \rfloor\mod c(x,y)$$` where `\(y = x+1\)` --> <!-- #### **Multiple-order-up** --> <!-- \begin{split} --> <!-- f(x,y) & = \sum_{i=0}^{order(y) - order(x) - 1} c(x, x+i)(f(x --> <!-- +i, x+i+1) - 1)\\ --> <!-- \end{split} --> --- ## Relationship of cyclic granularities **Harmonies** : pairs of granularities that aid exploratory data analysis **Clashes** : pairs leading to structurally empty sets <img src="images/clash.png" width="100%" style="display: block; margin: auto;" /> --- ## Summarising probability distributions #### Types of statistical distribution plots <img src="figure/allplot-1.svg" style="display: block; margin: auto;" /> --- # R package: **gravitas** .center[ ### Computation --- .left[ Compute any cyclic granularity? <span style="color:Red">`create_gran()` <br> <br> Exhaustive list of granularities to explore? <span style="color:Red"> `search_gran()` <br> ] ] .pull-left[ ### Interaction --- Check if cyclic granularities are harmonies/clashes? `is.harmony()` <br> <br> List of harmonies to explore? `harmony()` <br> ] .pull-right[ ### Visualization --- Possible probability distributions plots for harmonies? `prob_plot()` <br> <br> Sufficient observations? `gran_obs()` Recommendation on a harmony? `gran_advice()` ] --- <!-- class: center,middle --> <!-- # <span style="color:MediumVioletRed"> Package gravitas </span> --> <!-- ## granularity visualization of time series data --> .left-column[ ## smart meter example #### - the data ] .right-column[ ``` #> # A tsibble: 1,450,232 x 8 [30m] <UTC> #> # Key: customer_id [50] #> customer_id reading_datetime general_supply_… #> <chr> <dttm> <dbl> #> 1 10006414 2012-02-10 08:00:00 0.141 #> 2 10006414 2012-02-10 08:30:00 0.088 #> 3 10006414 2012-02-10 09:00:00 0.078 #> 4 10006414 2012-02-10 09:30:00 0.151 #> # … with 1.45e+06 more rows, and 5 more variables: #> # event_key <dbl>, controlled_load_kwh <dbl>, #> # gross_generation_kwh <dbl>, #> # net_generation_kwh <dbl>, other_kwh <dbl> ``` <i><small>Data source</i></small> : [<small><i>Department of the Environment and Energy, Australia</i></small>](https://data.gov.au/dataset/4e21dea3-9b87-4610-94c7-15a8a77907ef) ] --- .left-column[ ## smart meter example #### - the data #### - possible cyclic granularities `search_gran()` ] .right-column[ ```r smart_meter %>% * search_gran(lowest_unit = "hhour", * highest_unit = "month", * filter_out = c("fortnight", * "hhour")) ``` .pull-left[ <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> x </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> hour_day </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> hour_week </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> hour_month </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> day_week </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> day_month </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> week_month </td> </tr> </tbody> </table> ] .pull-right[ <large> So there are `\(^{6} P_2\)` = 30 pair of granularities to look at. ] ] --- .left-column[ ## smart meter example #### - the data #### - possible cyclic granularities `search_gran()` #### - set of harmonies `harmony()` ] .right-column[ ```r smart_meter %>% * harmony(ugran = "month", lgran = "hhour", * filter_out = c("fortnight", "hhour")) ``` <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> facet_variable </th> <th style="text-align:left;"> x_variable </th> <th style="text-align:right;"> facet_levels </th> <th style="text-align:right;"> x_levels </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> day_week </td> <td style="text-align:left;"> hour_day </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 24 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> day_month </td> <td style="text-align:left;"> hour_day </td> <td style="text-align:right;"> 31 </td> <td style="text-align:right;"> 24 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> week_month </td> <td style="text-align:left;"> hour_day </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 24 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> day_month </td> <td style="text-align:left;"> hour_week </td> <td style="text-align:right;"> 31 </td> <td style="text-align:right;"> 168 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> week_month </td> <td style="text-align:left;"> hour_week </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 168 </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> day_week </td> <td style="text-align:left;"> hour_month </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 744 </td> </tr> <tr> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> hour_day </td> <td style="text-align:left;"> day_week </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> day_month </td> <td style="text-align:left;"> day_week </td> <td style="text-align:right;"> 31 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> week_month </td> <td style="text-align:left;"> day_week </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> hour_day </td> <td style="text-align:left;"> day_month </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 31 </td> </tr> <tr> <td style="text-align:left;"> 11 </td> <td style="text-align:left;"> day_week </td> <td style="text-align:left;"> day_month </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 31 </td> </tr> <tr> <td style="text-align:left;"> 12 </td> <td style="text-align:left;"> hour_day </td> <td style="text-align:left;"> week_month </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 13 </td> <td style="text-align:left;"> day_week </td> <td style="text-align:left;"> week_month </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 5 </td> </tr> </tbody> </table> ### <large> Good news! Only 13 out 30 are harmonies </large> ] --- .left-column[ ## smart meter example #### - the data #### - possible cyclic granularities `search_gran()` #### - set of harmonies `harmony()` #### - advice `gran_advice()` ] .right-column[ ```r smart_meter %>% * gran_advice("month_year", "hour_day") ``` ``` #> The chosen granularities are harmonies #> #> Recommended plots are: quantile #> #> Number of observations are homogenous across facets #> #> Number of observations are homogenous within facets #> #> Cross tabulation of granularities : #> #> # A tibble: 24 x 13 #> hour_day Jan Feb Mar Apr May Jun Jul #> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 0 1151 1095 730 660 682 1008 1054 #> 2 1 1154 1094 730 660 682 1008 1054 #> 3 2 1152 1094 730 660 682 1008 1054 #> 4 3 1150 1094 730 660 682 1008 1054 #> # … with 20 more rows, and 5 more variables: #> # Aug <dbl>, Sep <dbl>, Oct <dbl>, Nov <dbl>, #> # Dec <dbl> ``` ### <large> Quantile plots recommended for the harmony pair (month_year, hour_day) </large> ] --- .left-column[ ## smart meter example #### - the data #### - possible cyclic granularities `search_gran()` #### - set of harmonies `harmony()` #### - advice `gran_advice()` #### - visualize harmonies `prob_plot()` ] .right-column[ ```r smart_meter %>% * prob_plot("month_year","hour_day", * plot_type = "quantile", * response = "general_supply_kwh", * quantile_prob = c(0.05, 0.1, 0.25, * 0.5, 0.75, 0.9, 0.95) ``` <img src="figure/granplotoverlay3-1.svg" style="display: block; margin: auto;" /> ] --- ## Another example: Cricket data of Indian Premier League <small><i>Data source</i></small>: [<small><i>Cricsheet</i></small>](http://cricsheet.org/) , [<small><i>Kaggle</i></small>](https://www.kaggle.com/josephgpinto/ipl-data-analysis/data) ``` #> Observations: 136,598 #> Variables: 38 #> $ season <dbl> 2008, 2008, 2008, 2008, 200… #> $ match_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, … #> $ inning <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, … #> $ over <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, … #> $ ball <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, … #> $ winner <chr> "Kolkata Knight Riders", "K… #> $ total_runs <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 4, … #> $ batting_team <chr> "Kolkata Knight Riders", "K… #> $ bowling_team <chr> "Royal Challengers Bangalor… #> $ batsman <chr> "SC Ganguly", "BB McCullum"… #> $ non_striker <chr> "BB McCullum", "SC Ganguly"… #> $ bowler <chr> "P Kumar", "P Kumar", "P Ku… #> $ is_super_over <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ wide_runs <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, … #> $ bye_runs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ legbye_runs <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, … #> $ noball_runs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ penalty_runs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ batsman_runs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 4, … #> $ extra_runs <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, … #> $ player_dismissed <chr> NA, NA, NA, NA, NA, NA, NA,… #> $ dismissal_kind <chr> NA, NA, NA, NA, NA, NA, NA,… #> $ fielder <chr> NA, NA, NA, NA, NA, NA, NA,… #> $ city <chr> "Bangalore", "Bangalore", "… #> $ date <date> 2008-04-18, 2008-04-18, 20… #> $ team1 <chr> "Kolkata Knight Riders", "K… #> $ team2 <chr> "Royal Challengers Bangalor… #> $ toss_winner <chr> "Royal Challengers Bangalor… #> $ toss_decision <chr> "field", "field", "field", … #> $ result <chr> "normal", "normal", "normal… #> $ dl_applied <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ win_by_runs <dbl> 140, 140, 140, 140, 140, 14… #> $ win_by_wickets <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ player_of_match <chr> "BB McCullum", "BB McCullum… #> $ venue <chr> "M Chinnaswamy Stadium", "M… #> $ umpire1 <chr> "Asad Rauf", "Asad Rauf", "… #> $ umpire2 <chr> "RE Koertzen", "RE Koertzen… #> $ umpire3 <lgl> NA, NA, NA, NA, NA, NA, NA,… ``` --- ## Difference in strategy between two top teams <img src="images/cricketex.gif" style="display: block; margin: auto;" /> --- class: center, middle ### Special thanks to .pull-left[ .portrait[  Di Cook ] <img src="images/Numbats.png"> NUMBATS, Monash University ] .pull-right[ .portrait[  Rob J Hyndman ] ### More Information Package: [gravitas 0.1.0 on CRAN](https://cran.r-project.org/web/packages/gravitas/index.html) Slides: https://sayanigupta-iisa2019.netlify.com/ Materials: https://github.com/Sayani07/IISA2019 Slides created with <i> Rmarkdown, knitr, xaringan, xaringanthemer</i> ]