{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Tukey Outlier Detection\n\nIdentifying outliers in time series using\nTukey outlier detection.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Identifying and removing outliers from PV sensor time series\ndata allows for more accurate data analysis.\nIn this example, we demonstrate how to use\n:py:func:`pvanalytics.quality.outliers.tukey` to identify and filter\nout outliers in a time series.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import pvanalytics\nfrom pvanalytics.quality.outliers import tukey\nimport matplotlib.pyplot as plt\nimport pandas as pd\nimport pathlib"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "First, we read in the ac_power_inv_7539_outliers example. Min-max normalized\nAC power is represented by the \"value_normalized\" column. There is a boolean\ncolumn \"outlier\" where inserted outliers are labeled as True, and all other\nvalues are labeled as False. These outlier values were inserted manually into\nthe data set to illustrate outlier detection by each of the functions.\nWe use a normalized time series example provided by the PV Fleets Initiative.\nThis example is adapted from the DuraMAT DataHub\nclipping data set:\nhttps://datahub.duramat.org/dataset/inverter-clipping-ml-training-set-real-data\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "pvanalytics_dir = pathlib.Path(pvanalytics.__file__).parent\nac_power_file_1 = pvanalytics_dir / 'data' / 'ac_power_inv_7539_outliers.csv'\ndata = pd.read_csv(ac_power_file_1, index_col=0, parse_dates=True)\nprint(data.head(10))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We then use :py:func:`pvanalytics.quality.outliers.tukey` to identify\noutliers in the time series, and plot the data with the tukey outlier mask.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tukey_outlier_mask = tukey(data=data['value_normalized'],\n                           k=0.5)\ndata['value_normalized'].plot()\ndata.loc[tukey_outlier_mask, 'value_normalized'].plot(ls='', marker='o')\nplt.legend(labels=[\"AC Power\", \"Detected Outlier\"])\nplt.xlabel(\"Date\")\nplt.ylabel(\"Normalized AC Power\")\nplt.tight_layout()\nplt.show()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}