{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Hampel Outlier Detection\n\nIdentifying outliers in time series using\nHampel outlier detection.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Identifying and removing outliers from PV sensor time series\ndata allows for more accurate data analysis.\nIn this example, we demonstrate how to use\n:py:func:`pvanalytics.quality.outliers.hampel` to identify and filter\nout outliers in a time series.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import pvanalytics\nfrom pvanalytics.quality.outliers import hampel\nimport matplotlib.pyplot as plt\nimport pandas as pd\nimport pathlib"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "First, we read in the ac_power_inv_7539_outliers example. Min-max normalized\nAC power is represented by the \"value_normalized\" column. There is a boolean\ncolumn \"outlier\" where inserted outliers are labeled as True, and all other\nvalues are labeled as False. These outlier values were inserted manually into\nthe data set to illustrate outlier detection by each of the functions.\nWe use a normalized time series example provided by the PV Fleets Initiative.\nThis example is adapted from the DuraMAT DataHub\nclipping data set:\nhttps://datahub.duramat.org/dataset/inverter-clipping-ml-training-set-real-data\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "pvanalytics_dir = pathlib.Path(pvanalytics.__file__).parent\nac_power_file_1 = pvanalytics_dir / 'data' / 'ac_power_inv_7539_outliers.csv'\ndata = pd.read_csv(ac_power_file_1, index_col=0, parse_dates=True)\nprint(data.head(10))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We then use :py:func:`pvanalytics.quality.outliers.hampel` to identify\noutliers in the time series, and plot the data with the hampel outlier mask.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "hampel_outlier_mask = hampel(data=data['value_normalized'],\n                             window=10)\ndata['value_normalized'].plot()\ndata.loc[hampel_outlier_mask, 'value_normalized'].plot(ls='', marker='o')\nplt.legend(labels=[\"AC Power\", \"Detected Outlier\"])\nplt.xlabel(\"Date\")\nplt.ylabel(\"Normalized AC Power\")\nplt.tight_layout()\nplt.show()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}