.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "generated/gallery/data-completeness.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_generated_gallery_data-completeness.py: Missing Data Periods ==================== Identifying days with missing data using a "completeness" score metric. .. GENERATED FROM PYTHON SOURCE LINES 9-16 Identifying days with missing data and filtering these days out reduces noise when performing data analysis. This example shows how to use a daily data "completeness" score to identify and filter out days with missing data. This includes using :py:func:`pvanalytics.quality.gaps.completeness_score`, :py:func:`pvanalytics.quality.gaps.complete`, and :py:func:`pvanalytics.quality.gaps.trim_incomplete`. .. GENERATED FROM PYTHON SOURCE LINES 16-23 .. code-block:: default import pvanalytics from pvanalytics.quality import gaps import matplotlib.pyplot as plt import pandas as pd import pathlib .. GENERATED FROM PYTHON SOURCE LINES 24-33 First, we import the AC power data stream that we are going to check for completeness. The time series we download is a normalized AC power time series from the PV Fleets Initiative, and is available via the DuraMAT DataHub: https://datahub.duramat.org/dataset/inverter-clipping-ml-training-set-real-data. This data set has a Pandas DateTime index, with the min-max normalized AC power time series represented in the 'value_normalized' column. The data is sampled at 15-minute intervals. This data set does contain NaN values. .. GENERATED FROM PYTHON SOURCE LINES 33-39 .. code-block:: default pvanalytics_dir = pathlib.Path(pvanalytics.__file__).parent file = pvanalytics_dir / 'data' / 'ac_power_inv_2173.csv' data = pd.read_csv(file, index_col=0, parse_dates=True) data = data.asfreq("15T") .. GENERATED FROM PYTHON SOURCE LINES 40-44 Now, we use :py:func:`pvanalytics.quality.gaps.completeness_score` to get the percentage of daily data that isn't NaN. This percentage score is calculated as the total number of non-NA values over a 24-hour period, meaning that nighttime values are expected. .. GENERATED FROM PYTHON SOURCE LINES 44-53 .. code-block:: default data_completeness_score = gaps.completeness_score(data['value_normalized']) # Visualize data completeness score as a time series. data_completeness_score.plot() plt.xlabel("Date") plt.ylabel("Daily Completeness Score (Fractional)") plt.tight_layout() plt.show() .. image-sg:: /generated/gallery/images/sphx_glr_data-completeness_001.png :alt: data completeness :srcset: /generated/gallery/images/sphx_glr_data-completeness_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 54-56 We mask complete days, based on daily completeness score, using :py:func:`pvanalytics.quality.gaps.complete`. .. GENERATED FROM PYTHON SOURCE LINES 56-73 .. code-block:: default min_completeness = 0.333 daily_completeness_mask = gaps.complete(data['value_normalized'], minimum_completeness=min_completeness) # Mask complete days, based on daily completeness score data_completeness_score.plot() data_completeness_score.loc[daily_completeness_mask].plot(ls='', marker='.') data_completeness_score.loc[~daily_completeness_mask].plot(ls='', marker='.') plt.axhline(y=min_completeness, color='r', linestyle='--') plt.legend(labels=["Completeness Score", "Threshold met", "Threshold not met", "Completeness Threshold (.33)"], loc="upper left") plt.xlabel("Date") plt.ylabel("Daily Completeness Score (Fractional)") plt.tight_layout() plt.show() .. image-sg:: /generated/gallery/images/sphx_glr_data-completeness_002.png :alt: data completeness :srcset: /generated/gallery/images/sphx_glr_data-completeness_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 74-78 We trim the time series based on the completeness score, where the time series must have at least 10 consecutive days of data that meet the completeness threshold. This is done using :py:func:`pvanalytics.quality.gaps.trim_incomplete`. .. GENERATED FROM PYTHON SOURCE LINES 78-90 .. code-block:: default number_consecutive_days = 10 completeness_trim_mask = gaps.trim_incomplete(data['value_normalized'], days=number_consecutive_days) # Re-visualize the time series with the data masked by the trim mask data[completeness_trim_mask]['value_normalized'].plot() data[~completeness_trim_mask]['value_normalized'].plot() plt.legend(labels=[True, False], title="Daily Data Passing") plt.xlabel("Date") plt.ylabel("Normalized AC Power") plt.tight_layout() plt.show() .. image-sg:: /generated/gallery/images/sphx_glr_data-completeness_003.png :alt: data completeness :srcset: /generated/gallery/images/sphx_glr_data-completeness_003.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none /home/docs/checkouts/readthedocs.org/user_builds/pvanalytics/checkouts/v0.1.2/pvanalytics/quality/gaps.py:416: FutureWarning: Indexing a timezone-aware DatetimeIndex with a timezone-naive datetime is deprecated and will raise KeyError in a future version. Use a timezone-aware object instead. mask.loc[start.date():end.date()] = True .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 1.162 seconds) .. _sphx_glr_download_generated_gallery_data-completeness.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: data-completeness.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: data-completeness.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_