Power Outage Influence Prediction
This is a homework for EECS 398 at U-M By Serena Chen & Andy Guo
Introduction
In this project, we are analysing and make prediction base on the “Power Outages” dataset. This dataset provides information on power outages in US from 2000 to 2016, including details about their timing, location, causes, and effects. Initially, it contained 1,535 rows and 57 columns. After removing redundant columns (variable
and OBS
) and merging date and time fields (OUTAGE.RESTORATION.DATE
, OUTAGE.RESTORATION.TIME
, OUTAGE.START.DATE
, and OUTAGE.START.TIME
), we reduced the dataset to 53 meaningful columns. We then extracted the day and time from these merged columns, resulting in a total of 55 meaningful columns.
Our work focuses on estimating the duration of power outage. We consider this a key metric for evaluating outage severity, as it directly influences the resources required for recovery. Understanding which factors are associated with a longer outage duration can help local authorities better prepare for future outages and reduce losses for power outage. Although megawatts lost (DEMAND.LOSS.MW
) could be important, that column has more than half of its entries missing, making it less suitable for prediction.
At this initial stage, we identified several relevant columns: MONTH
, YEAR
, CLIMATE.REGION
, CLIMATE.CATEGORY
, ANOMALY.LEVEL
, CAUSE.CATEGORY
, OUTAGE.DURATION
, DEMAND.LOSS.MW
, and CUSTOMERS.AFFECTED
. Among these, CLIMATE.REGION
, CLIMATE.CATEGORY
, and CAUSE.CATEGORY
are categorical variables that provide context (e.g., climate conditions like normal
or cold
, regions like east north central
, and causes like severe weather
or intentional attack
). The numerical variables include MONTH
and YEAR
for timing, ANOMALY.LEVEL
(ranging from -1.6 to 2.3), OUTAGE.DURATION
(minutes per outage), DEMAND.LOSS.MW
(megawatts lost), and CUSTOMERS.AFFECTED
(number of customers impacted). The relevant dataframe has 1535 rows and 11 columns in total.
Variable name | Description |
---|---|
MONTH | The month when the outage event occurred |
YEAR | The year when the outage event occurred |
CLIMATE.REGION | U.S. Climate regions as specified by National Centers for Environmental Information (9 Regions) |
CLIMATE.CATEGORY | The climate episodes corresponding to the years |
CAUSE.CATEGORY | Categories of all the events causing the major power outages |
OUTAGE.DURATION | Duration of outage events (in minutes) |
ANOMALY.LEVEL | The oceanic El Niño/La Niña (ONI) index referring to the cold and warm episodes by season |
DEMAND.LOSS.MW | Amount of peak demand lost during an outage event (in Megawatt) |
CUSTOMERS.AFFECTED | Number of customers affected by the power outage event |
start_day | The date of the outage’s start |
start_hour | Capturing the hour of the outage’s start |
This selection of columns will guide our investigation into understanding and predicting the extent of customer impact during power outages.
Data Cleaning and Exploratory Data Analysis
Data Cleaning, Univariate Analysis, and Imputation:
Data Cleaning
Our data cleaning include the following steps:
1, we identified and removed two columns that lacked meaningful information: Variables
(which contained only NaN values) and OBS
(which duplicated the dataframe’s indices).
2, We extracted specific information from the OUTAGE.START.DATE
and OUTAGE.START.TIME
columns. Specifically:
OUTAGE.START.DATE
was split to create a new column namedstart_day
, capturing the date of the outage’s start.OUTAGE.START.TIME
was split to create a new column namedstart_hour
, capturing the hour of the outage’s start.- The original date and time columns were dropped to avoid multicollinearity.
3, Then, we drop irrelevant columns and only keeping the features we are inerested in
The interested columns include:
Variable Name | Unit |
---|---|
MONTH | Month (1–12) |
YEAR | Year (e.g., 2000–2016) |
CLIMATE.REGION | Categorical (9 regions in the U.S.) |
CLIMATE.CATEGORY | Categorical (“Warm,” “Cold,” or “Normal”) |
CAUSE.CATEGORY | Categorical (e.g., “Severe Weather,” “Intentional Attack”) |
OUTAGE.DURATION | Minutes |
ANOMALY.LEVEL | Oceanic Niño Index (ONI), measured as a 3-month running mean |
DEMAND.LOSS.MW | Megawatts |
CUSTOMERS.AFFECTED | Number of customers |
start_day | Date (1-31) |
start_hour | Hour (0–23) |
After the data cleaning, the dataframe is like:
MONTH | YEAR | CLIMATE.REGION | CLIMATE.CATEGORY | ANOMALY.LEVEL | CAUSE.CATEGORY | OUTAGE.DURATION | DEMAND.LOSS.MW | CUSTOMERS.AFFECTED | start_day | start_hour |
---|---|---|---|---|---|---|---|---|---|---|
7 | 2011 | East North Central | normal | -0.3 | severe weather | 3060 | nan | 70000 | 1 | 17 |
5 | 2014 | East North Central | normal | -0.1 | intentional attack | 1 | nan | nan | 11 | 18 |
10 | 2010 | East North Central | cold | -1.5 | severe weather | 3000 | nan | 70000 | 26 | 20 |
6 | 2012 | East North Central | normal | -0.1 | severe weather | 2550 | nan | 68200 | 19 | 4 |
7 | 2015 | East North Central | warm | 1.2 | severe weather | 1740 | 250 | 250000 | 18 | 2 |
Univariate Analysis and Interesting Aggregates:
-
Here is the cause category recorded in the dataset, we can see the two major reasons of the power outage is ‘severe weather’ and ‘intentional attack’.
-
Next we want to explore the distribution of OUTAGE.DURATION variable, here is a histogram of power outage duration in hours:
About 80% of the power outages ended within 75 hours.
-
Here we have the number of Outage appear for each year and month.
We can see a peak of outage in 2011, and the peak month for power outage is June.
- And we want to find whether there is any geographical pattern for the number of power outage.
- What’s the shape for numerical variables?
Skewness Table for Key Variables
Variable Name | Skewness |
---|---|
MONTH | Neutral |
YEAR | Neutral |
CLIMATE.REGION | Neutral |
CLIMATE.CATEGORY | Neutral |
CAUSE.CATEGORY | Neutral |
OUTAGE.DURATION | Right-skewed |
ANOMALY.LEVEL | Neutral |
DEMAND.LOSS.MW | Right-skewed |
CUSTOMERS.AFFECTED | Right-skewed |
start_day | Neutral |
start_hour | Neutral |
Bivariate Analysis and Interesting Aggregates:
We proceeded with the nine columns identified in the introduction. We separated them into categorical and numerical columns and applied one-hot encoding to the categorical variables. Afterward, we calculated correlation coefficients using a correlation matrix and selected the columns most strongly correlated with OUTAGE.DURATION
.
- here we want to decide the relationship between Customers Affected and Outage Duration
The scatter plot reveals a concentration of data points where outage durations are short (below 20k minutes) and the number of customers affected is relatively small (under 500k). There are fewer instances of prolonged outages (exceeding 60k minutes), and these tend to involve varying numbers of customers, up to a maximum of approximately 3 million.
- here we want to decide the relationship between Month and Outage Duration
Outage durations are scattered across all months, with no clear concentration in any particular month. Most outages last less than 20k minutes, and there are only a few extreme outliers with durations exceeding 80k minutes. Surprisingly, the distribution of outage durations across the months suggests that outages occur consistently throughout the year, without any strong seasonal trend affecting outage duration.
- here we want to decide the relationship between Cause of outage and Mean Outage Duration
The bar chart highlights that outages caused by weather have the highest mean duration, followed by equipment failure. Other causes such as maintenance, vandalism, and other categories result in significantly shorter outage durations on average.
- What is the correlation between One-hot Encoded variables and OUTAGE.DURATION
?
- We utilized a correlation matrix to identify the top 30 features most correlated with
OUTAGE.DURATION
(a potential response variable that we are interested in) after one-hot encoding of the categorical variables.
- Here is the correlation matrix of variable with the top-5 correlation value
We might utilize this for future feature selection to improve the baseline model.
aggregates & pivot table
-Here is a pivot table for Region and Outage Duration
CLIMATE.REGION | OUTAGE.DURATION | |
---|---|---|
0 | Central | 2701.13 |
1 | East North Central | 5352.04 |
2 | Northeast | 2991.66 |
3 | Northwest | 1284.5 |
4 | South | 2846.1 |
5 | Southeast | 2217.69 |
6 | Southwest | 1566.14 |
7 | West | 1628.33 |
8 | West North Central | 696.562 |
Seemingly East North Central have the highest mean OUTAGE.DURATION among the other regions
-Here is a pivot table for Climate category and Outage Duration
CLIMATE.CATEGORY | OUTAGE.DURATION | |
---|---|---|
0 | cold | 2656.96 |
1 | normal | 2530.98 |
2 | warm | 2817.32 |
There is not obvious pattern for these two variables
Handle missing values
we developed an imputation strategy informed by our exploratory analysis:
Categorical Variables
- Variables like
CLIMATE_REGION
andCLIMATE.CATEGORY
were analyzed using univariate analysis, which showed relatively even distributions across their subcategories(i.e., no single category dominated the data significantly).
- As a result for these categorical variables, we applied probabilistic imputation to address missing values.
Numerical Variables
-
For variables such as
MONTH
,YEAR
,start_day
, andstart_hour
, probabilistic imputation was used. -
For other numerical variables, we examined their distributions:
-
If a variable was skewed, the median was used for imputation.
-
If a variable was not skewed, the mean was used for imputation.
-
Prediction Problem
We plan to predict the duration of each individual power outage occurrence in 2016, framing this as a regression problem. The response variable for this analysis is OUTAGE.DURATION, which measures the total duration of an outage in minutes. We chose this focus because the length of an outage is a critical factor for recovery planning, resource allocation, and minimizing the impact on affected communities, making it a vital metric for assessing outage severity. We choose this also because we find it correlats with many of the variables and have few missing values in the EDA section, which potentially allow us to create a better model.
To evaluate the performance of our model, we will use Mean Squared Error (MSE) as the primary metric. MSE was selected because it penalizes larger errors more heavily, which is important in the context of predicting outage durations where significant underestimations or overestimations could have substantial real-world consequences. By focusing on minimizing MSE, we aim to prioritize accuracy for severe or prolonged outages.
In line with the requirements, we will carefully ensure that all features included in the model are available at the time of prediction, avoiding any data leakage. Notably, we will include the variable number of customers affected as a predictor. This feature can be reasonably estimated based on the population of the outage area, which is typically known or accessible at the time of the outage.
Baseline Model
For this model we have:
1. Features in the Model
- Categorical Variables (Nominal):
- CLIMATE.REGION: Represents the geographical climate region.
- CLIMATE.CATEGORY: Indicates climate episodes such as “Warm,” “Cold,” or “Normal.”
- CAUSE.CATEGORY: Categories of events causing power outages (e.g., “Severe Weather,” “Intentional Attack”).
- Quantitative Variables:
- MONTH: Represents the month of the outage event (1–12).
- YEAR: Represents the year of the outage event (e.g., 2000–2016).
- ANOMALY.LEVEL: Oceanic Niño Index (ONI), indicating unusual weather conditions.
- CUSTOMERS.AFFECTED: The number of customers impacted by the outage, which can be estimated based on the population of the affected area.
- DEMAND.LOSS.MW: Amount of peak demand lost during the outage (in megawatts).
2. Pipeline Steps
The following is a pipeline of our baseline model
Preprocessing Pipeline (columntransformer
):
- Categorical variables (
pipeline-1
):- FunctionTransformer: Imputes missing values for categorical variables based on their observed probabilities in the training data. This step ensures that missing values are replaced with plausible categories based on the distribution of each feature.
- OneHotEncoder: Converts categorical variables into binary (one-hot) encoded features. Unknown categories during inference are ignored, and the first category is dropped to avoid multicollinearity.
- Numerical variables (
pipeline-2
):- FunctionTransformer: Imputes missing values for numerical variables using either the mean or median, depending on the skewness of the distribution. This ensures robust imputation for both symmetric and skewed distributions.
- StandardScaler: Standardizes numerical variables by centering them to a mean of 0 and scaling them to a standard deviation of 1. This step is critical for models like Linear Regression that are sensitive to feature scaling.
Linear Regression (columntransformer
):
After inputing all the values and transform variables accordingly, we fit a linear regression model.
3. Baseline Model performance
Mean Squared Error (MSE)
The Mean Squared Error (MSE) of the current model is 14,147,329.13, which indicates that the predicted values deviate significantly from the true values. This suggests poor model performance, especially if the target variable has a smaller range. The high MSE shows that the model is not effectively minimizing the prediction errors.
R-squared (R²)
The R-squared (R²) score is 0.274, meaning that only about 27.4% of the variance in the target variable is explained by the model. This is relatively low and suggests that the model is not capturing the relationship between the features and the target variable effectively.
We are planning to try some of the improve strateges in our final model tuning stage
1. Feature Selection
The combination of high MSE and low R² might indicate that irrelevant or noisy features are included in the model. To address this:
- We want to try Perform LASSO regression or other feature importance techniques to identify and select the most impactful features.
- Remove or transform irrelevant features that might be introducing noise.
2. Data Cleaning
Data quality is critical to improving the model’s performance. Recommendations include:
- Outliers: Identify and handle outliers in the dataset that might skew results.
- Missing Values: While the
not_na_mask_test
step suggests missing data handling, further cleaning might be required, such as:- Imputation with appropriate strategies (mean, median, or predictive methods).
- Removal of records with excessive missing data if applicable.
3. Hyperparameter Tuning
- Use Grid Search with cross-validation to find the best hyperparameters for the chosen algorithm.
- Evaluate various settings to balance bias and variance effectively.
4. Cross-Validation
To ensure the model generalizes well to unseen data:
- Implement k-fold cross-validation to evaluate the model’s performance more robustly.
- This will help avoid overfitting or underfitting and provide a better estimate of the model’s true performance.
Final Model
When finalizing our model, we adopted a systematic, iterative approach.
Comparison of Larger and Smaller Models
We started by comparing larger models to smaller ones, using the same strategy to develop and optimize each. Since our baseline model was relatively small, we initially explored a larger model. This larger model incorporated all features from the dataset and applied the same imputation strategy. However, this approach resulted in a much higher mean squared error (MSE) compared to our baseline model.
Feature Dropping and Debugging Challenges
To address this, we attempted to drop irrelevant columns after one-hot encoding. While we successfully identified columns to exclude, incorporating a transformer to drop these columns introduced bugs. Interestingly, Yihan had previously implemented a similar column-dropping approach (based on Lasso regression coefficients of 0) in Homework 10, Question 3.4, without any issues. Despite debugging for nearly two hours, we were unable to resolve the errors and decided to pivot toward using a feature selection method instead.
Using this approach, we discovered that selecting the top 10 features yielded the best performance for the larger model—similar to the number of features in our original smaller model. Nevertheless, the MSE remained higher than the baseline.
Interaction Terms and Reverting to the Smaller Model
Next, we experimented with adding interaction terms, guided by correlation coefficients and mutual information scores. Unfortunately, this again led to debugging challenges, prompting us to shift our focus back to the smaller model. This decision was supported by two observations:
- The smaller model consistently achieved lower MSE.
- It was easier to apply meaningful feature engineering to a simpler model.
Improvements to the Smaller Model
We began by incorporating GridSearchCV into the baseline model. Interestingly, this increased the MSE by approximately 12%. We then added polynomial features and applied feature selection, which reduced the MSE back to levels comparable to the baseline. Attempting to reintroduce interaction terms, we encountered the same debugging challenges as with the larger models. This reinforced our conclusion that smaller models were better suited for our task.
Final Feature Selection
We decided to drop specific columns—Anomaly.Level
, Month
, and DEMAND.LOSS.MW
—from the final model. The rationale was:
- These features had low correlation with
Outage.Duration
. DEMAND.LOSS.MW
represented post-outage data rather than a predictive feature.
Final Model Components
Our final model incorporated the following enhancements:
- Cross-validation: To ensure robust evaluation.
- Ridge regression: As a regularized regression approach.
- Polynomial features: To capture non-linear relationships.
- Feature selection: To reduce dimensionality and focus on the most informative features.
- Feature elimination: Dropping distracting columns (
Month
,Anomaly.Level
,DEMAND.LOSS.MW
).
Results
These adjustments improved the model’s performance by approximately 10% compared to the baseline. The optimal ridge regression coefficient was 2, and the best number of numerical features selected (after applying polynomial features and feature elimination) was 5.