Implement a Battery Management System (BMS) using Reinforcement Learning (RL) to manage energy flow in a Renewable Energy Community (REC) comprising:
- Photovoltaic (PV) Generator
- Residential Load
- Battery Storage System
- Electricity Price Model
- Increase Self-Consumption: Maximize the utilization of locally generated PV energy to meet residential load demands.
- Generate Profit: Sell surplus energy back to the grid during advantageous periods.
- Maintain Battery Health: Ensure the battery operates within safe State of Charge (SoC) limits.
- Adhere to Operational Constraints: Enforce physical and operational constraints in the energy management process.
-
Description: Energy generated by the PV system at time
$t$ . - Unit: Kilowatts (kW)
- Characteristics: Continuous variable.
- Data Source: Historical data reflecting realistic PV generation patterns.
-
Description: Energy consumption of the residential load at time
$t$ . - Unit: Kilowatts (kW)
- Characteristics: Continuous variable.
- Data Source: Historical data reflecting realistic load demand patterns.
-
State of Charge (SoC):
- Description: Current energy level in the battery as a percentage of its capacity.
-
Constraints:
$$
\text{SoC}_{\text{min}} \leq \text{SoC}t \leq \text{SoC}{\text{max}}
$$
- Typical values: $\text{SoC}{\text{min}} = 10%$, $\text{SoC}{\text{max}} = 95%$
- Characteristics: Continuous variable.
-
Charging/Discharging Efficiency (
$\eta$ ):- Represents the efficiency of the battery when charging or discharging.
- Typical value:
$\eta = 0.9$
- Description: Represents the current time in a cyclical manner to capture temporal patterns.
-
Encoding Method: Cyclical encoding using sine and cosine functions.
- Hour of Day: $$ \text{Hour}{\sin} = \sin\left(2\pi \times \frac{\text{Hour}}{24}\right) $$ $$ \text{Hour}{\cos} = \cos\left(2\pi \times \frac{\text{Hour}}{24}\right) $$
- Day of Week: $$ \text{Day}{\sin} = \sin\left(2\pi \times \frac{\text{Day}}{7}\right) $$ $$ \text{Day}{\cos} = \cos\left(2\pi \times \frac{\text{Day}}{7}\right) $$
- Characteristics: Continuous variables.
- Description: The cost of electricity, determined internally based on the current time.
- Calculation: Price is calculated using the time information according to predefined time phases (as per Italian Law).
-
Phase 1 (F1):
- Time: 8 AM – 7 PM, Monday to Friday
-
Price: High (
$c_{\text{max}}$ )
-
Phase 2 (F2):
-
Time:
- 7 AM – 8 AM and 7 PM – 11 PM, Monday to Friday
- 7 AM – 11 PM, Saturday
-
Price: Medium (
$c_{\text{mid}}$ )
-
Time:
-
Phase 3 (F3):
-
Time:
- 11 PM – 7 AM, Monday to Saturday
- All day Sunday
-
Price: Low (
$c_{\text{min}}$ )
-
Time:
The state at time
$$ s_t = \left[ \text{SoC}t,\ P^G_t,\ P^L_t,\ \text{Hour}{\sin},\ \text{Hour}{\cos},\ \text{Day}{\sin},\ \text{Day}_{\cos} \right] $$
-
SoC
$\text{SoC}_t$ : Continuous between $\text{SoC}{\text{min}}$ and $\text{SoC}{\text{max}}$. -
PV Generation
$P^G_t$ : Continuous, based on historical data. -
Load Demand
$P^L_t$ : Continuous, based on historical data. - Time Encoding: Continuous variables representing time cyclically.
The agent's action at time
-
Continuous Action Space: The action
$a_t$ represents the power to charge or discharge the battery.-
Charging:
$a_t > 0$ -
Discharging:
$a_t < 0$ -
Idle:
$a_t = 0$
-
Charging:
-
Constraints:
-
Charging Rate Limit:
$0 \leq a_t \leq a_{\text{charge_max}}$ -
Discharging Rate Limit:
$-a_{\text{discharge_max}} \leq a_t \leq 0$ -
Energy Availability:
- Charging: Limited to surplus PV energy. $$ a_t \leq \max\left(0,\ P^G_t - P^L_t\right) $$
- Discharging: Limited to net load demand. $$ -a_t \leq \max\left(0,\ P^L_t - P^G_t\right) $$
-
Charging Rate Limit:
-
Action Adjustment
-
Adjust Action for Constraints:
- Clip the agent's proposed action
$a_t^\text{proposed}$ to satisfy physical and operational constraints. -
Adjusted Action:
$a_t^\text{adjusted}$ - Action Adjustment Difference: $$ \Delta a_t = a_t^\text{proposed} - a_t^\text{adjusted} $$
- Clip the agent's proposed action
-
Adjust Action for Constraints:
-
State of Charge Update
- Proposed SoC Update: $$ \text{SoC}_{t+1}^\text{proposed} = \text{SoC}t + \eta \times \frac{a_t^\text{adjusted} \times \Delta t}{E{\text{cap}}} $$
-
Adjust SoC for Constraints:
- If $\text{SoC}{t+1}^\text{proposed}$ violates SoC constraints, adjust it: $$ \text{SoC}{t+1}^\text{adjusted} = \text{clip}\left( \text{SoC}{t+1}^\text{proposed},\ \text{SoC}{\text{min}},\ \text{SoC}_{\text{max}} \right) $$
- SoC Adjustment Difference: $$ \Delta \text{SoC} = \text{SoC}{t+1}^\text{proposed} - \text{SoC}{t+1}^\text{adjusted} $$
-
Energy Balance Equations
-
Net Load After PV Generation: $$ \text{Net Load} = P^L_t - P^G_t $$
-
Battery Contribution:
- Actual Action: $$ a_t^\text{actual} = a_t^\text{adjusted} $$
-
Adjust for Energy Availability:
- Charging: $$ \text{If } a_t^\text{actual} > 0:\ a_t^\text{actual} = \min\left( a_t^\text{actual},\ \max\left(0,\ -\text{Net Load}\right) \right) $$
- Discharging: $$ \text{If } a_t^\text{actual} < 0:\ a_t^\text{actual} = \max\left( a_t^\text{actual},\ -\max\left(0,\ \text{Net Load}\right) \right) $$
-
Grid Interaction:
- Energy Purchased: $$ P^{\text{grid}}_t = \max\left(0,\ \text{Net Load} + a_t^\text{actual}\right) $$
- Energy Sold: $$ P^{\text{surplus}}_t = \max\left(0,\ -\left( \text{Net Load} + a_t^\text{actual} \right) \right) $$
-
-
Price Calculation
-
Price Determination:
$\text{Price}_t$ is calculated internally based on the current time phase.
-
Price Determination:
- Action Penalty: $$ P_{\text{action}} = -\mu \times \left| \Delta a_t \right| $$
- SoC Penalty: $$ P_{\text{SoC_adjust}} = -\lambda_{\text{SoC}} \times \left| \Delta \text{SoC} \right| $$
The reward at time
-
Cost of Energy Purchased from the Grid
$C_{\text{purchase}}$ :$$ C_{\text{purchase}} = c_{\text{buy}} \times P^{\text{grid}}_t $$
-
Revenue from Energy Sold to the Grid
$R_{\text{sale}}$ :$$ R_{\text{sale}} = c_{\text{sell}} \times P^{\text{surplus}}_t $$
-
Total Penalty
$P_{\text{total}}$ :$$ P_{\text{total}} = P_{\text{action}} + P_{\text{SoC_adjust}} $$
-
Objective: Maximize
$r_t$ over time. - Note: Penalties are added to the reward (since they are negative), effectively reducing the reward when constraints are violated.
- Algorithm: Use RL algorithms suitable for continuous action spaces, such as:
- Deep Deterministic Policy Gradient (DDPG)
- Soft Actor-Critic (SAC)
- Proximal Policy Optimization (PPO) with continuous actions
-
Observation Space: Continuous space represented by:
$$ \text{Observation} = \begin{bmatrix} \text{SoC}t \ P^G_t \ P^L_t \ \text{Hour}{\sin} \ \text{Hour}{\cos} \ \text{Day}{\sin} \ \text{Day}_{\cos} \end{bmatrix} $$
-
Action Space: Continuous space within the charging and discharging rate limits.
-
Time Interval:
- Duration: 1 hour per time step.
- Episode Length: Spans multiple days, depending on data length.
-
Data Integration:
- Historical Data: Use real historical data for PV generation and load demand to create a realistic environment.
-
Data Handling:
- Load data into pandas DataFrames.
- Align and preprocess data (e.g., handle missing values, resample if necessary).
- At each time step, read the corresponding data point.
-
Action Adjustment in Code:
action_corrected, penalty_action = self._get_action_check(action, info) # Delta action for penalty calculation delta_action = action - action_corrected penalty_action = -mu * abs(delta_action)
-
SoC Update in Code:
SoC_proposed = self.SoC + self.eta * (action_corrected * self.time_interval) / self.battery_capacity SoC_adjusted = np.clip(SoC_proposed, self.SoC_min, self.SoC_max) delta_SoC = SoC_proposed - SoC_adjusted penalty_SoC_adjust = -lambda_SoC * abs(delta_SoC) self.SoC = SoC_adjusted
-
Energy Balance in Code:
net_load = self.L - self.G
-
Reward Calculation in Code:
reward = R_sale - C_purchased + penalty_action + penalty_SoC_adjust
-
No Forecasting: The agent only considers current time step data.
-
Battery Charging Constraints:
- Battery cannot be charged from the grid.
- Charging is limited to surplus PV energy.
-
Price Levels: Determined internally based on time.
-
Agent Penalization:
- The agent is penalized for proposing invalid actions and causing SoC violations, even if the environment adjusts these values.
-
Historical Data Usage:
- Realistic PV generation and load demand patterns improve the agent’s learning and policy effectiveness.
- Action Adjustment Difference: $$ \Delta a_t = a_t^\text{proposed} - a_t^\text{adjusted} $$
- SoC Update:
- Proposed SoC: $$ \text{SoC}_{t+1}^\text{proposed} = \text{SoC}t + \eta \times \frac{a_t^\text{adjusted} \times \Delta t}{E{\text{cap}}} $$
- SoC Adjustment Difference: $$ \Delta \text{SoC} = \text{SoC}{t+1}^\text{proposed} - \text{SoC}{t+1}^\text{adjusted} $$
- Action Penalty: $$ P_{\text{action}} = -\mu \times \left| \Delta a_t \right| $$
- SoC Adjustment Penalty: $$ P_{\text{SoC_adjust}} = -\lambda_{\text{SoC}} \times \left| \Delta \text{SoC} \right| $$
- Grid Interaction:
- Energy Purchased: $$ P^{\text{grid}}_t = \max\left(0,\ \text{Net Load} + a_t^\text{actual}\right) $$
- Energy Sold: $$ P^{\text{surplus}}_t = \max\left(0,\ -\left( \text{Net Load} + a_t^\text{actual} \right) \right) $$
- Reward Function: $$ r_t = \left[ c_{\text{sell}} \times P^{\text{surplus}}t \right] - \left[ c{\text{buy}} \times P^{\text{grid}}t \right] + P{\text{action}} + P_{\text{SoC_adjust}} $$
Conclusion
This project aims to develop an RL-based BMS that optimizes energy flow within a REC by maximizing self-consumption and generating profit while maintaining battery health and adhering to operational constraints. By incorporating penalties for action and SoC adjustments, the agent is incentivized to operate within valid constraints, leading to more effective and realistic policy learning.
Additional Information
Data Sources
• PV Generation and Load Demand:
• Use datasets that provide detailed energy consumption and PV generation data, such as:
• Pecan Street Dataport
• UCI Machine Learning Repository
• REFIT Electrical Load Measurements
Agent Learning Considerations
• Penalties and Adjustments:
• The environment adjusts invalid actions and SoC values to maintain physical realism.
• Penalties are applied to the agent to encourage learning valid actions.
• Agent Observations:
• The agent receives observations based on adjusted state variables.
• Over time, the agent learns to propose actions within valid constraints to maximize rewards.
Testing and Validation
• Environment Testing:
• Before training the agent, thoroughly test the environment with known scenarios.
• Ensure that energy flows, constraints, and rewards are calculated correctly.
• Agent Training:
• Start with a simple algorithm and gradually increase complexity.
• Monitor the agent’s performance and adjust hyperparameters as needed.
Future Enhancements
• Forecasting:
• Incorporate short-term forecasting of PV generation and load demand to improve decision-making.
• Dynamic Pricing:
• Implement dynamic electricity pricing models based on market conditions.
• Scalability:
• Extend the environment to manage multiple batteries or interact with a larger grid.