Battery Management System (BMS) with Reinforcement Learning

Project Overview

Implement a Battery Management System (BMS) using Reinforcement Learning (RL) to manage energy flow in a Renewable Energy Community (REC) comprising:

Photovoltaic (PV) Generator
Residential Load
Battery Storage System
Electricity Price Model

Objectives

Increase Self-Consumption: Maximize the utilization of locally generated PV energy to meet residential load demands.
Generate Profit: Sell surplus energy back to the grid during advantageous periods.
Maintain Battery Health: Ensure the battery operates within safe State of Charge (SoC) limits.
Adhere to Operational Constraints: Enforce physical and operational constraints in the energy management process.

Environment Components

1. Photovoltaic (PV) Generation $P^G_t$

Description: Energy generated by the PV system at time $t$.
Unit: Kilowatts (kW)
Characteristics: Continuous variable.
Data Source: Historical data reflecting realistic PV generation patterns.

2. Residential Load Demand $P^L_t$

Description: Energy consumption of the residential load at time $t$.
Unit: Kilowatts (kW)
Characteristics: Continuous variable.
Data Source: Historical data reflecting realistic load demand patterns.

3. Battery Storage System

State of Charge (SoC):
- Description: Current energy level in the battery as a percentage of its capacity.
- Constraints: $$ \text{SoC}_{\text{min}} \leq \text{SoC}t \leq \text{SoC}{\text{max}} $$
  - Typical values: $\text{SoC}{\text{min}} = 10%$, $\text{SoC}{\text{max}} = 95%$
- Characteristics: Continuous variable.
Charging/Discharging Efficiency ($\eta$):
- Represents the efficiency of the battery when charging or discharging.
- Typical value: $\eta = 0.9$

4. Time Encoding

Description: Represents the current time in a cyclical manner to capture temporal patterns.
Encoding Method: Cyclical encoding using sine and cosine functions.
- Hour of Day: $$ \text{Hour}{\sin} = \sin\left(2\pi \times \frac{\text{Hour}}{24}\right) $$ $$ \text{Hour}{\cos} = \cos\left(2\pi \times \frac{\text{Hour}}{24}\right) $$
- Day of Week: $$ \text{Day}{\sin} = \sin\left(2\pi \times \frac{\text{Day}}{7}\right) $$ $$ \text{Day}{\cos} = \cos\left(2\pi \times \frac{\text{Day}}{7}\right) $$
Characteristics: Continuous variables.

5. Electricity Price

Description: The cost of electricity, determined internally based on the current time.
Calculation: Price is calculated using the time information according to predefined time phases (as per Italian Law).

Price Phases

Phase 1 (F1):
- Time: 8 AM – 7 PM, Monday to Friday
- Price: High ($c_{\text{max}}$)
Phase 2 (F2):
- Time:
  - 7 AM – 8 AM and 7 PM – 11 PM, Monday to Friday
  - 7 AM – 11 PM, Saturday
- Price: Medium ($c_{\text{mid}}$)
Phase 3 (F3):
- Time:
  - 11 PM – 7 AM, Monday to Saturday
  - All day Sunday
- Price: Low ($c_{\text{min}}$)

States and Actions

State Representation

The state at time $t$ is represented by the vector:

$$ s_t = \left[ \text{SoC}t,\ P^G_t,\ P^L_t,\ \text{Hour}{\sin},\ \text{Hour}{\cos},\ \text{Day}{\sin},\ \text{Day}_{\cos} \right] $$

SoC $\text{SoC}_t$: Continuous between $\text{SoC}{\text{min}}$ and $\text{SoC}{\text{max}}$.
PV Generation $P^G_t$: Continuous, based on historical data.
Load Demand $P^L_t$: Continuous, based on historical data.
Time Encoding: Continuous variables representing time cyclically.

Action Space

The agent's action at time $t$ is:

$$ a_t \in \left[ -a_{\text{max}},\ a_{\text{max}} \right] $$

Continuous Action Space: The action $a_t$ represents the power to charge or discharge the battery.
- Charging: $a_t > 0$
- Discharging: $a_t < 0$
- Idle: $a_t = 0$
Constraints:
- Charging Rate Limit: $0 \leq a_t \leq a_{\text{charge_max}}$
- Discharging Rate Limit: $-a_{\text{discharge_max}} \leq a_t \leq 0$
- Energy Availability:
  - Charging: Limited to surplus PV energy. $$ a_t \leq \max\left(0,\ P^G_t - P^L_t\right) $$
  - Discharging: Limited to net load demand. $$ -a_t \leq \max\left(0,\ P^L_t - P^G_t\right) $$

Environment Dynamics

State Transition

Action Adjustment
- Adjust Action for Constraints:
  - Clip the agent's proposed action $a_t^\text{proposed}$ to satisfy physical and operational constraints.
  - Adjusted Action: $a_t^\text{adjusted}$
  - Action Adjustment Difference: $$ \Delta a_t = a_t^\text{proposed} - a_t^\text{adjusted} $$
State of Charge Update
- Proposed SoC Update: $$ \text{SoC}_{t+1}^\text{proposed} = \text{SoC}t + \eta \times \frac{a_t^\text{adjusted} \times \Delta t}{E{\text{cap}}} $$
- Adjust SoC for Constraints:
  - If $\text{SoC}{t+1}^\text{proposed}$ violates SoC constraints, adjust it: $$ \text{SoC}{t+1}^\text{adjusted} = \text{clip}\left( \text{SoC}{t+1}^\text{proposed},\ \text{SoC}{\text{min}},\ \text{SoC}_{\text{max}} \right) $$
  - SoC Adjustment Difference: $$ \Delta \text{SoC} = \text{SoC}{t+1}^\text{proposed} - \text{SoC}{t+1}^\text{adjusted} $$
Energy Balance Equations
- Net Load After PV Generation: $$ \text{Net Load} = P^L_t - P^G_t $$
- Battery Contribution:
  - Actual Action: $$ a_t^\text{actual} = a_t^\text{adjusted} $$
  - Adjust for Energy Availability:
    - Charging: $$ \text{If } a_t^\text{actual} > 0:\ a_t^\text{actual} = \min\left( a_t^\text{actual},\ \max\left(0,\ -\text{Net Load}\right) \right) $$
    - Discharging: $$ \text{If } a_t^\text{actual} < 0:\ a_t^\text{actual} = \max\left( a_t^\text{actual},\ -\max\left(0,\ \text{Net Load}\right) \right) $$
- Grid Interaction:
  - Energy Purchased: $$ P^{\text{grid}}_t = \max\left(0,\ \text{Net Load} + a_t^\text{actual}\right) $$
  - Energy Sold: $$ P^{\text{surplus}}_t = \max\left(0,\ -\left( \text{Net Load} + a_t^\text{actual} \right) \right) $$
Price Calculation
- Price Determination: $\text{Price}_t$ is calculated internally based on the current time phase.

Penalties for Constraint Violations

Action Penalty: $$ P_{\text{action}} = -\mu \times \left| \Delta a_t \right| $$
SoC Penalty: $$ P_{\text{SoC_adjust}} = -\lambda_{\text{SoC}} \times \left| \Delta \text{SoC} \right| $$

Reward Function

The reward at time $t$ aligns with the objectives and penalizes the agent for violating constraints.

Components of the Reward Function

Cost of Energy Purchased from the Grid $C_{\text{purchase}}$:

$$ C_{\text{purchase}} = c_{\text{buy}} \times P^{\text{grid}}_t $$
Revenue from Energy Sold to the Grid $R_{\text{sale}}$:

$$ R_{\text{sale}} = c_{\text{sell}} \times P^{\text{surplus}}_t $$
Total Penalty $P_{\text{total}}$:

$$ P_{\text{total}} = P_{\text{action}} + P_{\text{SoC_adjust}} $$

Total Reward Function

$$ r_t = R_{\text{sale}} - C_{\text{purchase}} + P_{\text{total}} $$

Objective: Maximize $r_t$ over time.
Note: Penalties are added to the reward (since they are negative), effectively reducing the reward when constraints are violated.

Algorithm and Implementation

Reinforcement Learning Approach

Algorithm: Use RL algorithms suitable for continuous action spaces, such as:
- Deep Deterministic Policy Gradient (DDPG)
- Soft Actor-Critic (SAC)
- Proximal Policy Optimization (PPO) with continuous actions

Environment Implementation Details

Observation Space: Continuous space represented by:

$$ \text{Observation} = \begin{bmatrix} \text{SoC}t \ P^G_t \ P^L_t \ \text{Hour}{\sin} \ \text{Hour}{\cos} \ \text{Day}{\sin} \ \text{Day}_{\cos} \end{bmatrix} $$
Action Space: Continuous space within the charging and discharging rate limits.
Time Interval:
- Duration: 1 hour per time step.
- Episode Length: Spans multiple days, depending on data length.
Data Integration:
- Historical Data: Use real historical data for PV generation and load demand to create a realistic environment.
- Data Handling:
  - Load data into pandas DataFrames.
  - Align and preprocess data (e.g., handle missing values, resample if necessary).
  - At each time step, read the corresponding data point.

Environment Dynamics in Code

Action Adjustment in Code:

action_corrected, penalty_action = self._get_action_check(action, info)
# Delta action for penalty calculation
delta_action = action - action_corrected
penalty_action = -mu * abs(delta_action)

SoC Update in Code:

  SoC_proposed = self.SoC + self.eta * (action_corrected * self.time_interval) / self.battery_capacity
  SoC_adjusted = np.clip(SoC_proposed, self.SoC_min, self.SoC_max)
  delta_SoC = SoC_proposed - SoC_adjusted
  penalty_SoC_adjust = -lambda_SoC * abs(delta_SoC)
  self.SoC = SoC_adjusted

Energy Balance in Code:
```
net_load = self.L - self.G
```

Reward Calculation in Code:

reward = R_sale - C_purchased + penalty_action + penalty_SoC_adjust

Assumptions and Considerations

No Forecasting: The agent only considers current time step data.
Battery Charging Constraints:
- Battery cannot be charged from the grid.
- Charging is limited to surplus PV energy.
Price Levels: Determined internally based on time.
Agent Penalization:
- The agent is penalized for proposing invalid actions and causing SoC violations, even if the environment adjusts these values.
Historical Data Usage:
- Realistic PV generation and load demand patterns improve the agent’s learning and policy effectiveness.

Equations Summary

Action Adjustment Difference: $$ \Delta a_t = a_t^\text{proposed} - a_t^\text{adjusted} $$
SoC Update:

Proposed SoC: $$ \text{SoC}_{t+1}^\text{proposed} = \text{SoC}t + \eta \times \frac{a_t^\text{adjusted} \times \Delta t}{E{\text{cap}}} $$
SoC Adjustment Difference: $$ \Delta \text{SoC} = \text{SoC}{t+1}^\text{proposed} - \text{SoC}{t+1}^\text{adjusted} $$

Action Penalty: $$ P_{\text{action}} = -\mu \times \left| \Delta a_t \right| $$
SoC Adjustment Penalty: $$ P_{\text{SoC_adjust}} = -\lambda_{\text{SoC}} \times \left| \Delta \text{SoC} \right| $$
Grid Interaction:

Energy Purchased: $$ P^{\text{grid}}_t = \max\left(0,\ \text{Net Load} + a_t^\text{actual}\right) $$
Energy Sold: $$ P^{\text{surplus}}_t = \max\left(0,\ -\left( \text{Net Load} + a_t^\text{actual} \right) \right) $$

Reward Function: $$ r_t = \left[ c_{\text{sell}} \times P^{\text{surplus}}t \right] - \left[ c{\text{buy}} \times P^{\text{grid}}t \right] + P{\text{action}} + P_{\text{SoC_adjust}} $$

Conclusion

This project aims to develop an RL-based BMS that optimizes energy flow within a REC by maximizing self-consumption and generating profit while maintaining battery health and adhering to operational constraints. By incorporating penalties for action and SoC adjustments, the agent is incentivized to operate within valid constraints, leading to more effective and realistic policy learning.

Additional Information

Data Sources

•	PV Generation and Load Demand:
•	Use datasets that provide detailed energy consumption and PV generation data, such as:
•	Pecan Street Dataport
•	UCI Machine Learning Repository
•	REFIT Electrical Load Measurements

Agent Learning Considerations

•	Penalties and Adjustments:
•	The environment adjusts invalid actions and SoC values to maintain physical realism.
•	Penalties are applied to the agent to encourage learning valid actions.
•	Agent Observations:
•	The agent receives observations based on adjusted state variables.
•	Over time, the agent learns to propose actions within valid constraints to maximize rewards.

Testing and Validation

•	Environment Testing:
•	Before training the agent, thoroughly test the environment with known scenarios.
•	Ensure that energy flows, constraints, and rewards are calculated correctly.
•	Agent Training:
•	Start with a simple algorithm and gradually increase complexity.
•	Monitor the agent’s performance and adjust hyperparameters as needed.

Future Enhancements

•	Forecasting:
•	Incorporate short-term forecasting of PV generation and load demand to improve decision-making.
•	Dynamic Pricing:
•	Implement dynamic electricity pricing models based on market conditions.
•	Scalability:
•	Extend the environment to manage multiple batteries or interact with a larger grid.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Data		Data
Environments		Environments
Utilities		Utilities
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Battery Management System (BMS) with Reinforcement Learning

Project Overview

Objectives

Environment Components

1. Photovoltaic (PV) Generation $P^G_t$

2. Residential Load Demand $P^L_t$

3. Battery Storage System

4. Time Encoding

5. Electricity Price

Price Phases

States and Actions

State Representation

Action Space

Environment Dynamics

State Transition

Penalties for Constraint Violations

Reward Function

Components of the Reward Function

Total Reward Function

Algorithm and Implementation

Reinforcement Learning Approach

Environment Implementation Details

Environment Dynamics in Code

Assumptions and Considerations

Equations Summary

About

Releases

Packages

Languages

danialzendehdel/BMS-RL

Folders and files

Latest commit

History

Repository files navigation

Battery Management System (BMS) with Reinforcement Learning

Project Overview

Objectives

Environment Components

1. Photovoltaic (PV) Generation $P^G_t$

2. Residential Load Demand $P^L_t$

3. Battery Storage System

4. Time Encoding

5. Electricity Price

Price Phases

States and Actions

State Representation

Action Space

Environment Dynamics

State Transition

Penalties for Constraint Violations

Reward Function

Components of the Reward Function

Total Reward Function

Algorithm and Implementation

Reinforcement Learning Approach

Environment Implementation Details

Environment Dynamics in Code

Assumptions and Considerations

Equations Summary

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages