Add data grabbing code

BaiqingL · Sep 11, 2024 · 5bc3b13 · 5bc3b13
1 parent 6eb4877
commit 5bc3b13
Show file tree

Hide file tree

Showing 3 changed files with 48 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -29,3 +29,15 @@ while battleSimulator.simulate_new_turn():
 
 During the simulation, you can access the current state of the battle and the Pokémon involved at any point.
 
+
+## Data Processing Pipeline
+
+The following files are used in sequence to clean and parse the data:
+
+1. `paginated_search.py`: Grabs replays from the replay API and filters out ones with rating < 2200
+2. `grab_battle_logs.py`: Grabs the battle logs for the replays
+3. `produce_question_prompts.py`: Produces the question prompts for the replays
+4. `clean_filter_prompt_parquet.py`: Cleans the prompts for the replays and filters out ones with rating < 2200
+
+Each script builds upon the output of the previous one, creating a streamlined data processing pipeline.
+
diff --git a/clean_prompt_parquet.py → clean_filter_prompt_parquet.py b/clean_prompt_parquet.py → clean_filter_prompt_parquet.py
@@ -7,6 +7,12 @@
 # Remove rows where prompts are empty
 df_cleaned = df[df['prompts'].apply(lambda x: len(x) > 0)]
 
+# Remove rows where the lens of prompts is greater than 50
+df_cleaned = df_cleaned[df_cleaned['prompts'].apply(lambda x: len(x) <= 50)]
+
+# Go into each prompt list and remove alternating prompts
+df_cleaned['prompts'] = df_cleaned['prompts'].apply(lambda x: x[::2])
+
 # Save to new parquet file with progress bar
 df_cleaned.to_parquet("data/battle_logs_with_prompts_cleaned.parquet", engine="pyarrow")
 

diff --git a/produce_question_prompts.py b/produce_question_prompts.py
@@ -229,6 +229,32 @@ def produce_question_prompt(scenario: str, winner_move: Tuple[BattleOrder, bool]
 Observing how the opponent switches can also yield significant information, particularly with deciding which Pokemon is a threat to their team. As an example, Choice Specs Heliolisk is out against the opponent's Golduck. Instead of switching in a Pokemon that resists Electric, the opponent sacks Golduck to Thunderbolt. This indicates that the opponent either has no Electric resists or no checks to Heliolisk, so it can be ascertained that Heliolisk is a massive threat and thus a win condition. Furthermore, if you have another Electric type like Raikou, then it can be determined that it also is a threat as it is quite similar to Heliolisk. In this situation, Heliolisk and Raikou should pretty much guarantee a win because as one punches holes in the opponent's team, the other should have no problem cleaning up. So, in a nutshell, if the opponent doesn't switch in a Pokemon that has a type advantage against the one you currently have in play, you can determine that that Pokemon is a threat, or that the type of that Pokemon threatens your opponent's team.
 Generation 9 introduces Terastallization, which lets your Pokemon transform in the middle of battle from its current typing into its Tera type. This adds a new layer of depth to Gen 9 battles. Tera typing can be super useful for things such as setting up STAB moves for that Tera type, setting up your Terablast users, or resisting a predicted attack you know your opponent is going to use."""
 
+    type_effectiveness_prompt = """
+Type      | Strong Against         | Weak To
+----------|------------------------|------------------
+Normal    | -                      | Fighting
+Fire      | Grass, Ice, Bug, Steel | Water, Ground, Rock
+Water     | Fire, Ground, Rock     | Electric, Grass
+Electric  | Water, Flying          | Ground
+Grass     | Water, Ground, Rock    | Fire, Ice, Poison, Flying, Bug
+Ice       | Grass, Ground, Flying, | Fire, Fighting, Rock, Steel
+          | Dragon                 |
+Fighting  | Normal, Ice, Rock,     | Flying, Psychic, Fairy
+          | Dark, Steel            |
+Poison    | Grass, Fairy           | Ground, Psychic
+Ground    | Fire, Electric, Poison,| Water, Grass, Ice
+          | Rock, Steel            |
+Flying    | Grass, Fighting, Bug   | Electric, Ice, Rock
+Psychic   | Fighting, Poison       | Bug, Ghost, Dark
+Bug       | Grass, Psychic, Dark   | Fire, Flying, Rock
+Rock      | Fire, Ice, Flying, Bug | Water, Grass, Fighting, Ground, Steel
+Ghost     | Psychic, Ghost         | Ghost, Dark
+Dragon    | Dragon                 | Ice, Dragon, Fairy
+Dark      | Psychic, Ghost         | Fighting, Bug, Fairy
+Steel     | Ice, Rock, Fairy       | Fire, Fighting, Ground
+Fairy     | Fighting, Dragon, Dark | Poison, Steel
+"""
+
     question_prompt = """Imagine you're an expert Pokemon Showdown player analyzing a random battle. I'll provide you with a scenario from a Gen 9 random battle, including details about both teams, the current field conditions, and the move that was just made. I want you to explain why the player likely chose that specific move.
 In your response, please:
 
@@ -239,6 +265,9 @@ def produce_question_prompt(scenario: str, winner_move: Tuple[BattleOrder, bool]
 Consider type advantages, the alternative moves the player could have made and why they might have been rejected.
 Conclude with a summary of why this move was likely the best choice in this situation.
 
+Here's the type effectiveness chart:
+[TYPE EFFECTIVENESS CHART]
+
 Here's the scenario:
 
 [SCENARIO]
@@ -271,7 +300,7 @@ def produce_question_prompt(scenario: str, winner_move: Tuple[BattleOrder, bool]
     available_orders_prompt = ""
     for i, order in enumerate(available_orders):
         available_orders_prompt += f"{i}. {str(order)}\n"
-    result = question_prompt.replace("[STRATEGY PROMPT]", strategy_prompt).replace("[SCENARIO]", scenario).replace("[WINNER_POKEMON]", winner_pokemon).replace("[LOSER_POKEMON]", loser_pokemon).replace("[WINNER_CHOICES]", available_orders_prompt)
+    result = question_prompt.replace("[STRATEGY PROMPT]", strategy_prompt).replace("[SCENARIO]", scenario).replace("[TYPE EFFECTIVENESS CHART]", type_effectiveness_prompt).replace("[WINNER_POKEMON]", winner_pokemon).replace("[LOSER_POKEMON]", loser_pokemon).replace("[WINNER_CHOICES]", available_orders_prompt)
     if not winner_move[1]:
         result = result.replace("[WINNER_MOVE]", str(winner_move[0]))
     else: