Skip to content

Commit

Permalink
Use Vec in place of HashMap<usize, T> (guillaume-be#252)
Browse files Browse the repository at this point in the history
**This Commit**
Attempts to simplify the `predict` function in the
`token_classification` pipeline by substituting a `HashMap` whose keys
are indices into a `Vec`.

**Why?**
Because the `HashMap` eagerly creates token buckets for all indices from
`0..input.len()` we can get the same behavior by using a `Vec`. This
cleans up some later code that was sorting on index because the `Vec`
maintains order by index naturally.

**Note**
I also switched from `get_mut().unwrap()` to `[]` notation because it
was the same but shorter. Happy to revert that if the
`get_mut().unwrap()` is specifically preferred for quickly finding panic
points by grepping for `unwrap` or something!

**Note**
I wrote a benchmark and it didn't seem to make it faster or slower but
hopefully that benchmark will be slightly helpful to those in the future
:crossed_fingers:.
  • Loading branch information
mlodato517 authored May 10, 2022
1 parent b49d853 commit c5faadc
Show file tree
Hide file tree
Showing 4 changed files with 65 additions and 18 deletions.
4 changes: 4 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,10 @@ harness = false
name = "tensor_operations_benchmark"
harness = false

[[bench]]
name = "token_classification_benchmark"
harness = false

[profile.bench]
opt-level = 3

Expand Down
5 changes: 3 additions & 2 deletions benches/sst2_benchmark.rs
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,9 @@ fn bench_sst2(c: &mut Criterion) {
torch_sys::dummy_cuda_dependency();
}
// Define input
let mut sst2_path = PathBuf::from(env::var("SST2_PATH")
.expect("Please set the \"squad_dataset\" environment variable pointing to the SQuAD dataset folder"));
let mut sst2_path = PathBuf::from(env::var("SST2_PATH").expect(
"Please set the \"SST2_PATH\" environment variable pointing to the SST2 dataset folder",
));
sst2_path.push("train.tsv");
let mut inputs = ss2_processor(sst2_path).unwrap();
inputs.truncate(2000);
Expand Down
55 changes: 55 additions & 0 deletions benches/token_classification_benchmark.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rust_bert::pipelines::token_classification::{
TokenClassificationConfig, TokenClassificationModel,
};
use tch::Device;

fn create_model() -> TokenClassificationModel {
let config = TokenClassificationConfig {
device: Device::cuda_if_available(),
..Default::default()
};
TokenClassificationModel::new(config).unwrap()
}

fn bench_token_classification_predict(c: &mut Criterion) {
// Set-up model
unsafe {
torch_sys::dummy_cuda_dependency();
}
let model = create_model();

// Define input
let input = ["In findings published Tuesday in Cornell University's arXiv by a team of scientists \
from the University of Montreal and a separate report published Wednesday in Nature Astronomy by a team \
from University College London (UCL), the presence of water vapour was confirmed in the atmosphere of K2-18b, \
a planet circling a star in the constellation Leo. This is the first such discovery in a planet in its star's \
habitable zone — not too hot and not too cold for liquid water to exist. The Montreal team, led by Björn Benneke, \
used data from the NASA's Hubble telescope to assess changes in the light coming from K2-18b's star as the planet \
passed between it and Earth. They found that certain wavelengths of light, which are usually absorbed by water, \
weakened when the planet was in the way, indicating not only does K2-18b have an atmosphere, but the atmosphere \
contains water in vapour form. The team from UCL then analyzed the Montreal team's data using their own software \
and confirmed their conclusion. This was not the first time scientists have found signs of water on an exoplanet, \
but previous discoveries were made on planets with high temperatures or other pronounced differences from Earth. \
\"This is the first potentially habitable planet where the temperature is right and where we now know there is water,\" \
said UCL astronomer Angelos Tsiaras. \"It's the best candidate for habitability right now.\" \"It's a good sign\", \
said Ryan Cloutier of the Harvard–Smithsonian Center for Astrophysics, who was not one of either study's authors. \
\"Overall,\" he continued, \"the presence of water in its atmosphere certainly improves the prospect of K2-18b being \
a potentially habitable planet, but further observations will be required to say for sure. \" \
K2-18b was first identified in 2015 by the Kepler space telescope. It is about 110 light-years from Earth and larger \
but less dense. Its star, a red dwarf, is cooler than the Sun, but the planet's orbit is much closer, such that a year \
on K2-18b lasts 33 Earth days. According to The Guardian, astronomers were optimistic that NASA's James Webb space \
telescope — scheduled for launch in 2021 — and the European Space Agency's 2028 ARIEL program, could reveal more \
about exoplanets like K2-18b."];
// (New sample credits: [WikiNews](https://en.wikinews.org/wiki/Astronomers_find_water_vapour_in_atmosphere_of_exoplanet_K2-18b))
c.bench_function("token_classification_predict", |b| {
b.iter(|| model.predict(black_box(&input), true, true))
});
}

criterion_group! {
name = benches;
config = Criterion::default().sample_size(10);
targets = bench_token_classification_predict
}
criterion_main!(benches);
19 changes: 3 additions & 16 deletions src/pipelines/token_classification.rs
Original file line number Diff line number Diff line change
Expand Up @@ -870,10 +870,7 @@ impl TokenClassificationModel {
.flat_map(|(example_index, example)| self.generate_features(example, example_index))
.collect();

let mut example_tokens_map: HashMap<usize, Vec<Token>> = HashMap::new();
for example_idx in 0..input.len() {
example_tokens_map.insert(example_idx, Vec::new());
}
let mut example_tokens_map: Vec<Vec<Token>> = vec![Vec::new(); input.len()];
let mut start = 0usize;
let len_features = features.len();

Expand Down Expand Up @@ -927,23 +924,13 @@ impl TokenClassificationModel {
word_idx,
)
};
example_tokens_map
.get_mut(&(feature.example_index))
.unwrap()
.push(token);
example_tokens_map[feature.example_index].push(token);
}
}
});
start = end;
}
let mut tokens = example_tokens_map
.into_iter()
.collect::<Vec<(usize, Vec<Token>)>>();
tokens.sort_by_key(|kv| kv.0);
let mut tokens = tokens
.into_iter()
.map(|(_, v)| v)
.collect::<Vec<Vec<Token>>>();
let mut tokens = example_tokens_map;

if consolidate_sub_tokens {
self.consolidate_tokens(&mut tokens, &self.label_aggregation_function);
Expand Down

0 comments on commit c5faadc

Please sign in to comment.