Skip to content

Commit

Permalink
deal with curly quotes
Browse files Browse the repository at this point in the history
  • Loading branch information
Chris Dyer committed May 22, 2015
1 parent d3c9c36 commit cf61f8b
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions corpus/support/tokenizer.pl
Original file line number Diff line number Diff line change
Expand Up @@ -388,15 +388,18 @@ sub deep_proc_token {
##### step 1: separate by punct T2 on the boundary
my $t2 = '\`|\!|\@|\+|\=|\[|\]|\<|\>|\||\(|\)|\{|\}|\?|\"|;|●|○';
if($line =~ s/^(($t2)+)/$1 /){
$line =~ s/"//;
return proc_line($line);
}

if($line =~ s/(($t2)+)$/ $1/){
$line =~ s/"//;
return proc_line($line);
}

## step 2: separate by punct T2 in any position
if($line =~ s/(($t2)+)/ $1 /g){
$line =~ s/"//g; # probably before punctuation char
return proc_line($line);
}

Expand Down

0 comments on commit cf61f8b

Please sign in to comment.