forked from forem/forem
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Language detection with CLD3 (forem#19756)
* Language detection with CLD3 POC * Update spec/models/article_spec.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update spec/models/article_spec.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update spec/models/article_spec.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update spec/models/article_spec.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update spec/models/article_spec.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update spec/models/article_spec.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update app/models/article.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Clean up tests * Clean up tests * Move language detection to service * rubocop * Update app/services/languages/detection.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * More linting cleanup * Remove allow_any_instance_of * Update tests to better exercise different paths * Fix flaky spec * Update spec/services/languages/detection_spec.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update spec/models/article_spec.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update spec/models/article_spec.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update spec/models/article_spec.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update spec/models/article_spec.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update spec/services/languages/detection_spec.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update spec/services/languages/detection_spec.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update spec/services/languages/detection_spec.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update spec/services/languages/detection_spec.rb Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
- Loading branch information
1 parent
77148d8
commit 8335a37
Showing
10 changed files
with
145 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
module Languages | ||
class Detection | ||
attr_reader :text | ||
|
||
PROBABILITY_THRESHOLD = 0.5 | ||
|
||
def self.call(...) | ||
new(...).call | ||
end | ||
|
||
def initialize(text) | ||
@text = text | ||
end | ||
|
||
def call(identifier: CLD3::NNetLanguageIdentifier.new(0, 1000)) | ||
language_outcome = identifier.find_language(text) | ||
return unless language_outcome.probability > PROBABILITY_THRESHOLD && language_outcome.reliable? | ||
|
||
language_outcome.language | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
class AddHumanLanguageToArticles < ActiveRecord::Migration[7.0] | ||
def change | ||
add_column :articles, :language, :string | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
class AddLanguageIndexToArticles < ActiveRecord::Migration[7.0] | ||
disable_ddl_transaction! | ||
def change | ||
add_index :articles, :language, algorithm: :concurrently | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
require "rails_helper" | ||
|
||
RSpec.describe Languages::Detection, type: :service do | ||
subject(:language_detection) { described_class.call(text) } | ||
|
||
context "when the text is clearly identifiable as English" do | ||
let(:text) { "This is clearly English text." } | ||
|
||
it "returns en" do | ||
expect(language_detection).to eq(:en) | ||
end | ||
end | ||
|
||
context "when the text is clearly identifiable as Spanish" do | ||
let(:text) { "Esto es claramente un texto en español." } | ||
|
||
it "returns es" do | ||
expect(language_detection).to eq(:es) | ||
end | ||
end | ||
|
||
context "when probability and reliability vary" do | ||
let(:text) { "This is some dummy text." } | ||
let(:identifier) { instance_double(CLD3::NNetLanguageIdentifier) } | ||
|
||
before do | ||
allow(CLD3::NNetLanguageIdentifier).to receive(:new).and_return(identifier) | ||
allow(identifier).to receive(:find_language).with(text).and_return(language_outcome) | ||
end | ||
|
||
context "when probability is low" do | ||
let(:language_outcome) do | ||
instance_double( | ||
CLD3::NNetLanguageIdentifier::Result, | ||
language: :es, | ||
probability: 0.4, | ||
reliable?: true | ||
) | ||
end | ||
|
||
it "returns nil" do | ||
expect(described_class.call(text)).to eq(nil) | ||
end | ||
end | ||
|
||
context "when reliability is low" do | ||
let(:language_outcome) do | ||
instance_double( | ||
'CLD3::NNetLanguageIdentifier::Result', | ||
language: :es, | ||
probability: 0.9, | ||
reliable?: false | ||
) | ||
end | ||
|
||
it "returns nil" do | ||
expect(described_class.call(text)).to be(nil) | ||
end | ||
end | ||
|
||
context "when probability and reliability are high" do | ||
let(:language_outcome) do | ||
instance_double( | ||
CLD3::NNetLanguageIdentifier::Result, | ||
language: :es, | ||
probability: 0.9, | ||
reliable?: true, | ||
) | ||
end | ||
|
||
it "returns es" do | ||
expect(described_class.call(text)).to eq(:es) | ||
end | ||
end | ||
end | ||
end |
Binary file not shown.