Skip to content

Commit

Permalink
Language detection with CLD3 (forem#19756)
Browse files Browse the repository at this point in the history
* Language detection with CLD3 POC

* Update spec/models/article_spec.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update spec/models/article_spec.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update spec/models/article_spec.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update spec/models/article_spec.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update spec/models/article_spec.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update spec/models/article_spec.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update app/models/article.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Clean up tests

* Clean up tests

* Move language detection to service

* rubocop

* Update app/services/languages/detection.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* More linting cleanup

* Remove allow_any_instance_of

* Update tests to better exercise different paths

* Fix flaky spec

* Update spec/services/languages/detection_spec.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update spec/models/article_spec.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update spec/models/article_spec.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update spec/models/article_spec.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update spec/models/article_spec.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update spec/services/languages/detection_spec.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update spec/services/languages/detection_spec.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update spec/services/languages/detection_spec.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update spec/services/languages/detection_spec.rb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
  • Loading branch information
benhalpern and github-actions[bot] authored Jul 17, 2023
1 parent 77148d8 commit 8335a37
Show file tree
Hide file tree
Showing 10 changed files with 145 additions and 1 deletion.
1 change: 1 addition & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ gem "blazer", "~> 2.6" # Allows admins to query data
gem "bootsnap", ">= 1.1.0", require: false # Boot large ruby/rails apps faster
gem "carrierwave", "~> 2.2" # Upload files in your Ruby applications, map them to a range of ORMs, store them on different backends
gem "carrierwave-bombshelter", "~> 0.2" # Protect your carrierwave from image bombs
gem "cld3", "~> 3.5" # Ruby interface for Compact Language Detector v3
gem "cloudinary", "~> 1.23" # Client library for easily using the Cloudinary service
gem "counter_culture", "~> 3.2" # counter_culture provides turbo-charged counter caches that are kept up-to-date
gem "ddtrace", "~> 1.3.0" # ddtrace is Datadog’s tracing client for Ruby.
Expand Down
2 changes: 2 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,7 @@ GEM
fastimage
cgi (0.3.6)
chartkick (4.2.1)
cld3 (3.5.3)
cloudinary (1.26.0)
aws_cf_signer
rest-client (>= 2.0.0)
Expand Down Expand Up @@ -982,6 +983,7 @@ DEPENDENCIES
carrierwave (~> 2.2)
carrierwave-bombshelter (~> 0.2)
cgi (~> 0.3.6)
cld3 (~> 3.5)
cloudinary (~> 1.23)
counter_culture (~> 3.2)
cuprite (~> 0.13)
Expand Down
7 changes: 7 additions & 0 deletions app/models/article.rb
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,7 @@ def self.unique_url_error
before_save :calculate_base_scores
before_save :fetch_video_duration
before_save :set_caches
before_save :detect_language
before_create :create_password
before_destroy :before_destroy_actions, prepend: true

Expand Down Expand Up @@ -625,6 +626,12 @@ def collection_cleanup
collection.destroy
end

def detect_language
return unless title_changed? || body_markdown_changed?

self.language = Languages::Detection.call("#{title}. #{body_text}")
end

def search_score
comments_score = (comments_count * 3).to_i
partial_score = (comments_score + (public_reactions_count.to_i * 300 * user.reputation_modifier * score.to_i))
Expand Down
22 changes: 22 additions & 0 deletions app/services/languages/detection.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
module Languages
class Detection
attr_reader :text

PROBABILITY_THRESHOLD = 0.5

def self.call(...)
new(...).call
end

def initialize(text)
@text = text
end

def call(identifier: CLD3::NNetLanguageIdentifier.new(0, 1000))
language_outcome = identifier.find_language(text)
return unless language_outcome.probability > PROBABILITY_THRESHOLD && language_outcome.reliable?

language_outcome.language
end
end
end
5 changes: 5 additions & 0 deletions db/migrate/20230712195950_add_human_language_to_articles.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
class AddHumanLanguageToArticles < ActiveRecord::Migration[7.0]
def change
add_column :articles, :language, :string
end
end
6 changes: 6 additions & 0 deletions db/migrate/20230713150940_add_language_index_to_articles.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
class AddLanguageIndexToArticles < ActiveRecord::Migration[7.0]
disable_ddl_transaction!
def change
add_index :articles, :language, algorithm: :concurrently
end
end
4 changes: 3 additions & 1 deletion db/schema.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
#
# It's strongly recommended that you check this file into your version control system.

ActiveRecord::Schema[7.0].define(version: 2023_06_27_154435) do
ActiveRecord::Schema[7.0].define(version: 2023_07_13_150940) do
# These are extensions that must be enabled in order to support this database
enable_extension "citext"
enable_extension "pg_trgm"
Expand Down Expand Up @@ -105,6 +105,7 @@
t.boolean "featured", default: false
t.string "feed_source_url"
t.integer "hotness_score", default: 0
t.string "language"
t.datetime "last_comment_at", precision: nil, default: "2017-01-01 05:00:00"
t.datetime "last_experience_level_rating_at", precision: nil
t.string "main_image"
Expand Down Expand Up @@ -160,6 +161,7 @@
t.index ["feed_source_url"], name: "index_articles_on_feed_source_url_unscoped"
t.index ["hotness_score", "comments_count"], name: "index_articles_on_hotness_score_and_comments_count"
t.index ["hotness_score"], name: "index_articles_on_hotness_score"
t.index ["language"], name: "index_articles_on_language"
t.index ["path"], name: "index_articles_on_path"
t.index ["public_reactions_count"], name: "index_articles_on_public_reactions_count", order: :desc
t.index ["published"], name: "index_articles_on_published"
Expand Down
23 changes: 23 additions & 0 deletions spec/models/article_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -1420,4 +1420,27 @@ def foo():
end
end
end
describe "#detect_language" do
let(:detected_language) { :kl } # kl for Klingon

before do
allow(Languages::Detection).to receive(:call).and_return(detected_language)
end

it "detects language using title and body for newly created articles" do
article = create(:article)
expect(Languages::Detection).to have_received(:call).with("#{article.title}. #{article.body_text}")
end

it "detects language using title and body for updated articles" do
article.update(body_markdown: "---title: This is a new english article\n---\n\n# Hello World")
expect(Languages::Detection).to have_received(:call).with("#{article.title}. #{article.body_text}")
end

it "does not call detection when title and body_markdown are unchanged" do
article.language = "es"
article.update(nth_published_by_author: 5)
expect(Languages::Detection).not_to have_received(:call)
end
end
end
76 changes: 76 additions & 0 deletions spec/services/languages/detection_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
require "rails_helper"

RSpec.describe Languages::Detection, type: :service do
subject(:language_detection) { described_class.call(text) }

context "when the text is clearly identifiable as English" do
let(:text) { "This is clearly English text." }

it "returns en" do
expect(language_detection).to eq(:en)
end
end

context "when the text is clearly identifiable as Spanish" do
let(:text) { "Esto es claramente un texto en español." }

it "returns es" do
expect(language_detection).to eq(:es)
end
end

context "when probability and reliability vary" do
let(:text) { "This is some dummy text." }
let(:identifier) { instance_double(CLD3::NNetLanguageIdentifier) }

before do
allow(CLD3::NNetLanguageIdentifier).to receive(:new).and_return(identifier)
allow(identifier).to receive(:find_language).with(text).and_return(language_outcome)
end

context "when probability is low" do
let(:language_outcome) do
instance_double(
CLD3::NNetLanguageIdentifier::Result,
language: :es,
probability: 0.4,
reliable?: true
)
end

it "returns nil" do
expect(described_class.call(text)).to eq(nil)
end
end

context "when reliability is low" do
let(:language_outcome) do
instance_double(
'CLD3::NNetLanguageIdentifier::Result',
language: :es,
probability: 0.9,
reliable?: false
)
end

it "returns nil" do
expect(described_class.call(text)).to be(nil)
end
end

context "when probability and reliability are high" do
let(:language_outcome) do
instance_double(
CLD3::NNetLanguageIdentifier::Result,
language: :es,
probability: 0.9,
reliable?: true,
)
end

it "returns es" do
expect(described_class.call(text)).to eq(:es)
end
end
end
end
Binary file added vendor/cache/cld3-3.5.3.gem
Binary file not shown.

0 comments on commit 8335a37

Please sign in to comment.