Skip to content

Commit

Permalink
Support ingesting DOCX files using pandoc
Browse files Browse the repository at this point in the history
  • Loading branch information
oxaroky02 committed Jun 10, 2024
1 parent 63729a4 commit 552ce1e
Show file tree
Hide file tree
Showing 4 changed files with 25 additions and 1 deletion.
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ FROM base

# Install packages needed for deployment
RUN apt-get update -qq && \
apt-get install --no-install-recommends -y curl libsqlite3-0 libvips netcat-traditional libpq5 poppler-utils && \
apt-get install --no-install-recommends -y curl libsqlite3-0 libvips netcat-traditional libpq5 poppler-utils pandoc && \
rm -rf /var/lib/apt/lists /var/cache/apt/archives

# Copy built artifacts: gems, application
Expand Down
1 change: 1 addition & 0 deletions app/services/parsers.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ module Parsers
def self.parser_for(filename)
name_locase = filename.downcase
return Parsers::Pdf if name_locase.end_with?(".pdf")
return Parsers::Docx if name_locase.end_with?(".docx")
return Parsers::Text if name_locase.end_with?(".txt", ".html", ".md")

raise StandardError, "Unsupported file extension: '#{filename.slice(/\.\w+$/)}'"
Expand Down
22 changes: 22 additions & 0 deletions app/services/parsers/docx.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
module Parsers
class Docx
include BasicTextChunker

def initialize(document)
@document = document
end

def text
# NOTE: Using -raw opt causes text to be broken up a lot; but not using raw
# may cause tables to be "pretty" in text which may not be ideal for chunking.
# Not specifying works best for rotated pages, so doing that for now
cmd = 'pandoc -f docx --to=commonmark -'
txt, serr, status = Open3.capture3(cmd, stdin_data: @document.contents, binmode: true)
return txt if status.success?

Rails.logger.error("Error running '#{cmd}' on DOCX: #{@document.filename}\n#{serr}")

raise StandardError, "Error converting DPCX to text: #{@document.filename}'"
end
end
end
1 change: 1 addition & 0 deletions ops.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ dependencies:
brew:
- overmind
- poppler
- pandoc
custom:
- bundle config --local path vendor/bundle
- bundle config set --local build.pg ${PG_OPTS}
Expand Down

0 comments on commit 552ce1e

Please sign in to comment.