New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Doc classifier #37

Merged

ankrgyl merged 5 commits into main from doc-classifier

Sep 26, 2022

Contributor

ankrgyl commented Sep 20, 2022

Replace the hacked image classification pipeline with a new DocumentClassificationPipeline, which is loosely based on the DocumentQuestionAnsweringPipelineandTextClassificationPipeline`.

ankrgyl requested review from rstebbing and jeffsrobertson

September 20, 2022 05:33

jeffsrobertson commented Sep 21, 2022

Nice! Will start digging in now 👍

jeffsrobertson commented Sep 21, 2022

There are two instances of "image-classification" which need to be replaced by "document-classification"in src/docquery/cmd/scan.py. Specifically, here and here.

I think a good unit test in the future would be to go through and run each of the pipelines from (a) the CLI and (b) the python module and verify that the returned answers match, as well as verifying that both entry points haven't been inadvertently broken.

jeffsrobertson commented Sep 21, 2022

This line needs the parameter images changed to image

jeffsrobertson suggested changes

View reviewed changes

jeffsrobertson left a comment

Looks great 👍 I left plenty of comments, many of which are nits and can probably be ignored, but I think I also identified a few breaking bugs that need to be addressed before merging. In particular, docquery/cmd/scan.py requires a few very easy fixes in order to get the CLI working again with the --classify flag. I also have a couple items that I need to look further into on my end of this work, which I've left comments referencing.

👍 Let me know if you'd like me to clarify or expand on any of the comments I left!

src/docquery/transformers_patch.py Outdated

    
              PIPELINE_DEFAULTS = {

                  "document-question-answering": "impira/layoutlm-document-qa",

                  "image-classification": "naver-clova-ix/donut-base-finetuned-rvlcdip",

                  "document-classification": "impira/layoutlm-document-classification",

jeffsrobertson Sep 21, 2022

I believe this should be "impira/layoutlm-document-classifier" (either that or I've misunderstood how registering models with huggingface works, also a possibility). The default model name you have here raises the following error:

OSError: impira/layoutlm-document-classification is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'

Whereas impira/layoutlm-document-classifier is a valid model identifier (found here), which I assume is the model you're trying to set as default.

Contributor Author

ankrgyl Sep 22, 2022

Good catch

src/docquery/ext/pipeline_document_classification.py Outdated

+              from transformers.utils import (
+                  ExplicitEnum,
+                  add_end_docstrings,
+                  is_tf_available,

jeffsrobertson Sep 21, 2022

nit: There's some imports in this file that are no longer used anymore

Contributor Author

ankrgyl Sep 22, 2022

fixed

src/docquery/ext/pipeline_document_classification.py Outdated

+                      """
+                      # XXX Refactor this or combine with document_question_answering
+                      if isinstance(image, list):
+                          normalized_images = (i if isinstance(i, tuple) or isinstance(i, list) else (i, None) for i in image)

jeffsrobertson Sep 21, 2022

nit: isinstance(i, tuple) or isinstance(i, list) -> isinstance(i, (tuple, list))

Contributor Author

ankrgyl Sep 22, 2022

Fixed

src/docquery/ext/pipeline_document_classification.py Outdated

+                      max_seq_len=None,
+                      function_to_apply=None,
+                      top_k=None,
+                      **kwargs,

jeffsrobertson Sep 21, 2022

kwargs isn't currently used anywhere

Contributor Author

ankrgyl Sep 22, 2022

Good catch.

src/docquery/ext/pipeline_document_classification.py

+                      tesseract_config: Optional[str] = None,
+                      max_seq_len=None,
+                      function_to_apply=None,
+                      top_k=None,

jeffsrobertson Sep 21, 2022

I took a quick look at huggingface's TextClassificationPipeline and it looks like they set the default value for top_k to an empty string, as None is reserved for something else. Not sure if that would make sense for our use case, but if you wanted to more closely follow huggingface's patterns: https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/text_classification.py#L77

Contributor Author

ankrgyl Sep 22, 2022

Interesting, thanks for pointing that out. I think that is some legacy code that is probably still around for back compat.

I mostly followed the newer image pipeline which I think we should do here too.

Contributor Author

ankrgyl Sep 22, 2022

I set the default to 1 to match roughly this behavior.

src/docquery/ext/pipeline_document_classification.py

+                                  return_overflowing_tokens=True,
+                              )
+                              num_spans = len(encoding["input_ids"])

jeffsrobertson Sep 21, 2022

nit: Sometimes we access members of encoding by dict keys (i.e. encoding["input_ids"]), and sometimes by attribute (i.e. encoding.input_ids). I think we should pick one or the other, for code consistency.

Contributor Author

ankrgyl Sep 22, 2022

I'm going to leave this as is for consistency with the document question answering pipeline

src/docquery/ext/pipeline_document_classification.py Outdated

+                                      elif i == self.tokenizer.sep_token_id:
+                                          bbox.append([1000] * 4)
+                                      else:
+                                          bbox.append([0] * 4)

jeffsrobertson Sep 21, 2022

This might be somewhat nit, but I believe this else: condition currently assigns [0, 0, 0, 0] to both the [CLS] token as well as padded tokens. If we create global variables for the default CLS and SEP tokens, we should also create one for DEFAULT_PAD_BBOX = [0, 0, 0, 0] and be more explicit in the code logic with when we're handling the [CLS] token vs pad tokens (even if they happen to have the same default bbox value for our use case, since this might not be true for other tokenizers).

Contributor Author

ankrgyl Sep 22, 2022

We assign different things in the document question answering pipeline, but I changed it to this to match what's in the Impira repo. Can you clarify whether there are any SEP tokens in document classification?

jeffsrobertson Sep 26, 2022

For doc classification, the first token of the token sequence is the [CLS] token and the last one is the [SEP] token, with every token in between assigned 0 for its sequence id.

I would personally remove the dependence on sequence_ids entirely, as it's not really useful for classification, and write this if block like the following:

if i == self.tokenizer.cls_token_id:
    bbox.append(DEFAULT_CLS_BBOX)
elif i == self.tokenizer.sep_token_id:
    bbox.append(DEFAULT_SEP_BBOX)
elif i == self.tokenizer.pad_token_id:
    bbox.append(DEFAULT_PAD_BBOX)
else:
    bbox.append(boxes[w])

Note that the above code can also be used in the QA pipeline (although I believe an additional condition needs to be added to add the appropriate bboxes for the "question" part of the token sequence).

src/docquery/ext/pipeline_document_classification.py

+                      if self.model_type == ModelType.VisionEncoderDecoder:
+                          model_outputs = self.model.generate(**model_inputs)
+                      else:
+                          model_outputs = self.model(**model_inputs)

jeffsrobertson Sep 21, 2022

nit: It's a bit confusing that model_outputs in this method is a ModelOutput object, but model_outputs in the postprocess() method is a list of ModelOutputs. If I were writing this I would tweak the variable names to try and reflect this, but I won't push too hard on this 😄

Contributor Author

ankrgyl Sep 22, 2022

I'll change it -- I think that is valid feedback.

src/docquery/ext/pipeline_document_classification.py Show resolved Hide resolved

src/docquery/ext/pipeline_document_classification.py

+                              model_outputs, function_to_apply=function_to_apply, top_k=top_k, **kwargs
+                          )
+                      answers = sorted(answers, key=lambda x: x.get("score", 0), reverse=True)[:top_k]

jeffsrobertson Sep 21, 2022

Will this break if top_k > len(answers)?

Contributor Author

ankrgyl Sep 22, 2022

I don't think so?

> a = [1,2,3]
> a[:10]
# [1, 2, 3]

jeffsrobertson commented Sep 26, 2022

Below is a diff that applies a couple minor changes to docquery/cmd/scan.py to get the CLI working again. Could you include these changes in your PR?

diff --git a/src/docquery/cmd/scan.py b/src/docquery/cmd/scan.py
index 640ee9b..4e3223e 100644
--- a/src/docquery/cmd/scan.py
+++ b/src/docquery/cmd/scan.py
@@ -36,7 +36,7 @@ def build_parser(subparsers, parent_parser):
     parser.add_argument(
         "--classify-checkpoint",
         default=None,
-        help=f"A custom model checkpoint to use (other than {PIPELINE_DEFAULTS['image-classification']})",
+        help=f"A custom model checkpoint to use (other than {PIPELINE_DEFAULTS['document-classification']})",
     )
 
     parser.set_defaults(func=main)
@@ -65,7 +65,7 @@ def main(args):
     log.info("Ready to start evaluating!")
 
     if args.classify:
-        classify = pipeline("image-classification", model=args.classify_checkpoint)
+        classify = pipeline("document-classification", model=args.classify_checkpoint)
 
     max_fname_len = max(len(str(p)) for (p, d) in docs)
     max_question_len = max(len(q) for q in args.questions) if len(args.questions) > 0 else 0
@@ -73,7 +73,7 @@ def main(args):
         if i > 0 and len(args.questions) > 1:
             print("")
         if args.classify:
-            cls = classify(images=d.preview[0])[0]
+            cls = classify(image=d.preview[0])[0]
             print(f"{str(p):<{max_fname_len}} Document Type: {cls['label']}")
 
         for q in args.questions:

jeffsrobertson commented Sep 26, 2022

Everything else looks good, thanks for the quick changes. After you've applied the above diff and had a chance to look at this comment (I'll leave it up to you if you want to make this change or not though), I'm happy to give a ✅ !

ankrgyl added 5 commits

September 26, 2022 10:25


          Add document classifier pipeline


          Fix donut/tests

fc2d5fd


          Fix trunctation=True

e8c53d4


          Apply comments

11e8777


          Comments

c2e9945

ankrgyl force-pushed the doc-classifier branch from 4bef4e4 to c2e9945 Compare

September 26, 2022 17:38

Contributor Author

ankrgyl commented Sep 26, 2022

@jeffsrobertson just applied your changes -- please take a look when you have a chance.

jeffsrobertson approved these changes

View reviewed changes

jeffsrobertson left a comment

Awesome, everything looks good to me! You have my ✅

ankrgyl merged commit 9012b11 into main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet