Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc classifier #37

Merged
merged 5 commits into from
Sep 26, 2022
Merged

Doc classifier #37

merged 5 commits into from
Sep 26, 2022

Conversation

ankrgyl
Copy link
Contributor

@ankrgyl ankrgyl commented Sep 20, 2022

Replace the hacked image classification pipeline with a new DocumentClassificationPipeline, which is loosely based on the DocumentQuestionAnsweringPipelineandTextClassificationPipeline`.

@jeffsrobertson
Copy link

Nice! Will start digging in now 👍

@jeffsrobertson
Copy link

There are two instances of "image-classification" which need to be replaced by "document-classification"in src/docquery/cmd/scan.py. Specifically, here and here.

I think a good unit test in the future would be to go through and run each of the pipelines from (a) the CLI and (b) the python module and verify that the returned answers match, as well as verifying that both entry points haven't been inadvertently broken.

@jeffsrobertson
Copy link

This line needs the parameter images changed to image

Copy link

@jeffsrobertson jeffsrobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 👍 I left plenty of comments, many of which are nits and can probably be ignored, but I think I also identified a few breaking bugs that need to be addressed before merging. In particular, docquery/cmd/scan.py requires a few very easy fixes in order to get the CLI working again with the --classify flag. I also have a couple items that I need to look further into on my end of this work, which I've left comments referencing.

👍 Let me know if you'd like me to clarify or expand on any of the comments I left!



PIPELINE_DEFAULTS = {
"document-question-answering": "impira/layoutlm-document-qa",
"image-classification": "naver-clova-ix/donut-base-finetuned-rvlcdip",
"document-classification": "impira/layoutlm-document-classification",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be "impira/layoutlm-document-classifier" (either that or I've misunderstood how registering models with huggingface works, also a possibility). The default model name you have here raises the following error:

OSError: impira/layoutlm-document-classification is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'

Whereas impira/layoutlm-document-classifier is a valid model identifier (found here), which I assume is the model you're trying to set as default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

from transformers.utils import (
ExplicitEnum,
add_end_docstrings,
is_tf_available,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: There's some imports in this file that are no longer used anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

"""
# XXX Refactor this or combine with document_question_answering
if isinstance(image, list):
normalized_images = (i if isinstance(i, tuple) or isinstance(i, list) else (i, None) for i in image)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isinstance(i, tuple) or isinstance(i, list) -> isinstance(i, (tuple, list))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

max_seq_len=None,
function_to_apply=None,
top_k=None,
**kwargs,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kwargs isn't currently used anywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

tesseract_config: Optional[str] = None,
max_seq_len=None,
function_to_apply=None,
top_k=None,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look at huggingface's TextClassificationPipeline and it looks like they set the default value for top_k to an empty string, as None is reserved for something else. Not sure if that would make sense for our use case, but if you wanted to more closely follow huggingface's patterns: https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/text_classification.py#L77

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, thanks for pointing that out. I think that is some legacy code that is probably still around for back compat.

I mostly followed the newer image pipeline which I think we should do here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set the default to 1 to match roughly this behavior.

return_overflowing_tokens=True,
)

num_spans = len(encoding["input_ids"])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Sometimes we access members of encoding by dict keys (i.e. encoding["input_ids"]), and sometimes by attribute (i.e. encoding.input_ids). I think we should pick one or the other, for code consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave this as is for consistency with the document question answering pipeline

elif i == self.tokenizer.sep_token_id:
bbox.append([1000] * 4)
else:
bbox.append([0] * 4)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be somewhat nit, but I believe this else: condition currently assigns [0, 0, 0, 0] to both the [CLS] token as well as padded tokens. If we create global variables for the default CLS and SEP tokens, we should also create one for DEFAULT_PAD_BBOX = [0, 0, 0, 0] and be more explicit in the code logic with when we're handling the [CLS] token vs pad tokens (even if they happen to have the same default bbox value for our use case, since this might not be true for other tokenizers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We assign different things in the document question answering pipeline, but I changed it to this to match what's in the Impira repo. Can you clarify whether there are any SEP tokens in document classification?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For doc classification, the first token of the token sequence is the [CLS] token and the last one is the [SEP] token, with every token in between assigned 0 for its sequence id.

I would personally remove the dependence on sequence_ids entirely, as it's not really useful for classification, and write this if block like the following:

if i == self.tokenizer.cls_token_id:
    bbox.append(DEFAULT_CLS_BBOX)
elif i == self.tokenizer.sep_token_id:
    bbox.append(DEFAULT_SEP_BBOX)
elif i == self.tokenizer.pad_token_id:
    bbox.append(DEFAULT_PAD_BBOX)
else:
    bbox.append(boxes[w])

Note that the above code can also be used in the QA pipeline (although I believe an additional condition needs to be added to add the appropriate bboxes for the "question" part of the token sequence).

if self.model_type == ModelType.VisionEncoderDecoder:
model_outputs = self.model.generate(**model_inputs)
else:
model_outputs = self.model(**model_inputs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's a bit confusing that model_outputs in this method is a ModelOutput object, but model_outputs in the postprocess() method is a list of ModelOutputs. If I were writing this I would tweak the variable names to try and reflect this, but I won't push too hard on this 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it -- I think that is valid feedback.

model_outputs, function_to_apply=function_to_apply, top_k=top_k, **kwargs
)

answers = sorted(answers, key=lambda x: x.get("score", 0), reverse=True)[:top_k]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this break if top_k > len(answers)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so?

> a = [1,2,3]
> a[:10]
# [1, 2, 3]

@jeffsrobertson
Copy link

Below is a diff that applies a couple minor changes to docquery/cmd/scan.py to get the CLI working again. Could you include these changes in your PR?

diff --git a/src/docquery/cmd/scan.py b/src/docquery/cmd/scan.py
index 640ee9b..4e3223e 100644
--- a/src/docquery/cmd/scan.py
+++ b/src/docquery/cmd/scan.py
@@ -36,7 +36,7 @@ def build_parser(subparsers, parent_parser):
     parser.add_argument(
         "--classify-checkpoint",
         default=None,
-        help=f"A custom model checkpoint to use (other than {PIPELINE_DEFAULTS['image-classification']})",
+        help=f"A custom model checkpoint to use (other than {PIPELINE_DEFAULTS['document-classification']})",
     )
 
     parser.set_defaults(func=main)
@@ -65,7 +65,7 @@ def main(args):
     log.info("Ready to start evaluating!")
 
     if args.classify:
-        classify = pipeline("image-classification", model=args.classify_checkpoint)
+        classify = pipeline("document-classification", model=args.classify_checkpoint)
 
     max_fname_len = max(len(str(p)) for (p, d) in docs)
     max_question_len = max(len(q) for q in args.questions) if len(args.questions) > 0 else 0
@@ -73,7 +73,7 @@ def main(args):
         if i > 0 and len(args.questions) > 1:
             print("")
         if args.classify:
-            cls = classify(images=d.preview[0])[0]
+            cls = classify(image=d.preview[0])[0]
             print(f"{str(p):<{max_fname_len}} Document Type: {cls['label']}")
 
         for q in args.questions:

@jeffsrobertson
Copy link

Everything else looks good, thanks for the quick changes. After you've applied the above diff and had a chance to look at this comment (I'll leave it up to you if you want to make this change or not though), I'm happy to give a ✅ !

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Sep 26, 2022

@jeffsrobertson just applied your changes -- please take a look when you have a chance.

Copy link

@jeffsrobertson jeffsrobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, everything looks good to me! You have my ✅

@ankrgyl ankrgyl merged commit 9012b11 into main Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants