-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doc classifier #37
Doc classifier #37
Conversation
Nice! Will start digging in now 👍 |
There are two instances of I think a good unit test in the future would be to go through and run each of the pipelines from (a) the CLI and (b) the python module and verify that the returned answers match, as well as verifying that both entry points haven't been inadvertently broken. |
This line needs the parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great 👍 I left plenty of comments, many of which are nits and can probably be ignored, but I think I also identified a few breaking bugs that need to be addressed before merging. In particular, docquery/cmd/scan.py
requires a few very easy fixes in order to get the CLI working again with the --classify
flag. I also have a couple items that I need to look further into on my end of this work, which I've left comments referencing.
👍 Let me know if you'd like me to clarify or expand on any of the comments I left!
src/docquery/transformers_patch.py
Outdated
|
||
|
||
PIPELINE_DEFAULTS = { | ||
"document-question-answering": "impira/layoutlm-document-qa", | ||
"image-classification": "naver-clova-ix/donut-base-finetuned-rvlcdip", | ||
"document-classification": "impira/layoutlm-document-classification", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should be "impira/layoutlm-document-classifier"
(either that or I've misunderstood how registering models with huggingface works, also a possibility). The default model name you have here raises the following error:
OSError: impira/layoutlm-document-classification is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
Whereas impira/layoutlm-document-classifier
is a valid model identifier (found here), which I assume is the model you're trying to set as default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
from transformers.utils import ( | ||
ExplicitEnum, | ||
add_end_docstrings, | ||
is_tf_available, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: There's some imports in this file that are no longer used anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
""" | ||
# XXX Refactor this or combine with document_question_answering | ||
if isinstance(image, list): | ||
normalized_images = (i if isinstance(i, tuple) or isinstance(i, list) else (i, None) for i in image) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isinstance(i, tuple) or isinstance(i, list)
-> isinstance(i, (tuple, list))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
max_seq_len=None, | ||
function_to_apply=None, | ||
top_k=None, | ||
**kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kwargs
isn't currently used anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
tesseract_config: Optional[str] = None, | ||
max_seq_len=None, | ||
function_to_apply=None, | ||
top_k=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look at huggingface's TextClassificationPipeline
and it looks like they set the default value for top_k
to an empty string, as None
is reserved for something else. Not sure if that would make sense for our use case, but if you wanted to more closely follow huggingface's patterns: https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/text_classification.py#L77
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, thanks for pointing that out. I think that is some legacy code that is probably still around for back compat.
I mostly followed the newer image pipeline which I think we should do here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set the default to 1 to match roughly this behavior.
return_overflowing_tokens=True, | ||
) | ||
|
||
num_spans = len(encoding["input_ids"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Sometimes we access members of encoding
by dict keys (i.e. encoding["input_ids"]
), and sometimes by attribute (i.e. encoding.input_ids
). I think we should pick one or the other, for code consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to leave this as is for consistency with the document question answering pipeline
elif i == self.tokenizer.sep_token_id: | ||
bbox.append([1000] * 4) | ||
else: | ||
bbox.append([0] * 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be somewhat nit, but I believe this else:
condition currently assigns [0, 0, 0, 0]
to both the [CLS]
token as well as padded tokens. If we create global variables for the default CLS
and SEP
tokens, we should also create one for DEFAULT_PAD_BBOX = [0, 0, 0, 0]
and be more explicit in the code logic with when we're handling the [CLS]
token vs pad tokens (even if they happen to have the same default bbox value for our use case, since this might not be true for other tokenizers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We assign different things in the document question answering pipeline, but I changed it to this to match what's in the Impira repo. Can you clarify whether there are any SEP tokens in document classification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For doc classification, the first token of the token sequence is the [CLS]
token and the last one is the [SEP]
token, with every token in between assigned 0
for its sequence id.
I would personally remove the dependence on sequence_ids entirely, as it's not really useful for classification, and write this if block like the following:
if i == self.tokenizer.cls_token_id:
bbox.append(DEFAULT_CLS_BBOX)
elif i == self.tokenizer.sep_token_id:
bbox.append(DEFAULT_SEP_BBOX)
elif i == self.tokenizer.pad_token_id:
bbox.append(DEFAULT_PAD_BBOX)
else:
bbox.append(boxes[w])
Note that the above code can also be used in the QA pipeline (although I believe an additional condition needs to be added to add the appropriate bboxes for the "question" part of the token sequence).
if self.model_type == ModelType.VisionEncoderDecoder: | ||
model_outputs = self.model.generate(**model_inputs) | ||
else: | ||
model_outputs = self.model(**model_inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It's a bit confusing that model_outputs
in this method is a ModelOutput
object, but model_outputs
in the postprocess()
method is a list of ModelOutput
s. If I were writing this I would tweak the variable names to try and reflect this, but I won't push too hard on this 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it -- I think that is valid feedback.
model_outputs, function_to_apply=function_to_apply, top_k=top_k, **kwargs | ||
) | ||
|
||
answers = sorted(answers, key=lambda x: x.get("score", 0), reverse=True)[:top_k] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this break if top_k > len(answers)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so?
> a = [1,2,3]
> a[:10]
# [1, 2, 3]
Below is a diff that applies a couple minor changes to
|
Everything else looks good, thanks for the quick changes. After you've applied the above diff and had a chance to look at this comment (I'll leave it up to you if you want to make this change or not though), I'm happy to give a ✅ ! |
4bef4e4
to
c2e9945
Compare
@jeffsrobertson just applied your changes -- please take a look when you have a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, everything looks good to me! You have my ✅
Replace the hacked image classification pipeline with a new
DocumentClassificationPipeline
, which is loosely based on the DocumentQuestionAnsweringPipelineand
TextClassificationPipeline`.