cfg: other languages compatibility #66

nlpirate · 2022-10-07T06:24:58Z

nlpirate
Oct 7, 2022

I am interested in some verticalization of lambeq (and consequently discocat) to languages other than English, particularly Italian. As far as I read from the documentation from a linguistic point of view, at the base of the framework, there are cfg grammars. I know this theoretical formalism very well. How is it possible to visualize and extract the structures and formalisms of these grammars from the library so that they can be extended/modified?

chirico85 · 2023-01-31T16:37:48Z

chirico85
Jan 31, 2023

Good point.
In my case German would be interesting.

0 replies

Thommy257 · 2023-02-06T14:52:04Z

Thommy257
Feb 6, 2023

Hi @nlpirate and @chirico85,

Our parser works based on the Combinatory categorial grammar (CCG) formalism, which is a bit different to context-free grammar (CFG). While CFGs are generative, i.e produce valid sentences, CCG models are used to infer grammar trees from well-formed sentences. Hence, CCGs are parsable, which we leverage using our BobcatParser.

Bobcat works in two stages: First, we apply a BERT model to determine the most likely CCG types per word. The outcome of that step is a weighted list of the k most likely types. After that, we apply a deterministic chart parser that aims to find the most probable CCG reduction tree from the possible word types. You can read more about our parser here.

As you can see, we use a statistical model for the first step, which needs to be trained on data. The data we use to train Bobcat is the CCGbank, which is a translation from the Penn treebank. Hence, to support multiple languages, we need to have such CCG banks for each language, which might require a lot of work.

However, if you don't require a fully-comprehensive CCG parser, you can always create your own (deterministic) parser based on our abstract CCGParser class:

lambeq/lambeq/text2diagram/ccg_parser.py

Line 34 in 70a1fe8

class CCGParser(Reader):

I hope this helps!

0 replies

Thommy257 · 2023-02-06T15:20:56Z

Thommy257
Feb 6, 2023

Also, DisCoPy supports CFG grammars (https://docs.discopy.org/en/0.5/discopy/grammar.cfg.html?highlight=cfg#module-discopy.grammar.cfg), therefore CFGs are also supported by lambeq. Furthermore, CCGs can express CFGs, i.e. also be used to generate sentences.

0 replies

dimkart · 2023-02-08T14:31:01Z

dimkart
Feb 8, 2023
Maintainer

I think there was an effort for creating an Italian CCGBank in the past (Turin univ.?) , not sure however what happened with that project.

0 replies

nlpirate · 2023-02-12T15:27:13Z

nlpirate
Feb 12, 2023
Author

yes, indeed it exists and is available on the site (tut-ccg), but the annotation is different than the one used in lambeq

0 replies

dimkart · 2023-03-16T12:08:46Z

dimkart
Mar 16, 2023
Maintainer

Not sure if there is anything more to say here, since as @Thommy257 explained above, without an annotated corpus like CCGBank you can't train a statistical parser. However we are very much interested in adding to lambeq support for languages other than English (and we welcome any community work towards this goal), so this issue will be converted into a Discussion to stay alive.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cfg: other languages compatibility #66

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

cfg: other languages compatibility #66

nlpirate Oct 7, 2022

Replies: 6 comments

chirico85 Jan 31, 2023

Thommy257 Feb 6, 2023

Thommy257 Feb 6, 2023

dimkart Feb 8, 2023 Maintainer

nlpirate Feb 12, 2023 Author

dimkart Mar 16, 2023 Maintainer

nlpirate
Oct 7, 2022

chirico85
Jan 31, 2023

Thommy257
Feb 6, 2023

Thommy257
Feb 6, 2023

dimkart
Feb 8, 2023
Maintainer

nlpirate
Feb 12, 2023
Author

dimkart
Mar 16, 2023
Maintainer