how can I use the original text in the snippets after cleaning? #54

mikkokotila · 2020-03-19T17:40:50Z

Once I've removed stopwords using nltk or similar, I want to be able to see the original text snippets and not the ones without stopwords. How can I achieve that?

JasonKessler · 2020-03-27T06:38:00Z

The preferred way to remove stopwords in Scattertext is to pass the full documents into a Corpus factory, and then use the Corpus.remove_terms method to create a corpus free of stopwords. You'll still let be able to view the original documents in the scattertext explorer.

For example:

convention_df = st.SampleCorpora.ConventionData2012.get_data().assign(
	parse = lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)
corpus = st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parse').build()

stoplisted_corpus = corpus.remove_terms(['a', 'the'])

On the other hand, you could pass in an alternate_text_field parameter into produce_scattertext_explorer or another compatible function. This would be a column name in the data frame used to create the corpus which would be searched and displayed in the Scattertext visualization. However, the alternative text field is not used to make the plot itself.

sound118 · 2020-04-11T08:23:09Z

The preferred way to remove stopwords in Scattertext is to pass the full documents into a Corpus factory, and then use the Corpus.remove_terms method to create a corpus free of stopwords. You'll still let be able to view the original documents in the scattertext explorer.

For example:
convention_df = st.SampleCorpora.ConventionData2012.get_data().assign(
	parse = lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)
corpus = st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parse').build()

stoplisted_corpus = corpus.remove_terms(['a', 'the'])
On the other hand, you could pass in an alternate_text_field parameter into produce_scattertext_explorer or another compatible function. This would be a column name in the data frame used to create the corpus which would be searched and displayed in the Scattertext visualization. However, the alternative text field is not used to make the plot itself.

I am actually facing the same issue as mentioned above. I am doing a scattertext plot for Chinese and I followed your instructions above by passing an alternate_text_field parameter into produce_scattertext_explorer. When I click a term in scattertext plot, no origianl text shows up in the snippets. Actually, it shows nothing in the snippets. How do I make the original text to be shown there?

JasonKessler · 2020-04-11T16:01:56Z

Could you upload the example which fails to show snippets?

sound118 · 2020-04-12T01:23:17Z

@JasonKessler I just uploaded the example that can reproduce the issue, please see https://github.com/sound118/Scatter-text-for-Chinese

I used "jieba" package to remove stopwords list and load user-defined dictionary in case of any wrong Chinese term segmentation, applied your "chinese_nlp" afterwards.
df['parsed_text'] = df['parsed_text'].apply(chinese_nlp)

You can change the file path to run the program on your lcoal machine to find out the issue.

Thanks.

JasonKessler · 2020-04-13T05:01:25Z

I think the issue is that the alternative text field has to be whitespace-tokenized for the matcher to work.

sound118 · 2020-04-13T05:31:46Z

@JasonKessler, thanks for the hint. It works after adding
df['text'] = df['text'].apply(chinese_nlp)
in the uploaded program. At least, it's still readable after whitespace-tokenized the alternative text field as apposed to the parsed documents in Chinese. It will be even better being able to support the original Chinese documents in the snippet if some features could be added in your scattertext package. Nevertheless, it's elegant enough.

JasonKessler · 2020-04-13T06:24:45Z

Glad to hear it works.

It would be a good feature for someone in the community to pick up and build.

MastafaF · 2020-09-15T17:38:43Z

Hi @JasonKessler ,

I have the same issue here and I could not solve it with your suggestion.
Basically, the following code is used:

data = data.loc[:, ['id', 'language', 'ProcessedText', 'OriginalText']]

from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline


data['parse'] = data['ProcessedText'].apply(st.whitespace_nlp_with_sentences)

unigram_corpus = (st.CorpusFromParsedDocuments(data,
                                               category_col='language',
                                               parsed_col='parse')
                  .build().get_stoplisted_unigram_corpus())



html = st.produce_scattertext_explorer(
            unigram_corpus,
            category='French', category_name='French', not_category_name='German',
            minimum_term_frequency=0, pmi_threshold_coefficient=0,
            width_in_pixels=1000, metadata=unigram_corpus.get_df()['language'],
            alternative_text_field = 'OriginalText' ,
            transform=st.Scalers.dense_rank
)

What I expect is to see fully the text from the OriginalText column after clicking on a given word in the chart. However at the moment I only see a chunk of such text.

For example, when clicking on the word 'thank', I would see something like the following:

Thank you!

When I expect to see the following instead:

This was a great moment. Thank you!

Basically, I do not want the chunking when search for a given word among my text column. Can we achieve that? 😄

JasonKessler · 2020-09-16T05:12:55Z

Try adding use_full_doc=True as an argument to produce_scattertext_explorer. If that doesn't work, could you please post an independently runnable example which demonstrates the problem?

MastafaF · 2020-09-16T17:38:08Z

Works great @JasonKessler! Thanks 😄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how can I use the original text in the snippets after cleaning? #54

how can I use the original text in the snippets after cleaning? #54

mikkokotila commented Mar 19, 2020

JasonKessler commented Mar 27, 2020 •

edited

Loading

sound118 commented Apr 11, 2020

JasonKessler commented Apr 11, 2020

sound118 commented Apr 12, 2020

JasonKessler commented Apr 13, 2020

sound118 commented Apr 13, 2020

JasonKessler commented Apr 13, 2020

MastafaF commented Sep 15, 2020

JasonKessler commented Sep 16, 2020

MastafaF commented Sep 16, 2020

how can I use the original text in the snippets after cleaning? #54

how can I use the original text in the snippets after cleaning? #54

Comments

mikkokotila commented Mar 19, 2020

JasonKessler commented Mar 27, 2020 • edited Loading

sound118 commented Apr 11, 2020

JasonKessler commented Apr 11, 2020

sound118 commented Apr 12, 2020

JasonKessler commented Apr 13, 2020

sound118 commented Apr 13, 2020

JasonKessler commented Apr 13, 2020

MastafaF commented Sep 15, 2020

JasonKessler commented Sep 16, 2020

MastafaF commented Sep 16, 2020

JasonKessler commented Mar 27, 2020 •

edited

Loading