-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how can I use the original text in the snippets after cleaning? #54
Comments
The preferred way to remove stopwords in Scattertext is to pass the full documents into a Corpus factory, and then use the Corpus.remove_terms method to create a corpus free of stopwords. You'll still let be able to view the original documents in the scattertext explorer. For example:
On the other hand, you could pass in an |
I am actually facing the same issue as mentioned above. I am doing a scattertext plot for Chinese and I followed your instructions above by passing an alternate_text_field parameter into produce_scattertext_explorer. When I click a term in scattertext plot, no origianl text shows up in the snippets. Actually, it shows nothing in the snippets. How do I make the original text to be shown there? |
Could you upload the example which fails to show snippets? |
@JasonKessler I just uploaded the example that can reproduce the issue, please see https://github.com/sound118/Scatter-text-for-Chinese I used "jieba" package to remove stopwords list and load user-defined dictionary in case of any wrong Chinese term segmentation, applied your "chinese_nlp" afterwards. You can change the file path to run the program on your lcoal machine to find out the issue. Thanks. |
I think the issue is that the alternative text field has to be whitespace-tokenized for the matcher to work. |
@JasonKessler, thanks for the hint. It works after adding |
Glad to hear it works. It would be a good feature for someone in the community to pick up and build. |
Hi @JasonKessler , I have the same issue here and I could not solve it with your suggestion. data = data.loc[:, ['id', 'language', 'ProcessedText', 'OriginalText']]
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
data['parse'] = data['ProcessedText'].apply(st.whitespace_nlp_with_sentences)
unigram_corpus = (st.CorpusFromParsedDocuments(data,
category_col='language',
parsed_col='parse')
.build().get_stoplisted_unigram_corpus())
html = st.produce_scattertext_explorer(
unigram_corpus,
category='French', category_name='French', not_category_name='German',
minimum_term_frequency=0, pmi_threshold_coefficient=0,
width_in_pixels=1000, metadata=unigram_corpus.get_df()['language'],
alternative_text_field = 'OriginalText' ,
transform=st.Scalers.dense_rank
) What I expect is to see fully the text from the OriginalText column after clicking on a given word in the chart. However at the moment I only see a chunk of such text. For example, when clicking on the word 'thank', I would see something like the following: Thank you! When I expect to see the following instead: This was a great moment. Thank you! Basically, I do not want the chunking when search for a given word among my text column. Can we achieve that? 😄 |
Try adding use_full_doc=True as an argument to produce_scattertext_explorer. If that doesn't work, could you please post an independently runnable example which demonstrates the problem? |
Works great @JasonKessler! Thanks 😄 |
Once I've removed stopwords using nltk or similar, I want to be able to see the original text snippets and not the ones without stopwords. How can I achieve that?
The text was updated successfully, but these errors were encountered: