Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix URL parser for parentheses and Chinese Characters #43

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

singularity-s0
Copy link

@singularity-s0 singularity-s0 commented May 29, 2021

Currently, URLs like http://foo.com)你好 will be interpreted as is. Instead, it should be separated into http://foo.com and )你好 (according to URL conventions).

Although parentheses are by definition allowed in URLs, they usually appear encoded. So parentheses that appear at the end of a URL are, for most of the time, not part of it.

Chinese characters are simply not allowed in URLs.

This PR fixes those issues.

@EsteveAguilera
Copy link
Contributor

May I suggest adding a test to reproduce that?
Just to be sure it never breaks in the future.

@@ -1,6 +1,8 @@
import 'package:linkify/linkify.dart';

void main() {
print(linkify("Made by https://cretezy.com [email protected]"));
print(linkify(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, mind reverting this and adding a test instead?

@singularity-s0
Copy link
Author

Sorry, I'm quite new to GitHub and it looks like I'm not quite aware of how PRs and tests work.

The last 5 "experimental" commits were not meant for merging. I was trying to find the best Regex that meets the need of my forum (where users may post links). Although I've found the best Regex for my app, it may not be the best for general purposes and thus might not be suitable for merging.

Since this is only a simple Regex change, I suggest closing this PR and manually "merge" this commit, if needed.

As for tests, since I'm not familiar with how it works..., could you please provide instructions on how to add one? Sorry for troubling you...

@Cretezy
Copy link
Owner

Cretezy commented Aug 3, 2021

Hey, thanks for the contribution! I could do that, or if you wish I can walk you through bring up your PR to something we can merge together (and you learn!). Let me know!

@singularity-s0
Copy link
Author

Thank you very much for your kind offer :) Where should we begin?

@singularity-s0
Copy link
Author

singularity-s0 commented Aug 3, 2021

After experimenting with the URL parser, I came up with several ideas:

Unexpected Unicode Characters

Currently, an URL element will end only when there is a whitespace character matched by \s. This will cause the URL parser to match unexpected characters (like Chinese ones).

According to RFC3986, encoded URLs may only contain the following characters:

%:/?#[]@!$&'()*+,;=-._~
plus ALPHA and DIGIT

which translates to [%:/\?#\[\]@!\$&'\(\)\*\+,;=-\.~\w] in Regex.
So the _urlRegex constant can be safely updated to

final _urlRegex = RegExp(
  r'^(.*?)((?:https?:\/\/|www\.)[^\s/$.?#].[%:/\?#\[\]@!\$&'\(\)\*\+,;=\-~.\w]*)',
  caseSensitive: false,
  dotAll: true,
);

(not sure why but it looks like the syntax is a little different in Dart, why does - needs escaping while . does not?)

This will prevent unexpected unicode characters from being matched into our LinkElement.

Parentheses

Parentheses appear as sub-delimiters in RFC3986, so they are technically allowed as a part of the URL. But in practical context they are often confused with the surrounding text (e.g. in Markdown, where links always ends with )). Previously I have suggested removing them from the URL Regex, but that's not ideal and only serves as a workaround. There should be better ways to handle this. For example, since parentheses usually appear in pairs, maybe we can count the number of ( before the LinkElement and match it against the ) at end of LinkElement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants