Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Mount URL with spaces #137

Closed
wants to merge 1 commit into from
Closed

Conversation

EyitopeIO
Copy link
Contributor

@EyitopeIO EyitopeIO commented Apr 17, 2024

To prevent PR from being too large, the path resolution was kept separate and not merged into one. I thought perhaps this would be a refactoring for later. Looking forward to the review and what to change.

Tests on Ubuntu 22

httpdirfs -d -f "http://0.0.0.0:9000/SP ACE/G B" foo

httpdirfs -d -f --cache --single-file-mode "http://0.0.0.0:9000/SP ACE/G B/code.deb" foo

httpdirfs -d --cache https://old-releases.ubuntu.com/releases/23.04 foo

httpdirfs -d --single-file-mode https://download.iopsys.eu/iopsys/toolchain/crosstools-gcc-9.2-linux-4.19-glibc-2.30-binutils-2.32-Rel1.12-full.tar.bz2 foo

httpdirfs -f https://www.crawler-test.com/urls/url_with_spaces/URL%20with%20spaces foo

@EyitopeIO
Copy link
Contributor Author

Fix for #86

@fangfufu
Copy link
Owner

To prevent PR from being too large, the path resolution was kept separate and not merged into one.

Yes this is a good idea.

src/util.c Outdated
lprintf(fatal, "URL too long: %s\n", url);
}
if (url[i] == escape_me) {
switch (c) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a switch statement here?

src/util.h Outdated
/**
* \brief URL characters that can be escaped
*/
typedef enum {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we specify the actual list of characters that can be escaped here? This is very indirect.

src/link.c Outdated
@@ -77,10 +77,15 @@ static CURL *Link_to_curl(Link *link)
if (ret) {
lprintf(error, "%s", curl_easy_strerror(ret));
}
ret = curl_easy_setopt(curl, CURLOPT_URL, link->f_url);

char *escaped_spaces = escape_char(link->f_url, ESCAPED_CHAR_SPACE);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have difficulty with understanding what ESCAPED_CHAR_SPACE gets evaluated to.

src/util.c Outdated

switch (c) {
case ESCAPED_CHAR_SPACE:
escape_me = ' ';
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good idea, the characters that need to be escape is not going to change on the fly. You should define it globally, rather than creating a new type to wrap around it. I don't think we need this many indirection.

@EyitopeIOn
Copy link

@fangfufu Replying all comments as @EyitopeIO

I would look for examples on how to reduce/remove the indirection and update in a bit. I'm just recalling that I've seen good example in libnetfilter_conntrack when updating patches.

@fangfufu
Copy link
Owner

@fangfufu Replying all comments as @EyitopeIO

I would look for examples on how to reduce/remove the indirection and update in a bit. I'm just recalling that I've seen good example in libnetfilter_conntrack when updating patches.

Could you point me to your example please? All I am suggesting is that we don't unnecessarily define new types. I understand you might want to do what you did when there might be multiple characters that need to be escaped, but this is not the case here.

@EyitopeIO
Copy link
Contributor Author

EyitopeIO commented Apr 19, 2024

@fangfufu
I've checked again the "examples" I thought I'd seen in libnetfilter_conntrack source and they unrelated. Never mind

I've added eb274b3 to address unresolved issues. I hope I understood them correctly. Please don't hesitate to let me know as I'm willing to adapt and learn.

I also tested this by setting MAX_PATH_LEN and MAX_FILENAME_LEN to the length of http://0.0.0.0:9000/SP ACE/G B.

@fangfufu
Copy link
Owner

Please could you squash your 3 commits together into one commit, your 3 commits are all addressing the same issue. It really should be squashed together.

In future consider using git commit --amend then git push --force.

src/util.c Outdated
@@ -49,6 +49,53 @@ char *path_append(const char *path, const char *filename)
return str;
}

char *escape_char(const char *url, const char escape_me)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only escaping a single space character, please rename this function into escape_space or something. and remove the escape_me variable.

src/util.c Outdated
Comment on lines 70 to 75
if (escape_me == ' ') {
replacement = "%20";
} else {
lprintf(warning, "Not escaping: '%c' in URL\n", escape_me);
return NULL;
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer necessary.

src/util.h Outdated
Comment on lines 20 to 28
/**
* \brief escape a single character in a URL
* \details This function escapes a character a URL and returns a pointer to
* the string with the escaped character. We don't use curl_easy_escape() on
* the entire URL because it would break the URL. For example, 'http://a c'
* becomes 'http:%2F%2Fa%20c', escaping more characters than we want.
* \note You need to free the char * after use.
*/
char *escape_char(const char *s, const char c);
Copy link
Owner

@fangfufu fangfufu Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rewrite the description to escaping only the space character.

@EyitopeIO EyitopeIO force-pushed the fix-space branch 2 times, most recently from 8f0b787 to 27ecec6 Compare April 20, 2024 00:42
@EyitopeIO
Copy link
Contributor Author

@fangfufu
Squashed commits to one. #137 (comment)
Updated function name and description. #137 (comment)

src/util.c Outdated
@@ -49,6 +49,47 @@ char *path_append(const char *path, const char *filename)
return str;
}

char *escape_spaces(const char *path)
{
char space = ' ';
Copy link
Owner

@fangfufu fangfufu Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this variable. Remove please. ' ' has fewer characters than space.

@fangfufu
Copy link
Owner

fangfufu commented Apr 20, 2024

Actually I have an important question, why don't we use curl_easy_escape? Why are you writing a function which escape a single space character?

@EyitopeIO
Copy link
Contributor Author

EyitopeIO commented Apr 20, 2024

@fangfufu Using curl_easy_escape unfortunately does not allow us to specify the exact character we want to escape. It escapes everything in the provided string.

Ideally, I think you'd want to curl_easy_escape(url, len, ' '), but instead it escapes everything that can be escaped, including the forward slashes in the URL; thus breaking the URL.

so http://foo/b r becomes http:%2F%2Ffoo%2Fb%20r

So originally my thinking was to have a function that behaves like the libcurl function, but allows you specify what you want to escape.

@fangfufu
Copy link
Owner

So originally my thinking was to have a function that behaves like the libcurl function, but allows you specify what you want to escape.

No, in this situation you should not be writing code that duplicate what libcurl does.

You can remove the protocol part of the URL, please refer to make_link_relative in link.c

Have a look at https://curl.se/libcurl/c/curl_easy_getinfo.html

I am sorry but I don't like the current implementation, as it is a hack for the space character.

@fangfufu
Copy link
Owner

Just for the record, this is for fixing #138

Copy link

Quality Gate Passed Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

@EyitopeIO
Copy link
Contributor Author

EyitopeIO commented Apr 22, 2024

@fangfufu
I looked carefully at make_link_relative and here are my observations.

Let's say we want to mount https://old-releases.ubuntu.com. This is easy because everything after the protocol can be escaped and BOOM! We're done!

If we now want to mount https://old-releases.ubuntu.com/releases/23.04 using curl's function (curl_easy_escape), it would be necessary to add another "hack" to prevent the slashes from being escaped and invalidating the URL. In this case, we would want to escape everything after the protocol while leaving slashes intact. Mounting a URL likehttp://0.0.0.0:9000/SP ACE/G B and using curl's function after the protocol would also escape the colon and break the link.

Now that you mention it, changes so far sure looks like a hack for just the space character.

I think the links could be sanitised (escaping funny characters) from within the function LinkTable_uninitialised_fill instead of in function LinkTable_fill, since both single-file mode and normal mode take a path to LinkTable_uninitialised_fill.

Lastly, if a user specified a path https://fufufang.com/Work in Progress/财富, I'm assuming that is exactly how the user expects it be mounted. So I've been avoiding touching the content of the Link structure, since the only part of the code where the characters cause a problem is when making the request to the server i.e. CURLOPT_URL

Truth is I don't really know what I'm doing and just guessing.

@fangfufu
Copy link
Owner

@EyitopeIO, I have given you the function that is used to escape percent-encoding which is curl_easy_escape. If you looked up that function, you would find that the opposite is curl_easy_unescape. These are the functions I used to convert between percentage encoding and whatever encoding the URL uses. They are placed in different places between cache.c and link.c. All I am suggesting is that you place those functions at the correct places.

@fangfufu
Copy link
Owner

This pull request will not be merged, as you said earlier, it is basically a hack. If you are still interested in working on it, please submit something that actually addresses issue #138. This problem actually has occurred in #127 and #137.

@fangfufu fangfufu closed this Apr 24, 2024
@fangfufu
Copy link
Owner

fangfufu commented Apr 24, 2024

There is also path_to_Link_recursive ( https://github.com/fangfufu/httpdirfs/blob/master/src/link.c#L827) which consumes each part of the URL to construct directories, if you only want part of the URL.

@EyitopeIO
Copy link
Contributor Author

Okay, I'd try one more time but with the understanding that the fix should generally address characters that could break URLs (e.g. in the related issues), and not just a hack for the space character.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants