Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filename character sequence with tilde sometimes causes false error on Linux #152

Open
akubot opened this issue Apr 23, 2021 · 9 comments
Open

Comments

@akubot
Copy link

akubot commented Apr 23, 2021

Observed on Linux, a failure at line 931 of bagit.py in git commit hash code bff20d2

RHEL 7.5 (Maipo), Linux 3.10.0-1160.15.2.el7.x86_64 GNU/Linux

Python 3.7.4 using miniconda3

File names containing this 4-character sequence cause an "is unsafe" false error due to expansion of the tilde:

~$_-

To reproduce, create a file name such as

data/blah/~$_-expect_fail.doc

In Python v3, do this

import os, sys
f = "data/blah/~$_-expect_fail.doc"

# Note that Linux home directory is inserted in expansion
os.path.expandvars(f) 
'data/blah/~/home/users/jdoe-expect_fail.doc'

A similar test with filename that does not have the hyphen after the underscore

data/blah/~$_expect_success.doc

works OK, i.e. without unwanted expansion of the tilde.

@acdha
Copy link
Member

acdha commented Apr 23, 2021

That's certainly quite odd — does it reproduce if you simply run python -c 'import os.path; os.path.expanduser("data/blah/~$_-expect_fail.doc")'? Looking at the stdlib implementation, my first reaction is surprise that this isn't failing the getpwnam() lookup:

https://github.com/python/cpython/blob/3.8/Lib/posixpath.py#L228-L271

@akubot
Copy link
Author

akubot commented Apr 23, 2021

I modified your test slightly, it confirms the problem.

I changed it to expandvars, which is the one failing.
and also wrapped it with a print()

The original command you posted does not throw an error but also doesn't print anything.

$ python3 -c 'import os.path; print(os.path.expandvars("data/blah/~$_-expect_fail.doc"))'
data/blah/~/home/users/jdoe-expect_fail.doc

I don't have an explanation for the issue, but somehow these 4 characters together expand on RHEL anyway.

@acdha
Copy link
Member

acdha commented Apr 23, 2021

Ah, I missed that and it explains the problem: $_ is a shell variable (last argument of the previous command, if memory serves), and that would explain why you're seeing this depending on the next character since it has to be something which stops parsing at the next character.

I'm wondering whether we want to remove that part of #88 on the theory that this is being overzealous. I do not want to be in the business of having to try to maintain a separate implementation of expandvars.

@akubot
Copy link
Author

akubot commented Apr 23, 2021

It looks like it's slightly more subtle -- the sequence "$_-" also causes the problem

$ python3 -c 'import os.path; print(os.path.expandvars("data/blah/$_-expect_fail.doc"))'
data/blah/~/home/users/jdoe-expect_fail.doc

while just "$_" does not

$ python3 -c 'import os.path; print(os.path.expandvars("data/blah/$_expect_success.doc"))'
data/blah/$_expect_success.doc

and similarly for just "~$_"

python3 -c 'import os.path; print(os.path.expandvars("data/blah/~$_expect_success.doc"))'
data/blah/~$_expect_success.doc

So it appears the hyphen helps trigger this. But that makes me think there are other cases lurking.

Regarding the bagit.py source code, I'm not sure what the correct choice would be. I suppose you could put some guard around the expandvars code, to check for the character sequences that mess it up on Linux?

But I'm not sure specifically what you would do, because you are currently matching a possibly expanded filename against the real one.

Probably better to discuss with regular contributors, and ask whether to back out the expandvars snippet completely? Then it should always pass on Linux.

BTW, a Windows user verified this issue does not occur there, which makes sense because this is a Linux expansion.

@acdha
Copy link
Member

acdha commented Apr 23, 2021

I was thinking that it might be good to discuss this with @runderwood to see whether we have a compelling reason to keep expandvars in here.

This variable is set by at least the popular Bourne shells and ones like Fish which try to be compatible so I suppose we could purge $_ at startup but that seems like it's just asking for some unexpected dependency to show up later.

@runderwood
Copy link
Contributor

I don't really recall why the expandvars call is there, to be honest.

While I can see the reasoning that it's safer to disallow potentially dangerous variable expansion, I believe it's system-dependent so you don't really enforce a meaningful universal policy doing it this way.

I don't have a compelling argument for keeping it there if it's causing problems.

@susannah-slv
Copy link

I have encountered similar behavior using Windows (both Windows 10 and Windows Server 2022).

When processing a filename with a repeated $ character (eg. filename $$$.mbox becomes filename $$.mbox) the bag is successfully created but fails during validation.

I checked how the _path_is_dangerous(entry_path) call was handling this, and it was failing the expandvars call.

The error I am getting:
Path "data\path\filename $$$.mbox" in manifest "X:\Directory\bag\manifest-sha256.txt" is unsafe

If paths with $$ characters are dangerous, should this behavior flag at the bag creation stage? Could it be a logged warning rather than complete failure?

@runderwood
Copy link
Contributor

runderwood commented Mar 6, 2025

I think the expandvars could be removed, unless somebody has a good argument for keeping it. It's been a long time, so I'm not sure what the rationale was. But it certainly seems like path-handling is asymmetrical (on read/write), at the least, in this regard. So I'm happy to either pull the call altogether or retain it but trigger a warning message rather than flagging the path as unsafe and returning True.

EDIT: Maybe this should be reworked to focus on preventing escape from the bag hierarchy rather than equality after expansion?

@acdha
Copy link
Member

acdha commented Mar 6, 2025

I think that's right - your comment on 70e197d certainly suggests that the real benefit comes from the os.path.commonprefix check after using os.path.realpath to avoid problems with things like 'bag_path/../../../etc/passwd`.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants