Skip to content

Latest commit

 

History

History
981 lines (766 loc) · 33.5 KB

README.md

File metadata and controls

981 lines (766 loc) · 33.5 KB

Mastodon Archive

This tool enables you to make an archive of your statuses, your favourites, bookmarks and the media in both your statuses, your favourites and your bookmarks. From this archive, you can generate a simple text file, or a HTML file with or without media. Take a look at an example if you're curious.

Note that Mastodon v2.3.0 added an account archive download feature: "Every 7 days you are able to request a full archive of your toots. The toots are exported in ActivityPub JSON format alongside the media files attached to them, your avatar and header images as well as the private key of your account used for signing content." If all you want to do is have a backup of your data, perhaps that is enough and you don't need this tool. Use something like tumelune, mastodon-data-viewer.py or meow to browse the archive.

This tool does not download the full archive of your toots from the server. Instead, it uses the Mastodon client API to fetch them incrementally.

You can get the latest sources from the author’s site.

Table of Contents

Installation

There are multiple alternative ways to install mastodon-archive on your machine:

Linux Packages

There are now packages available for Debian (*.deb) and Redhat (*.rpm) based systems. They are not in the standard repositories, though – but for now can be installed from IzzySoft's Repositories. There you also find instructions on how to include them:

Once you've got the repo added and your indexes refreshed, just install using

  • sudo apt install mastodon-archive on Debian & derivates
  • sudo yum install mastodon-archive on Redhat & derivates

Should you get a notice on your Mastodon.py being outdated while running mastodon-archive (will e.g. happen on Ubuntu 20.04), please see the contrib/README.md for a fix (upgrade_python-mastodon.sh).

Using PIP

The following command will install mastodon-archive and all its dependencies:

# Python 3
pip3 install mastodon-archive

If this is the first tool you installed using pip then perhaps it installed mastodon-archive in a directory that's not on your PATH. I had to add the following to my ~/.bashrc file:

export PATH=$PATH:$HOME/.local/bin

🔥 If you're getting an error that ends with Command "python setup.py egg_info" failed with error code 1 ... you might have to install the setup tools. Try the following:

pip3 install --user setuptools
pip3 install mastodon-archive

Manually install the latest development code

You can always clone the repository and run python setup.py from within its root directory:

git clone https://github.com/kensanata/mastodon-archive
cd mastodon-archive
python setup.py install

Global options

If you don't want the script to generate any output unless there are errors, e.g., because you are running it from a scheduled task and don't want to get email about it unless something goes wrong, you can specify --quiet before the command to suppress non-error output, e.g., mastodon-archive --quiet archive, mastodon-archive --quiet media, etc. This will not suppress output for commands whose main point is to generate output.

Making an archive

When using the app for the first time, you have to authorize it:

$ mastodon-archive archive [email protected]
Registering app
Log in
Visit the following URL and authorize the app:
[the app gives you a huge URL which you need to visit using a browser]
Then paste the access token here:
[this is where you paste the authorization code]
Get user info
Get statuses (this may take a while)
Save 41 statuses

Note that the library we are using says: "Mastodons API rate limits per IP. By default, the limit is 300 requests per 5 minute time slot. This can differ from instance to instance and is subject to change." Thus, if every request gets 20 toots, then we can get at most 6000 toots per five minutes.

If this is taking too long, consider skipping your favourites and bookmarks:

$ mastodon-archive archive --no-favourites --no-bookmarks [email protected]

If you want a better picture of conversations, you can also include mentions. Mentions are notifications of statuses in which you were mentioned as opposed to statuses of yours that were favoured or boosted by others. Note that if you used to dismiss notifications using the "Clear notifications" menu, then no mentions will be found as mentions are simply a particular kind of notification.

$ mastodon-archive archive --with-mentions [email protected]

No matter what you did, You will end up with three new files:

dice.camp.client.secret is where the client secret for this instance is stored. dice.camp.user.kensanata.secret is where the authorisation token for this user and instance is stored. If these two files exist, you don't have to log in the next time you run the app. If your login expired, you need to remove the file containing the authorisation token and you will be asked to authorize the app again.

dice.camp.user.kensanata.json is the JSON file with your data (but without your media attachments). If this file exists, only the missing toots will be downloaded the next time you run the app. If you suspect a problem and want to make sure that everything is downloaded again, you need to remove this file.

Splitting an archive

If you keep adding your archive, it eventually grows very large. When it reaches hundreds of megabytes, consider splitting it.

$ ls -lh *.json
-rw-r--r-- 1 alex alex 120M Apr 14 21:50 octodon.social.user.kensanata.json

You can provide an --older-than option to specify the number of weeks you want to keep. The default is four weeks.

If you don't provide the --confirmed option, this is a dry run.

$ mastodon-archive split --older-than=10 [email protected]
This is a dry run and nothing will be moved.
Instead, we'll just list what would have happened.
Use --confirmed to actually do it.
Loading existing archive
Older than 2019-02-03 22:11:48.253408
statuses: 10623
favourites: 11233
mentions: 10773
Would have saved this to octodon.social.user.kensanata.0.json

When you do the split, the files are saved.

$ mastodon-archive split --older-than=10 --confirmed [email protected]
Loading existing archive
Older than 2019-02-03 22:11:59.668432
statuses: 10623
favourites: 11233
mentions: 10773
Saving octodon.social.user.kensanata.json
Saving octodon.social.user.kensanata.0.json

Verify the result:

$ ls -lh *.json
-rw-r--r-- 1 alex alex 107M Apr 14 22:12 octodon.social.user.kensanata.0.json
-rw-r--r-- 1 alex alex  13M Apr 14 22:12 octodon.social.user.kensanata.json

Downloading media files

Assuming you already made an archive of your toots:

$ mastodon-archive media [email protected]
44 urls in your archive (half of them are previews)
34 files already exist
Downloading |################################| 10/10

By default, media you uploaded and media of statuses you added your favourites or bookmarks are not part of your archive. To download these too, specify the favourites collection:

$ mastodon-archive media --collection favourites [email protected]

specify the bookmarks collection:

$ mastodon-archive media --collection bookmarks [email protected]

You will end up with a new directory, dice.camp.user.kensanata. It contains all the media you uploaded, and their corresponding previews.

If you rerun it, it will simply try to get the remaining files. Note, however, that instance administrators can delete media files. Thus, you might be forever missing some files—particularly the ones from remote instances, if you added any to your favourites. If you don't want to see errors about media that fail to download for this reason, add --suppress-errors to the command.

There's one thing you need to remember, though: the media directory contains all the media from your statuses, and all the media from your favourites. There is no particular reason why the media files from both sources need to be in the same directory, see issue #11.

Generating a text file

Assuming you already made an archive of your toots:

$ mastodon-archive text [email protected]
[lots of other toots]
Alex Schroeder 🐉 @kensanata 2017-11-14T22:21:50.599000+00:00
https://dice.camp/@kensanata/99005111284322450
[#introduction](https://dice.camp/tags/introduction) I run
[#osr](https://dice.camp/tags/osr) games using my own hose rule document but
it all started with Labyrinth Lord which I knew long before I knew B/X. Sadly,
my Indie Game Night is no longer a thing but I still love Lady Blackbird, all
the [#pbta](https://dice.camp/tags/pbta) hacks on my drive, and so much more.
But in the three campaigns I run, it’s all OSR right now.

Generating a text file just means redirection the output to a text file:

$ mastodon-archive text [email protected] > statuses.txt

If you're working with text, you might expect the first toot to be at the top and the last toot to be at the bottom. In this case, you need to reverse the list:

$ mastodon-archive text --reverse [email protected] | head

Searching your archive

You can also filter using regular expressions. These will be checked against the status content (obviously), display name and username (both are important for boosted toots), and the created at date. Also note that the regular expression will be applied to the raw status content. In other words, the status contains all the HTML and problably starts with a <p>, which is then removed in the output.

$ mastodon-archive text [email protected] house

You can provide multiple regular expressions and they will all be checked:

$ mastodon-archive text [email protected] house rule

Remember basic regular expression syntax: \b is a word boundary, (?i) ignores case, (a|b) is for alternatives, just to pick some useful ones. Use single quotes to protect your backslashes and questionmarks.

$ mastodon-archive text [email protected] house 'rule\b'

You can also search your favourites, your bookmarks or your mentions:

$ mastodon-archive text --collection favourites [email protected] '(?i)blackbird'

Dates are in ISO format (e.g. 2017-11-19T14:00). I usually only care about year and month, though:

$ mastodon-archive text --collection favourites [email protected] bird '2017-(07|08|09|10|11)'

Show context for a toot

Sometimes you only remember something about a thread. Let's say you asked a question a while back but now you can't remember the answer you got back then. First, find the question:

$ mastodon-archive text [email protected] rules
Alex Schroeder 🐉 @kensanata 2018-05-28T21:19:27.483000+00:00
https://dice.camp/@kensanata/100109016572069901
...

Using the URL, you can now search the archive for some context:

$ mastodon-archive context [email protected] https://dice.camp/@kensanata/100109016572069901

This shows the same information clicking on the toot shows you in the web client: all its ancestors and all its descendants. Obviously, if these toots are not in your archive, we can't find them. You'll have to click on the links and hope they're still around.

Generating a HTML file

Assuming you already made an archive of your toots:

$ mastodon-archive html [email protected]

This will create numbered HTML files starting with dice.camp.user.kensanata.statuses.0.html, each page with 2000 toots.

You can change the number of toots per page using an option:

$ mastodon-archive html --toots-per-page 100 [email protected]

If you have downloaded your media attachments, these will be used in the HTML files. Thus, if you want to upload the HTML files, you now need to upload the media directory as well or all the media links will be broken.

You can also generate a file for your favourites:

$ mastodon-archive html --collection favourites [email protected]

This will create numbered HTML files starting with dice.camp.user.kensanata.favourites.0.html, each page with 2000 toots.

Note that both the HTML file with your statuses and the HTML file with your favourites will refer to the media files in your media directory.

Meow

Meow is a viewer for Mastodon export files (gratis but not free software). Such files contain all of one's toots, stars and bookmarks. It can also process your archives created with this tool. Meow runs locally in your browser and needs access to your archive. This is accomplished by serving the archive via a local web server.

Here’s how to serve your archive, locally, for Meow to access, including all the media in your archive, if you archived it:

$ mastodon-archive meow [email protected]

Once this is done, open Meow with the “Mastodon Archive Import URL” and it pulls the archived data from the local web server you just started:

https://purr.neocities.org/mastodon-archive-import/

Known limitations:

  • If a media file doesn't exist locally, Meow generally tries to load it from the remote server. One notable exception is profile pictures and banners — you need to download your media to see them.

  • Boosts and favorites use post contents and media from the backup, but not user profiles (because of how Meow works internally), those are fetched from their instances.

Reporting

Some numbers, including your ten most used hashtags:

$ mastodon-archive report [email protected]
Considering the last 12 weeks
Statuses:               296
Boosts:                  17
Media:                    9

Top 10 hashtags:
#caster(8) #20questions(5) #osr(3) #dungeonslayers(2) #introduction(2)
#currentprojects(2) #diaspora(1) #gygax(1) #yoonsuin(1) #casters(1)

Favourites:             248
Boosts:                   0
Media:                   20

Top 10 hashtags:
#1strpg(9) #rpg(5) #myfirstcharacter(5) #introduction(5) #osr(4)
#1strpgs(4) #dnd(3) #gamesnacks(1) #vancian(1) #mastoart(1)

You can specify a different time number of weeks to consider using --newer-than N or use --all to consider all your statuses, favourites and bookmarks.

You can list a different number of hashtags using --top N and you can list all of them by using --top -1. This might result in a very long list.

By default only your toots are considered for the hashtags. Use --include-boosts to also include toot you have boosted.

Expiring your toots and favourites

Somewhat deprecated: Please note that Mastodon now offers Preferences → Automated post deletion. Just make sure that you never skip your backups and you should be fine. 😅

Mastodon does not expire your favourites.

Warning: This is a destructive operation. You will delete your toots on your instance, or unfavour your favourites, or dismiss your notifications on your instance. Where as it might be possible to favour all your favourites again, there is no way to repost all those toots. You will have a copy in your archive, but there is no way to restore these to your instance.

But why? I might want to keep a copy of my toots, but I don't think they have much value going back months and years. I never read through years of tweeting history! This only benefits your enemies, never your friends. So I want to expire my toots. We can always write a blog post about the good stuff. You can read more about this on my blog.

Alternatives: Check out ephemtoot (a Python script), or MastoPurgee. These tools expire your toots without archiving them. Or use the "Automated post deletion" feature you can find with your account preferences in recent versions of Mastodon.

Anyway, back to Mastodon Archive. 🙂

Sadly, I have some bad news for you: this has been rate limited to 30 statuses per 30 minutes! 😭

No, really! See the merge request. This is terrible. Expiry basically only works if you run it every time you have posted 30 statuses or so, in the long run. If you don't, be prepared for a long wait! 😴

In order to not go crazy, the code catches an interrupt (such as you pressing Ctrl-C) and saves the data even though it hasn't finished expiring your statuses.

Anyway, enough complaining. How do you do it?

You can expire your toots using the expire command and providing the --older-than option. This option specifies the number of weeks to keep on the server. Anything older than that is deleted or unfavoured. If you use --older-than 0, then all your toots will be deleted, or all your favourites will be unfavoured, or all your notifications will be dismissed.

~/src/mastodon-archive $ mastodon-archive expire --older-than 0 [email protected]
This is a dry run and nothing will be expired.
Instead, we'll just list what would have happened.
Use --confirmed to actually do it.
Delete: 2017-11-26 "<p>Testing äöü</p>"

Actually, the default operation just does a dry run. You need to use the --confirmed option to proceed.

And one more thing: since this requires the permission to write to your account, you will have to reauthorize the app.

$ mastodon-archive expire --collection favourites --older-than 0 \
  --confirmed [email protected]
Log in
Visit the following URL and authorize the app:
[long URL shown here]
Then paste the access token here:
[long token pasted here]
Expiring |################################| 1/1

After a while you'll notice that archiving mentions takes more and more time. The reason is when expiring mentions, the tool goes through all your notifications and looks at those of the type "mention" and expires them if they are old enough. There are other types of notifications, however: "follow", "favourite", and "reblog" (at the time of this writing). As these are not archived, we also don't expire them. Thus, the list of notifications to look through when archiving keeps growing unless you use the "Clear notifications" menu in the Mastodon web client. Alternatively, you can use the --delete-other-notifications option together with --collection mentions and then the tool will dismiss all the older other notifications for you.

Troubleshooting

🔥 If you are archiving a ton of toots and you run into a General API problem, use the --pace option. This is what the problem looks like:

$ mastodon-archive archive [email protected]
...
Get statuses (this may take a while)
Traceback (most recent call last):
...
mastodon.Mastodon.MastodonAPIError: General API problem.

Solution:

$ mastodon-archive archive --pace [email protected]

The problem seems to be related to how Mastodon rate limits requests.

🔥 If you are expiring many toots, same thing. The default rate limit is 300 requests per five minutes, so when more than 300 toots are to be deleted, the app simply has to wait for five minutes before continuing. It takes time.

$ mastodon-archive expire --confirm [email protected]
Loading existing archive
Expiring |                                | 1/1236
We need to authorize the app to make changes to your account.
Log in
Visit the following URL and authorize the app:
[long URL here]
Then paste the access token here:
[access token here]
Considering the default rate limit of 300 requests per five minutes and having 1236 statuses,
this will take at least 20 minutes to complete.
Expiring |#######                         | 301/1236

🔥 If you are experimenting with expiry, you'll need to give the app write permissions. If you then delete the user secret file, hoping to start with a clean slate when archiving, you'll be asked to authorize the app again, but somehow Mastodon remembers that you have already granted the app read and write permissions, and you will get this error:

mastodon.Mastodon.MastodonAPIError: Granted scopes "read write" differ from requested scopes "read".

In order to get rid of this, you need to visit the website, got to Settings → Authorized apps and revoke your authorization for mastodon-archive. Now you can try the authorization URL again and you will only get read permissions instead of both read and write permissions.

🔥 Some servers are compatible with the Mastodon client protocol and yet you'll get the error "Version check failed". In these cases, you can skip this check by using the --no-version-check option.

$ mastodon-archive archive --pace --no-version-check [email protected]

The problem is that the library iplementing the Mastodon client protocol tries to determine the exact feature-set available from your instance based on the instance's version string. When using mastodon-archive for instances that don't use Mastodon, you might have to skip the version check. When you disable the version check, whatever you're trying to do might work – or it might not. Unfortunately, you're on your own.

Followers

This is work in progress. I'm actually not sure where I want to go with this. Right now it either lists all your followers, or it lists all your followers that haven't interacted with you; in the later case you can block them, too. This is for very grumpy users, for sure.

If a toot of theirs mentions you, then that counts as an interaction. Favouring and boosting does not count. By default, this looks at the last twelve weeks. In order for this to work, you need an archive containing both mentions and followers.

$ mastodon-archive archive --with-mentions --with-followers [email protected]
Loading existing archive
Get user info
Get new statuses
Fetched a total of 0 new toots
Get new favourites
Fetched a total of 0 new toots
Get new notifications
Fetched a total of 2 new toots
Get followers (this may take a while)
Saving 659 statuses, 376 favourites, 478 mentions, and 107 followers

Now you're ready to determine the list of lurkers:

$ mastodon-archive followers --no-mentions [email protected]
Considering the last 12 weeks
There is no whitelist
...

As I said, this is work in progress and I don't really know where I'm going with this. More on my blog.

This command supports the whitelist.

Following

Assume you're on the fediverse just for the conversation. You're not actually interested in following anybody who never talks to you: no journalists, no famous people, no pundits. You just want to follow regular people who interact with you. You can list the people you're following who never mentioned you, and you can unfollow them all!

There are two prerequisites, however:

  1. you need to add the people you're following to the archive
  2. you need to add the mentions to the archive (this can take a long time)
$ mastodon-archive archive --with-following --with-mentions [email protected]
Loading existing archive
Get user info
Get new statuses
X
Added a total of 11 new items
Get new favourites
X
Added a total of 7 new items
Get new notifications and look for mentions
.....
Added a total of 7 new items
Skipping followers
Get following (this may take a while)
Saving 932 statuses, 527 favourites, 657 mentions, 107 followers, and 192 following

Given this data, you can now list the people we're interested in:

$ mastodon-archive following [email protected]
Considering the last 12 weeks
...

All these people that never mentioned you: do you really want to follow them all? If you don't, here's how to unfollow them:

$ mastodon-archive following --unfollow [email protected]
Considering the last 12 weeks
Unfollowing |################################| 1/125
We need to authorize the app to make changes to your account.
Registering app
This app needs access to your Mastodon account.
Visit the following URL and authorize the app:
[long URL here]
Then paste the access token here:
[access token here]

Note that the application needs the permission to unfollow people in your name, which is why you need to authorize it again.

This command supports the whitelist.

Mutes and Blocks

You can download lists of users you've muted and/or blocked by adding --with-mutes and/or --with-blocks to the archive command.

User notes

There is currently a deficiency in the Mastodon API: it can't list all users for whom you have added private notes. Therefore, it is impossible for this script to definitively archive all private notes. However, if you add --with-notes to the archive command, then the script will download and archive notes for all users already downloaded for other reasons, i.e., followers, follows, mutes, and/or blocks. This is useful, e.g., if you are in the habit of adding private notes documenting for your future reference why you've followed, blocked, or muted someone.

Whitelist

You can have a whitelist of people you want to be exempt from some commands. Create a text file with a name like the following: dice.camp.user.kensanata.whitelist.txt.

That is: <your domain>.user.<your account>.whitelist.txt.

There, list the accounts you want to have in your whitelist, one per line. All of these formats should work:

kensanata
[email protected]
Alex Schroeder <[email protected]>

To verify your whitelist, use the whitelist command:

$ mastodon-archive whitelist [email protected]
2 accounts are on the whitelist
[email protected]
kensanata

Using wc -l to count the lines in my output, here's how you can see that it works:

$ mastodon-archive followers [email protected] | wc -l
58
$ echo [email protected] >> dice.camp.user.kensanata.whitelist.txt
$ mastodon-archive followers [email protected] | wc -l
57

Mutuals

How do you go about creating a whitelist, though? It's hard! You could start with the list of people that are following you back, perhaps? Here's a command to do just that:

$ mastodon-archive mutuals [email protected]
Get user info
...

We don't currently store the relationship status in our archive so that's why this command requires a live connection. We do have the list of people we are following in our archive, so we use that. If you haven't done so, you need to create an archive using the --with-following option before you can use the mutuals command.

$ mastodon-archive archive --with-following [email protected]
Loading existing archive
...

Example Setup

I have a shell script called backup-mastodon which does the following:

#!/bin/sh
mkdir -p ~/Documents/Mastodon/
cd ~/Documents/Mastodon/ || exit

accounts="[email protected] [email protected] [email protected]"

echo Archive Statuses, Favourites, Mentions
for acc in $accounts; do
    echo "$acc"
    mastodon-archive archive --skip-bookmarks --with-mentions "$acc"
done

echo Expiring Statuses
for acc in $accounts; do
    echo "$acc"
    mastodon-archive expire --older-than 8 --collection statuses --confirm "$acc"
done

echo Expiring Favourites
for acc in $accounts; do
    echo "$acc"
    mastodon-archive expire --older-than 8 --collection favourites --confirm "$acc"
done

echo Dismissing Notifications
for acc in $accounts; do
    echo "$acc"
    mastodon-archive expire --older-than 8 --collection mentions --delete-other-notifications --confirm "$acc"
done

Documentation

The data we have in our archive file is a hash with three keys:

  1. account is a User dict
  2. statuses is a list of Toot dicts
  3. favourites is a list of Toot dicts
  4. bookmarks is a list of Toot dicts
  5. mentions is a list of Toot dicts

If you want to understand the details and the nested nature of these data structures, you need to look at the Mastodon API documentation. One way to get started is to look at what a Status entity looks like.

Development

The setup.py determines how the app is installed and what its dependencies are.

If you checked out the repository and you want to run the code from the working directory on a single user system, use sudo pip3 install --upgrade --editable . in your working directory to make it is "editable" (i.e. the installation you have is linked to your working directory, now).

If you don't want to do this for the entire system, you need your own virtual environemt: pip3 install virtualenvwrapper, mkvirtualenv ma --python python3 (this installs and activates a virtual environment called ma), pip install -e . (-e installs an "editable" copy) and you're set. Use workon ma to work in that virtual environment in the future.

Processing using jq

jq is a lightweight and flexible command-line JSON processor. That means you can use it to work with your archive.

The following command will take all your favourites and create a map with the keys time and message for each one of them, and put it all in an array.

$ jq '[.favourites[] | {time: .account.username, message: .content}]' < dice.camp.user.kensanata.json

Example output, assuming I had only a single favourite:

[
  {
    "time": "andrhia",
    "message": "<p>It’s nice to reinvent yourself every so often, don’t you think?</p>"
  }
]

Exploring the API

Now that you have token files, you can explore the Mastodon API using curl. Your access token is the long string in the file *.user.*.secret. Here is how to use it.

Get a single status:

curl --silent --show-error \
     --header "Authorization: Bearer "$(cat dice.camp.user.kensanata.secret) \
     https://dice.camp/api/v1/statuses/99005111284322450

Extract the account id from your archive using jq and use echo to strip the surrounding double quotes. Then use the id to get some statuses from the account and use jq to print the status ids:

ID=$(eval echo $(jq .account.id < dice.camp.user.kensanata.json))
curl --silent --show-error \
     --header "Authorization: Bearer "$(cat dice.camp.user.kensanata.secret) \
     "https://dice.camp/api/v1/accounts/$ID/statuses?limit=3" \
     | jq '.[]|.id'

Alternatives

There are three kinds of alternatives:

  1. Solutions that extract your public toots from your profile, e.g. https://octodon.social/@kensanata. The problem there is that you'll only get "top level" toots and boosts but no replies.

  2. Solutions that extract your public toots from your Atom feed, e.g. https://octodon.social/users/kensanata.atom. The problem there is that you'll only get a few pages worth of toots, not all of them.

  3. Solutions that make use of the official Mastodon downloads.

    • Mastodon Data Viewer is a viewer for Mastodon backup data written in Python. It creates a local server that you can use to browse the data. Designed for large (>40,000) toot archives. This tool only takes official Mastodon backups.
    • mav-z is using HTML and Javascript. You can simply open its HTML file with your browser, pick the exported archive – and see nice stats, walk though your toots, jump to a given month, etc.