Skip to content

Commit

Permalink
Merge pull request github#248 from github/ls/scripts
Browse files Browse the repository at this point in the history
move scripts over from git-repo-analysis
  • Loading branch information
larsxschneider authored Jun 20, 2019
2 parents ed56a4b + 30b2d54 commit 6df2ef6
Show file tree
Hide file tree
Showing 12 changed files with 657 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,4 @@ Make a pull request and we'll consider it.
* _graphql_: here's a bunch of sample GraphQL queries that can be run against our [GitHub GraphQL API](https://developer.github.com/early-access/graphql).
* _hooks_: want to find out how to write a consumer for [our web hooks](https://developer.github.com/webhooks/)? The examples in this subdirectory show you how. We are open for more contributions via pull requests.
* _pre-receive-hooks_: this one contains [pre-receive-hooks](https://help.github.com/enterprise/admin/guides/developer-workflow/about-pre-receive-hooks/) that can block commits on GitHub Enterprise that do not fit your requirements. Do you have more great examples? Create a pull request and we will check it out.
* _scripts_: want to analyze or clean-up your Git repository? The scripts in this subdirectory show you how. We are open for more contributions via pull requests.
17 changes: 17 additions & 0 deletions scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Git Repo Analysis Scripts

Git can become slow if a repository exceeds certain thresholds ([read this for details](http://larsxschneider.github.io/2016/09/21/large-git-repos)). Use the scripts explained below to identify possible culprits in a repository. The scripts have been tested on macOS but they should run on Linux as is.

_Hint:_ The scripts can run for a long time and output a lot lines. Pipe their output to a file (`./script > myfile`) for further processing.

## Large by File Size
Use the [git-find-large-files](git-find-large-files) script to identity large files in your Git repository that you could move to [Git LFS](https://git-lfs.github.com/) (e.g. using [git-lfs-migrate](https://github.com/git-lfs/git-lfs/blob/master/docs/man/git-lfs-migrate.1.ronn)).

Use the [git-find-lfs-extensions](git-find-lfs-extensions) script to identify certain file types that you could move to [Git LFS](https://git-lfs.github.com/).

## Large by File Count
Use the [git-find-dirs-many-files](git-find-dirs-many-files) and [git-find-dirs-unwanted](git-find-dirs-unwanted) scripts to identify directories with a large number of files. These might indicate 3rd party components that could be extracted.

Use the [git-find-dirs-deleted-files](git-find-dirs-deleted-files) to identify directories that have been deleted and used to contain a lot of files. If you purge all files under these directories from your history then you might be able significantly reduce the overall size of your repository.


29 changes: 29 additions & 0 deletions scripts/git-change-author
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/usr/bin/env bash
#
# Fix an invalid committer/author all commits of your repository.
#
# Usage:
# git-change-author <old-email> <new-name> <new-email>
#
# Author: Lars Schneider, https://github.com/larsxschneider
#

filter=$(cat <<EOF
OLD_EMAIL='$1'
NEW_NAME='$2'
NEW_EMAIL='$3'
if [ "\$GIT_COMMITTER_EMAIL" = "\$OLD_EMAIL" ]
then
export GIT_COMMITTER_NAME="\$NEW_NAME"
export GIT_COMMITTER_EMAIL="\$NEW_EMAIL"
fi
if [ "\$GIT_AUTHOR_EMAIL" = "\$OLD_EMAIL" ]
then
export GIT_AUTHOR_NAME="\$NEW_NAME"
export GIT_AUTHOR_EMAIL="\$NEW_EMAIL"
fi
EOF
)

git filter-branch --env-filter "$filter" --tag-name-filter cat -- --all
34 changes: 34 additions & 0 deletions scripts/git-find-dirs-deleted-files
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/usr/bin/env bash
#
# Print the number of deleted files per directory. The output indicates
# if the directory is present in the HEAD revision.
#
# A deleted directory with a lot of files could indicate a 3rd party
# component that has been deleted. These are usually good candidates for
# purging to make Git repositories smaller (see `git-purge-files`).
#
# The script must be called from the root of the Git repository.
#
# Usage:
# git-find-dirs-deleted-files
#
# Output: [deleted file count] [directory still in HEAD revision?] [directory]
#
# Author: Lars Schneider, https://github.com/larsxschneider
#

git -c diff.renameLimit=30000 log --diff-filter=D --summary |
grep ' delete mode ...... ' |
sed 's/ delete mode ...... //' |
while read -r F ; do
D=$(dirname "$F");
if ! [ -d "$D" ]; then
while ! [ -d "$(dirname "$D")" ] ; do D=$(dirname "$D"); done;
echo "deleted $D";
else
echo "present $D";
fi;
done |
sort |
uniq -c |
sort -k 2,2 -r
31 changes: 31 additions & 0 deletions scripts/git-find-dirs-many-files
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/usr/bin/env bash
#
# Print directories with the number of files underneath them.
#
# A directory with a lot of files could indicates a 3rd party component.
# These are usually good candidates for purging to make Git repositories
# smaller (see `git-purge-files`).
#
# The script must be called from the root of the Git repository.
#
# Usage:
# git-find-dirs-many-files [file count threshold]
#
# Author: Lars Schneider, https://github.com/larsxschneider
#

if [ -z "$1" ]; then
FILE_COUNT=100
else
FILE_COUNT=$1
fi

IFS=$'\n';
DIRS=$(find . -type d -not -path "./.git/*" -exec bash -c 'COUNT=$(find "$0" -type f | wc -l); echo "$COUNT $0"' {} \; | sort -nr)

for DIR in $DIRS; do
if [ $(($(echo $DIR | sed 's/\..*//'))) -le $FILE_COUNT ]; then
break
fi
echo $DIR
done
65 changes: 65 additions & 0 deletions scripts/git-find-dirs-unwanted
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
#!/usr/bin/env bash
#
# git-find-dirs-unwanted.sh
#
# Search the entire history of a Git repository for (potentially)
# unwanted directories. E.g. 3rd party directories, temp, build or
# Perforce stream directories.
#
# The script prints the number of files under each directory to see the
# impact on the Git tree. Directories with a large number of files can
# be good candidates for exclusions in repository migrations to Git.
#
# The script must be called in the Git root directory.
#
# Author: Lars Schneider, https://github.com/larsxschneider
#

DIRS=$(git -c diff.renameLimit=30000 log --all --name-only --pretty=format: \
| awk -F'[^/]*$' '{print $1}' \
| sort -u \
| grep -i \
-e 3p \
-e 3rd \
-e artifacts \
-e assemblies \
-e backup \
-e bin \
-e build \
-e components \
-e debug \
-e deploy \
-e generated \
-e install \
-e lib \
-e modules \
-e obj \
-e output \
-e packages \
-e party \
-e recycle.bin \
-e release \
-e resources \
-e streams \
-e temp \
-e third \
-e tmp \
-e tools \
-e util \
-e vendor \
-e x64 \
-e x86 \
)

IFS=$'\n'
for I in $DIRS; do
if [ -e "$I" ]; then
FILE_COUNT=$(find "$I" -type f | wc -l)
echo "$FILE_COUNT $I"
else
while ! [ -e $(dirname "$I") ]; do
I=$(dirname "$I")/;
done;
echo "deleted $I"
fi
done | sort -n -r | uniq
63 changes: 63 additions & 0 deletions scripts/git-find-ignored-files
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/usr/bin/env bash
#
# Find all files present in the index and working tree ignored by .gitignore.
#
# Usage:
# git-find-ignored-files [-s | --sort-by-size] [--help]
#
# Author: Patrick Lühne, https://www.luehne.de/
#

function print_help
{
grep "^# Usage" < "$0" | cut -c 3-
}

if [[ $# -gt 1 ]]
then
print_help
exit 1
fi

case "$1" in
-h|--help)
print_help
exit 0
;;
-s|--sort-by-size)
;;
*)
if [[ $# -gt 0 ]]
then
(>&2 echo "error: unknown option “$1")
print_help
exit 1
fi
;;
esac

# Find all ignored files
files=$(git ls-files --ignored --exclude-standard)

# Stop if no ignored files were found
if [[ -z $files ]]
then
(>&2 echo "info: no ignored files in working tree or index")
exit 0
fi

# Compute the file sizes of all these files
file_sizes=$(echo "$files" | tr '\n' '\0' | xargs -0 du -sh)

# Obtain the origins why these files are ignored
gitignore_origins=$(echo "$files" | git check-ignore --verbose --stdin --no-index)

# Merge the two lists into one
command="join -1 2 -2 2 -t $'\t' -o 1.1,1.2,2.1 <(echo \"$file_sizes\") <(echo \"$gitignore_origins\")"

if [[ $1 =~ ^-s|--sort-by-size$ ]]
then
command="$command | sort -h"
fi

eval "$command"
91 changes: 91 additions & 0 deletions scripts/git-find-large-files
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
#!/usr/bin/env bash
#
# Print the largest files in a Git repository. The script must be called
# from the root of the Git repository. You can pass a threshold to print
# only files greater than a certain size (compressed size in Git database,
# default is 500kb).
#
# Files that have a large compressed size should usually be stored in
# Git LFS [2].
#
# Based on script from Antony Stubbs [1] and improved with ideas from Peff.
#
# [1] http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# [2] https://git-lfs.github.com/
#
# Usage:
# git-find-large-files [size threshold in KB]
#
# Author: Lars Schneider, https://github.com/larsxschneider
#

if [ -z "$1" ]; then
MIN_SIZE_IN_KB=500
else
MIN_SIZE_IN_KB=$1
fi

# Use "look" if it is available, otherwise use "grep" (e.g. on Windows)
if look >/dev/null 2>&1; then
# On Debian the "-b" is available and required to make "look" perform
# a binary search (see https://unix.stackexchange.com/a/499312/275508 ).
if look 2>&1 | grep -q .-b; then
search="look -b"
else
search=look
fi
else
search=grep
fi

# set the internal field separator to line break,
# so that we can iterate easily over the verify-pack output
IFS=$'\n';

# list all objects including their size, sort by compressed size
OBJECTS=$(
git cat-file \
--batch-all-objects \
--batch-check='%(objectsize:disk) %(objectname)' \
| sort -nr
)

TMP_DIR=$(mktemp -d "${TMPDIR:-/tmp}/git-find-large-files.XXXXXX") || exit
trap "rm -rf '$TMP_DIR'" EXIT

git rev-list --all --objects | sort > "$TMP_DIR/objects"
git rev-list --all --objects --max-count=1 | sort > "$TMP_DIR/objects.1"

for OBJ in $OBJECTS; do
# extract the compressed size in kilobytes
COMPRESSED_SIZE=$(($(echo $OBJ | cut -f 1 -d ' ')/1024))

if [ $COMPRESSED_SIZE -le $MIN_SIZE_IN_KB ]; then
break
fi

# extract the SHA
SHA=$(echo $OBJ | cut -f 2 -d ' ')

# find the objects location in the repository tree
LOCATION=$($search $SHA "$TMP_DIR/objects" | sed "s/$SHA //")
if $search $SHA "$TMP_DIR/objects.1" >/dev/null; then
# Object is in the head revision
HEAD="Present"
elif [ -e $LOCATION ]; then
# Objects path is in the head revision
HEAD="Changed"
else
# Object nor its path is in the head revision
HEAD="Deleted"
fi

echo "$COMPRESSED_SIZE,$HEAD,$LOCATION" >> "$TMP_DIR/output"
done

if [ -f "$TMP_DIR/output" ]; then
column -t -s ',' < "$TMP_DIR/output"
fi

rm -rf "$TMP_DIR"
exit 0
Loading

0 comments on commit 6df2ef6

Please sign in to comment.