如何删除Git历史记录里较大的文件？

背景

Git历史记录中，若不小心加入了较大的而无用的文件，很可能会让你的repository的size变得很大。Github和Bitbucket都分别对repository的大小做出了限制（见https://help.github.com/articles/what-is-my-disk-quota/ 及https://confluence.atlassian.com/pages/viewpage.action;jsessionid=735235A3CE151FB6D4C518F3971FD524.node2?pageId=273877699）。那么如何在Git的历史记录中找出较大的文件并且删除它们呢？

解决办法

1. 在repository的根目录下，运行以下shell。该shell会自动列出size前十的文件。

#!/bin/bash
#set -x 
# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs
# set the internal field spereator to line break, so that we can iterate easily over the verify-pack output
IFS=$'\n';
# list all objects including their size, sort by size, take top 10
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`
echo "All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file."
output="size,pack,SHA,location"
for y in $objects
do
    # extract the size in bytes
    size=$((`echo $y | cut -f 5 -d ' '`/1024))
    # extract the compressed size in bytes
    compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
    # extract the SHA
    sha=`echo $y | cut -f 1 -d ' '`
    # find the objects location in the repository tree
    other=`git rev-list --all --objects | grep $sha`
    #lineBreak=`echo -e "\n"`
    output="${output}\n${size},${compressedSize},${other}"
done
echo -e $output | column -t -s ', '

运行结果：

注意最后一列列出了这些文件的location，location将会在第二步中被用到。

2. 使用git filter-branch移除文件

git filter-branch --index-filter 'git rm --cached --ignore-unmatch *.gz' HEAD

`*.gz` 需替换为第一步中得到的location。

若你想删除的文件还存在当前的文件夹下，并且你想保留它，请将HEAD替换为HEAD^

3. push回remote repository：

git push origin master --force

4. 清理repository。这一步也是必须的。

git reflog expire --expire=now --all
git gc --aggressive --prune=now

5. 可以通过以下命令查看repository的大小：

du -sh .git

如何删除Git历史记录里较大的文件？

Leave a comment

Cancel reply