francescomecca.eu/_posts/2015-06-30-script-per-il-bulk-download-da-archive-org.md

60 lines
2.7 KiB
Markdown
Raw Normal View History

2016-05-01 11:13:57 +02:00
---
id: 9
title: Script per il bulk download da Archive.org
date: 2015-06-30T13:39:00+00:00
author: pesceWanda
layout: post
guid: https://provaprova456789.wordpress.com/2015/06/30/script-per-il-bulk-download-da-archive-org
permalink: /index.php/archives/9
blogger_blog:
- caught-in-thenet.blogspot.com
blogger_author:
- pescedinomewanda
blogger_a4a2016d1c35883202d5ddff9b0ea4ff_permalink:
- 3355260023671997284
categories:
- PesceWanda
tags:
- archive.org
- bulk download archive.org
- script
---
In questi giorni mi e\` capitato di dover scaricare varie collezioni da [archive.org](https://en.wikipedia.org/wiki/Internet_Archive), una libreria digitale multimediale la cui missione e\` l’accesso universale a tutta la conoscenza.
Principalmente lo uso per scaricare tantissime registrazioni live di vari concerti registrati a mio avviso in maniera impeccabile.
Nel sito si trova una guida per scaricare in bulk usando wget e gli strumenti del sito, ma risulta piuttosto prolissa e complicata se si vuole fare un download al volo.
Questo e\` lo script che uso, modificato da [questo](https://github.com/ghalfacree/bash-scripts/blob/master/archivedownload.sh) script: e\` scritto in bash e funziona su tutte le distribuzioni sulle quali e\` installato wget, tail e sed.
<pre class="wp-code-highlight prettyprint linenums:1">#!/bin/bash
# Write here the extension of the file that you want to accept
#filetype =.flac
#append this to line 24
#-A .$filetype
#Write here the extension of the file that you want to reject, divided by a comma
fileremove = .null
if [ “$1” = “” ]; then
echo USAGE: archivedownload.sh collectionname
echo See Archive.org entry page for the collection name.
echo Collection name must be entered exactly as shown: lower case, with hyphens.
exit
fi
echo Downloading list of entries for collection name $1…
wget -nd -q “http://archive.org/advancedsearch.php?q=collection%3A$1&amp;fl%5B%5D=identifier&amp;sort%5B%5D=identifier+asc&amp;sort%5B%5D=&amp;sort%5B%5D=&amp;rows=9999&amp;page=1&amp;callback=callback&amp;save=yes&amp;output=csv” -O identifiers.txt
echo Processing entry list for wget parsing…
tail -n +2 identifiers.txt | sed s/”//g &gt; processedidentifiers.txt
if [ “`cat processedidentifiers.txt | wc -l`” = “0” ]; then
echo No identifiers found for collection $1. Check name and try again.
rm processedidentifiers.txt identifiers.txt
exit
fi
echo Beginning wget download of `cat processedidentifiers.txt | wc -l` identifiers…
wget -r -H -nc -np -nH -nd -e -R $fileremove robots=off -i processedidentifiers.txt -B http://archive.org/download/
rm identifiers.txt processedidentifiers.txt
echo Complete.
</pre>
Francesco Mecca