Caught in the Net (Posts about archive.org)francescomecca.euenContents © 2020 <a href="mailto:francescomecca.eu">Francesco Mecca</a> Wed, 29 Jan 2020 10:04:36 GMTNikola (getnikola.com)http://blogs.law.harvard.edu/tech/rssScript per il bulk download da Archive.orgfrancescomecca.eu/blog/2015/6/30/script-per-il-bulk-download-da-archive-org/Francesco Mecca<div><p>In questi giorni mi e` capitato di dover scaricare varie collezioni da <a href="https://en.wikipedia.org/wiki/Internet_Archive">archive.org</a>, una libreria digitale multimediale la cui missione e` l’accesso universale a tutta la conoscenza.</p> <p>Principalmente lo uso per scaricare tantissime registrazioni live di vari concerti registrati a mio avviso in maniera impeccabile.</p> <p>Nel sito si trova una guida per scaricare in bulk usando wget e gli strumenti del sito, ma risulta piuttosto prolissa e complicata se si vuole fare un download al volo.</p> <p>Questo e` lo script che uso, modificato da <a href="https://github.com/ghalfacree/bash-scripts/blob/master/archivedownload.sh">questo</a> script: e` scritto in bash e funziona su tutte le distribuzioni sulle quali e` installato wget, tail e sed.</p> <pre class="wp-code-highlight prettyprint linenums:1">#!/bin/bash # Write here the extension of the file that you want to accept #filetype =.flac #append this to line 24 #-A .$filetype #Write here the extension of the file that you want to reject, divided by a comma fileremove = .null if [ “$1” = “” ]; then echo USAGE: archivedownload.sh collectionname echo See Archive.org entry page for the collection name. echo Collection name must be entered exactly as shown: lower case, with hyphens. exit fi echo Downloading list of entries for collection name $1… wget -nd -q “http://archive.org/advancedsearch.php?q=collection%3A$1&amp;fl%5B%5D=identifier&amp;sort%5B%5D=identifier+asc&amp;sort%5B%5D=&amp;sort%5B%5D=&amp;rows=9999&amp;page=1&amp;callback=callback&amp;save=yes&amp;output=csv” -O identifiers.txt echo Processing entry list for wget parsing… tail -n +2 identifiers.txt | sed ‘s/”//g’ &gt; processedidentifiers.txt if [ “`cat processedidentifiers.txt | wc -l`” = “0” ]; then echo No identifiers found for collection $1. Check name and try again. rm processedidentifiers.txt identifiers.txt exit fi echo Beginning wget download of `cat processedidentifiers.txt | wc -l` identifiers… wget -r -H -nc -np -nH -nd -e -R $fileremove robots=off -i processedidentifiers.txt -B ‘http://archive.org/download/’ rm identifiers.txt processedidentifiers.txt echo Complete. </pre> <p>Francesco Mecca </p></div>archive.orgbulk download archive.orgPesceWandascriptfrancescomecca.eu/blog/2015/6/30/script-per-il-bulk-download-da-archive-org/Tue, 30 Jun 2015 13:39:00 GMT