francescomecca.eu/output/categories/bulk-download-archiveorg.xml

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Caught in the Net (Posts about bulk download archive.org)</title><link>francescomecca.eu</link><description></description><atom:link href="francescomecca.eu/categories/bulk-download-archiveorg.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2024 &lt;a href="mailto:francescomecca.eu"&gt;Francesco Mecca&lt;/a&gt; </copyright><lastBuildDate>Wed, 28 Feb 2024 09:29:26 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Script per il bulk download da Archive.org</title><link>francescomecca.eu/blog/2015/6/30/script-per-il-bulk-download-da-archive-org/</link><dc:creator>Francesco Mecca</dc:creator><description>&lt;p&gt;In questi giorni mi e` capitato di dover scaricare varie collezioni da &lt;a href="https://en.wikipedia.org/wiki/Internet_Archive"&gt;archive.org&lt;/a&gt;, una libreria digitale multimediale la cui missione e` l’accesso universale a tutta la conoscenza.&lt;/p&gt;
&lt;p&gt;Principalmente lo uso per scaricare tantissime registrazioni live di vari concerti registrati a mio avviso in maniera impeccabile.&lt;/p&gt;
&lt;p&gt;Nel sito si trova una guida per scaricare in bulk usando wget e gli strumenti del sito, ma risulta piuttosto prolissa e complicata se si vuole fare un download al volo.&lt;/p&gt;
&lt;p&gt;Questo e` lo script che uso, modificato da &lt;a href="https://github.com/ghalfacree/bash-scripts/blob/master/archivedownload.sh"&gt;questo&lt;/a&gt; script: e` scritto in bash e funziona su tutte le distribuzioni sulle quali e` installato wget, tail e sed.&lt;/p&gt;
&lt;pre class="wp-code-highlight prettyprint linenums:1"&gt;#!/bin/bash

# Write here the extension of the file that you want to accept
#filetype =.flac
#append this to line 24
#-A .$filetype
#Write here the extension of the file that you want to reject, divided by a comma
fileremove = .null

if [ “$1” = “” ]; then
  echo USAGE: archivedownload.sh collectionname
  echo See Archive.org entry page for the collection name.
  echo Collection name must be entered exactly as shown: lower case, with hyphens.
  exit
fi
echo Downloading list of entries for collection name $1…
wget -nd -q “http://archive.org/advancedsearch.php?q=collection%3A$1&amp;amp;fl%5B%5D=identifier&amp;amp;sort%5B%5D=identifier+asc&amp;amp;sort%5B%5D=&amp;amp;sort%5B%5D=&amp;amp;rows=9999&amp;amp;page=1&amp;amp;callback=callback&amp;amp;save=yes&amp;amp;output=csv” -O identifiers.txt
echo Processing entry list for wget parsing…
tail -n +2 identifiers.txt | sed ‘s/”//g’ &amp;gt; processedidentifiers.txt
if [ “`cat processedidentifiers.txt | wc -l`” = “0” ]; then
  echo No identifiers found for collection $1. Check name and try again.
  rm processedidentifiers.txt identifiers.txt
  exit
fi
echo Beginning wget download of `cat processedidentifiers.txt | wc -l` identifiers…
wget -r -H -nc -np -nH -nd -e -R $fileremove robots=off -i processedidentifiers.txt -B ‘http://archive.org/download/’
rm identifiers.txt processedidentifiers.txt
echo Complete.
&lt;/pre&gt;

&lt;p&gt;Francesco Mecca &lt;/p&gt;</description><category>archive.org</category><category>bulk download archive.org</category><category>PesceWanda</category><category>script</category><guid>francescomecca.eu/blog/2015/6/30/script-per-il-bulk-download-da-archive-org/</guid><pubDate>Tue, 30 Jun 2015 13:39:00 GMT</pubDate></item></channel></rss>