2018-11-10 18:19:00 +01:00
<?xml version="1.0" encoding="utf-8"?>
2020-01-29 11:08:46 +01:00
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?> <rss version= "2.0" xmlns:dc= "http://purl.org/dc/elements/1.1/" xmlns:atom= "http://www.w3.org/2005/Atom" > <channel > <title > Caught in the Net (Posts about bulk download archive.org)</title> <link > francescomecca.eu</link> <description > </description> <atom:link href= "francescomecca.eu/categories/bulk-download-archiveorg.xml" rel= "self" type= "application/rss+xml" > </atom:link> <language > en</language> <copyright > Contents © 2020 < a href="mailto:francescomecca.eu"> Francesco Mecca< /a> </copyright> <lastBuildDate > Wed, 29 Jan 2020 10:04:36 GMT</lastBuildDate> <generator > Nikola (getnikola.com)</generator> <docs > http://blogs.law.harvard.edu/tech/rss</docs> <item > <title > Script per il bulk download da Archive.org</title> <link > francescomecca.eu/blog/2015/6/30/script-per-il-bulk-download-da-archive-org/</link> <dc:creator > Francesco Mecca</dc:creator> <description > < div> < p> In questi giorni mi e` capitato di dover scaricare varie collezioni da < a href="https://en.wikipedia.org/wiki/Internet_Archive"> archive.org< /a> , una libreria digitale multimediale la cui missione e` l’ accesso universale a tutta la conoscenza.< /p>
2018-11-10 18:19:00 +01:00
< p> Principalmente lo uso per scaricare tantissime registrazioni live di vari concerti registrati a mio avviso in maniera impeccabile.< /p>
< p> Nel sito si trova una guida per scaricare in bulk usando wget e gli strumenti del sito, ma risulta piuttosto prolissa e complicata se si vuole fare un download al volo.< /p>
< p> Questo e` lo script che uso, modificato da < a href="https://github.com/ghalfacree/bash-scripts/blob/master/archivedownload.sh"> questo< /a> script: e` scritto in bash e funziona su tutte le distribuzioni sulle quali e` installato wget, tail e sed.< /p>
< pre class="wp-code-highlight prettyprint linenums:1"> #!/bin/bash
# Write here the extension of the file that you want to accept
#filetype =.flac
#append this to line 24
#-A .$filetype
#Write here the extension of the file that you want to reject, divided by a comma
fileremove = .null
if [ “$1” = “” ]; then
echo USAGE: archivedownload.sh collectionname
echo See Archive.org entry page for the collection name.
echo Collection name must be entered exactly as shown: lower case, with hyphens.
exit
fi
echo Downloading list of entries for collection name $1…
wget -nd -q “http://archive.org/advancedsearch.php?q=collection%3A$1& amp;fl%5B%5D=identifier& amp;sort%5B%5D=identifier+asc& amp;sort%5B%5D=& amp;sort%5B%5D=& amp;rows=9999& amp;page=1& amp;callback=callback& amp;save=yes& amp;output=csv” -O identifiers.txt
echo Processing entry list for wget parsing…
tail -n +2 identifiers.txt | sed ‘ s/”//g’ & gt; processedidentifiers.txt
if [ “`cat processedidentifiers.txt | wc -l`” = “0” ]; then
echo No identifiers found for collection $1. Check name and try again.
rm processedidentifiers.txt identifiers.txt
exit
fi
echo Beginning wget download of `cat processedidentifiers.txt | wc -l` identifiers…
wget -r -H -nc -np -nH -nd -e -R $fileremove robots=off -i processedidentifiers.txt -B ‘ http://archive.org/download/’
rm identifiers.txt processedidentifiers.txt
echo Complete.
< /pre>
< p> Francesco Mecca < /p> < /div> </description> <category > archive.org</category> <category > bulk download archive.org</category> <category > PesceWanda</category> <category > script</category> <guid > francescomecca.eu/blog/2015/6/30/script-per-il-bulk-download-da-archive-org/</guid> <pubDate > Tue, 30 Jun 2015 13:39:00 GMT</pubDate> </item> </channel> </rss>