2016-05-01 11:13:57 +02:00
<!DOCTYPE html>
< html lang = "en-us" >
< head >
< link href = "http://gmpg.org/xfn/11" rel = "profile" >
< meta http-equiv = "X-UA-Compatible" content = "IE=edge" >
< meta http-equiv = "content-type" content = "text/html; charset=utf-8" >
<!-- Enable responsiveness on mobile devices -->
< meta name = "viewport" content = "width=device-width, initial-scale=1.0, maximum-scale=1" >
< title >
Script per il bulk download da Archive.org · Caught in the Net
< / title >
<!-- CSS -->
< link rel = "stylesheet" href = "/public/css/poole.css" >
< link rel = "stylesheet" href = "/public/css/syntax.css" >
< link rel = "stylesheet" href = "/public/css/hyde.css" >
<!-- Icons -->
< link rel = "apple-touch-icon-precomposed" sizes = "144x144" href = "/public/apple-touch-icon-144-precomposed.png" >
< link rel = "shortcut icon" href = "/public/favicon.ico" >
<!-- RSS -->
< link rel = "alternate" type = "application/rss+xml" title = "RSS" href = "/atom.xml" >
< / head >
< body class = "theme-base-09" >
< div class = "sidebar" >
< div class = "container sidebar-sticky" >
< div class = "sidebar-about" >
< h1 >
< a href = "/" >
Caught in the Net
< / a >
< / h1 >
< p class = "lead" > < / p >
< / div >
< nav class = "sidebar-nav" >
< a class = "sidebar-nav-item" href = "/" > Home< / a >
< a class = "sidebar-nav-item" href = "/about/" > About< / a >
< a class = "sidebar-nav-item" href = "/archive/" > Archive< / a >
< a class = "sidebar-nav-item" href = "/contattami/" > Contattami< / a >
2016-05-03 23:38:08 +02:00
2016-05-01 11:13:57 +02:00
< a class = "sidebar-nav-item" href = "/atom.xml" > RSS< / a >
< a class = "sidebar-nav-item" href = "http://francescomecca.eu:3000" > Personal Git< / a >
< a cleass = "sidebar-nav-item" href = "https://github.com/s211897-studentipolito" > github< / a >
< span class = "sidebar-nav-item" href = "" > Powered by Jekyll and Hyde< / span >
< / nav >
< p > © 2016. CC BY-SA 4.0 International < / p >
< / div >
< / div >
< h3 class = "masthead-title" >
< a href = "/" title = "Home" > Caught in the Net< / a >
< / h3 >
< div class = "content container" >
< div class = "post" >
< h1 class = "post-title" > Script per il bulk download da Archive.org< / h1 >
< span class = "post-date" > 30 Jun 2015< / span >
< p > In questi giorni mi e` capitato di dover scaricare varie collezioni da < a href = "https://en.wikipedia.org/wiki/Internet_Archive" > archive.org< / a > , una libreria digitale multimediale la cui missione e` l’ accesso universale a tutta la conoscenza.< / p >
< p > Principalmente lo uso per scaricare tantissime registrazioni live di vari concerti registrati a mio avviso in maniera impeccabile.< / p >
< p > Nel sito si trova una guida per scaricare in bulk usando wget e gli strumenti del sito, ma risulta piuttosto prolissa e complicata se si vuole fare un download al volo.< / p >
< p > Questo e` lo script che uso, modificato da < a href = "https://github.com/ghalfacree/bash-scripts/blob/master/archivedownload.sh" > questo< / a > script: e` scritto in bash e funziona su tutte le distribuzioni sulle quali e` installato wget, tail e sed.< / p >
< pre class = "wp-code-highlight prettyprint linenums:1" > #!/bin/bash
# Write here the extension of the file that you want to accept
#filetype =.flac
#append this to line 24
#-A .$filetype
#Write here the extension of the file that you want to reject, divided by a comma
fileremove = .null
if [ “$1” = “” ]; then
echo USAGE: archivedownload.sh collectionname
echo See Archive.org entry page for the collection name.
echo Collection name must be entered exactly as shown: lower case, with hyphens.
exit
fi
echo Downloading list of entries for collection name $1…
wget -nd -q “http://archive.org/advancedsearch.php?q=collection%3A$1& fl%5B%5D=identifier& sort%5B%5D=identifier+asc& sort%5B%5D=& sort%5B%5D=& rows=9999& page=1& callback=callback& save=yes& output=csv” -O identifiers.txt
echo Processing entry list for wget parsing…
tail -n +2 identifiers.txt | sed ‘ s/”//g’ > processedidentifiers.txt
if [ “`cat processedidentifiers.txt | wc -l`” = “0” ]; then
echo No identifiers found for collection $1. Check name and try again.
rm processedidentifiers.txt identifiers.txt
exit
fi
echo Beginning wget download of `cat processedidentifiers.txt | wc -l` identifiers…
wget -r -H -nc -np -nH -nd -e -R $fileremove robots=off -i processedidentifiers.txt -B ‘ http://archive.org/download/’
rm identifiers.txt processedidentifiers.txt
echo Complete.
< / pre >
< p > Francesco Mecca< / p >
< / div >
< div class = "related" >
< h2 > Related Posts< / h2 >
< ul class = "related-posts" >
< li >
< h3 >
2016-07-06 07:49:02 +02:00
< a href = "/pescewanda/2016/07/05/arduino_keyboard/" >
Arduino Uno as HID keyboard
< small > 05 Jul 2016< / small >
2016-05-01 11:13:57 +02:00
< / a >
< / h3 >
< / li >
< li >
< h3 >
2016-07-06 07:49:02 +02:00
< a href = "/pescewanda/2016/05/16/lifehacks2/" >
Lifehacks (2)
< small > 16 May 2016< / small >
2016-05-01 11:13:57 +02:00
< / a >
< / h3 >
< / li >
< li >
< h3 >
2016-07-06 07:49:02 +02:00
< a href = "/pescewanda/2016/05/15/genetic-alg/" >
Interpolation using a genetic algorithm
< small > 15 May 2016< / small >
2016-05-01 11:13:57 +02:00
< / a >
< / h3 >
< / li >
< / ul >
< / div >
< / div >
< / body >
< / html >