2016-05-01 11:13:57 +02:00
<!DOCTYPE html>
< html lang = "en-us" >
2017-05-09 11:29:54 +02:00
2016-05-01 11:13:57 +02:00
< head >
2017-05-09 11:29:54 +02:00
< meta charset = "UTF-8" >
< title > Caught in the Net< / title >
< meta name = "viewport" content = "width=device-width, initial-scale=1" >
< meta name = "theme-color" content = "#157878" >
< link rel = "stylesheet" href = "/css/normalize.css" >
<!-- <link href='https://fonts.googleapis.com/css?family=Open+Sans:400,700' rel='stylesheet' type='text/css'> -->
< link rel = "stylesheet" href = "/fonts/opensans.css" >
< link rel = "stylesheet" href = "/css/cayman.css" >
2016-05-01 11:13:57 +02:00
< / head >
2017-05-09 11:29:54 +02:00
< body >
< section class = "page-header" >
< h1 class = "project-name" > Caught in the Net< / h1 >
< h2 class = "project-tagline" > La rete ti cattura ma libera il pensiero< / h2 >
< a class = "btn" href = "/" > Home< / a >
2017-05-09 12:17:48 +02:00
< a class = "btn" href = "/about/" > About me< / a >
2017-05-09 11:29:54 +02:00
< a class = "btn" href = "/contattami/" > Contact me< / a >
2017-05-10 20:26:52 +02:00
< a class = "btn" href = "/archive/" > Archive< / a >
< a class = "btn" href = "/feed.xml" > RSS< / a >
2017-09-22 14:07:26 +02:00
< a class = "btn" href = "http://francescomecca.eu/git/explore/repos" > Personal Git< / a >
2017-05-09 11:29:54 +02:00
< a class = "btn" href = "https://github.com/FraMecca" > Github< / a >
2017-05-09 11:38:51 +02:00
< a class = "btn" href = "/curriculum/CV_Mecca_Francesco.pdf" > Curriculum< / a >
2017-05-09 11:29:54 +02:00
< / section >
2016-05-01 11:13:57 +02:00
2017-05-09 11:29:54 +02:00
< section class = "main-content" >
2016-05-01 11:13:57 +02:00
< div class = "post" >
< h1 class = "post-title" > Script per il bulk download da Archive.org< / h1 >
< span class = "post-date" > 30 Jun 2015< / span >
< p > In questi giorni mi e` capitato di dover scaricare varie collezioni da < a href = "https://en.wikipedia.org/wiki/Internet_Archive" > archive.org< / a > , una libreria digitale multimediale la cui missione e` l’ accesso universale a tutta la conoscenza.< / p >
< p > Principalmente lo uso per scaricare tantissime registrazioni live di vari concerti registrati a mio avviso in maniera impeccabile.< / p >
< p > Nel sito si trova una guida per scaricare in bulk usando wget e gli strumenti del sito, ma risulta piuttosto prolissa e complicata se si vuole fare un download al volo.< / p >
< p > Questo e` lo script che uso, modificato da < a href = "https://github.com/ghalfacree/bash-scripts/blob/master/archivedownload.sh" > questo< / a > script: e` scritto in bash e funziona su tutte le distribuzioni sulle quali e` installato wget, tail e sed.< / p >
< pre class = "wp-code-highlight prettyprint linenums:1" > #!/bin/bash
# Write here the extension of the file that you want to accept
#filetype =.flac
#append this to line 24
#-A .$filetype
#Write here the extension of the file that you want to reject, divided by a comma
fileremove = .null
if [ “$1” = “” ]; then
echo USAGE: archivedownload.sh collectionname
echo See Archive.org entry page for the collection name.
echo Collection name must be entered exactly as shown: lower case, with hyphens.
exit
fi
echo Downloading list of entries for collection name $1…
wget -nd -q “http://archive.org/advancedsearch.php?q=collection%3A$1& fl%5B%5D=identifier& sort%5B%5D=identifier+asc& sort%5B%5D=& sort%5B%5D=& rows=9999& page=1& callback=callback& save=yes& output=csv” -O identifiers.txt
echo Processing entry list for wget parsing…
tail -n +2 identifiers.txt | sed ‘ s/”//g’ > processedidentifiers.txt
if [ “`cat processedidentifiers.txt | wc -l`” = “0” ]; then
echo No identifiers found for collection $1. Check name and try again.
rm processedidentifiers.txt identifiers.txt
exit
fi
echo Beginning wget download of `cat processedidentifiers.txt | wc -l` identifiers…
wget -r -H -nc -np -nH -nd -e -R $fileremove robots=off -i processedidentifiers.txt -B ‘ http://archive.org/download/’
rm identifiers.txt processedidentifiers.txt
echo Complete.
< / pre >
2017-03-20 00:28:27 +01:00
< p > Francesco Mecca < / p >
2016-05-01 11:13:57 +02:00
< / div >
2016-09-17 15:13:02 +02:00
<!-- <div class="related"> -->
<!-- <related - posts /> -->
<!-- <h2>Related Posts</h2> -->
<!-- <ul class="related - posts"> -->
2018-03-27 04:11:48 +02:00
<!-- -->
<!-- <li> -->
<!-- <h3> -->
2018-03-27 04:26:23 +02:00
<!-- <a href="/pescewanda/2018/03/27/addio - reddit/"> -->
2018-03-27 04:11:48 +02:00
<!-- Addio Reddit -->
<!-- <small>27 Mar 2018</small> -->
<!-- </a> -->
<!-- </h3> -->
<!-- </li> -->
2017-10-05 06:35:49 +02:00
<!-- -->
<!-- <li> -->
<!-- <h3> -->
<!-- <a href="/pescewanda/2017/10/02/minidoxguide/"> -->
<!-- Minidox, a guide for the Europeans and the Scrooges -->
<!-- <small>02 Oct 2017</small> -->
<!-- </a> -->
<!-- </h3> -->
<!-- </li> -->
2017-05-10 20:26:52 +02:00
<!-- -->
<!-- <li> -->
<!-- <h3> -->
<!-- <a href="/pescewanda/2017/05/09/vaporwave/"> -->
<!-- Cyber - utopia and vaporwave -->
<!-- <small>09 May 2017</small> -->
<!-- </a> -->
<!-- </h3> -->
<!-- </li> -->
2017-05-07 12:21:36 +02:00
<!-- -->
<!-- <li> -->
<!-- <h3> -->
<!-- <a href="/pescewanda/2017/05/07/latestage_handbrake/"> -->
<!-- Late Stage Capitalism meets FOSS -->
<!-- <small>07 May 2017</small> -->
<!-- </a> -->
<!-- </h3> -->
<!-- </li> -->
2017-03-22 13:07:09 +01:00
<!-- -->
<!-- <li> -->
<!-- <h3> -->
2017-03-22 19:38:23 +01:00
<!-- <a href="/pescewanda/2017/03/20/spazio - digitale - rant - facebook__eng/"> -->
2017-03-22 13:07:09 +01:00
<!-- Some shallow thoughts from my tiny virtual space -->
<!-- <small>20 Mar 2017</small> -->
<!-- </a> -->
<!-- </h3> -->
<!-- </li> -->
2017-02-24 07:32:48 +01:00
<!-- -->
2016-09-17 15:13:02 +02:00
<!-- </ul> -->
<!-- </div> -->
2017-03-22 19:38:23 +01:00
2016-09-17 15:13:02 +02:00
2018-03-27 04:11:48 +02:00
2016-09-17 15:13:02 +02:00
2018-03-27 04:11:48 +02:00
2016-09-17 15:13:02 +02:00
2016-11-18 20:45:39 +01:00
2016-09-17 15:13:02 +02:00
2016-11-19 18:38:14 +01:00
2016-09-17 15:13:02 +02:00
2017-05-10 20:26:52 +02:00
2016-11-18 20:45:39 +01:00
2016-09-17 15:13:02 +02:00
2017-10-05 06:35:49 +02:00
2016-09-17 15:13:02 +02:00
2016-11-18 20:45:39 +01:00
2018-03-27 04:11:48 +02:00
2016-09-17 15:13:02 +02:00
2016-11-19 18:38:14 +01:00
2016-09-17 15:13:02 +02:00
2017-05-07 12:21:36 +02:00
2016-09-17 15:13:02 +02:00
2017-05-07 12:21:36 +02:00
2016-11-19 18:38:14 +01:00
2016-09-17 15:13:02 +02:00
2016-11-18 20:45:39 +01:00
2017-05-10 20:26:52 +02:00
2016-09-17 15:13:02 +02:00
2017-05-10 20:26:52 +02:00
2017-02-24 07:32:48 +01:00
2016-09-17 15:13:02 +02:00
2017-03-22 13:07:09 +01:00
2017-02-24 07:32:48 +01:00
2016-09-17 15:13:02 +02:00
2017-10-05 06:35:49 +02:00
2016-11-18 20:45:39 +01:00
2017-10-05 06:35:49 +02:00
2017-03-22 13:07:09 +01:00
2016-11-18 20:45:39 +01:00
2017-05-07 12:21:36 +02:00
2017-02-24 07:32:48 +01:00
2016-11-18 20:45:39 +01:00
2018-03-27 04:11:48 +02:00
2016-11-18 20:45:39 +01:00
2017-05-07 12:21:36 +02:00
2018-03-27 04:11:48 +02:00
2016-09-17 15:13:02 +02:00
2017-05-10 20:26:52 +02:00
2016-09-17 15:13:02 +02:00
2017-05-10 20:26:52 +02:00
2016-05-01 11:13:57 +02:00
2017-05-09 11:29:54 +02:00
< footer class = "site-footer" >
2017-05-10 20:26:52 +02:00
<!-- <span class="site - footer - owner"><a href="http://francescomecca.eu">Caught in the Net</a> is maintained by <a href="contattami">Francesco Mecca</a>.</span> -->
2017-05-09 11:29:54 +02:00
< span > CC BY-SA 4.0 International.< / br > < / span >
< span class = "site-footer-credits" > < a href = "https://jekyllrb.com" > Jekyll< / a > , < a href = "https://github.com/jasonlong/cayman-theme" > Cayman theme< / a > .< / span >
< / footer >
< / section >
2016-05-01 11:13:57 +02:00
< / body >
< / html >