2018-11-10 18:19:00 +01:00
<!DOCTYPE html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width" >
< title > Script per il bulk download da Archive.org | Caught in the Net< / title >
2018-11-10 18:32:04 +01:00
< link rel = "stylesheet" href = "../../../../../assets/blog/fonts/opensans.css" >
< link href = "../../../../../assets/blog/css/normalize.css" rel = "stylesheet" type = "text/css" >
< link href = "../../../../../assets/blog/css/cayman.css" rel = "stylesheet" type = "text/css" >
2018-11-10 18:19:00 +01:00
< meta name = "theme-color" content = "#5670d4" >
< meta name = "generator" content = "Nikola (getnikola.com)" >
< link rel = "alternate" type = "application/rss+xml" title = "RSS" hreflang = "en" href = "../../../../../rss.xml" >
< link rel = "canonical" href = "francescomecca.eu/blog/2015/6/30/script-per-il-bulk-download-da-archive-org/" >
< meta name = "author" content = "Francesco Mecca" >
< link rel = "prev" href = "../../13/lfbi-contro-la-crittografia/" title = "L&#8217;FBI contro la crittografia" type = "text/html" >
< link rel = "next" href = "../../../7/7/la-rivoluzione-digitale-nella-professione-dellavvocato/" title = "La Rivoluzione Digitale nella Professione dell&#8217;Avvocato" type = "text/html" >
< meta property = "og:site_name" content = "Caught in the Net" >
< meta property = "og:title" content = "Script per il bulk download da Archive.org" >
< meta property = "og:url" content = "francescomecca.eu/blog/2015/6/30/script-per-il-bulk-download-da-archive-org/" >
< meta property = "og:description" content = "In questi giorni mi e ` capitato di dover scaricare varie collezioni da archive . org , una libreria digitale multimediale la cui missione e ` l ’ accesso universale a tutta la conoscenza .
Principalmente lo ">
< meta property = "og:type" content = "article" >
< meta property = "article:published_time" content = "2015-06-30T13:39:00Z" >
< meta property = "article:tag" content = "archive.org" >
< meta property = "article:tag" content = "bulk download archive.org" >
< meta property = "article:tag" content = "PesceWanda" >
< meta property = "article:tag" content = "script" >
< / head >
< body >
< div id = "container" >
< section class = "page-header" > < h1 class = "project-name" >
Caught in the Net
< / h1 >
< h2 class = "project-tagline" > La rete ti cattura ma libera il pensiero< / h2 >
< a class = "btn" href = "../../../../../" > Home< / a >
< a class = "btn" href = "../../../../../pages/about/" > About me< / a >
< a class = "btn" href = "../../../../../pages/contattami/" > Contact me< / a >
< a class = "btn" href = "../../../../../archiveall.html" > Archive< / a >
< a class = "btn" href = "../../../../../rss.xml" > RSS< / a >
< a class = "btn" href = "http://francescomecca.eu/git/pesceWanda" > Personal Git< / a >
< a class = "btn" href = "https://github.com/FraMecca" > Github< / a >
< a class = "btn" href = "../../../../../wp-content/curriculum/CV_Mecca_Francesco.pdf" > Curriculum< / a >
< / section > < section class = "main-content" > < div class = "post" >
< header > < h1 class = "post-title" >
< h1 class = "p-name post-title" itemprop = "headline name" > Script per il bulk download da Archive.org< / h1 >
< / h1 >
< / header > < p class = "dateline post-date" > 30 June 2015< / p >
< / div >
< div class = "e-content entry-content" itemprop = "articleBody text" >
< div >
< p > In questi giorni mi e` capitato di dover scaricare varie collezioni da < a href = "https://en.wikipedia.org/wiki/Internet_Archive" > archive.org< / a > , una libreria digitale multimediale la cui missione e` l’ accesso universale a tutta la conoscenza.< / p >
< p > Principalmente lo uso per scaricare tantissime registrazioni live di vari concerti registrati a mio avviso in maniera impeccabile.< / p >
< p > Nel sito si trova una guida per scaricare in bulk usando wget e gli strumenti del sito, ma risulta piuttosto prolissa e complicata se si vuole fare un download al volo.< / p >
< p > Questo e` lo script che uso, modificato da < a href = "https://github.com/ghalfacree/bash-scripts/blob/master/archivedownload.sh" > questo< / a > script: e` scritto in bash e funziona su tutte le distribuzioni sulle quali e` installato wget, tail e sed.< / p >
< pre class = "wp-code-highlight prettyprint linenums:1" > #!/bin/bash
# Write here the extension of the file that you want to accept
#filetype =.flac
#append this to line 24
#-A .$filetype
#Write here the extension of the file that you want to reject, divided by a comma
fileremove = .null
if [ “$1” = “” ]; then
echo USAGE: archivedownload.sh collectionname
echo See Archive.org entry page for the collection name.
echo Collection name must be entered exactly as shown: lower case, with hyphens.
exit
fi
echo Downloading list of entries for collection name $1…
wget -nd -q “http://archive.org/advancedsearch.php?q=collection%3A$1& fl%5B%5D=identifier& sort%5B%5D=identifier+asc& sort%5B%5D=& sort%5B%5D=& rows=9999& page=1& callback=callback& save=yes& output=csv” -O identifiers.txt
echo Processing entry list for wget parsing…
tail -n +2 identifiers.txt | sed ‘ s/”//g’ > processedidentifiers.txt
if [ “`cat processedidentifiers.txt | wc -l`” = “0” ]; then
echo No identifiers found for collection $1. Check name and try again.
rm processedidentifiers.txt identifiers.txt
exit
fi
echo Beginning wget download of `cat processedidentifiers.txt | wc -l` identifiers…
wget -r -H -nc -np -nH -nd -e -R $fileremove robots=off -i processedidentifiers.txt -B ‘ http://archive.org/download/’
rm identifiers.txt processedidentifiers.txt
echo Complete.
< / pre >
< p > Francesco Mecca < / p >
< / div >
< / div >
< aside class = "postpromonav" > < nav > < h4 > Categories< / h4 >
< ul itemprop = "keywords" class = "tags" >
< li > < a class = "tag p-category" href = "../../../../../categories/archiveorg/" rel = "tag" > archive.org< / a > < / li >
< li > < a class = "tag p-category" href = "../../../../../categories/bulk-download-archiveorg/" rel = "tag" > bulk download archive.org< / a > < / li >
< li > < a class = "tag p-category" href = "../../../../../categories/pescewanda/" rel = "tag" > PesceWanda< / a > < / li >
< li > < a class = "tag p-category" href = "../../../../../categories/script/" rel = "tag" > script< / a > < / li >
< / ul > < / nav > < / aside > < p class = "sourceline" > < a href = "index.md" class = "sourcelink" > Source< / a > < / p >
< footer class = "site-footer" id = "footer" > < span > CC BY-SA 4.0 International.< br > < / span >
2018-11-10 18:19:11 +01:00
< span class = "site-footer-credits" > < a href = "https://getnikola.com" > Nikola< / a > , < a href = "https://github.com/jasonlong/cayman-theme" > Cayman theme< / a > .< / span >
2018-11-10 18:19:00 +01:00
< / footer > < / section >
< / div >
< / body >
< / html >