francescomecca.eu/_site/index.php/archives/9.html
2017-05-09 11:29:54 +02:00

581 lines
9.5 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en-us">
<head>
<meta charset="UTF-8">
<title>Caught in the Net</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="theme-color" content="#157878">
<link rel="stylesheet" href="/css/normalize.css">
<!--<link href='https://fonts.googleapis.com/css?family=Open+Sans:400,700' rel='stylesheet' type='text/css'>-->
<link rel="stylesheet" href="/fonts/opensans.css">
<link rel="stylesheet" href="/css/cayman.css">
</head>
<body>
<section class="page-header">
<h1 class="project-name">Caught in the Net</h1>
<h2 class="project-tagline">La rete ti cattura ma libera il pensiero</h2>
<a class="btn" href="/">Home</a>
<a class="btn" href="/archive/">Archive</a>
<a class="btn" href="/contattami/">Contact me</a>
<a class="btn" href="http://francescomecca.eu:3000/explore/repos">Personal Git</a>
<a class="btn" href="https://github.com/FraMecca">Github</a>
<a class="btn" href="/feed.xml">RSS</a>
</section>
<section class="main-content">
<div class="post">
<h1 class="post-title">Script per il bulk download da Archive.org</h1>
<span class="post-date">30 Jun 2015</span>
<p>In questi giorni mi e` capitato di dover scaricare varie collezioni da <a href="https://en.wikipedia.org/wiki/Internet_Archive">archive.org</a>, una libreria digitale multimediale la cui missione e` l&#8217;accesso universale a tutta la conoscenza.</p>
<p>Principalmente lo uso per scaricare tantissime registrazioni live di vari concerti registrati a mio avviso in maniera impeccabile.</p>
<p>Nel sito si trova una guida per scaricare in bulk usando wget e gli strumenti del sito, ma risulta piuttosto prolissa e complicata se si vuole fare un download al volo.</p>
<p>Questo e` lo script che uso, modificato da <a href="https://github.com/ghalfacree/bash-scripts/blob/master/archivedownload.sh">questo</a> script: e` scritto in bash e funziona su tutte le distribuzioni sulle quali e` installato wget, tail e sed.</p>
<pre class="wp-code-highlight prettyprint linenums:1">#!/bin/bash
# Write here the extension of the file that you want to accept
#filetype =.flac
#append this to line 24
#-A .$filetype
#Write here the extension of the file that you want to reject, divided by a comma
fileremove = .null
if [ “$1” = “” ]; then
echo USAGE: archivedownload.sh collectionname
echo See Archive.org entry page for the collection name.
echo Collection name must be entered exactly as shown: lower case, with hyphens.
exit
fi
echo Downloading list of entries for collection name $1…
wget -nd -q “http://archive.org/advancedsearch.php?q=collection%3A$1&amp;fl%5B%5D=identifier&amp;sort%5B%5D=identifier+asc&amp;sort%5B%5D=&amp;sort%5B%5D=&amp;rows=9999&amp;page=1&amp;callback=callback&amp;save=yes&amp;output=csv” -O identifiers.txt
echo Processing entry list for wget parsing…
tail -n +2 identifiers.txt | sed s/”//g &gt; processedidentifiers.txt
if [ “`cat processedidentifiers.txt | wc -l`” = “0” ]; then
echo No identifiers found for collection $1. Check name and try again.
rm processedidentifiers.txt identifiers.txt
exit
fi
echo Beginning wget download of `cat processedidentifiers.txt | wc -l` identifiers…
wget -r -H -nc -np -nH -nd -e -R $fileremove robots=off -i processedidentifiers.txt -B http://archive.org/download/
rm identifiers.txt processedidentifiers.txt
echo Complete.
</pre>
<p>Francesco Mecca </p>
</div>
<!--<div class="related">-->
<!--<related-posts />-->
<!--<h2>Related Posts</h2>-->
<!--<ul class="related-posts">-->
<!---->
<!--<li>-->
<!--<h3>-->
<!--<a href="/pescewanda/2017/05/07/latestage_handbrake/">-->
<!--Late Stage Capitalism meets FOSS-->
<!--<small>07 May 2017</small>-->
<!--</a>-->
<!--</h3>-->
<!--</li>-->
<!---->
<!--<li>-->
<!--<h3>-->
<!--<a href="/pescewanda/2017/03/20/spazio-digitale-rant-facebook__eng/">-->
<!--Some shallow thoughts from my tiny virtual space-->
<!--<small>20 Mar 2017</small>-->
<!--</a>-->
<!--</h3>-->
<!--</li>-->
<!---->
<!--<li>-->
<!--<h3>-->
<!--<a href="/pescewanda/2017/03/07/spazio-digitale-rant-facebook/">-->
<!--Breve riflessione dal mio piccolo mondo virtuale-->
<!--<small>07 Mar 2017</small>-->
<!--</a>-->
<!--</h3>-->
<!--</li>-->
<!---->
<!--<li>-->
<!--<h3>-->
<!--<a href="/pescewanda/2016/11/15/machine-learning-PARTE3/">-->
<!--Capire il Machine Learning (parte 3)-->
<!--<small>15 Nov 2016</small>-->
<!--</a>-->
<!--</h3>-->
<!--</li>-->
<!---->
<!--<li>-->
<!--<h3>-->
<!--<a href="/pescewanda/2016/11/11/machine-learning-PARTE2/">-->
<!--Capire il Machine Learning (parte 2)-->
<!--<small>11 Nov 2016</small>-->
<!--</a>-->
<!--</h3>-->
<!--</li>-->
<!---->
<!--</ul>-->
<!--</div>-->
<footer class="site-footer">
<span class="site-footer-owner"><a href="http://localhost:4000">Caught in the Net</a> is maintained by <a href="">Francesco Mecca</a>.</span>
<span> CC BY-SA 4.0 International.</br> </span>
<span class="site-footer-credits"><a href="https://jekyllrb.com">Jekyll</a>, <a href="https://github.com/jasonlong/cayman-theme">Cayman theme</a>.</span>
</footer>
</section>
</body>
</html>