francescomecca.eu/_site/index.php/archives/9.html

568 lines
9.4 KiB
HTML
Raw Normal View History

2016-05-01 11:13:57 +02:00
<!DOCTYPE html>
<html lang="en-us">
2017-05-09 11:29:54 +02:00
2016-05-01 11:13:57 +02:00
<head>
2017-05-09 11:29:54 +02:00
<meta charset="UTF-8">
<title>Caught in the Net</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="theme-color" content="#157878">
<link rel="stylesheet" href="/css/normalize.css">
<!--<link href='https://fonts.googleapis.com/css?family=Open+Sans:400,700' rel='stylesheet' type='text/css'>-->
<link rel="stylesheet" href="/fonts/opensans.css">
<link rel="stylesheet" href="/css/cayman.css">
2016-05-01 11:13:57 +02:00
</head>
2017-05-09 11:29:54 +02:00
<body>
<section class="page-header">
<h1 class="project-name">Caught in the Net</h1>
<h2 class="project-tagline">La rete ti cattura ma libera il pensiero</h2>
<a class="btn" href="/">Home</a>
2017-05-09 12:17:48 +02:00
<a class="btn" href="/about/">About me</a>
2017-05-09 11:29:54 +02:00
<a class="btn" href="/contattami/">Contact me</a>
2017-05-10 20:26:52 +02:00
<a class="btn" href="/archive/">Archive</a>
<a class="btn" href="/feed.xml">RSS</a>
2018-08-15 11:48:39 +02:00
<a class="btn" href="http://francescomecca.eu/git/pesceWanda">Personal Git</a>
2017-05-09 11:29:54 +02:00
<a class="btn" href="https://github.com/FraMecca">Github</a>
2017-05-09 11:38:51 +02:00
<a class="btn" href="/curriculum/CV_Mecca_Francesco.pdf">Curriculum</a>
2017-05-09 11:29:54 +02:00
</section>
2016-05-01 11:13:57 +02:00
2017-05-09 11:29:54 +02:00
<section class="main-content">
2016-05-01 11:13:57 +02:00
<div class="post">
<h1 class="post-title">Script per il bulk download da Archive.org</h1>
<span class="post-date">30 Jun 2015</span>
<p>In questi giorni mi e` capitato di dover scaricare varie collezioni da <a href="https://en.wikipedia.org/wiki/Internet_Archive">archive.org</a>, una libreria digitale multimediale la cui missione e` l&#8217;accesso universale a tutta la conoscenza.</p>
<p>Principalmente lo uso per scaricare tantissime registrazioni live di vari concerti registrati a mio avviso in maniera impeccabile.</p>
<p>Nel sito si trova una guida per scaricare in bulk usando wget e gli strumenti del sito, ma risulta piuttosto prolissa e complicata se si vuole fare un download al volo.</p>
<p>Questo e` lo script che uso, modificato da <a href="https://github.com/ghalfacree/bash-scripts/blob/master/archivedownload.sh">questo</a> script: e` scritto in bash e funziona su tutte le distribuzioni sulle quali e` installato wget, tail e sed.</p>
<pre class="wp-code-highlight prettyprint linenums:1">#!/bin/bash
# Write here the extension of the file that you want to accept
#filetype =.flac
#append this to line 24
#-A .$filetype
#Write here the extension of the file that you want to reject, divided by a comma
fileremove = .null
if [ “$1” = “” ]; then
echo USAGE: archivedownload.sh collectionname
echo See Archive.org entry page for the collection name.
echo Collection name must be entered exactly as shown: lower case, with hyphens.
exit
fi
echo Downloading list of entries for collection name $1…
wget -nd -q “http://archive.org/advancedsearch.php?q=collection%3A$1&amp;fl%5B%5D=identifier&amp;sort%5B%5D=identifier+asc&amp;sort%5B%5D=&amp;sort%5B%5D=&amp;rows=9999&amp;page=1&amp;callback=callback&amp;save=yes&amp;output=csv” -O identifiers.txt
echo Processing entry list for wget parsing…
tail -n +2 identifiers.txt | sed s/”//g &gt; processedidentifiers.txt
if [ “`cat processedidentifiers.txt | wc -l`” = “0” ]; then
echo No identifiers found for collection $1. Check name and try again.
rm processedidentifiers.txt identifiers.txt
exit
fi
echo Beginning wget download of `cat processedidentifiers.txt | wc -l` identifiers…
wget -r -H -nc -np -nH -nd -e -R $fileremove robots=off -i processedidentifiers.txt -B http://archive.org/download/
rm identifiers.txt processedidentifiers.txt
echo Complete.
</pre>
2017-03-20 00:28:27 +01:00
<p>Francesco Mecca </p>
2016-05-01 11:13:57 +02:00
</div>
2016-09-17 15:13:02 +02:00
<!--<div class="related">-->
<!--<related-posts />-->
<!--<h2>Related Posts</h2>-->
<!--<ul class="related-posts">-->
2018-10-25 12:23:48 +02:00
<!---->
<!--<li>-->
<!--<h3>-->
<!--<a href="/pescewanda/2018/10/24/eduhack_coventry/">-->
<!--eLearning in the age of Social Networks, the EduHack Platform-->
<!--<small>24 Oct 2018</small>-->
<!--</a>-->
<!--</h3>-->
<!--</li>-->
2018-08-15 11:48:39 +02:00
<!---->
<!--<li>-->
<!--<h3>-->
<!--<a href="/pescewanda/2018/07/27/dtldr/">-->
<!--Un articolo per r/italyinformatica-->
<!--<small>27 Jul 2018</small>-->
<!--</a>-->
<!--</h3>-->
<!--</li>-->
2018-03-27 04:11:48 +02:00
<!---->
<!--<li>-->
<!--<h3>-->
2018-03-27 04:26:23 +02:00
<!--<a href="/pescewanda/2018/03/27/addio-reddit/">-->
2018-03-27 04:11:48 +02:00
<!--Addio Reddit-->
<!--<small>27 Mar 2018</small>-->
<!--</a>-->
<!--</h3>-->
<!--</li>-->
2017-10-05 06:35:49 +02:00
<!---->
<!--<li>-->
<!--<h3>-->
<!--<a href="/pescewanda/2017/10/02/minidoxguide/">-->
<!--Minidox, a guide for the Europeans and the Scrooges-->
<!--<small>02 Oct 2017</small>-->
<!--</a>-->
<!--</h3>-->
<!--</li>-->
2017-05-10 20:26:52 +02:00
<!---->
<!--<li>-->
<!--<h3>-->
<!--<a href="/pescewanda/2017/05/09/vaporwave/">-->
<!--Cyber-utopia and vaporwave-->
<!--<small>09 May 2017</small>-->
<!--</a>-->
<!--</h3>-->
<!--</li>-->
2017-05-07 12:21:36 +02:00
<!---->
2016-09-17 15:13:02 +02:00
<!--</ul>-->
<!--</div>-->
2017-03-22 19:38:23 +01:00
2016-09-17 15:13:02 +02:00
2018-08-15 11:48:39 +02:00
2016-09-17 15:13:02 +02:00
2018-10-25 12:23:48 +02:00
2016-11-18 20:45:39 +01:00
2016-09-17 15:13:02 +02:00
2016-11-19 18:38:14 +01:00
2016-09-17 15:13:02 +02:00
2016-11-18 20:45:39 +01:00
2016-09-17 15:13:02 +02:00
2018-10-25 12:23:48 +02:00
2016-09-17 15:13:02 +02:00
2018-10-25 12:23:48 +02:00
2016-09-17 15:13:02 +02:00
2016-11-18 20:45:39 +01:00
2018-03-27 04:11:48 +02:00
2016-09-17 15:13:02 +02:00
2016-11-19 18:38:14 +01:00
2016-09-17 15:13:02 +02:00
2018-08-15 11:48:39 +02:00
2016-11-19 18:38:14 +01:00
2016-09-17 15:13:02 +02:00
2016-11-18 20:45:39 +01:00
2016-09-17 15:13:02 +02:00
2018-10-25 12:23:48 +02:00
2017-02-24 07:32:48 +01:00
2016-09-17 15:13:02 +02:00
2017-03-22 13:07:09 +01:00
2017-02-24 07:32:48 +01:00
2016-09-17 15:13:02 +02:00
2017-10-05 06:35:49 +02:00
2016-11-18 20:45:39 +01:00
2017-10-05 06:35:49 +02:00
2017-03-22 13:07:09 +01:00
2016-11-18 20:45:39 +01:00
2017-05-07 12:21:36 +02:00
2017-02-24 07:32:48 +01:00
2016-11-18 20:45:39 +01:00
2018-03-27 04:11:48 +02:00
2016-11-18 20:45:39 +01:00
2017-05-07 12:21:36 +02:00
2018-03-27 04:11:48 +02:00
2016-09-17 15:13:02 +02:00
2017-05-10 20:26:52 +02:00
2016-09-17 15:13:02 +02:00
2017-05-10 20:26:52 +02:00
2016-05-01 11:13:57 +02:00
2017-05-09 11:29:54 +02:00
<footer class="site-footer">
2017-05-10 20:26:52 +02:00
<!--<span class="site-footer-owner"><a href="http://francescomecca.eu">Caught in the Net</a> is maintained by <a href="contattami">Francesco Mecca</a>.</span>-->
2017-05-09 11:29:54 +02:00
<span> CC BY-SA 4.0 International.</br> </span>
<span class="site-footer-credits"><a href="https://jekyllrb.com">Jekyll</a>, <a href="https://github.com/jasonlong/cayman-theme">Cayman theme</a>.</span>
</footer>
</section>
2016-05-01 11:13:57 +02:00
</body>
</html>