francescomecca.eu/_site/index.php/archives/9.html

209 lines
5.6 KiB
HTML
Raw Normal View History

2016-05-01 11:13:57 +02:00
<!DOCTYPE html>
<html lang="en-us">
<head>
<link href="http://gmpg.org/xfn/11" rel="profile">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<!-- Enable responsiveness on mobile devices-->
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">
<title>
Script per il bulk download da Archive.org &middot; Caught in the Net
</title>
<!-- CSS -->
<link rel="stylesheet" href="/public/css/poole.css">
<link rel="stylesheet" href="/public/css/syntax.css">
<link rel="stylesheet" href="/public/css/hyde.css">
<!-- Icons -->
<link rel="apple-touch-icon-precomposed" sizes="144x144" href="/public/apple-touch-icon-144-precomposed.png">
<link rel="shortcut icon" href="/public/favicon.ico">
<!-- RSS -->
<link rel="alternate" type="application/rss+xml" title="RSS" href="/atom.xml">
</head>
<body class="theme-base-09">
<div class="sidebar">
<div class="container sidebar-sticky">
<div class="sidebar-about">
<h1>
<a href="/">
Caught in the Net
</a>
</h1>
<p class="lead"></p>
</div>
<nav class="sidebar-nav">
<a class="sidebar-nav-item" href="/">Home</a>
<a class="sidebar-nav-item" href="/about/">About</a>
<a class="sidebar-nav-item" href="/archive/">Archive</a>
<a class="sidebar-nav-item" href="/contattami/">Contattami</a>
<a class="sidebar-nav-item" href="/atom.xml">RSS</a>
<a class="sidebar-nav-item" href="http://francescomecca.eu:3000">Personal Git</a>
<a cleass="sidebar-nav-item" href="https://github.com/s211897-studentipolito">github</a>
<span class="sidebar-nav-item" href="" >Powered by Jekyll and Hyde</span>
</nav>
<p>&copy; 2016. CC BY-SA 4.0 International </p>
</div>
</div>
<h3 class="masthead-title">
<a href="/" title="Home">Caught in the Net</a>
</h3>
<div class="content container">
<div class="post">
<h1 class="post-title">Script per il bulk download da Archive.org</h1>
<span class="post-date">30 Jun 2015</span>
<p>In questi giorni mi e` capitato di dover scaricare varie collezioni da <a href="https://en.wikipedia.org/wiki/Internet_Archive">archive.org</a>, una libreria digitale multimediale la cui missione e` l&#8217;accesso universale a tutta la conoscenza.</p>
<p>Principalmente lo uso per scaricare tantissime registrazioni live di vari concerti registrati a mio avviso in maniera impeccabile.</p>
<p>Nel sito si trova una guida per scaricare in bulk usando wget e gli strumenti del sito, ma risulta piuttosto prolissa e complicata se si vuole fare un download al volo.</p>
<p>Questo e` lo script che uso, modificato da <a href="https://github.com/ghalfacree/bash-scripts/blob/master/archivedownload.sh">questo</a> script: e` scritto in bash e funziona su tutte le distribuzioni sulle quali e` installato wget, tail e sed.</p>
<pre class="wp-code-highlight prettyprint linenums:1">#!/bin/bash
# Write here the extension of the file that you want to accept
#filetype =.flac
#append this to line 24
#-A .$filetype
#Write here the extension of the file that you want to reject, divided by a comma
fileremove = .null
if [ “$1” = “” ]; then
echo USAGE: archivedownload.sh collectionname
echo See Archive.org entry page for the collection name.
echo Collection name must be entered exactly as shown: lower case, with hyphens.
exit
fi
echo Downloading list of entries for collection name $1…
wget -nd -q “http://archive.org/advancedsearch.php?q=collection%3A$1&amp;fl%5B%5D=identifier&amp;sort%5B%5D=identifier+asc&amp;sort%5B%5D=&amp;sort%5B%5D=&amp;rows=9999&amp;page=1&amp;callback=callback&amp;save=yes&amp;output=csv” -O identifiers.txt
echo Processing entry list for wget parsing…
tail -n +2 identifiers.txt | sed s/”//g &gt; processedidentifiers.txt
if [ “`cat processedidentifiers.txt | wc -l`” = “0” ]; then
echo No identifiers found for collection $1. Check name and try again.
rm processedidentifiers.txt identifiers.txt
exit
fi
echo Beginning wget download of `cat processedidentifiers.txt | wc -l` identifiers…
wget -r -H -nc -np -nH -nd -e -R $fileremove robots=off -i processedidentifiers.txt -B http://archive.org/download/
rm identifiers.txt processedidentifiers.txt
echo Complete.
</pre>
<p>Francesco Mecca</p>
</div>
<div class="related">
<h2>Related Posts</h2>
<ul class="related-posts">
<li>
<h3>
2016-05-03 14:40:37 +02:00
<a href="/pescewanda/2016/04/17/wright-nakamoto/">
#JeSuisSatoshiNakamoto
2016-05-01 11:13:57 +02:00
<small>17 Apr 2016</small>
</a>
</h3>
</li>
<li>
<h3>
2016-05-03 14:40:37 +02:00
<a href="/pescewanda/2016/04/17/kpd-player/">
Kyuss Music Player
<small>17 Apr 2016</small>
2016-05-01 11:13:57 +02:00
</a>
</h3>
</li>
<li>
<h3>
2016-05-03 14:40:37 +02:00
<a href="/pescewanda/2016/04/10/short-lesson-from-reddit/">
Bright Father
2016-05-01 11:13:57 +02:00
<small>10 Apr 2016</small>
</a>
</h3>
</li>
</ul>
</div>
</div>
</body>
</html>