francescomecca.eu/_site/index.php/archives/9.html
PesceWanda 6249a5dfc9 blog
2016-05-01 11:13:57 +02:00

208 lines
5.6 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en-us">
<head>
<link href="http://gmpg.org/xfn/11" rel="profile">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<!-- Enable responsiveness on mobile devices-->
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">
<title>
Script per il bulk download da Archive.org &middot; Caught in the Net
</title>
<!-- CSS -->
<link rel="stylesheet" href="/public/css/poole.css">
<link rel="stylesheet" href="/public/css/syntax.css">
<link rel="stylesheet" href="/public/css/hyde.css">
<!-- Icons -->
<link rel="apple-touch-icon-precomposed" sizes="144x144" href="/public/apple-touch-icon-144-precomposed.png">
<link rel="shortcut icon" href="/public/favicon.ico">
<!-- RSS -->
<link rel="alternate" type="application/rss+xml" title="RSS" href="/atom.xml">
</head>
<body class="theme-base-09">
<div class="sidebar">
<div class="container sidebar-sticky">
<div class="sidebar-about">
<h1>
<a href="/">
Caught in the Net
</a>
</h1>
<p class="lead"></p>
</div>
<nav class="sidebar-nav">
<a class="sidebar-nav-item" href="/">Home</a>
<a class="sidebar-nav-item" href="/about/">About</a>
<a class="sidebar-nav-item" href="/archive/">Archive</a>
<a class="sidebar-nav-item" href="/contattami/">Contattami</a>
<a class="sidebar-nav-item" href="/atom.xml">RSS</a>
<a class="sidebar-nav-item" href="http://francescomecca.eu:3000">Personal Git</a>
<a cleass="sidebar-nav-item" href="https://github.com/s211897-studentipolito">github</a>
<span class="sidebar-nav-item" href="" >Powered by Jekyll and Hyde</span>
</nav>
<p>&copy; 2016. CC BY-SA 4.0 International </p>
</div>
</div>
<h3 class="masthead-title">
<a href="/" title="Home">Caught in the Net</a>
</h3>
<div class="content container">
<div class="post">
<h1 class="post-title">Script per il bulk download da Archive.org</h1>
<span class="post-date">30 Jun 2015</span>
<p>In questi giorni mi e` capitato di dover scaricare varie collezioni da <a href="https://en.wikipedia.org/wiki/Internet_Archive">archive.org</a>, una libreria digitale multimediale la cui missione e` l&#8217;accesso universale a tutta la conoscenza.</p>
<p>Principalmente lo uso per scaricare tantissime registrazioni live di vari concerti registrati a mio avviso in maniera impeccabile.</p>
<p>Nel sito si trova una guida per scaricare in bulk usando wget e gli strumenti del sito, ma risulta piuttosto prolissa e complicata se si vuole fare un download al volo.</p>
<p>Questo e` lo script che uso, modificato da <a href="https://github.com/ghalfacree/bash-scripts/blob/master/archivedownload.sh">questo</a> script: e` scritto in bash e funziona su tutte le distribuzioni sulle quali e` installato wget, tail e sed.</p>
<pre class="wp-code-highlight prettyprint linenums:1">#!/bin/bash
# Write here the extension of the file that you want to accept
#filetype =.flac
#append this to line 24
#-A .$filetype
#Write here the extension of the file that you want to reject, divided by a comma
fileremove = .null
if [ “$1” = “” ]; then
echo USAGE: archivedownload.sh collectionname
echo See Archive.org entry page for the collection name.
echo Collection name must be entered exactly as shown: lower case, with hyphens.
exit
fi
echo Downloading list of entries for collection name $1…
wget -nd -q “http://archive.org/advancedsearch.php?q=collection%3A$1&amp;fl%5B%5D=identifier&amp;sort%5B%5D=identifier+asc&amp;sort%5B%5D=&amp;sort%5B%5D=&amp;rows=9999&amp;page=1&amp;callback=callback&amp;save=yes&amp;output=csv” -O identifiers.txt
echo Processing entry list for wget parsing…
tail -n +2 identifiers.txt | sed s/”//g &gt; processedidentifiers.txt
if [ “`cat processedidentifiers.txt | wc -l`” = “0” ]; then
echo No identifiers found for collection $1. Check name and try again.
rm processedidentifiers.txt identifiers.txt
exit
fi
echo Beginning wget download of `cat processedidentifiers.txt | wc -l` identifiers…
wget -r -H -nc -np -nH -nd -e -R $fileremove robots=off -i processedidentifiers.txt -B http://archive.org/download/
rm identifiers.txt processedidentifiers.txt
echo Complete.
</pre>
<p>Francesco Mecca</p>
</div>
<div class="related">
<h2>Related Posts</h2>
<ul class="related-posts">
<li>
<h3>
<a href="/pescewanda/2016/04/17/kpd-player/">
Kyuss Music Player
<small>17 Apr 2016</small>
</a>
</h3>
</li>
<li>
<h3>
<a href="/pescewanda/2016/04/10/short-lesson-from-reddit/">
Bright Father
<small>10 Apr 2016</small>
</a>
</h3>
</li>
<li>
<h3>
<a href="/pescewanda/2016/04/10/lifehacks/">
Lifehacks
<small>10 Apr 2016</small>
</a>
</h3>
</li>
</ul>
</div>
</div>
</body>
</html>