francescomecca.eu/output/blog/2015/6/30/script-per-il-bulk-download-da-archive-org/index.html

112 lines
6.3 KiB
HTML
Raw Normal View History

2018-11-10 18:19:00 +01:00
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width">
<title>Script per il bulk download da Archive.org | Caught in the Net</title>
2018-11-10 18:32:04 +01:00
<link rel="stylesheet" href="../../../../../assets/blog/fonts/opensans.css">
<link href="../../../../../assets/blog/css/normalize.css" rel="stylesheet" type="text/css">
<link href="../../../../../assets/blog/css/cayman.css" rel="stylesheet" type="text/css">
2018-11-10 18:19:00 +01:00
<meta name="theme-color" content="#5670d4">
<meta name="generator" content="Nikola (getnikola.com)">
<link rel="alternate" type="application/rss+xml" title="RSS" hreflang="en" href="../../../../../rss.xml">
<link rel="canonical" href="francescomecca.eu/blog/2015/6/30/script-per-il-bulk-download-da-archive-org/">
<meta name="author" content="Francesco Mecca">
<link rel="prev" href="../../13/lfbi-contro-la-crittografia/" title="L&amp;#8217;FBI contro la crittografia" type="text/html">
<link rel="next" href="../../../7/7/la-rivoluzione-digitale-nella-professione-dellavvocato/" title="La Rivoluzione Digitale nella Professione dell&amp;#8217;Avvocato" type="text/html">
<meta property="og:site_name" content="Caught in the Net">
<meta property="og:title" content="Script per il bulk download da Archive.org">
<meta property="og:url" content="francescomecca.eu/blog/2015/6/30/script-per-il-bulk-download-da-archive-org/">
<meta property="og:description" content="In questi giorni mi e` capitato di dover scaricare varie collezioni da archive.org, una libreria digitale multimediale la cui missione e` laccesso universale a tutta la conoscenza.
Principalmente lo ">
<meta property="og:type" content="article">
<meta property="article:published_time" content="2015-06-30T13:39:00Z">
<meta property="article:tag" content="archive.org">
<meta property="article:tag" content="bulk download archive.org">
<meta property="article:tag" content="PesceWanda">
<meta property="article:tag" content="script">
</head>
<body>
<div id="container">
<section class="page-header"><h1 class="project-name">
Caught in the Net
</h1>
<h2 class="project-tagline">La rete ti cattura ma libera il pensiero</h2>
<a class="btn" href="../../../../../">Home</a>
<a class="btn" href="../../../../../pages/about/">About me</a>
<a class="btn" href="../../../../../pages/contattami/">Contact me</a>
<a class="btn" href="../../../../../archiveall.html">Archive</a>
<a class="btn" href="../../../../../rss.xml">RSS</a>
<a class="btn" href="http://francescomecca.eu/git/pesceWanda">Personal Git</a>
<a class="btn" href="https://github.com/FraMecca">Github</a>
2020-01-29 11:08:46 +01:00
<a class="btn" href="https://francescomecca.eu/git/pesceWanda/Curriculum_vitae/raw/master/latex.dir/cv.pdf">Curriculum</a>
2018-11-10 18:19:00 +01:00
</section><section class="main-content"><div class="post">
<header><h1 class="post-title">
<h1 class="p-name post-title" itemprop="headline name">Script per il bulk download da Archive.org</h1>
</h1>
</header><p class="dateline post-date">30 June 2015</p>
</div>
<div class="e-content entry-content" itemprop="articleBody text">
<div>
<p>In questi giorni mi e` capitato di dover scaricare varie collezioni da <a href="https://en.wikipedia.org/wiki/Internet_Archive">archive.org</a>, una libreria digitale multimediale la cui missione e` laccesso universale a tutta la conoscenza.</p>
<p>Principalmente lo uso per scaricare tantissime registrazioni live di vari concerti registrati a mio avviso in maniera impeccabile.</p>
<p>Nel sito si trova una guida per scaricare in bulk usando wget e gli strumenti del sito, ma risulta piuttosto prolissa e complicata se si vuole fare un download al volo.</p>
<p>Questo e` lo script che uso, modificato da <a href="https://github.com/ghalfacree/bash-scripts/blob/master/archivedownload.sh">questo</a> script: e` scritto in bash e funziona su tutte le distribuzioni sulle quali e` installato wget, tail e sed.</p>
<pre class="wp-code-highlight prettyprint linenums:1">#!/bin/bash
# Write here the extension of the file that you want to accept
#filetype =.flac
#append this to line 24
#-A .$filetype
#Write here the extension of the file that you want to reject, divided by a comma
fileremove = .null
if [ “$1” = “” ]; then
echo USAGE: archivedownload.sh collectionname
echo See Archive.org entry page for the collection name.
echo Collection name must be entered exactly as shown: lower case, with hyphens.
exit
fi
echo Downloading list of entries for collection name $1…
wget -nd -q “http://archive.org/advancedsearch.php?q=collection%3A$1&amp;fl%5B%5D=identifier&amp;sort%5B%5D=identifier+asc&amp;sort%5B%5D=&amp;sort%5B%5D=&amp;rows=9999&amp;page=1&amp;callback=callback&amp;save=yes&amp;output=csv” -O identifiers.txt
echo Processing entry list for wget parsing…
tail -n +2 identifiers.txt | sed s/”//g &gt; processedidentifiers.txt
if [ “`cat processedidentifiers.txt | wc -l`” = “0” ]; then
echo No identifiers found for collection $1. Check name and try again.
rm processedidentifiers.txt identifiers.txt
exit
fi
echo Beginning wget download of `cat processedidentifiers.txt | wc -l` identifiers…
wget -r -H -nc -np -nH -nd -e -R $fileremove robots=off -i processedidentifiers.txt -B http://archive.org/download/
rm identifiers.txt processedidentifiers.txt
echo Complete.
</pre>
<p>Francesco Mecca </p>
</div>
</div>
<aside class="postpromonav"><nav><h4>Categories</h4>
<ul itemprop="keywords" class="tags">
<li><a class="tag p-category" href="../../../../../categories/archiveorg/" rel="tag">archive.org</a></li>
<li><a class="tag p-category" href="../../../../../categories/bulk-download-archiveorg/" rel="tag">bulk download archive.org</a></li>
<li><a class="tag p-category" href="../../../../../categories/pescewanda/" rel="tag">PesceWanda</a></li>
<li><a class="tag p-category" href="../../../../../categories/script/" rel="tag">script</a></li>
</ul></nav></aside><p class="sourceline"><a href="index.md" class="sourcelink">Source</a></p>
<footer class="site-footer" id="footer"><span> CC BY-SA 4.0 International.<br></span>
2018-11-10 18:19:11 +01:00
<span class="site-footer-credits"><a href="https://getnikola.com">Nikola</a>, <a href="https://github.com/jasonlong/cayman-theme">Cayman theme</a>.</span>
2018-11-10 18:19:00 +01:00
</footer></section>
</div>
</body>
</html>