francescomecca.eu/output/blog/2015/6/30/script-per-il-bulk-download-da-archive-org/index.html
Francesco Mecca 2558df9d00 assets dir
2018-11-10 18:32:04 +01:00

111 lines
6.3 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width">
<title>Script per il bulk download da Archive.org | Caught in the Net</title>
<link rel="stylesheet" href="../../../../../assets/blog/fonts/opensans.css">
<link href="../../../../../assets/blog/css/normalize.css" rel="stylesheet" type="text/css">
<link href="../../../../../assets/blog/css/cayman.css" rel="stylesheet" type="text/css">
<meta name="theme-color" content="#5670d4">
<meta name="generator" content="Nikola (getnikola.com)">
<link rel="alternate" type="application/rss+xml" title="RSS" hreflang="en" href="../../../../../rss.xml">
<link rel="canonical" href="francescomecca.eu/blog/2015/6/30/script-per-il-bulk-download-da-archive-org/">
<meta name="author" content="Francesco Mecca">
<link rel="prev" href="../../13/lfbi-contro-la-crittografia/" title="L&amp;#8217;FBI contro la crittografia" type="text/html">
<link rel="next" href="../../../7/7/la-rivoluzione-digitale-nella-professione-dellavvocato/" title="La Rivoluzione Digitale nella Professione dell&amp;#8217;Avvocato" type="text/html">
<meta property="og:site_name" content="Caught in the Net">
<meta property="og:title" content="Script per il bulk download da Archive.org">
<meta property="og:url" content="francescomecca.eu/blog/2015/6/30/script-per-il-bulk-download-da-archive-org/">
<meta property="og:description" content="In questi giorni mi e` capitato di dover scaricare varie collezioni da archive.org, una libreria digitale multimediale la cui missione e` laccesso universale a tutta la conoscenza.
Principalmente lo ">
<meta property="og:type" content="article">
<meta property="article:published_time" content="2015-06-30T13:39:00Z">
<meta property="article:tag" content="archive.org">
<meta property="article:tag" content="bulk download archive.org">
<meta property="article:tag" content="PesceWanda">
<meta property="article:tag" content="script">
</head>
<body>
<div id="container">
<section class="page-header"><h1 class="project-name">
Caught in the Net
</h1>
<h2 class="project-tagline">La rete ti cattura ma libera il pensiero</h2>
<a class="btn" href="../../../../../">Home</a>
<a class="btn" href="../../../../../pages/about/">About me</a>
<a class="btn" href="../../../../../pages/contattami/">Contact me</a>
<a class="btn" href="../../../../../archiveall.html">Archive</a>
<a class="btn" href="../../../../../rss.xml">RSS</a>
<a class="btn" href="http://francescomecca.eu/git/pesceWanda">Personal Git</a>
<a class="btn" href="https://github.com/FraMecca">Github</a>
<a class="btn" href="../../../../../wp-content/curriculum/CV_Mecca_Francesco.pdf">Curriculum</a>
</section><section class="main-content"><div class="post">
<header><h1 class="post-title">
<h1 class="p-name post-title" itemprop="headline name">Script per il bulk download da Archive.org</h1>
</h1>
</header><p class="dateline post-date">30 June 2015</p>
</div>
<div class="e-content entry-content" itemprop="articleBody text">
<div>
<p>In questi giorni mi e` capitato di dover scaricare varie collezioni da <a href="https://en.wikipedia.org/wiki/Internet_Archive">archive.org</a>, una libreria digitale multimediale la cui missione e` laccesso universale a tutta la conoscenza.</p>
<p>Principalmente lo uso per scaricare tantissime registrazioni live di vari concerti registrati a mio avviso in maniera impeccabile.</p>
<p>Nel sito si trova una guida per scaricare in bulk usando wget e gli strumenti del sito, ma risulta piuttosto prolissa e complicata se si vuole fare un download al volo.</p>
<p>Questo e` lo script che uso, modificato da <a href="https://github.com/ghalfacree/bash-scripts/blob/master/archivedownload.sh">questo</a> script: e` scritto in bash e funziona su tutte le distribuzioni sulle quali e` installato wget, tail e sed.</p>
<pre class="wp-code-highlight prettyprint linenums:1">#!/bin/bash
# Write here the extension of the file that you want to accept
#filetype =.flac
#append this to line 24
#-A .$filetype
#Write here the extension of the file that you want to reject, divided by a comma
fileremove = .null
if [ “$1” = “” ]; then
echo USAGE: archivedownload.sh collectionname
echo See Archive.org entry page for the collection name.
echo Collection name must be entered exactly as shown: lower case, with hyphens.
exit
fi
echo Downloading list of entries for collection name $1…
wget -nd -q “http://archive.org/advancedsearch.php?q=collection%3A$1&amp;fl%5B%5D=identifier&amp;sort%5B%5D=identifier+asc&amp;sort%5B%5D=&amp;sort%5B%5D=&amp;rows=9999&amp;page=1&amp;callback=callback&amp;save=yes&amp;output=csv” -O identifiers.txt
echo Processing entry list for wget parsing…
tail -n +2 identifiers.txt | sed s/”//g &gt; processedidentifiers.txt
if [ “`cat processedidentifiers.txt | wc -l`” = “0” ]; then
echo No identifiers found for collection $1. Check name and try again.
rm processedidentifiers.txt identifiers.txt
exit
fi
echo Beginning wget download of `cat processedidentifiers.txt | wc -l` identifiers…
wget -r -H -nc -np -nH -nd -e -R $fileremove robots=off -i processedidentifiers.txt -B http://archive.org/download/
rm identifiers.txt processedidentifiers.txt
echo Complete.
</pre>
<p>Francesco Mecca </p>
</div>
</div>
<aside class="postpromonav"><nav><h4>Categories</h4>
<ul itemprop="keywords" class="tags">
<li><a class="tag p-category" href="../../../../../categories/archiveorg/" rel="tag">archive.org</a></li>
<li><a class="tag p-category" href="../../../../../categories/bulk-download-archiveorg/" rel="tag">bulk download archive.org</a></li>
<li><a class="tag p-category" href="../../../../../categories/pescewanda/" rel="tag">PesceWanda</a></li>
<li><a class="tag p-category" href="../../../../../categories/script/" rel="tag">script</a></li>
</ul></nav></aside><p class="sourceline"><a href="index.md" class="sourcelink">Source</a></p>
<footer class="site-footer" id="footer"><span> CC BY-SA 4.0 International.<br></span>
<span class="site-footer-credits"><a href="https://getnikola.com">Nikola</a>, <a href="https://github.com/jasonlong/cayman-theme">Cayman theme</a>.</span>
</footer></section>
</div>
</body>
</html>