francescomecca.eu/_posts/2015-06-30-script-per-il-bulk-download-da-archive-org.md
2017-03-20 00:46:14 +01:00

60 lines
2.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
id: 9
title: Script per il bulk download da Archive.org
date: 2015-06-30T13:39:00+00:00
author: pesceWanda
layout: post
guid: https://provaprova456789.wordpress.com/2015/06/30/script-per-il-bulk-download-da-archive-org
permalink: /index.php/archives/9
blogger_blog:
- caught-in-thenet.blogspot.com
blogger_author:
- pescedinomewanda
blogger_a4a2016d1c35883202d5ddff9b0ea4ff_permalink:
- 3355260023671997284
categories:
- PesceWanda
tags:
- archive.org
- bulk download archive.org
- script
---
In questi giorni mi e\` capitato di dover scaricare varie collezioni da [archive.org](https://en.wikipedia.org/wiki/Internet_Archive), una libreria digitale multimediale la cui missione e\` l’accesso universale a tutta la conoscenza.
Principalmente lo uso per scaricare tantissime registrazioni live di vari concerti registrati a mio avviso in maniera impeccabile.
Nel sito si trova una guida per scaricare in bulk usando wget e gli strumenti del sito, ma risulta piuttosto prolissa e complicata se si vuole fare un download al volo.
Questo e\` lo script che uso, modificato da [questo](https://github.com/ghalfacree/bash-scripts/blob/master/archivedownload.sh) script: e\` scritto in bash e funziona su tutte le distribuzioni sulle quali e\` installato wget, tail e sed.
<pre class="wp-code-highlight prettyprint linenums:1">#!/bin/bash
# Write here the extension of the file that you want to accept
#filetype =.flac
#append this to line 24
#-A .$filetype
#Write here the extension of the file that you want to reject, divided by a comma
fileremove = .null
if [ “$1” = “” ]; then
echo USAGE: archivedownload.sh collectionname
echo See Archive.org entry page for the collection name.
echo Collection name must be entered exactly as shown: lower case, with hyphens.
exit
fi
echo Downloading list of entries for collection name $1…
wget -nd -q “http://archive.org/advancedsearch.php?q=collection%3A$1&amp;fl%5B%5D=identifier&amp;sort%5B%5D=identifier+asc&amp;sort%5B%5D=&amp;sort%5B%5D=&amp;rows=9999&amp;page=1&amp;callback=callback&amp;save=yes&amp;output=csv” -O identifiers.txt
echo Processing entry list for wget parsing…
tail -n +2 identifiers.txt | sed s/”//g &gt; processedidentifiers.txt
if [ “`cat processedidentifiers.txt | wc -l`” = “0” ]; then
echo No identifiers found for collection $1. Check name and try again.
rm processedidentifiers.txt identifiers.txt
exit
fi
echo Beginning wget download of `cat processedidentifiers.txt | wc -l` identifiers…
wget -r -H -nc -np -nH -nd -e -R $fileremove robots=off -i processedidentifiers.txt -B http://archive.org/download/
rm identifiers.txt processedidentifiers.txt
echo Complete.
</pre>
Francesco Mecca