59 lines
2.2 KiB
Markdown
59 lines
2.2 KiB
Markdown
|
<!--
|
||
|
.. title: Experiment part 1
|
||
|
.. slug: programming-session
|
||
|
.. date: 2018-11-15
|
||
|
.. tags: programming,dlang,web
|
||
|
.. category: PesceWanda
|
||
|
.. link:
|
||
|
.. description:
|
||
|
.. type: text
|
||
|
-->
|
||
|
|
||
|
This is an experimental blogpost with the goal of improving myself on the design of system programs.
|
||
|
|
||
|
The following words are a flow that tries to recapture the process that I follow when I program.
|
||
|
|
||
|
# 17.11
|
||
|
|
||
|
This is the [tracklist](https://www.discogs.com/Various-Psychedelic-Underground-13/release/1251814):
|
||
|
|
||
|
Tracklist
|
||
|
1 Waniyetula - Lindis Farne
|
||
|
2 Missus Beastly - Steel's Electric
|
||
|
3 Vikings Invasion - Shadow Boogie
|
||
|
4 Live - Fly Like A Bird
|
||
|
5 Arktis - Speed Boogie
|
||
|
6 Level π - Hubble's Dream - Dream Without End
|
||
|
7 Emma Myldenberger - RAA
|
||
|
8 Skyline - Beautiful Lady
|
||
|
9 Guru Guru - Oxymoron
|
||
|
|
||
|
# The problem
|
||
|
|
||
|
Yesterday I got an insider tip that probably http://isole.ecn.org/ might be taken down and I remembered that I had an old archive of the entire site.
|
||
|
Actually I am an amateur data hoarder so I keep archives/backups of pretty much everything I find amusing.
|
||
|
|
||
|
I wanted to update the archive so I used httrack.
|
||
|
I found the following problems:
|
||
|
|
||
|
1. too slow (doesn't even use 1/20 of my connection)
|
||
|
2. doesn't understand that if I want to mirror ecn.org I may want isole.ecn.org as well
|
||
|
3. if I configure httrack to archive everything basically I hit some external websites (like google.com) and it starts archiving that as well
|
||
|
4. I can't configure it to avoid any external link because I may need them
|
||
|
|
||
|
A perfect scraper for me would:
|
||
|
1. record a list of every external website it hit so that later I can archive them as well
|
||
|
2. allow me to scrape subdomains as well
|
||
|
3. configure a level of archiving for external websites
|
||
|
|
||
|
# The approach
|
||
|
|
||
|
Right now httrack does a perfect job of storing the webpages with a coherent folder layout and offline browsing capabilities.
|
||
|
Let's see how it does that.
|
||
|
|
||
|
On the original page I can found:
|
||
|
`<link rel='stylesheet' id='avia-google-webfont' href='//fonts.googleapis.com/css?family=Lato:300,400,700' type='text/css' media='all'/> `
|
||
|
|
||
|
When httrack saves it:
|
||
|
`<link rel='stylesheet' id='avia-google-webfont' href='../fonts.googleapis.com/css8f84.css?family=Lato:300,400,700' type='text/css' media='all'/> `
|