<p>This is an experimental blogpost with the goal of improving myself on the design of system programs.</p>
&lt;p&gt;The following words are a flow that tries to recapture the process that I follow when I program.&lt;/p&gt;
&lt;p&gt;This is the &lt;a href=""&gt;tracklist&lt;/a&gt;:&lt;/p&gt;
1 Waniyetula - Lindis Farne
2 Missus Beastly - Steel's Electric
3 Vikings Invasion - Shadow Boogie
4 Live - Fly Like A Bird
5 Arktis - Speed Boogie
6 Level π - Hubble's Dream - Dream Without End
7 Emma Myldenberger - RAA
8 Skyline - Beautiful Lady
9 Guru Guru - Oxymoron&lt;/p&gt;
&lt;h2&gt;The problem&lt;/h2&gt;
&lt;p&gt;Yesterday I got an insider tip that probably might be taken down and I remembered that I had an old archive of the entire site.
Actually I am an amateur data hoarder so I keep archives/backups of pretty much everything I find amusing.&lt;/p&gt;
&lt;p&gt;I wanted to update the archive so I used httrack.
I found the following problems:&lt;/p&gt;
&lt;li&gt;too slow (doesn't even use 1/20 of my connection)&lt;/li&gt;
&lt;li&gt;doesn't understand that if I want to mirror I may want as well&lt;/li&gt;
&lt;li&gt;if I configure httrack to archive everything basically I hit some external websites (like and it starts archiving that as well&lt;/li&gt;
&lt;li&gt;I can't configure it to avoid any external link because I may need them&lt;/li&gt;
&lt;p&gt;A perfect scraper for me would:
1. record a list of every external website it hit so that later I can archive them as well
2. allow me to scrape subdomains as well
3. configure a level of archiving for external websites&lt;/p&gt;
&lt;h2&gt;The approach&lt;/h2&gt;
&lt;p&gt;Right now httrack does a perfect job of storing the webpages with a coherent folder layout and offline browsing capabilities.
Let's see how it does that.&lt;/p&gt;
&lt;p&gt;On the original page I can found:
&lt;code&gt;&amp;lt;link rel='stylesheet' id='avia-google-webfont' href='//,400,700' type='text/css' media='all'/&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;When httrack saves it:
<p>When httrack saves it:
<code>&lt;link rel='stylesheet' id='avia-google-webfont' href='../,400,700' type='text/css' media='all'/&gt;</code></p>