112 lines
5.8 KiB
HTML
112 lines
5.8 KiB
HTML
|
<!DOCTYPE html>
|
||
|
<html lang="en">
|
||
|
<head>
|
||
|
<meta charset="utf-8">
|
||
|
<meta name="viewport" content="width=device-width">
|
||
|
<title>Experiment part 1 | Caught in the Net</title>
|
||
|
<link rel="stylesheet" href="../../../../../assets/blog/fonts/opensans.css">
|
||
|
<link href="../../../../../assets/blog/css/normalize.css" rel="stylesheet" type="text/css">
|
||
|
<link href="../../../../../assets/blog/css/cayman.css" rel="stylesheet" type="text/css">
|
||
|
<meta name="theme-color" content="#5670d4">
|
||
|
<meta name="generator" content="Nikola (getnikola.com)">
|
||
|
<link rel="alternate" type="application/rss+xml" title="RSS" hreflang="en" href="../../../../../rss.xml">
|
||
|
<link rel="canonical" href="francescomecca.eu/blog/2018/15/11/programming-session/">
|
||
|
<meta name="author" content="Francesco Mecca">
|
||
|
<link rel="prev" href="../../../10/2/eduhack-coventry/" title="eLearning in the age of Social Networks, the EduHack Platform" type="text/html">
|
||
|
<link rel="next" href="../../../../2019/03/06/Dconf-2019/" title="Dconf 2019" type="text/html">
|
||
|
<meta property="og:site_name" content="Caught in the Net">
|
||
|
<meta property="og:title" content="Experiment part 1">
|
||
|
<meta property="og:url" content="francescomecca.eu/blog/2018/15/11/programming-session/">
|
||
|
<meta property="og:description" content="This is an experimental blogpost with the goal of improving myself on the design of system programs.
|
||
|
The following words are a flow that tries to recapture the process that I follow when I program.
|
||
|
17">
|
||
|
<meta property="og:type" content="article">
|
||
|
<meta property="article:published_time" content="2018-11-15T00:00:00Z">
|
||
|
<meta property="article:tag" content="dlang">
|
||
|
<meta property="article:tag" content="programming">
|
||
|
<meta property="article:tag" content="web">
|
||
|
</head>
|
||
|
<body>
|
||
|
<div id="container">
|
||
|
|
||
|
<section class="page-header"><h1 class="project-name">
|
||
|
Caught in the Net
|
||
|
</h1>
|
||
|
<h2 class="project-tagline">La rete ti cattura ma libera il pensiero</h2>
|
||
|
<a class="btn" href="../../../../../">Home</a>
|
||
|
<a class="btn" href="../../../../../pages/about/">About me</a>
|
||
|
<a class="btn" href="../../../../../pages/contattami/">Contact me</a>
|
||
|
<a class="btn" href="../../../../../archiveall.html">Archive</a>
|
||
|
<a class="btn" href="../../../../../rss.xml">RSS</a>
|
||
|
<a class="btn" href="http://francescomecca.eu/git/pesceWanda">Personal Git</a>
|
||
|
<a class="btn" href="https://github.com/FraMecca">Github</a>
|
||
|
<a class="btn" href="../../../../../wp-content/curriculum/CV_Mecca_Francesco.pdf">Curriculum</a>
|
||
|
</section><section class="main-content"><div class="post">
|
||
|
|
||
|
<header><h1 class="post-title">
|
||
|
|
||
|
<h1 class="p-name post-title" itemprop="headline name">Experiment part 1</h1>
|
||
|
|
||
|
</h1>
|
||
|
</header><p class="dateline post-date">15 November 2018</p>
|
||
|
</div>
|
||
|
|
||
|
|
||
|
|
||
|
<div class="e-content entry-content" itemprop="articleBody text">
|
||
|
<div>
|
||
|
<p>This is an experimental blogpost with the goal of improving myself on the design of system programs.</p>
|
||
|
<p>The following words are a flow that tries to recapture the process that I follow when I program.</p>
|
||
|
<h2>17.11</h2>
|
||
|
<p>This is the <a href="https://www.discogs.com/Various-Psychedelic-Underground-13/release/1251814">tracklist</a>:</p>
|
||
|
<p>Tracklist
|
||
|
1 Waniyetula - Lindis Farne
|
||
|
2 Missus Beastly - Steel's Electric
|
||
|
3 Vikings Invasion - Shadow Boogie
|
||
|
4 Live - Fly Like A Bird
|
||
|
5 Arktis - Speed Boogie
|
||
|
6 Level π - Hubble's Dream - Dream Without End
|
||
|
7 Emma Myldenberger - RAA
|
||
|
8 Skyline - Beautiful Lady
|
||
|
9 Guru Guru - Oxymoron</p>
|
||
|
<h2>The problem</h2>
|
||
|
<p>Yesterday I got an insider tip that probably http://isole.ecn.org/ might be taken down and I remembered that I had an old archive of the entire site.
|
||
|
Actually I am an amateur data hoarder so I keep archives/backups of pretty much everything I find amusing.</p>
|
||
|
<p>I wanted to update the archive so I used httrack.
|
||
|
I found the following problems:</p>
|
||
|
<ol>
|
||
|
<li>too slow (doesn't even use 1/20 of my connection)</li>
|
||
|
<li>doesn't understand that if I want to mirror ecn.org I may want isole.ecn.org as well</li>
|
||
|
<li>if I configure httrack to archive everything basically I hit some external websites (like google.com) and it starts archiving that as well</li>
|
||
|
<li>I can't configure it to avoid any external link because I may need them</li>
|
||
|
</ol>
|
||
|
<p>A perfect scraper for me would:
|
||
|
1. record a list of every external website it hit so that later I can archive them as well
|
||
|
2. allow me to scrape subdomains as well
|
||
|
3. configure a level of archiving for external websites</p>
|
||
|
<h2>The approach</h2>
|
||
|
<p>Right now httrack does a perfect job of storing the webpages with a coherent folder layout and offline browsing capabilities.
|
||
|
Let's see how it does that.</p>
|
||
|
<p>On the original page I can found:
|
||
|
<code><link rel='stylesheet' id='avia-google-webfont' href='//fonts.googleapis.com/css?family=Lato:300,400,700' type='text/css' media='all'/></code></p>
|
||
|
<p>When httrack saves it:
|
||
|
<code><link rel='stylesheet' id='avia-google-webfont' href='../fonts.googleapis.com/css8f84.css?family=Lato:300,400,700' type='text/css' media='all'/></code></p>
|
||
|
</div>
|
||
|
</div>
|
||
|
<aside class="postpromonav"><nav><h4>Categories</h4>
|
||
|
|
||
|
<ul itemprop="keywords" class="tags">
|
||
|
<li><a class="tag p-category" href="../../../../../categories/dlang/" rel="tag">dlang</a></li>
|
||
|
<li><a class="tag p-category" href="../../../../../categories/programming/" rel="tag">programming</a></li>
|
||
|
<li><a class="tag p-category" href="../../../../../categories/web/" rel="tag">web</a></li>
|
||
|
</ul></nav></aside><p class="sourceline"><a href="index.md" class="sourcelink">Source</a></p>
|
||
|
|
||
|
|
||
|
|
||
|
<footer class="site-footer" id="footer"><span> CC BY-SA 4.0 International.<br></span>
|
||
|
<span class="site-footer-credits"><a href="https://getnikola.com">Nikola</a>, <a href="https://github.com/jasonlong/cayman-theme">Cayman theme</a>.</span>
|
||
|
</footer></section>
|
||
|
</div>
|
||
|
</body>
|
||
|
</html>
|