francescomecca.eu/output/categories/web.xml

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Caught in the Net (Posts about web)</title><link>francescomecca.eu</link><description></description><atom:link href="francescomecca.eu/categories/web.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2019 &lt;a href="mailto:francescomecca.eu"&gt;Francesco Mecca&lt;/a&gt; </copyright><lastBuildDate>Sun, 24 Mar 2019 14:38:24 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Experiment part 1</title><link>francescomecca.eu/blog/2018/15/11/programming-session/</link><dc:creator>Francesco Mecca</dc:creator><description>&lt;div&gt;&lt;p&gt;This is an experimental blogpost with the goal of improving myself on the design of system programs.&lt;/p&gt;
&lt;p&gt;The following words are a flow that tries to recapture the process that I follow when I program.&lt;/p&gt;
&lt;h2&gt;17.11&lt;/h2&gt;
&lt;p&gt;This is the &lt;a href="https://www.discogs.com/Various-Psychedelic-Underground-13/release/1251814"&gt;tracklist&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;Tracklist
 1 Waniyetula - Lindis Farne
 2 Missus Beastly - Steel's Electric
 3 Vikings Invasion - Shadow Boogie
 4 Live - Fly Like A Bird 
 5 Arktis - Speed Boogie 
 6 Level π - Hubble's Dream - Dream Without End 
 7 Emma Myldenberger - RAA
 8 Skyline - Beautiful Lady
 9 Guru Guru - Oxymoron&lt;/p&gt;
&lt;h2&gt;The problem&lt;/h2&gt;
&lt;p&gt;Yesterday I got an insider tip that probably http://isole.ecn.org/ might be taken down and I remembered that I had an old archive of the entire site.
Actually I am an amateur data hoarder so I keep archives/backups of pretty much everything I find amusing.&lt;/p&gt;
&lt;p&gt;I wanted to update the archive so I used httrack.
I found the following problems:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;too slow (doesn't even use 1/20 of my connection)&lt;/li&gt;
&lt;li&gt;doesn't understand that if I want to mirror ecn.org I may want isole.ecn.org as well&lt;/li&gt;
&lt;li&gt;if I configure httrack to archive everything basically I hit some external websites (like google.com) and it starts archiving that as well&lt;/li&gt;
&lt;li&gt;I can't configure it to avoid any external link because I may need them&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A perfect scraper for me would:
1. record a list of every external website it hit so that later I can archive them as well
2. allow me to scrape subdomains as well
3. configure a level of archiving for external websites&lt;/p&gt;
&lt;h2&gt;The approach&lt;/h2&gt;
&lt;p&gt;Right now httrack does a perfect job of storing the webpages with a coherent folder layout and offline browsing capabilities.
Let's see how it does that.&lt;/p&gt;
&lt;p&gt;On the original page I can found:
&lt;code&gt;&amp;lt;link rel='stylesheet' id='avia-google-webfont' href='//fonts.googleapis.com/css?family=Lato:300,400,700' type='text/css' media='all'/&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;When httrack saves it:
&lt;code&gt;&amp;lt;link rel='stylesheet' id='avia-google-webfont' href='../fonts.googleapis.com/css8f84.css?family=Lato:300,400,700' type='text/css' media='all'/&amp;gt;&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;</description><category>dlang</category><category>programming</category><category>web</category><guid>francescomecca.eu/blog/2018/15/11/programming-session/</guid><pubDate>Thu, 15 Nov 2018 00:00:00 GMT</pubDate></item></channel></rss>
cultura hacker 1 2019-05-18 23:13:24 +02:00			`<?xml version="1.0" encoding="utf-8"?>`
			<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Caught in the Net (Posts about web)</title><link>francescomecca.eu</link><description></description><atom:link href="francescomecca.eu/categories/web.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2019 <a href="mailto:francescomecca.eu">Francesco Mecca</a> </copyright><lastBuildDate>Sun, 24 Mar 2019 14:38:24 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Experiment part 1</title><link>francescomecca.eu/blog/2018/15/11/programming-session/</link><dc:creator>Francesco Mecca</dc:creator><description><div><p>This is an experimental blogpost with the goal of improving myself on the design of system programs.</p>
			`<p>The following words are a flow that tries to recapture the process that I follow when I program.</p>`
			`<h2>17.11</h2>`
			`<p>This is the <a href="https://www.discogs.com/Various-Psychedelic-Underground-13/release/1251814">tracklist</a>:</p>`
			`<p>Tracklist`
			`1 Waniyetula - Lindis Farne`
			`2 Missus Beastly - Steel's Electric`
			`3 Vikings Invasion - Shadow Boogie`
			`4 Live - Fly Like A Bird`
			`5 Arktis - Speed Boogie`
			`6 Level π - Hubble's Dream - Dream Without End`
			`7 Emma Myldenberger - RAA`
			`8 Skyline - Beautiful Lady`
			`9 Guru Guru - Oxymoron</p>`
			`<h2>The problem</h2>`
			`<p>Yesterday I got an insider tip that probably http://isole.ecn.org/ might be taken down and I remembered that I had an old archive of the entire site.`
			`Actually I am an amateur data hoarder so I keep archives/backups of pretty much everything I find amusing.</p>`
			`<p>I wanted to update the archive so I used httrack.`
			`I found the following problems:</p>`
			`<ol>`
			`<li>too slow (doesn't even use 1/20 of my connection)</li>`
			`<li>doesn't understand that if I want to mirror ecn.org I may want isole.ecn.org as well</li>`
			`<li>if I configure httrack to archive everything basically I hit some external websites (like google.com) and it starts archiving that as well</li>`
			`<li>I can't configure it to avoid any external link because I may need them</li>`
			`</ol>`
			`<p>A perfect scraper for me would:`
			`1. record a list of every external website it hit so that later I can archive them as well`
			`2. allow me to scrape subdomains as well`
			`3. configure a level of archiving for external websites</p>`
			`<h2>The approach</h2>`
			`<p>Right now httrack does a perfect job of storing the webpages with a coherent folder layout and offline browsing capabilities.`
			`Let's see how it does that.</p>`
			`<p>On the original page I can found:`
			`<code>&lt;link rel='stylesheet' id='avia-google-webfont' href='//fonts.googleapis.com/css?family=Lato:300,400,700' type='text/css' media='all'/&gt;</code></p>`
			`<p>When httrack saves it:`
			`<code>&lt;link rel='stylesheet' id='avia-google-webfont' href='../fonts.googleapis.com/css8f84.css?family=Lato:300,400,700' type='text/css' media='all'/&gt;</code></p></div></description><category>dlang</category><category>programming</category><category>web</category><guid>francescomecca.eu/blog/2018/15/11/programming-session/</guid><pubDate>Thu, 15 Nov 2018 00:00:00 GMT</pubDate></item></channel></rss>`