Caught in the Net (Posts about web)francescomecca.euenContents © 2019 <a href="mailto:francescomecca.eu">Francesco Mecca</a> Sun, 24 Mar 2019 14:38:24 GMTNikola (getnikola.com)http://blogs.law.harvard.edu/tech/rssExperiment part 1francescomecca.eu/blog/2018/15/11/programming-session/Francesco Mecca<div><p>This is an experimental blogpost with the goal of improving myself on the design of system programs.</p> <p>The following words are a flow that tries to recapture the process that I follow when I program.</p> <h2>17.11</h2> <p>This is the <a href="https://www.discogs.com/Various-Psychedelic-Underground-13/release/1251814">tracklist</a>:</p> <p>Tracklist 1 Waniyetula - Lindis Farne 2 Missus Beastly - Steel's Electric 3 Vikings Invasion - Shadow Boogie 4 Live - Fly Like A Bird 5 Arktis - Speed Boogie 6 Level π - Hubble's Dream - Dream Without End 7 Emma Myldenberger - RAA 8 Skyline - Beautiful Lady 9 Guru Guru - Oxymoron</p> <h2>The problem</h2> <p>Yesterday I got an insider tip that probably http://isole.ecn.org/ might be taken down and I remembered that I had an old archive of the entire site. Actually I am an amateur data hoarder so I keep archives/backups of pretty much everything I find amusing.</p> <p>I wanted to update the archive so I used httrack. I found the following problems:</p> <ol> <li>too slow (doesn't even use 1/20 of my connection)</li> <li>doesn't understand that if I want to mirror ecn.org I may want isole.ecn.org as well</li> <li>if I configure httrack to archive everything basically I hit some external websites (like google.com) and it starts archiving that as well</li> <li>I can't configure it to avoid any external link because I may need them</li> </ol> <p>A perfect scraper for me would: 1. record a list of every external website it hit so that later I can archive them as well 2. allow me to scrape subdomains as well 3. configure a level of archiving for external websites</p> <h2>The approach</h2> <p>Right now httrack does a perfect job of storing the webpages with a coherent folder layout and offline browsing capabilities. Let's see how it does that.</p> <p>On the original page I can found: <code>&lt;link rel='stylesheet' id='avia-google-webfont' href='//fonts.googleapis.com/css?family=Lato:300,400,700' type='text/css' media='all'/&gt;</code></p> <p>When httrack saves it: <code>&lt;link rel='stylesheet' id='avia-google-webfont' href='../fonts.googleapis.com/css8f84.css?family=Lato:300,400,700' type='text/css' media='all'/&gt;</code></p></div>dlangprogrammingwebfrancescomecca.eu/blog/2018/15/11/programming-session/Thu, 15 Nov 2018 00:00:00 GMT