<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>island94.org &#187; data</title>
	<atom:link href="http://www.island94.org/tag/data/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.island94.org</link>
	<description>an internet backwater</description>
	<lastBuildDate>Wed, 08 Sep 2010 17:21:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Good enough data</title>
		<link>http://www.island94.org/2010/02/good-enough-data/</link>
		<comments>http://www.island94.org/2010/02/good-enough-data/#comments</comments>
		<pubDate>Tue, 23 Feb 2010 01:10:51 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[ability]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[yawn]]></category>

		<guid isPermaLink="false">http://www.island94.org/?p=1776</guid>
		<description><![CDATA[I&#8217;ve been spending some time at work scraping data. Long story short: government transparency is not transparent when the only access they give you is a pile of poorly structured html. That&#8217;s better than government opacity but not past the level of frosted glass: titillating but unsatisfying. If your expected audience is pencil pushers, please [...]


No related posts.]]></description>
			<content:encoded><![CDATA[<p><img class="aligncenter size-medium wp-image-1783" title="btop-map-combined" src="http://www.island94.org/wp-content/uploads/2010/02/btop-map-combined-500x462.png" alt="" width="500" height="462" /></p>
<p>I&#8217;ve been spending some time at work scraping data. Long story short: government transparency is not transparent when the only access they give you is a pile of poorly structured html. That&#8217;s better than government opacity but not past the level of frosted glass: titillating but unsatisfying. If your expected audience is pencil pushers, please release your data in a spreadsheet. <a href="http://transmissionproject.org/current/2009/11/ntia-broadband-access-data">That&#8217;s what I did</a>.</p>
<p>Notes for nerds:</p>
<p><strong>Regular Expressions vs. Parsing Engines: </strong>I wrote a the first parser in Python with Regular Expressions, then rewrote it in BeautifulSoup (a Python parser). It took me about 2 hours to write it the first time with RegExp. It took me about 2 days to do it with BeautifulSoup. It&#8217;s slightly easier to maintain now, but you tell me which one is more semantically correct:</p>
<p><code>project_title = re.search('&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;&amp;lt;b&amp;gt;Project&amp;amp;nbsp;title&amp;lt;/b&amp;gt;&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;(.+)&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;', line)</code></p>
<p>versus</p>
<p><code>project_title = app.find(text="Project&amp;amp;nbsp;title").parent.parent.nextSibling.string</code></p>
<p>Yep, it&#8217;s written in 2-column tables with each row being a different data-set: the first column holds a key (if there is a key; sometimes there isn&#8217;t) and the second column being the data . With RegExp, I know exactly what I&#8217;m looking for. With the parser, I have to find the element in the tree, then traverse up, over and down (if there isn&#8217;t a key, I have to go up, up, over, over, over, down, over, down). The data itself is a big set of applications (about 2000+ total) and each application has about 15 different data-sets (some with keys, some just follow a consistent-ish pattern).</p>
<p>Fortunately, I have an <a href="http://www.media-democracy.net/">appreciative audience</a> for my troubles and it lets me <a href="http://transmissionproject.org/current/2010/2/btop-applications-and-awards-by-state">draw pretty maps</a> like the ones above. Also <a href="http://flowingdata.com/2009/11/12/how-to-make-a-us-county-thematic-map-using-free-tools/">done with Python</a> by parsing an SVG vector image.</p>
<p><strong>Michigan boaters beware</strong>: there is now an isthmus between Mackinaw City and St. Ignace. Rather than rewrite the process for grouped-shapes&#8212;Michigan being in 2 parts&#8212;it was good enough to make Michigan 1. Hawaii somehow endured.</p>


<p>No related posts.</p>]]></content:encoded>
			<wfw:commentRss>http://www.island94.org/2010/02/good-enough-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
