<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
    xmlns:admin="http://webns.net/mvcb/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:content="http://purl.org/rss/1.0/modules/content/">

    <channel>
    
    <title>DDJ &#45; Resources</title>
    <link>http://datadrivenjournalism.net/resources</link>
    <description>DDJ &#45; Resources</description>
    <dc:language>en</dc:language>
    <dc:creator>support@ejc.net</dc:creator>
    <dc:rights>Copyright 2012</dc:rights>
    <dc:date>2012-01-25T09:46:52+00:00</dc:date>
    <admin:generatorAgent rdf:resource="http://expressionengine.com/" />
    

    <item>
      <title>Essential visualisation resources: Tools for analysis, collection and enterprise</title>
      <link>http://datadrivenjournalism.net/resources/part_1_the_essential_collection_of_visualisation_resources</link>
      <guid>http://datadrivenjournalism.net/resources/part_1_the_essential_collection_of_visualisation_resources#When:14:33:57Z</guid>
      <description><![CDATA[<p>
	<em>Originally published by <a href="http://www.visualisingdata.com/index.php/about/">Andy Kirk</a> on <a href="http://www.visualisingdata.com/">Visualising Data</a>, 17 March 2011. This article is republished with permission.</em></p>
<p>
	&nbsp;</p>
<p>
	This is the first part of a <a href="http://www.visualisingdata.com/index.php/resources/">multi-part series</a> designed to share with readers an inspiring collection of the most important, effective, useful and practical data visualisation resources. The series will cover visualisation tools, resources for sourcing and handling data, online learning tutorials, visualisation blogs, visualisation books and academic papers. Your feedback is most welcome to help capture any additions or revisions so that this collection can live up to its claim as the essential list of resources.</p>
<p>
	&nbsp;</p>
<p>
	<img alt="" src="http://farm8.staticflickr.com/7143/6765533427_a1f01be303.jpg" style="width: 500px; height: 46px; " /></p>
<p>
	This first part presents the data visualisation tools associated with conducting analysis, creating effective graphs and implementing business intelligence operations.</p>
<p>
	Please note, I may not have personally used all tools presented but have seen sufficient evidence of their value from other sources. Also, to avoid re-inventing the wheel, descriptive text may have been reproduced from the native websites for some resources.</p>
<p>
	&nbsp;</p>
<h3>
	Microsoft Excel</h3>
<p>
	Microsoft Excel is the most popular spreadsheet tool in the world with over 400 million users and therefore the most accessible tool for conducting analysis and presenting data in graphical format. The package receives a great deal of justified criticism within the visualisation field for the appalling default and range of bad practice graph designs it promotes, yet in the right hands it can be an incredibly powerful and effective visualisation tool.</p>
<p>
	<a href="http://office.microsoft.com/en-us/excel/">Find out more information</a> | <span style="color:#808080;">Cost:</span> Trial &gt; under &pound;100/$150 per license | <span style="color:#808080;">Tags:</span> Spreadsheet, Office, Graphing</p>
<p>
	<span style="color:#808080;">Good examples and references:</span> <a href="http://peltiertech.com/">Peltier Tech</a> | <a href="http://www.excelcharts.com/blog/">Excel Charts Blog</a> | <a href="http://chandoo.org/wp/">Chandoo</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (July 7, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7163/6765537767_f436436d21.jpg" style="width: 500px; height: 182px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	Open Office Calc</h3>
<p>
	For those who cannot afford or get access to a Microsoft Excel license, OpenOffice.org is an open-source project providing an online office platform mirroring much of the functionality provided by Microsoft Office. The aim is &ldquo;to create the best possible office suite that all can use&rdquo;. The Excel equivalent is Calc and although some of the graphing features are limited it is an ever-evolving tool that is being used by many and is only going to improve.</p>
<p>
	<a href="http://why.openoffice.org/why_great.html">Find out more information</a> | <span style="color:#808080;">Cost:</span> Free | <span style="color:#808080;">Tags:</span> Spreadsheet, Office, Graphing</p>
<p>
	<span style="color:#808080;">Examples and references:</span> <a href="http://wiki.services.openoffice.org/wiki/Documentation/OOo3_User_Guides/Calc_Guide/Gallery_of_chart_types">Gallery of chart types</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (July 7, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7152/6765537773_2bc757e183.jpg" style="width: 500px; height: 317px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	Tableau Desktop</h3>
<p>
	Tableau Desktop is based on breakthrough technology from Stanford University that lets you drag &amp; drop to analyse data rapidly and fluidly, connect to data in a few clicks, then visualise and create interactive dashboards in an instant. Tableau have based their product on years of research to build a system that supports people&rsquo;s natural ability to think visually providing a tool that lets you easily build beautiful, effective, rich data visualisations.</p>
<p>
	<a href="http://www.tableausoftware.com/products">Find out more information</a> | <span style="color:#808080;">Cost:</span> Trial &gt; &pound;600/$999 Personal, &pound;1200/$1999 Professional | <span style="color:#808080;">Tags:</span> Statistical Analysis, Business Intelligence, Dashboard</p>
<p>
	<span style="color:#808080;">Examples and references:</span> <a href="http://theinformationlab.co.uk/">The Information Lab</a> | <a href="http://www.thedatastudio.co.uk/">The Data Studio</a> | <a href="http://www.freakalytics.com/">Freakalytics</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (July 7, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7030/6765537785_b24b8fcb52.jpg" style="width: 500px; height: 413px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	Tableau Public</h3>
<p>
	Tableau Public is the web-based, publicly accessible version of Tableau Desktop which enables you to create interactive visualisations and embed them into your website, publish them on the Tableau Public Gallery or share within the Tableau Public community. Note, the visualisations cannot be saved locally, that is the &lsquo;public&rsquo; essence of this free tool.</p>
<p>
	<a href="http://www.tableausoftware.com/public">Find out more information</a> | <span style="color:#808080;">Cost:</span> Free | <span style="color:#808080;">Tags:</span> Statistical Analysis, Business Intelligence, Community</p>
<p>
	<span style="color:#808080;">Examples and references:</span> <a href="http://theinformationlab.co.uk/">The Information Lab</a> | <a href="http://twitter.com/#!/joemako">Joe Mako</a> | <a href="http://oecdfactblog.org/tableaugallery.html">OECD Factblog</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (July 7, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7156/6765537789_12a51dd7f2.jpg" style="width: 500px; height: 384px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	TIBCO Spotfire</h3>
<p>
	TIBCO Spotfire Professional aims to make it easier to build and deploy analytic applications over the web or perform ad-hoc analytics on-the-fly by letting you interactively query, visualise, aggregate, filter, and drill into datasets of virtually any size.</p>
<p>
	<a href="http://spotfire.tibco.com/products/spotfire-professional/exploratory-data-analysis.aspx">Find out more information</a> | <span style="color:#808080;">Cost:</span> Trial &gt; &pound;/$ unknown | <span style="color:#808080;">Tags:</span> Statistical Analysis, Business Intelligence</p>
<p>
	<span style="color:#808080;">Examples and references:</span> <a href="http://spotfire.tibco.com/demo/default.aspx">Demo Gallery</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (July 7, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7170/6765537793_40f9dfdfed.jpg" style="width: 500px; height: 363px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	QlikView</h3>
<p>
	The QlikView platform aims to bridge the gap between traditional BI solutions and standalone office productivity applications, enabling users to forge new paths and make new discoveries. QlikView infuses a broad set of new capabilities, analysis, insight, and value to existing data stores with user interfaces that are clean, simple, and straightforward.</p>
<p>
	<a href="http://www.qlikview.com/">Find out more information</a> | <span style="color:#808080;">Cost:</span> Trial &gt; &pound;/$ unknown | <span style="color:#808080;">Tags:</span> Statistical Analysis, Business Intelligence, Dashboard</p>
<p>
	<span style="color:#808080;">Examples and references:</span> <a href="http://www.qlikview.com/us/explore/experience">Demo Gallery</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (July 7, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7156/6765537801_0e9616f3c1.jpg" style="width: 500px; height: 328px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	Grapheur</h3>
<p>
	Grapheur is a reactive Business Intelligence tool integrating data mining, modeling, multi-variate analysis and interactive visualisation into an end-to-end discovery and continuous innovation process powered by creativity and curiosity.</p>
<p>
	<a href="http://grapheur.com/">Find out more information</a> | <span style="color:#808080;">Cost:</span> Trial &gt; &pound;/$ unknown | <span style="color:#808080;">Tags:</span> Statistical Analysis, Business Intelligence, Multi-Variate</p>
<p>
	<span style="color:#808080;">Examples and references:</span> <a href="http://grapheur.com/info/cases/">Demo Gallery</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (July 7, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7164/6765546741_f407bdc225.jpg" style="width: 500px; height: 400px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	Gephi</h3>
<p>
	Gephi is an open-source, free interactive visualisation and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs. It claims to be &ldquo;like Photoshop but for data&rdquo;, allowing the user to interact with the data representation, manipulate structures, shapes and colors to reveal hidden properties.</p>
<p>
	<a href="http://gephi.org/">Find out more information</a> | <span style="color:#808080;">Cost:</span> Free | <span style="color:#808080;">Tags:</span> Statistical Analysis, Business Intelligence, Complex Systems</p>
<p>
	<span style="color:#808080;">Examples and references:</span> <a href="http://gephi.org/features/">Demo Gallery</a> | <a href="http://www.visualizing.org/full-screen/29391">The VIZoSPHERE</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (July 7, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7142/6765548487_7ec9119afd.jpg" style="width: 500px; height: 399px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	Visokio Omniscope</h3>
<p>
	Visokio Omniscope is a versatile, multi-tab and multi-view interactive data analysis, filtering and presentation tool. It offers a powerful new way to visualise, explore and report on large tables of data &ndash; with related images, maps, links, and more &ndash; then lets you share your file with others using the free Viewer.</p>
<p>
	<a href="http://www.visokio.com/">Find out more information</a> | <span style="color:#808080;">Cost:</span> Trial &gt; &pound;/$ unknown | <span style="color:#808080;">Tags:</span> Statistical Analysis, Business Intelligence, Multi-Format</p>
<p>
	<span style="color:#808080;">Examples and references:</span> <a href="http://www.visokio.com/omniscope/demos">Demo Gallery</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (July 7, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7034/6765550455_4a4548c258.jpg" style="width: 500px; height: 378px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	Panopticon</h3>
<p>
	Panopticon data visualisation software supports rapid analysis of fast-changing and historical time series data sets. You can deploy it on the desktop or over the web &mdash; or embed it into your own enteprise applications. Originally focusing on real-time treemap visualisations, the product suite is now much broader encompassing traditional options such as bar charts, line graphs and an innovative time-series solution termed &lsquo;horizon graphs&rsquo; and more contemporary solutions such as Stephen Few&rsquo;s bullet graph and Edward Tufte&rsquo;s sparklines. This creates a great variety of innovative and effective visualisations that can be combined into a single powerful, interactive dashboard display. Most importantly they employ best practice visual principles throughout their offering which stands them apart from other competitors.</p>
<p>
	<a href="http://www.panopticon.com/">Find out more information</a> | <span style="color:#808080;">Cost:</span> Trial &gt; &pound;/$ unknown | <span style="color:#808080;">Tags:</span> Statistical Analysis, Business Intelligence, Dashboard</p>
<p>
	<span style="color:#808080;">Good examples and references:</span> <a href="http://www.panopticon.com/demo_gallery/index.php">Gallery</a> | <a href="http://www.perceptualedge.com/blog/?p=965">Review from Stephen Few</a> | <a href="http://www.panopticon.com/showroom/white_papers_data_visualization_software_technology.htm">White papers</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (July 7, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7166/6765551963_33b0e567dc.jpg" style="width: 500px; height: 302px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	Notable Others&hellip;</h3>
<p>
	Here are some additional suggestions you may wish to consider within this category of visualisation resources:</p>
<p>
	<a href="http://www.wolfram.com/mathematica/">Wolfram Mathematica</a> | Bring in your data, combine it with Wolfram Alpha&rsquo;s ever-increasing store of knowledge, apply sophisticated symbolic and numeric analysis, and create state-of-the-art visualizations&mdash;all in one system, with one integrated workflow.</p>
<p>
	<a href="http://www.visualdatatools.com/DataGraph/">Data Graph</a> | DataGraph is a simple and powerful graphing application for Mac OS X &ndash; a companion for Excel, Numbers or any of the big statistical packages.</p>
<p>
	<a href="http://www.omnigroup.com/products/omnigraphsketcher">OmniGraphSketcher</a> | OmniGraphSketcher helps you make elegant and precise graphs in seconds, whether you have specific data to visualise or you just have a concept to explain.</p>
<p>
	<a href="http://plot.micw.eu/">PLOT</a> | PLOT is a scientific 2D plotting program for Mac OS X designed for everyday plotting &ndash; it is easy to use, to create high quality plots, it allows easy and powerful manipulations and calculations of data and it is free.</p>
<p>
	<a href="http://www.mathworks.com/products/matlab/">MATLAB</a> | All the graphics features that are required to visualise engineering and scientific data are available in MATLAB&reg;, including 2-D and 3-D plotting functions, 3-D volume visualization functions, tools for interactively creating plots, and the ability to export results to all popular graphics formats.</p>
<p>
	<a href="http://www-01.ibm.com/software/analytics/spss/products/statistics/vizdesigner/">SPSS Visualisation Designer</a> | Easily develop and build new visualisations that enable new ways to portray and communicate analytics to others. No extensive programming skills are required to conceive, create and share compelling visualizations.</p>
<p>
	<a href="http://www.stata.com/">STATA</a> | Stata is a complete, integrated statistical package that provides everything you need for data analysis, data management, and graphics &ndash; you get everything you need in one package.</p>
<p>
	<a href="http://visualizefree.com/index.jsp">Visualize Free</a> | Visualize Free is a free visual analysis tool, providing the perfect solution for visually exploring and presenting data that standard office charting software cannot handle.</p>
<p>
	<a href="http://www.dundas.com/dashboard/">Dundas</a> | Dundas Dashboard brings together all of the tools you need to build meaningful, interactive and fully customized dashboards in one easy to use platform.</p>
<p>
	<a href="http://www.wondergraphs.com/">Wondergraphs</a> | Wondergraphs strives to be the best way to get and share insights from your data, offering free &gt; premium set of online, graphical report design tools for analysts and businesses.</p>
]]></description> 
      <dc:date>2012-01-26T14:33:57+00:00</dc:date>
    </item>

    <item>
      <title>Essential visualisation resources: Tools for mapping</title>
      <link>http://datadrivenjournalism.net/resources/part_4_the_essential_collection_of_visualisation_resources</link>
      <guid>http://datadrivenjournalism.net/resources/part_4_the_essential_collection_of_visualisation_resources#When:09:46:52Z</guid>
      <description><![CDATA[<p>
	<em>Originally published by <a href="http://www.visualisingdata.com/index.php/about/">Andy Kirk</a>&nbsp;on <a href="http://www.visualisingdata.com">Visualising Data</a>, 1 May 2011. This article is republished with permission.</em></p>
<p>
	&nbsp;</p>
<p>
	This is the fourth part of a <a href="http://www.visualisingdata.com/index.php/resources/">multi-part series</a> designed to share with readers an inspiring collection of the most important, effective, useful and practical data visualisation resources. The series will cover visualisation tools, resources for sourcing and handling data, online learning tutorials, visualisation blogs, visualisation books and academic papers. Your feedback is most welcome to help capture any additions or revisions so that this collection can live up to its claim as the essential list of resources.</p>
<p>
	&nbsp;</p>
<p>
	<img alt="" src="http://farm8.staticflickr.com/7018/6759420767_10acbc109e.jpg" style="width: 500px; height: 46px; " /></p>
<p>
	This fourth part presents a broad range of visualisation resources that can be used for representing data via maps. This is a rapidly evolving subset of the population of visualisation resources, one that seems to be constantly introducing us to new tools and clever technologies to bring innovation to the representation of geographical data. There are many tools for creating maps and plotting location data, but this collection focuses on the those tools that provide the means of overlaying data to represent a visualisation within a geographical context.</p>
<p>
	Please note, I may not have personally used all the tools presented here but have seen sufficient evidence of their value from other sources. Also, to avoid re-inventing the wheel, descriptive text may have been reproduced from native websites for some resources.</p>
<p>
	&nbsp;</p>
<h3>
	Google Maps &amp; Google Earth</h3>
<p>
	Google Maps is undoubtedly the most commonly used mapping technology on the web allowing users to explore the world with incredible detail. Google Earth essentially provides a 3D interface view of the globe, letting you pan and zoom to explore the Earth. The real power of these tools from a visualisation sense comes particularly through the features of accompanying APIs (eg. Google Maps API and Google Visualization AP) and through the process of combining KML data, which enables you to overlay your own visual data onto the foundation 2D or 3D mapped views.</p>
<p>
	<a href="http://www.google.co.uk/help/maps/tour/">Find out more information</a> | <span style="color:#808080;">Cost:</span> Free | <span style="color:#808080;">Tags:</span> Google, KML, 3D</p>
<p>
	<span style="color:#808080;">Good examples and references:</span> <a href="http://www.google.co.uk/intl/en_uk/earth/learn/">Google Earth Tutorials</a> | <a href="http://econym.org.uk/gmap/">Google Maps Tutorial</a> | <a href="http://thematicmapping.org/">Thematic Mapping</a> | <a href="http://www.casa.ucl.ac.uk/software/gmapcreator.asp">GMapCreator</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (7th July, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7167/6759432053_fe272a691e.jpg" style="width: 500px; height: 263px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	ArcGIS</h3>
<p>
	ArcGIS is a range of powerful and versatile mapping tools from ESRI. The products allow you to integrate data layers onto maps, globes, and models on the desktop and serve them out for use on a desktop, in a browser, or in the field via mobile devices. For developers, ArcGIS gives you APIs for building rich, interactive applications using JavaScript, Flex, or Silverlight, embedding your applications into Web pages or launch stand-alone Web applications.</p>
<p>
	<a href="http://www.arcgis.com/home/index.html">Find out more information</a> | <span style="color:#808080;">Cost:</span> Free &gt; Paid Licenses | <span style="color:#808080;">Tags:</span> Desktop, web-based, APIs</p>
<p>
	<span style="color:#808080;">Good examples and references:</span> <a href="http://www.esri.com/software/arcgis/arcinfo/index.html">ArcGIS Desktop</a> | <a href="http://www.arcgis.com/home/webmap/viewer.html">ArcGIS Web Mapping</a> | <a href="http://explorer.arcgis.com/">ArcGIS Explorer</a> | <a href="http://www.arcgis.com/home/gallery.html">Gallery</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (7th July, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7163/6759433115_7fd28de0f2.jpg" style="width: 500px; height: 214px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	GeoCommons</h3>
<p>
	GeoCommons enables everyone to create rich interactive visualisations to solve problems without any experience using traditional mapping tools. Map real-time social data and the over 50,000 open-source data sets in GeoCommons then share your interactive maps and analysis with others by embedding them in websites, blogs or sharing via Facebook or Twitter. API&rsquo;s allow developers to enhance the scope of the GeoCommons visualisations. GeoIQ extends the functionality of GeoCommons by delivering advanced security, large enterprise data support and robust location analytics.</p>
<p>
	<a href="http://geocommons.com/">Find out more information</a> | <span style="color:#808080;">Cost:</span> GeoCommons = Free, GeoIQ = Paid Licenses | <span style="color:#808080;">Tags:</span> Interactive, web-based, datasets, APIs</p>
<p>
	<span style="color:#808080;">Good examples and references:</span> <a href="http://www.geoiq.com/">GeoIQ</a> | <a href="http://geocommons.com/help/About">User Manual</a> | <a href="http://issuemap.org/">IssueMap</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (7th July, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7173/6759434223_f7cf54e1be.jpg" style="width: 500px; height: 207px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	OpenHeatMap</h3>
<p>
	OpenHeatMap is a straightforward and accessible way for non-specialists to upload data and create maps that communicate information. Developed by Pete Warden, and incorporating the <a href="http://www.openstreetmap.org/">OpenStreetMap</a> map data, it transforms data from sources such as a Google Spreadsheet into an interactive, animated view of a geographical area, which you can then share online. For developers, it&rsquo;s a JQuery plugin that makes it easy to create a completely open-source mapping component on any web page, using either Flash or HTML5&rsquo;s Canvas element.</p>
<p>
	<a href="http://www.openheatmap.com/">Find out more information</a> | <span style="color:#808080;">Cost:</span> Free | <span style="color:#808080;">Tags: </span>Spreadsheet, interactive, JQuery</p>
<p>
	<span style="color:#808080;">Good examples and references: </span><a href="http://www.openheatmap.com/gallery.html">Gallery</a> | <a href="http://wiki.github.com/petewarden/openheatmap/">Documentation</a></p>
<p>
	<span style="color:#808080;">Status: </span>Ongoing (7th July, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7171/6759436377_2bb281ce73.jpg" style="width: 500px; height: 377px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	Indiemapper</h3>
<p>
	Indiemapper describes itself as a smarter, easier, more elegant way to make thematic maps from digital data, closing the gap between data and map by taking a visual approach to map-making. It is a web-based app that loads geo-data, allows custom control over mapmaking, and exports static maps in vector and raster formats. With Indiemapper you have the tools you need to make beautiful thematic maps without overwhelming you with hundreds of obscure GIS functions, with nothing being any more than 2 clicks away to keep mapmaking simple, fast, and fun.</p>
<p>
	<a href="http://indiemapper.com/">Find out more information</a> | <span style="color:#808080;">Cost: </span>Trial &gt; $30 per month license | <span style="color:#808080;">Tags:</span> Portable, KML, interactive</p>
<p>
	<span style="color:#808080;">Good examples and references:</span> <a href="http://indiemapper.com/gallery.php">Gallery</a> | <a href="http://indiemapper.com/about.php">System information</a> | <a href="http://flowingdata.com/2010/04/28/review-indiemapper-makes-thematic-maps-easy/">FlowingData Review</a></p>
<p>
	<span style="color:#808080;">Status: </span>Ongoing (7th July, 2011)</p>
<p style="text-align: center; ">
	<embed allowfullscreen="true" allowscriptaccess="always" flashvars="config=http://videos.indiemapper.com.s3.amazonaws.com/intro.xml" height="360" src="http://videos.indiemapper.com.s3.amazonaws.com/player.swf" width="640"></embed></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	InstantAtlas</h3>
<p>
	InstantAtlas is premium data presentation software for location-based statistical data enabling information analysts and researchers to create highly-interactive dynamic and profile reports that combine statistics and map data to improve data visualization, enhance communication, and engage people in more informed decision making. With a range of products, features and reporting options it represents a powerful option for this particular type of visualisation challenge.</p>
<p>
	Find out more information | <span style="color:#808080;">Cost: </span>Trial &gt; $1,000+ license plus add-ons | <span style="color:#808080;">Tags:</span> Desktop &amp; server, interactive, collaborative</p>
<p>
	<span style="color:#808080;">Good examples and references:</span> <a href="http://www.instantatlas.com/products.xhtml">Products</a> | <a href="http://www.instantatlas.com/support.xhtml">Tutorials and Support Zone</a> |&nbsp; <a href="http://www.instantatlas.com/iashots_SM.xhtml">Single Map Examples</a> | <a href="http://flowingdata.com/2010/04/02/map-and-report-data-with-instantatlas/">FlowingData Review</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (7th July, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7019/6759438143_5cb70c534f.jpg" style="width: 500px; height: 260px; " /></p>
<p>
	&nbsp;</p>
<h3>
	Target Map</h3>
<p>
	Developed by MapGenia of Barcelona, TargetMap aims to provide an easy way to create and share customised data maps on line, allowing everyone from individuals to large organisations to be able represent their data on maps of any country in the world and to share their knowledge through an online community and gallery of creations.</p>
<p>
	<a href="http://www.targetmap.com/">Find out more information</a> | <span style="color:#808080;">Cost:</span> Free | <span style="color:#808080;">Tags:</span> Community, Embeddable</p>
<p>
	<span style="color:#808080;">Good examples and references: </span><a href="http://targetmap.blogspot.com/">Blog/Help</a></p>
<p>
	<span style="color:#808080;">Status: </span>Ongoing (7th July, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7018/6759439545_1a49e0f4d9.jpg" style="width: 500px; height: 238px; " /></p>
<p>
	&nbsp;</p>
<h3>
	TileMill</h3>
<p>
	TileMill is a modern map design studio powered by open source technology. It is a tool for cartographers to quickly and easily design maps for the web using custom geographical data and then integrate layers of data for visual representation. Maps created with TileMill can be displayed using the Google Maps API, OpenLayers and a number of other projects.</p>
<p>
	<a href="http://tilemill.com/index.html">Find out more information</a> |<span style="color:#808080;"> Cost:</span> Free | <span style="color:#808080;">Tags:</span> Open Source, KML, Mac/Ubuntu</p>
<p>
	<span style="color:#808080;">Good examples and references: </span><a href="http://tilemill.com/manual.html">Manual</a> | <a href="http://developmentseed.org/blog/2011/feb/16/announcing-tilemill-modern-map-design-studio-powered-open-source">TileMill Announcement</a> | <a href="http://MapBox">MapBox</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (7th July, 2011)</p>
<p style="text-align: center; ">
	<iframe allowfullscreen="" frameborder="0" height="250" mozallowfullscreen="" src="http://player.vimeo.com/video/20006926?title=0&amp;byline=0&amp;portrait=0" webkitallowfullscreen="" width="400"></iframe></p>
<p>
	&nbsp;</p>
<p>
	<a href="http://vimeo.com/20006926"> </a></p>
<p style="text-align: center; ">
	<a href="http://vimeo.com/20006926">TileMill: Open Source Map Design</a> from <a href="http://vimeo.com/developmentseed">Development Seed</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
<p>
	&nbsp;</p>
<h3>
	Polymaps</h3>
<p>
	Polymaps is a project from <a href="https://simplegeo.com/">SimpleGeo</a> and <a href="http://stamen.com/">Stamen</a> which offers a free JavaScript library for making dynamic, interactive maps in modern web browsers. It provides speedy display of multi-zoom datasets over maps, and supports a variety of visual presentations for tiled vector data. Because Polymaps can load data at a full range of scales, it&rsquo;s ideal for showing information from country level on down to states, cities, neighborhoods, and individual streets.</p>
<p>
	<a href="http://polymaps.org/">Find out more information</a> | <span style="color:#808080;">Cost:</span> Free |<span style="color:#808080;"> Tags: </span>JavaScript, interactive, embeddable</p>
<p>
	<span style="color:#808080;">Good examples and references: </span><a href="http://polymaps.org/ex/">Examples</a> | <a href="http://polymaps.org/docs/">Documentation</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (7th July, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7005/6759440729_e741f8db79.jpg" style="width: 500px; height: 278px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	Color Brewer</h3>
<p>
	Color Brewer offers a slightly different utility to those listed above. It is a wonderfully useful diagnostic tool that helps you evaluate the effectiveness of individual colour schemes for the use on map designs that represent data, such as choropleth maps. Utilisation of the recommended colour classes will aid map legibility and enhance the potential for accurate interpretation and general insights taken from map-based visualisations.</p>
<p>
	<a href="http://colorbrewer2.org/">Find out more information</a> | <span style="color:#808080;">Cost:</span> Free | <span style="color:#808080;">Tags:</span> Colour use, map design</p>
<p>
	<span style="color:#808080;">Good examples and references: </span><a href="http://www.ingentaconnect.com/content/maney/caj/2003/00000040/00000001/art00004">Academic Paper</a> | <a href="http://www.personal.psu.edu/faculty/c/a/cab38/">Cynthia Brewer</a> | <a href="http://www.typebrewer.org/">TypeBrewer</a></p>
<p>
	<span style="color:#808080;">Status:</span> Ongoing (7th July, 2011)</p>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7009/6759441657_00bd578031.jpg" style="width: 500px; height: 317px; " /></p>
<p style="text-align: center; ">
	&nbsp;</p>
<h3>
	Notable others&hellip;</h3>
<p>
	<a href="http://dotspotting.org/">Dotspotting</a> | Dotspotting is the first project Stamen is releasing as part of Citytracking, a project funded by the Knight News Challenge, making tools to help people gather data about cities and make that data more legible.</p>
<p>
	<a href="http://www.datamaps.eu/">DataMaps.eu</a> | DataMaps.eu offers a free visualization tool that can convert your complex location-related data, without any programming effort, in to appealing, easy to understand visualisations.</p>
<p>
	<a href="http://geotime.com/">GeoTime</a> | This award-winning visual analysis tool places an emphasis on visual presentations, introducing new ways to visualize events over time, including the ability to run statistical functions on numerical attributes within your data.</p>
]]></description> 
      <dc:date>2012-01-25T09:46:52+00:00</dc:date>
    </item>

    <item>
      <title>Power tools for aspiring data journalists: Funnel Plots in R</title>
      <link>http://datadrivenjournalism.net/resources/power_tools_for_aspiring_data_journalists_funnel_plots_in_r</link>
      <guid>http://datadrivenjournalism.net/resources/power_tools_for_aspiring_data_journalists_funnel_plots_in_r#When:15:26:05Z</guid>
      <description><![CDATA[<p>
	<em>Originally published by Tony Hirst on <a href="http://blog.ouseful.info">OUseful.Info, the blog&hellip;</a>, 31 October 2011. This article is republished with permission.</em></p>
<p>
	&nbsp;</p>
<p>
	Picking up on Paul Bradshaw&rsquo;s post <a href="http://onlinejournalismblog.com/2011/10/31/a-quick-exercise-for-aspiring-data-journalists/">A quick exercise for aspiring data journalists</a> which hints at how you can use Google Spreadsheets to grab &ndash; and explore &ndash; a mortality dataset highlighted by Ben Goldacre in <a href="http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis">DIY statistical analysis: experience the thrill of touching real data</a>, I thought I&rsquo;d describe a quick way of analysing the data using R, a very powerful statistical programming environment that should probably be part of your toolkit if you ever want to get round to doing some serious stats, and have a go at reproducing the analysis using a bit of judicious websearching and some cut-and-paste action&hellip;</p>
<p>
	R is an open-source, cross-platform environment that allows you to do programming like things with stats, as well as producing a wide range of graphical statistics (stats visualisations) as if by magic. (Which is to say, it can be terrifying to try to get your head round&hellip; but once you&rsquo;ve grasped a few key concepts, it becomes a really powerful tool&hellip; At least, that&rsquo;s what I&rsquo;m hoping as I struggle to learn how to use it myself!)</p>
<p>
	I&rsquo;ve been using <a href="http://rstudio.org/">R-Studio</a> to work with R, a) because it&rsquo;s free and works cross-platform, b) it can be run as a service and accessed via the web (though I haven&rsquo;t tried that yet; the hosted option still hasn&rsquo;t appeared yet, either&hellip;), and c) it offers a structured environment for managing R projects.</p>
<p>
	So, to get started. Paul describes a dataset posted as an HTML table by Ben Goldacre that is used to generate the dots on this graph:</p>
<p style="text-align: left; ">
	<img alt="" src="http://farm8.staticflickr.com/7147/6679185861_4de0c66617.jpg" style="width: 600px; height: 360px; " /></p>
<p>
	The lines come from a probabilistic model that helps us see the likely spread of death rates given a particular population size.</p>
<p>
	If we want to do stats on the data, then we could, as Paul suggests, pull the data into a spreadsheet and then work from there&hellip; Or, we could pull it directly into R, at which point all manner of voodoo stats capabilities become available to us.</p>
<p>
	As with the <em>=importHTML</em> formula in Google spreadsheets, R has a way of scraping data from an HTML table anywhere on the public web:</p>
<p style="margin-left: 40px; ">
	#First, we need to load in the XML library that contains the scraper function library(XML) #Scrape the table cancerdata=data.frame( readHTMLTable( &#39;http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis&#39;,which=1, header=c(&#39;Area&#39;,&#39;Rate&#39;,&#39;Population&#39;,&#39;Number&#39;)))</p>
<p>
	The format is simple: <em>readHTMLTable(url,which=TABLENUMBER) </em>(TABLENUMBER is used to extract the N&rsquo;th table in the page.) The header part labels the columns (the data pulled in from the HTML table itself contains all sorts of clutter).</p>
<p>
	We can inspect the data we&rsquo;ve imported as follows:</p>
<p style="margin-left: 40px; ">
	#Look at the whole table cancerdata #Look at the column headers names(cancerdata) #Look at the first 10 rows head(cancerdata) #Look at the last 10 rows tail(cancerdata) #What sort of datatype is in the Number column? class(cancerdata$Number)</p>
<p>
	The last line &ndash; <em>class(cancerdata$Number)</em> &ndash; identifies the data as type &lsquo;factor&rsquo;. In order to do stats and plot graphs, we need the Number, Rate and Population columns to contain actual numbers&hellip; (Factors organise data according to categories; when the table is loaded in, the data is loaded in as strings of characters; rather than seeing each number as a number, it&rsquo;s identified as a category.)</p>
<p style="margin-left: 40px; ">
	#Convert the numerical columns to a numeric datatype cancerdata$Rate=as.numeric(levels(cancerdata$Rate)[as.integer(cancerdata$Rate)]) cancerdata$Population=as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)]) cancerdata$Number=as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)]) #Just check it worked&hellip; class(cancerdata$Number) head(cancerdata)</p>
<p>
	We can now plot the data:</p>
<p style="margin-left: 40px; ">
	#Plot the Number of deaths by the Population plot(Number ~ Population,data=cancerdata)</p>
<p>
	If we want to, we can add a title:</p>
<p style="margin-left: 40px; ">
	#Add a title to the plot plot(Number ~ Population,data=cancerdata, main=&#39;Bowel Cancer Occurrence by Population&#39;)</p>
<p>
	We can also tweak the axis labels:</p>
<p style="margin-left: 40px; ">
	plot(Number ~ Population,data=cancerdata, main=&#39;Bowel Cancer Occurrence by Population&#39;,ylab=&#39;Number of deaths&#39;)</p>
<p>
	<img alt="" src="http://farm8.staticflickr.com/7008/6679192939_fc123dfe11_z.jpg" style="width: 600px; height: 419px; " /></p>
<p>
	The plot command is great for generating quick charts. If we want a bit more control over the charts we produce, the <em>ggplot2</em> library is the way to go. (<em>ggpplot2</em> isn&#39;t part of the standard R bundle, so you&#39;ll need to install the package yourself if you haven&#39;t already installed it. In RStudio, find the <em>Packages</em> tab, click <em>Install Packages</em>, search for <em>ggplot2</em> and then install it, along with its dependencies...):</p>
<p style="margin-left: 40px; ">
	require(ggplot2) ggplot(cancerdata)+geom_point(aes(x=Population,y=Number))+opts(title=&#39;Bowel Cancer Data&#39;)+ylab(&#39;Number of Deaths&#39;)</p>
<p>
	<img alt="" src="http://farm8.staticflickr.com/7014/6679190379_3963a32e64_z.jpg" style="width: 600px; height: 419px; " /></p>
<p>
	Doing a bit of searching for the &quot;funnel plot&quot; chart type used to display the ata in Goldacre&#39;s article, I came across a post on Cross Validated, the Stack Overflow/Statck Exchange site dedicated to statistics related Q&amp;A: <a href="http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r">How to draw funnel plot using ggplot2 in R?</a></p>
<p>
	The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing the code... This is a dangerous thing to do, and I can&#39;t guarantee that the analysis is the same type of analysis as the one Goldacre refers to... but what I&#39;m trying to do is show (quickly) that R provides a very powerful stats analysis environment and could probably do the sort of analysis you want in the hands of someone who knows how to drive it, and also knows what stats methods can be appropriately applied for any given data set...</p>
<p>
	Anyway - here&#39;s something resembling the Goldacre plot, using the cribbed code which has confidence limits at the 95% and 99.9% levels. Note that I needed to do a couple of things:</p>
<p>
	1) work out what values to use where! I did this by looking at the ggplot code to see what was plotted. p was on the y-axis and should be used to present the death rate. The data provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the range 0..1. The x-axis is the population.</p>
<p style="margin-left: 40px; ">
	#TH: funnel plot code from: #TH: <a href="http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210">http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210</a> #TH: Use our cancerdata number=cancerdata$Population #TH: The rate is given as a &#39;per 100,000&#39; value, so normalise it p=cancerdata$Rate/100000</p>
<p style="margin-left: 40px; ">
	p.se &lt;- sqrt((p*(1-p)) / (number)) df &lt;- data.frame(p, number, p.se)</p>
<p style="margin-left: 40px; ">
	## common effect (fixed effect model) p.fem &lt;- weighted.mean(p, 1/p.se^2)</p>
<p style="margin-left: 40px; ">
	## lower and upper limits for 95% and 99.9% CI, based on FEM estimator #TH: I&#39;m going to alter the spacing of the samples used to generate the curves number.seq &lt;- seq(1000, max(number), 1000) number.ll95 &lt;- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq)) number.ul95 &lt;- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq)) number.ll999 &lt;- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq)) number.ul999 &lt;- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq)) dfCI &lt;- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem)</p>
<p style="margin-left: 40px; ">
	## draw plot #TH: note that we need to tweak the limits of the y-axis fp &lt;- ggplot(aes(x = number, y = p), data = df) + geom_point(shape = 1) + geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) + geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) + geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) + geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) + geom_hline(aes(yintercept = p.fem), data = dfCI) + scale_y_continuous(limits = c(0,0.0004)) + xlab(&quot;number&quot;) + ylab(&quot;p&quot;) + theme_bw()</p>
<p style="margin-left: 40px; ">
	fp</p>
<p>
	<img alt="" src="http://farm8.staticflickr.com/7143/6679191561_2459d08a75_z.jpg" style="width: 600px; height: 419px; " /></p>
<p>
	As I said above, it can be quite dangerous just pinching other folks&#39; stats code if you aren&#39;t a statistician and don&#39;t really know whether you have actually replicated someone else&#39;s analysis or done something completely different... (this is a situation I often find myself in!); which is why I think we need to encourage folk who release statistical reports to not only release their data, but also show their working, including the code they used to generate any summary tables or charts that appear in those reports.</p>
<p>
	In addition, it&#39;s worth noting that cribbing other folk&#39;s code and analyses and applying it to your own data may lead to a nonsense result because some stats analyses only work if the data has the right sort of distribution...So be aware of that, always post your own working somewhere, and if someone then points out that it&#39;s nonsense, you&#39;ll hopefully be able to learn from it...</p>
<p>
	Given those caveats, what I hope to have done is raise awareness of what R can be used to do (including pulling data into a stats computing environment via an HTML table screenscrape) and also produced some sort of recipe we could take to a statistician to say: is this the sort of thing Ben Goldacre was talking about? And if not, why not?</p>
<p>
	[If I&#39;ve made any huge - or even minor - blunders in the above, please let me know... There&#39;s always a risk in cutting and pasting things that look like they produce the sort of thing you&#39;re interested in, but may actually be doing something completely different!]</p>
]]></description> 
      <dc:date>2012-01-11T15:26:05+00:00</dc:date>
    </item>

    <item>
      <title>The top 10 data&#45;mining links of 2011</title>
      <link>http://datadrivenjournalism.net/resources/the_top_10_data_mining_links_of_2011</link>
      <guid>http://datadrivenjournalism.net/resources/the_top_10_data_mining_links_of_2011#When:10:34:48Z</guid>
      <description><![CDATA[<p>
	<em>Originally published by <a href="http://www.pbs.org/idealab/author-bios.html#jonathan_stray">Jonathan Stray</a> on <a href="http://www.pbs.org/idealab/">MediaShift Idea Lab</a>, 10 January 2012. This article is republished with permission.</em></p>
<p>
	&nbsp;</p>
<p>
	<a href="http://overview.ap.org/">Overview</a> is a project to create an open-source document-mining system for investigative journalists and other curious people. We&#39;ve written before about the <a href="http://www.pbs.org/idealab/2011/10/3-difficult-document-mining-problems-that-overview-wants-to-solve297.html">goals</a> of the project, and we&#39;re developing some new technology, but mostly we&#39;re stealing it from other fields.</p>
<p>
	<img alt="" src="http://farm8.staticflickr.com/7162/6678113937_3392e2d898.jpg" style="float: right; width: 400px; height: 111px; " />The following are some of the best ideas we saw in 2011, the data-mining work that we found most inspirational. Many of these links are educational resources for learning about specific technology. Some of this work illuminates how algorithms and humans treat information differently. Other are just amazing, mind-bending work.</p>
<p>
	1.&nbsp;<strong>What do your connections say about you?</strong>&nbsp;A lot. It is possible to accurately <a href="http://www.criticalinsight.net/publications/conover_prediction_socialcom.pdf">predict your political orientation</a> solely on the basis of your network on Twitter. You can also work out <a href="http://www.fastcompany.com/1769217/there-are-no-secrets-from-twitter">gender and other things</a> from public information.</p>
<p>
	2. <strong>Free textbooks from Stanford University. </strong>&quot;<a href="http://nlp.stanford.edu/IR-book/information-retrieval-book.html">Introduction to Information Retrieval</a>&quot; teaches you how a search engine works, in great detail. &quot;<a href="http://infolab.stanford.edu/~ullman/mmds.html">Mining Massive Data Sets</a>&quot; covers a variety of big-data principles that apply to different types of information.</p>
<p>
	3. <strong>We&#39;re not above having a list of lists. </strong>Here&#39;s the Data Mining Blog&#39;s <a href="http://www.dataminingblog.com/top-five-articles-in-data-mining/">top 5 articles</a>. Most of these are foundational, covering basic philosophy and technique such as choosing variables, finding clusters, and deciding what you&#39;re looking for.</p>
<p>
	4. The<strong> MINE technique</strong> looks for patterns between hundreds or thousands of variables -- say, patterns of gene expression inside a single cell. It&#39;s very general, and finds not only individual relationships but networks of cause and effect. Here&#39;s a nifty <a href="http://www.broadinstitute.org/news-and-publications/mine-detecting-novel-associations-large-data-sets">video</a>, here&#39;s the original <a href="http://jonathanstray.com/papers/MINE.pdf">paper</a>, and here&#39;s one statistician&#39;s <a href="http://andrewgelman.com/2011/12/mr-pearson-meet-mr-mandelbrot-detecting-novel-associations-in-large-data-sets/">review</a>.</p>
<p>
	5. This is one of those papers that really changed the way I look at things. How do we know when a data visualization shows us something that is &quot;actually there,&quot; as opposed to an artifact of the numbers? &quot;<a href="http://jonathanstray.com/papers/wickham.pdf">Graphical Inference for Infovis</a>&quot; provides one excellent answer, based on a clever analogy with numerical statistics.</p>
<p>
	6. Lots of text-mining work uses &quot;clustering&quot; or &quot;classification&quot; techniques to sort documents into topics. But doesn&#39;t a categorization algorithm impose its own preconceptions? This is a deep issue, which you might think of as &quot;<a href="http://bit.ly/thYSU">framing</a>&quot; in code. To explore this question Justin Grimmer and Gary King went meta with a <a href="http://gking.harvard.edu/sites/scholar.iq.harvard.edu/files/gking/files/201018067_online_1.pdf">system</a> <strong>that visualizes all possible categorizations of a document set</strong>, and how they relate.</p>
<p>
	7. A few years ago Google <a href="http://www.google.org/flutrends/">showed</a> that the number of searches for &quot;flu&quot; was a great predictor of the actual number of outbreaks in a given location -- faster and more specific than the Center for Disease Control&#39;s own surveillance data. The team has now expanded the technique into <a href="http://www.google.com/trends/correlate/whitepaper.pdf">Google Correlate</a>, which instantly scans through petabytes of data to find search terms which follow any user-supplied time series. Here&#39;s New Scientist taking it for a <a href="http://www.newscientist.com/blogs/onepercent/2011/05/google-correlate-passes-our-we.html">test drive</a>.</p>
<p>
	<img alt="" src="http://farm8.staticflickr.com/7152/6678115261_8c6b1babd2.jpg" style="float: right; width: 300px; height: 189px; " />8. Not content with free professional textbooks, Stanford has created <strong>two free online courses</strong> for <a href="http://ml-class.org/">machine learning</a> and <a href="http://nlp-class.org/">natural language processing</a>. Both are live-streamed lecture series taught by experts, with homework. Learning these intricate technologies has never been easier.</p>
<p>
	9. Lots of people have speculated about the <strong>role of social media in protest movements</strong>. A team of researchers looked at the data, analyzing a huge set of tweets from the &quot;May 20&quot; protests in Spain last year. How do protests spread from social media? Now we have at least one <a href="http://www.nature.com/srep/2011/111215/srep00197/full/srep00197.html">solid answer</a>.</p>
<p>
	10. And the craziest data-mining link we ran across in 2011: <strong>IBM&#39;s DeepQA project</strong>, which beat human Jeopardy champions. This project looks into an unstructured database to correctly answer about 80% of all general questions posed to it, in just a few seconds. Here&#39;s a <a href="http://blog.ted.com/2011/02/18/experts-and-ibm-insiders-break-down-watsons-jeopardy-win/">TED talk</a>, and here&#39;s the technical <a href="http://www.stanford.edu/class/cs124/AIMagzine-DeepQA.pdf">paper</a> that explains how it works. I can&#39;t tell you how badly I want one of these in the newsroom. If enough journalist hackers build on each other&#39;s work, maybe one day ...</p>
<p>
	Happy data mining! We&#39;ll be releasing our own prototype document-mining system, and the source, at the <a href="http://www.ire.org/conferences/nicar-2012/">NICAR</a> conference next month. If these are the sorts of algorithms you like to play with, we&#39;re also <a href="http://overview.ap.org/">hiring</a> programmers who want to bring these sorts of advanced techniques within everyone&#39;s reach.</p>
<p>
	&nbsp;</p>
]]></description> 
      <dc:date>2012-01-11T10:34:48+00:00</dc:date>
    </item>

    <item>
      <title>A computational journalism reading list</title>
      <link>http://datadrivenjournalism.net/resources/a_computational_journalism_reading_list</link>
      <guid>http://datadrivenjournalism.net/resources/a_computational_journalism_reading_list#When:11:30:27Z</guid>
      <description><![CDATA[<p>
	<em>Originally published by</em><em>&nbsp;<a href="http://jonathanstray.com/me">Jonathan Stray</a> on <a href="http://jonathanstray.com">jonathanstray.com</a>, 31 January 2011.&nbsp;</em><em>This article is republished with permission.</em></p>
<p>
	&nbsp;</p>
<p>
	There is something extraordinarily rich in the intersection of computer science and journalism. It feels like there&rsquo;s a nascent field in the making, tied to the rise of the internet. The last few years have seen calls for a new class of&nbsp; &ldquo;<a href="http://www.niemanlab.org/2011/01/dave-winer-how-can-universities-educate-journo-programmers/">programmer journalist</a>&rdquo; and the birth of a community of <a href="http://hackshackers.com/">hacks and hackers</a>. Meanwhile, several schools are now <a href="http://www.wired.com/epicenter/2010/04/will-columbia-trained-code-savvy-journalists-bridge-the-mediatech-divide/">offering joint degrees</a>. But we&rsquo;ll need more than competent programmers in newsrooms. What are the key problems of computational journalism? What other fields can we draw upon for ideas and theory? For that matter, what is it?</p>
<p>
	I&rsquo;d like to propose a working definition of computational journalism as the application of computer science to the problems of public information, knowledge, and belief, by practitioners who see their mission as outside of both commerce and government. This includes the journalistic mainstay of &ldquo;reporting&rdquo; &mdash; because information not published is information not known &mdash; but my definition is intentionally much broader than that. To succeed, this young discipline will need to draw heavily from social science, computer science, public communications, cognitive psychology and other fields, as well as the traditional values and practices of the journalism profession.</p>
<p>
	&ldquo;Computational journalism&rdquo; has no textbooks yet. In fact the term barely is barely recognized. The phrase seems to have emerged at Georgia Tech in 2006 or <a href="http://www.cc.gatech.edu/classes/AY2007/cs4803cj_spring/">2007</a>. Nonetheless I feel like there are already important topics and key references.</p>
<p>
	&nbsp;</p>
<h3>
	Data journalism</h3>
<p>
	Data journalism is obtaining, reporting on, curating and publishing data in the public interest. The practice is often more about spreadsheets than algorithms, so I&rsquo;ll suggest that not all data journalism is &ldquo;computational,&rdquo; in the same way that a novel written on a word processor isn&rsquo;t &ldquo;computational.&rdquo; But data journalism is interesting and important and dovetails with computational journalism in many ways.</p>
<ul>
	<li>
		The Nieman Journalism Lab&rsquo;s <a href="http://www.niemanlab.org/2010/08/how-the-guardian-is-pioneering-data-journalism-with-free-tools/">interview with Guardian Data Blog editor Simon Rogers</a> remains a solid introduction to (one kind of) contemporary practice.</li>
	<li>
		The best practical guides I know are Rogers&rsquo; &ldquo;<a href="http://www.journalism.co.uk/skills/how-to-get-to-grips-with-data-journalism/s7/a542402/">How to: get to grips with data journalism</a>&rdquo; and Dan Nguyen&rsquo;s <a href="http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data">series of data-scraping tutorials at ProPublica</a>.</li>
	<li>
		Stanford&rsquo;s <a href="http://datajournalism.stanford.edu/">Journalism in the Age of Data</a> is an hour-long documentary on data journalism and visualization.</li>
	<li>
		The web is a linked system of human-readable documents. Now Tim Berners-Lee wants to create a web of machine-readable <a href="http://blog.ted.com/2009/03/13/tim_berners_lee_web/">linked data</a>. The full potential is unclear, but it&rsquo;s a big idea that may come to be the backbone of <a href="http://en.wikipedia.org/wiki/Semantic_Web">semantic web</a> visions. The <a href="http://data.nytimes.com/">New York Times</a>, <a href="http://www.guardian.co.uk/open-platform">The Guardian</a>, and others are experimenting with open data APIs.</li>
	<li>
		Everyblock creator Adrian Holovaty seems to have been the first to suggest that reporters file structured data in his 2006 &ldquo;<a href="http://www.holovaty.com/writing/fundamental-change/">A Fundamental Way Newspaper Websites Need to Change</a>.&rdquo; This idea is beautifully expanded in Stijn Debrouwere&rsquo;s &ldquo;<a href="http://stdout.be/2010/information-architecture-for-news-websites/">Information Architecture for News Websites</a>&rdquo; series.</li>
</ul>
<h3>
	&nbsp;</h3>
<h3>
	Visualization</h3>
<p>
	Big data requires powerful exploration and storytelling tools, and increasingly that means visualization. But there&rsquo;s good visualization and bad visualization, and the field has advanced tremendously since Tufte wrote <a href="http://www.edwardtufte.com/tufte/books_vdqi">The Visual Display of Quantitative Information</a>. There is lots of good science that is too little known, and many open problems here.</p>
<ul>
	<li>
		Tamara Munzner&rsquo;s <a href="http://www.cs.ubc.ca/labs/imager/tr/2009/VisChapter/">chapter on visualization</a> is the essential primer. She puts visualization on rigorous perceptual footing, and discusses all the major categories of practice. Absolutely required reading for anyone who works with pictures of data.</li>
	<li>
		Ben Fry invented the Processing language and wrote his <a href="http://benfry.com/phd/">PhD thesis on &ldquo;computational information design</a>,&rdquo; which is his powerful conception of the iterative, interactive practice of designing useful visualizations.</li>
	<li>
		How do we make visualization statistically rigorous? How do we know we&rsquo;re not just fooling ourselves when we see patterns in the pixels? This <a href="http://jonathanstray.com/papers/wickham.pdf">amazing paper by Wickham</a> et. al. has some answers.</li>
	<li>
		Is a visualization a story? Segal and Heer explore this question in &ldquo;<a href="http://vis.stanford.edu/files/2010-Narrative-InfoVis.pdf">Narrative Visualization: Telling Stories with Data</a>.&rdquo;</li>
</ul>
<h3>
	&nbsp;</h3>
<h3>
	Computational linguistics</h3>
<p>
	Data is more than numbers. Given that the web is designed to be read by humans, it makes heavy use of human language. And then there are all the world&rsquo;s books, and the archival recordings of millions of speeches and interviews. Computers are slowly getting better at dealing with language.</p>
<ul>
	<li>
		Word frequency techniques like <a href="http://en.wikipedia.org/wiki/Tfidf">tf-idf</a> and the <a href="http://en.wikipedia.org/wiki/Vector_space_model">vector space document model</a> are very simple and very useful. See also <a href="http://en.wikipedia.org/wiki/Stemming">stemming</a>. Lots more in the wonderful (and free!) <a href="http://nlp.stanford.edu/IR-book/information-retrieval-book.html">Introduction to Information Retrieval</a>. This book explains how search engines are built, and&nbsp; discusses tf-idf etc. in great technical detail.</li>
	<li>
		Statistical language models are increasingly important for all kinds of applications. Michael Nielsen has a great <a href="http://michaelnielsen.org/blog/introduction-to-statistical-machine-translation/">introduction to statistical machine translation</a>. Google&rsquo;s Peter Norvig discusses how he implemented <a href="http://norvig.com/spell-correct.html">statistical spelling correction</a> on his laptop during a long plane flight. For the full deal, see the book <a href="http://books.google.com/books?id=YiFDxbEX3SUC&amp;lpg=PP1&amp;dq=Foundations%20of%20statistical%20language%20processing%22&amp;pg=PP1#v=onepage&amp;q&amp;f=false">Foundations of Statistical Natural Language Processing</a>.</li>
	<li>
		On a related note, <a href="http://ngrams.googlelabs.com/">Google N-gram viewer</a> lets you look at the frequency of short phrases within 4% of all books published, ever. The <a href="http://mfi.uchicago.edu/publications/papers/Science_Culturomics.pdf">excellent paper</a> gives examples of how to use this for cultural research. Dan Cohen has <a href="http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/">important criticisms</a>.</li>
	<li>
		Speech-to-text algorithms enable automated transcription, and Matt Thompson explores the <a href="http://www.niemanlab.org/2010/12/coming-soon-to-journalism-matt-thompson-sees-the-speakularity-and-universal-instant-transcription/">huge implications for journalism</a>.</li>
	<li>
		Reuters maintains the <a href="http://www.opencalais.com/">OpenCalais</a> entity extraction service, which parses text to contextually determine who and what is referenced.</li>
	<li>
		IBM&rsquo;s Watson project built a question-answering system that reads reference books and wins at Jeopardy. Imagine how useful to journalists and curious readers this could be! This <a href="http://www.stanford.edu/class/cs124/AIMagzine-DeepQA.pdf">paper on the DeepQA system</a> describes how they did it.</li>
</ul>
<h3>
	&nbsp;</h3>
<h3>
	Communications technology and free speech</h3>
<p>
	<a href="http://harvardmagazine.com/2000/01/code-is-law.html">Code is law</a>. Because our communications systems use software, the underlying mathematics of communication lead to staggering political consequences &mdash; including whether or not it is possible for governments to verify online identity or remove things from the internet. The key topics here are networks, cryptography, and information theory.</p>
<ul>
	<li>
		The <a href="http://www.cacr.math.uwaterloo.ca/hac/index.html">Handbook of Applied Cryptography</a> is a classic, and free online. But despite the title it doesn&rsquo;t really explain how crypto is used in the real world, <a href="http://en.wikipedia.org/wiki/Cryptography">like Wikipedia does</a>.</li>
	<li>
		It&rsquo;s important to know how the internet routes information, using <a href="http://en.wikipedia.org/wiki/Transmission_Control_Protocol">TCP/IP</a> and <a href="http://en.wikipedia.org/wiki/Border_Gateway_Protocol">BGP</a>, or at a somewhat higher level, things like the <a href="http://www.ittc.ku.edu/~niehaus/classes/750-s06/documents/BT-description.pdf">BitTorrent protocol</a>. The technical details determine how hard it is to do things like block websites, suppress the dissemination of a file, or <a href="http://blog.torproject.org/blog/recent-events-egypt">remove entire countries from the internet</a>.</li>
	<li>
		Anonymity is deeply important to online free speech, and very hard. The <a href="http://www.torproject.org/">Tor project</a> is the outstanding leader in anonymity-related research.</li>
	<li>
		Information theory is stunningly useful across almost every technical discipline. Pierce&rsquo;s <a href="http://www.amazon.com/Introduction-Information-Theory-Symbols-Signals/dp/0486240614/ref=pd_rhf_p_t_1">short textbook</a> is the classic introduction, while Tom Schneider&rsquo;s <a href="http://www-lmmb.ncifcrf.gov/~toms/paper/primer/">Information Theory Primer</a> seems to be the best free online reference.</li>
</ul>
<h3>
	&nbsp;</h3>
<h3>
	Tracking the spread of information (and misinformation)</h3>
<p>
	What do we know about how information spreads through society? Very little. But one nice side effect of our increasingly digital public sphere is the ability to track such things, at least in principle.</p>
<ul>
	<li>
		<a href="http://memetracker.org/">Memetracker</a> was (AFAIK) the first credible demonstration of whole-web information tracking, following quoted soundbites through blogs and mainstream news sites and everything in between. Zach Seward has cogent <a href="http://www.niemanlab.org/2009/07/in-the-news-cycle-memes-spread-more-like-a-heartbeat-than-a-virus/">reflections on their findings</a>.</li>
	<li>
		The <a href="http://truthy.indiana.edu/">Truthy Project</a> aims for automated detection of astro-turfing on Twitter. They specialize in covert political messaging, or as I like to call it, computational propaganda.</li>
	<li>
		We badly need tools to help us determine the source of any given online &ldquo;fact.&rdquo; There are many existing techniques that could be applied to the problem, as I discussed in a <a href="http://jonathanstray.com/escaping-the-news-hall-of-mirrors">previous post</a>.</li>
	<li>
		If we had information provenance tools that worked across a spectrum of media outlets and feed types (web, social media, etc.) it would be much cheaper to do the sort of <a href="http://www.journalism.org/analysis_report/how_news_happens">information ecosystem studies</a> that Pew and others occasionally undertake. This would lead to a much better understanding of <a href="http://www.niemanlab.org/2010/02/the-googlechina-hacking-case-how-many-news-outlets-do-the-original-reporting-on-a-big-story/">who does original reporting</a>.</li>
</ul>
<h3>
	&nbsp;</h3>
<h3>
	Filtering and recommendation</h3>
<p>
	With <a href="http://techcrunch.com/2010/08/04/schmidt-data/">vastly more information than ever before</a> available to us, attention becomes the scarcest resource. Algorithms are an essential tool in filtering the flood of information that reaches each person. (Social media networks also <a href="http://jonathanstray.com/whats-the-point-of-social-news">act as filters</a>.)</p>
<ul>
	<li>
		The paper on <a href="http://crpit.com/confpapers/CRPITV70Truyen.pdf">preference networks</a> by Turyen et. al. is probably as good an introduction as anything to the state of the art in recommendation engines, those algorithms that tell you what articles you might like to read or what <a href="http://en.wikipedia.org/wiki/Netflix_Prize">movies you might like to watch</a>.</li>
	<li>
		Before Google News there was Columbia News Blaster, which incorporated a number of interesting algorithms such as multi-lingual article clustering, automatic summarization, and more as described in <a href="http://www.cs.columbia.edu/~sable/research/hlt-blaster.pdf">this paper</a> by McKeown et. al.</li>
	<li>
		Anyone playing with clustering algorithms needs to have a deep appreciation of the <a href="http://en.wikipedia.org/wiki/Ugly_duckling_theorem">ugly duckling theorem</a>, which says that there is no categorization without preconceptions. King and Grimmer explore this with their technique for <a href="http://gking.harvard.edu/files/abs/discov-abs.shtml">visualizing the space of clusterings</a>.</li>
	<li>
		Any digital journalism product which involves the audience to any degree &mdash; that should be all digital journalism products &mdash; is a piece of social software, well defined by Clay Shirky in his classic essay, &ldquo;<a href="http://www.shirky.com/writings/group_enemy.html">A Group Is Its Own Worst Enemy</a>.&rdquo; It&rsquo;s also a &ldquo;<a href="http://cdixon.org/2010/01/17/collective-knowledge-systems/">collective knowledge system</a>&rdquo; as articulated by Chris Dixon.</li>
</ul>
<h3>
	&nbsp;</h3>
<h3>
	Measuring public knowledge</h3>
<p>
	If journalism is about &ldquo;informing the public&rdquo; then we must consider what happens to stories after publication &mdash; this is the <a href="http://jonathanstray.com/does-journalism-work">&ldquo;last mile&rdquo; problem in journalism</a>. There is almost none of this happening in professional journalism today, aside from basic traffic analytics. The key question here is, how does journalism change ideas and action? Can we apply computers to help answer this question empirically?</p>
<ul>
	<li>
		World Public Opinion&rsquo;s recent <a href="http://www.worldpublicopinion.org/pipa/articles/brunitedstatescanadara/671.php?nid=&amp;id=&amp;pnt=671&amp;lb=">survey of misinformation among American voters</a> solves this problem in the classic way, by doing a randomly sampled opinion poll. I discuss their bleak results <a href="http://jonathanstray.com/american-journalism-failed-to-inform-voters">here</a>.</li>
	<li>
		Blogosphere maps and other kinds of visualizations can help us understand the public information ecosystem, such as this <a href="http://cyber.law.harvard.edu/publications/2008/Mapping_Irans_Online_Public/interactive_blogosphere_map">interactive visualization of Iranian blogs</a>. I have previously suggested using such maps as a navigation tool that might <a href="http://jonathanstray.com/mapping-the-daily-me">broaden our information horizons</a>.</li>
	<li>
		<a href="http://www.unglobalpulse.org/">UN Global Pulse</a> is a serious attempt to create a real-time global monitoring system to detect humanitarian threats in crisis situations. They plan to do this by mining the &ldquo;data exhaust&rdquo; of entire societies &mdash; social media postings, online records, news reports, and whatever else they can get their hands on. Sounds like <a href="http://www.unglobalpulse.org/blog/real-time-information-everyone-journalists-perspective-un-global-pulse">key technology for journalism</a>.</li>
	<li>
		<a href="http://sm.rutgers.edu/vox/event/">Vox Civitas</a> is an ambitious social media mining tool designed for journalists. Computational linguistics, visualization, and more.</li>
</ul>
<h3>
	&nbsp;</h3>
<h3>
	Research agenda</h3>
<p>
	I know of only one work which proposes a research agenda for computational journalism.</p>
<ul>
	<li>
		&ldquo;<a href="http://www.eecs.umich.edu/~congy/work/cidr11.pdf">Computational Journalism: A Call to Arms for Database Researchers</a>&rdquo; by Sarah Cohen et. al. raises the very intriguing possibility of building systems that automatically or semi-automatically scan databases for stories, document the rationale for believing certain facts, etc.</li>
</ul>
<p>
	This paper presents a broad vision and is really a must-read. However, it deals almost exclusively with reporting, that is, finding new knowledge and making it public. I&rsquo;d like to suggest that the following unsolved problems are also important:</p>
<ul>
	<li>
		Tracing the source of any particular &ldquo;fact&rdquo; found online, and generally tracking the spread and mutation of information.</li>
	<li>
		Cheap metrics for the state of the public information ecosystem. How accurate is the web? How accurate is a particular source?</li>
	<li>
		Techniques for mapping public knowledge. What is it that people actually know and believe? How polarized is a population? What is under-reported? What is well reported but poorly appreciated?</li>
	<li>
		Information routing and timing: how can we route each story to the set of people who might be most concerned about it, or best in a position to act, at the moment when it will be most relevant to them?</li>
</ul>
<p>
	This sort of attention to the health of the public information ecosystem as a whole, beyond just the traditional surfacing of new stories, seems essential to the project of <a href="http://jonathanstray.com/does-journalism-work">making journalism work</a>.</p>
<p>
	&nbsp;</p>
<p>
	<em>Teaser image: based on the visual representation of the <a href="http://cyber.law.harvard.edu/publications/2008/Mapping_Irans_Online_Public/interactive_blogosphere_map">Iranian blogoshpere</a>.</em></p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
]]></description> 
      <dc:date>2011-12-15T11:30:27+00:00</dc:date>
    </item>

    <item>
      <title>The Bastards Book of Ruby</title>
      <link>http://datadrivenjournalism.net/resources/the_bastards_book_of_ruby</link>
      <guid>http://datadrivenjournalism.net/resources/the_bastards_book_of_ruby#When:07:48:14Z</guid>
      <description><![CDATA[<p>
	The <em>Bastards Book of Ruby</em> is an introduction to programming for non-programmers. The online book focuses on the use of programming for the gathering, organising, and analysing of data in all its forms.</p>
<p>
	<em>&ldquo;The world of data has exploded in the past few years without a corresponding increase in the people or tools to efficiently make sense of it. And so I&rsquo;ve had a hankering to create a more cohesive, useful programming guide aimed at not just journalists, but for anyone in any field,&rdquo;</em> explains&nbsp;<a href="http://danwin.com/">Dan Nguyen</a>, the author of the book, and journalist at ProPublica.</p>
<p>
	The book is a work in progress. At this moment it contains four sections and about 30 chapters, in various stages of completion. For an overview of the chapters, check out the full <a href="http://ruby.bastardsbook.com/toc/">table of contents</a>.&nbsp;The four sections are:</p>
<ul>
	<li>
		<a href="http://ruby.bastardsbook.com/toc/#fundamentals">The Fundamentals</a>: Ruby installation guide and basics of computer science</li>
	<li>
		<a href="http://ruby.bastardsbook.com/toc/#supplementals">Supplementals</a>: useful concepts of programming and how to scrape the Web</li>
	<li>
		<a href="http://ruby.bastardsbook.com/toc/#theory">Design and Theory</a>: more useful computer science and software engineering concepts</li>
	<li>
		<a href="http://ruby.bastardsbook.com/toc/#projects">The Projects</a>: examples of data projects, the rationale behind them and step by step guides to producing them</li>
</ul>
<p style="text-align: center; ">
	<img alt="" src="http://farm8.staticflickr.com/7013/6504023995_d5f3e47b96.jpg" style="margin-left: 5px; margin-right: 5px; width: 400px; height: 264px; " /></p>
<p style="text-align: center; ">
	<em>A random sampling of the mugshots collected from the Putnam County Sheriff&#39;s jail history, the subject of one of the chapters.&nbsp;</em><em>Image credit:&nbsp;Dan Nguyen.</em></p>
<p>
	The book&nbsp;is freely available&nbsp;at: <a href="http://ruby.bastardsbook.com/">ruby.bastardsbook.com</a>.</p>
]]></description> 
      <dc:date>2011-12-13T07:48:14+00:00</dc:date>
    </item>

    <item>
      <title>Programmer&#45;journalist job openings</title>
      <link>http://datadrivenjournalism.net/resources/programmer_journalist_jobs</link>
      <guid>http://datadrivenjournalism.net/resources/programmer_journalist_jobs#When:14:02:20Z</guid>
      <description><![CDATA[<p>
	A spreadsheet listing over 50 programmer-journalist jobs has been circulating online for some time now.&nbsp;All the jobs require technical skills and range from newsroom developer to interactive designer, multimedia producer and social media editor.&nbsp;</p>
<p>
	All the openings are in the United States. Most of them are with news publishers, including New York Times, Boston Globe,&nbsp;Huffington Post, the Guardian and the Washington Post.&nbsp;</p>
<p>
	.&nbsp;<img alt="Screen_shot_2011-12-02_at_3.18.30_PM.png" src="http://datadrivenjournalism.net/uploads/Screen_shot_2011-12-02_at_3.18.30_PM.png" style="width: 700px; height: 395px; " /></p>
<p>
	&nbsp;</p>
<p>
	To see and edit the list of programmer-journalist jobs please click <a href="https://docs.google.com/a/ejc.net/spreadsheet/ccc?key=0AmqohgGX3YQadE1VSktrWG1nNFF6RUFNT1RKa0k0a2c&amp;authkey=CK7OlpsI&amp;hl=en_US#gid=4">here</a>. Do you know of data journalism jobs that are not in the list? Please add them to the list or leave a comment.</p>
<p>
	To see and edit a list of data journalism and graphics internships please click <a href="https://docs.google.com/a/ejc.net/spreadsheet/ccc?key=0AsJrqt3yp-JydEp6ZGl2STU2by1YNlB3b1RYNXN4TVE#gid=0">here</a>.&nbsp;Do you know of any interships that are not in the list? Please add them to the list or leave a comment.</p>
]]></description> 
      <dc:date>2011-12-02T14:02:20+00:00</dc:date>
    </item>

    <item>
      <title>Getting text out of an image&#45;only PDF</title>
      <link>http://datadrivenjournalism.net/resources/getting_text_out_of_an_image_only_pdf2</link>
      <guid>http://datadrivenjournalism.net/resources/getting_text_out_of_an_image_only_pdf2#When:07:54:03Z</guid>
      <description><![CDATA[<p>
	<em>Originally published by </em><em><a href="http://www.propublica.org/site/author/dan_nguyen/">Dan Nguyen</a>&nbsp;</em><em>on&nbsp;</em><em><a href="http://www.propublica.org/">ProPublica</a>, 30 December, 2010</em><em>. This article is republished with permission.&nbsp;</em></p>
<p>
	&nbsp;</p>
<p>
	In the <a href="http://www.propublica.org/nerds/item/turning-pdfs-to-text-doc-dollars-guide">previous guide</a>, we describe several methods for turning <a href="http://sunlightlabs.com/blog/2009/adobe-bad-open-government/">PDFs</a> into data usable for spreadsheets. However, those only handle PDFs that have actual text embedded within them. When a PDF contains just <em>images </em>of text, as they do in scanned documents, then the problem isn&#39;t just how to convert them into neat tabular data, but how to extract <em>any text,</em> period.</p>
<p>
	In this tutorial, we&#39;ll explain how to write a program to extract the data into tabular format. Here&#39;s an overview of the basic steps:</p>
<p>
	&nbsp;</p>
<h4>
	1. Determine the positions of the lines that divide the rows and columns on a page.</h4>
<p>
	<img alt="" src="http://farm8.staticflickr.com/7143/6440534995_9e8aebdc4f.jpg" style="margin-left: 5px; margin-right: 5px; margin-top: 5px; margin-bottom: 5px; width: 480px; height: 286px; " /></p>
<h4>
	&nbsp;</h4>
<h4>
	2. Break the image apart along those lines to create (hundreds of) individual image files, one for each cell.</h4>
<p>
	<img alt="" src="http://farm8.staticflickr.com/7011/6440547249_e91d8d4a93.jpg" style="margin-left: 5px; margin-right: 5px; margin-top: 5px; margin-bottom: 5px; width: 480px; height: 292px; " /></p>
<h4>
	&nbsp;</h4>
<h4>
	3. Perform <a href="http://en.wikipedia.org/wiki/Optical_character_recognition">optical character recognition</a> on each cell to translate the image into a textfile.</h4>
<p>
	<img alt="" src="http://farm8.staticflickr.com/7022/6440548181_337bfbfab4.jpg" style="margin-left: 5px; margin-right: 5px; margin-top: 5px; margin-bottom: 5px; width: 480px; height: 348px; " /></p>
<h4>
	&nbsp;</h4>
<h4>
	4. Reassemble these (hundreds of) text files in the same order that you divided the main image, creating a (text) spreadsheet of the data.</h4>
<p>
	<img alt="" src="http://farm8.staticflickr.com/7147/6440549275_ffd68329f0.jpg" style="margin-left: 5px; margin-right: 5px; margin-top: 5px; margin-bottom: 5px; width: 480px; height: 240px; " /></p>
<p>
	A caveat: The code examples provided here are specific to the FlashPaper version of Eli Lilly&#39;s doctor payment disclosures, such as the black outlines of its table cells.</p>
<h4>
	&nbsp;</h4>
<h3>
	Background</h3>
<p>
	&nbsp;</p>
<p>
	Eli Lilly became the first major drug company to post its physician payments online in <a href="http://blogs.wsj.com/health/2009/07/31/eli-lillys-payments-to-doctors-revealed/">July 2009</a>. However, Lilly <a href="http://www.nytimes.com/2010/04/13/business/13docpay.html">was criticized</a> in an April 2010 New York Times article for using a proprietary and <a href="http://www.adobe.com/products/flashpaper/">discontinued</a> format -- Adobe&#39;s &quot;FlashPaper&quot; -- which made the data virtually impossible to download or copy. In fact,&nbsp; <a href="https://www.pharmashine.com/">PharmaShine</a>, a company that maintains a commercial database of physician payments, said it had to <a href="http://www.nytimes.com/2010/04/13/business/13docpay.html?adxnnl=1&amp;adxnnlx=1322814273-X1tinBK79LP7WwWK06G8BQ">manually retype the entire list</a>.</p>
<p>
	Eli Lilly disagrees with the Times&#39; story&#39;s characterization that it had &quot;purposely made its report impossible to download.&quot; In an e-mail to ProPublica, Lilly spokesman J. Scott MacGregor called the characterization &quot;misleading, as it was never our intention to make it difficult for people to access information&quot; and said that preserving integrity of the data was the reason for not making it downloadable initially.</p>
<p>
	In any event, Lilly now provides their <a href="http://www.lillyphysicianpaymentregistry.com/">data as a PDF</a>, with copyable text. However, when we began work on <a href="http://projects.propublica.org/docdollars/">Dollars for Docs</a>, Lilly was only providing data in the FlashPaper format. We came up a way to download the file and programmatically to extract tabular data from it.</p>
<h4>
	&nbsp;</h4>
<h3>
	Software to Get</h3>
<p>
	&nbsp;</p>
<ul>
	<li>
		<a href="http://www.ruby-lang.org/en/">Ruby</a></li>
	<li>
		<a href="http://rmagick.rubyforge.org/">RMagick</a> - a library that provides Ruby methods for image processing. Warning: <a href="http://rmagick.rubyforge.org/install-faq.html">installing it</a> can be laborious.</li>
	<li>
		<a href="http://code.google.com/p/tesseract-ocr/">Tesseract</a> - An <a href="http://en.wikipedia.org/wiki/Optical_character_recognition">optical character recognition engine</a> maintained by Google. It&#39;s used to do the computerized translation of images to text.</li>
	<li>
		<a href="http://www.mozilla.org/en-US/firefox/new/?from=getfirefox">Firefox</a> and the <a href="http://getfirebug.com/">Firebug</a> plugin - read my <a href="http://www.propublica.org/nerds/item/reading-flash-data">tutorial on Flash data-scraping</a> for a primer on how to use Firebug to discover the files sent to your web browser.</li>
	<li>
		<a href="http://www.adobe.com/products/acrobatpro.html">Adobe Acrobat Pro</a> (or just <a href="http://acrobatusers.com/tutorials/acrobat-distiller-9">Distiller</a>?) -- we used this to convert FlashPaper to a regular PDF</li>
	<li>
		<a href="https://github.com/thejefflarson/pdf-splitter">PDF-Splitter</a> - My colleague Jeff Larson&#39;s command-line utility to split a PDF into the TIFF image format (requires Mac OS X Snow Leopard). This can be done (though not as easily) with Adobe Acrobat Pro.</li>
</ul>
<h4>
	&nbsp;</h4>
<h3>
	Downloading the FlashPaper Document</h3>
<p>
	&nbsp;</p>
<p>
	Visit the Lilly <a href="http://www.lillyphysicianpaymentregistry.com/">faculty registry</a> and observe your browser&#39;s traffic through Firebug (see my <a href="http://www.propublica.org/nerds/item/reading-flash-data">previous tutorial</a>). You should see a file called <a href="http://www.lillyphysicianpaymentregistry.com/">data.swf</a> with a size of more than 1MB.</p>
<p>
	<img alt="" src="http://farm8.staticflickr.com/7005/6440550405_b5315d739c.jpg" style="margin-left: 5px; margin-right: 5px; margin-top: 5px; margin-bottom: 5px; width: 481px; height: 323px; " /></p>
<p>
	<em>The Safari browser&#39;s Activity window also will reveal the files downloaded to your browser; <strong>data.swf </strong>is what we want.</em></p>
<p>
	<br />
	Download the file and open it in your web browser. There&#39;s a printer icon in the top-right. Click on it and choose PDF as the output. You&#39;ll end up with a PDF weighing over 200MB. When you open it up, you&#39;ll see that it <em>appears </em>to be a normal table of text. But you won&#39;t be able to highlight-and-copy (which is also the case for secure PDFs), and saving as text will create an empty file.</p>
<p>
	So let&#39;s convert it to the <a href="http://en.wikipedia.org/wiki/Tagged_Image_File_Format">TIFF</a> image format, the only format Tesseract can read. You can do this either with Adobe Acrobat Pro&#39;s <strong>Export </strong>function, or my colleague Jeff Larson&#39;s <a href="https://github.com/thejefflarson/docsplit">pdf-splitter</a>:</p>
<p>
	<img alt="" src="http://farm8.staticflickr.com/7004/6440552003_514d16f495.jpg" style="margin-left: 5px; margin-right: 5px; margin-top: 5px; margin-bottom: 5px; width: 280px; height: 251px; " /></p>
<p>
	<em>Click the printer icon, circled in red, and print the file as a PDF.</em></p>
<p>
	<script src="https://gist.github.com/1422592.js"> </script></p>
<p>
	This will result in <strong>.tif </strong>files for every page in the original PDF.</p>
<p>
	&nbsp;</p>
<h4>
	Step 1: Reading Lines With RMagick</h4>
<p>
	RMagick is a library for Ruby that allows us do a variety of graphics operations in our program, such as changing an image&#39;s color. In fact, that&#39;s the first thing we want to do because Tesseract works best with black-and-white images. This is done using RMagick&#39;s <a href="http://studio.imagemagick.org/RMagick/doc/image3.html#quantize">quantize</a> method. We also can reduce the gray boosting the image&#39;s <a href="http://studio.imagemagick.org/RMagick/doc/image3.html#sigmoidal_contrast_channel">contrast</a>.</p>
<p>
	We had limited success doing this programmatically and so just used a Photoshop batch operation to get the desired black-and-white contrast. For the purposes of this tutorial, you can use this sample black-and-white excerpt: <a href="http://propublica.s3.amazonaws.com/assets/nerds/table-to-ocr.tif">table-to-ocr.tif</a></p>
<p>
	<script src="https://gist.github.com/1422403.js"> </script></p>
<p>
	<img alt="" src="http://farm8.staticflickr.com/7012/6440552693_9cd31faf37_m.jpg" style="margin-left: 5px; margin-right: 5px; margin-top: 5px; margin-bottom: 5px; float: right; width: 150px; height: 136px; " />Detecting the lines in the .tif file simply involves finding the lines in which all the pixels are non-white (some may be gray, which you&#39;ll see if you zoom in at the pixel level). This can be done with using RMagick&#39;s <a href="http://studio.imagemagick.org/RMagick/doc/image2.html#get_pixels">get_pixels</a>&nbsp;method on every horizontal and vertical line; <a href="http://studio.imagemagick.org/RMagick/doc/image2.html#get_pixels">get_pixels</a> returns an array of pixels within the boundaries we specify.</p>
<p>
	RMagick&#39;s <a href="http://studio.imagemagick.org/RMagick/doc/struct.html#Pixel">Pixel class</a> has <strong>red</strong>, <strong>blue</strong>, and <strong>green </strong>attributes. Examining the red, blue, and green values of white pixel should give you 65535 for each; a black pixel will return the value 0. Gray pixels are anywhere in between.</p>
<p>
	So first we crop the image to remove the white space surrounding the table (using <a href="http://studio.imagemagick.org/RMagick/doc/image1.html#bounding_box">bounding_box</a>). Then we examine each pixel of a line and record the positions where every pixel in that line had color values less than a dark gray (63000 seems to be enough tolerance):</p>
<p>
	&nbsp;</p>
<p>
	<script src="https://gist.github.com/1422626.js"> </script></p>
<h4>
	Step 2: Breaking It Into Pieces</h4>
<p>
	Now that we have two arrays defining rows and columns, we iterate through each one and call RMagick&#39;s constitute method, which creates new images based on the dimensions we provided it. We then write each image to a file named <em>column_numberxrow_number</em>.tif:</p>
<p>
	<script src="https://gist.github.com/1422497.js"> </script></p>
<p>
	You should end up with a directory called <strong>cell-files</strong> with nearly 500 TIFF files in it.</p>
<p>
	&nbsp;</p>
<h4>
	Step 3: Tesseract Each Image</h4>
<p>
	<a href="http://code.google.com/p/tesseract-ocr/">Tesseract</a> is a free optical character recognition program, first developed by HP and now maintained as open-source software by Google. Its operation is simple: point it to an image file, and it produces a text file with what it interprets as text from that image. So, in the above code, we simply run <a href="http://code.google.com/p/tesseract-ocr/">Tesseract</a> on each TIFF as it is created. Add this to the above code, after the <strong>constitute </strong>call:</p>
<p>
	<script src="https://gist.github.com/1422602.js"> </script></p>
<p>
	Now you should have nearly 500 text files in <strong>cell-files</strong>.</p>
<p>
	&nbsp;</p>
<h4>
	Step 4: All Together Now</h4>
<p>
	In the previous code, we&#39;re essentially stepping through the image column by column, line by line. While we&#39;re in this loop, we might as well record each of the text files&#39; content into one master text file. If we add a delimiting character each time, we end up with tabular data.</p>
<p>
	So, in the previous block of code, open a text file called &quot;1-table.txt&quot; and after the tesseract call, write the contents of that tiny text file into &quot;1-table.txt.&quot; The combined code for Steps 2-4 is:</p>
<p>
	<script src="https://gist.github.com/1422514.js"> </script></p>
<h3>
	&nbsp;</h3>
<h3>
	Cleanup Time</h3>
<p>
	&nbsp;</p>
<p>
	If that seemed too easy, it was. Open the resulting text file in a spreadsheet:</p>
<p>
	<img alt="" src="http://farm8.staticflickr.com/7003/6440553275_71ed04950d.jpg" style="margin-left: 5px; margin-right: 5px; margin-top: 5px; margin-bottom: 5px; width: 480px; height: 312px; " /></p>
<p>
	<em>Tesseract&#39;s imperfect translation of the images we sent it.</em></p>
<p>
	Tesseract isn&#39;t perfect, and on the first pass it may mistranslate many characters, especially ones that look similar to another, such as &#39;O&#39; and &#39;0&#39; (zero). You <a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract">can train the Tesseract engine</a>, though it&#39;s tedious and involves giving it a character-by-character correction of a sample test.</p>
<p>
	Even then it won&#39;t be perfect. Your next step will be to determine the best way to clean this data. You could start by validating each column against what you know it should contain (such as a currency format). Or you could design a <a href="http://www.propublica.org/article/propublicas-guide-to-mechanical-turk">Mechanical Turk task</a> in which you send an individual text file and TIFF for each cell and ask workers to perform a simple verification. You could even write your own Rails application to display the images and text-values side by side, so that your co-workers can collaboratively do the verification (this is what we did until <a href="http://www.lillyphysicianpaymentregistry.com/">Lilly released their data as PDFs</a>).</p>
<p>
	Again, the code here is specific to Lilly&#39;s format and may not be as successful on a scanned document where, for example, the lines aren&#39;t as easy to determine.</p>
<p>
	We hope to craft a more generalized version of this guide in the near future. We&#39;ll call the project &quot;Tableract&quot; for now.</p>
<h4>
	&nbsp;</h4>
<h3>
	The Dollars for Docs Data Guides</h3>
<p>
	&nbsp;</p>
<p>
	<strong>Introduction:</strong> <a href="http://www.propublica.org/nerds/item/the-coders-cause-in-dollars-for-docs">The Coder&#39;s Cause</a> &ndash; Public records gathering as a programming challenge.</p>
<ol>
	<li>
		<a href="http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning">Using Google Refine to Clean Messy Data</a> &ndash; Google Refine, which is downloadable software, can quickly sort and reconcile the imperfections in real-world data.</li>
	<li>
		<a href="http://www.propublica.org/nerds/item/reading-flash-data">Reading Data from Flash Sites</a> &ndash; Use Firefox&#39;s Firebug plugin to discover and capture raw data sent to your browser.</li>
	<li>
		<a href="http://www.propublica.org/nerds/item/turning-pdfs-to-text-doc-dollars-guide">Parsing PDFs</a> &ndash; Convert made-for-printer documents into usable spreadsheets with third-party sites or command-line utilities and some Ruby scripting.</li>
	<li>
		<a href="http://www.propublica.org/nerds/item/scraping-websites">Scraping HTML</a> &ndash; Write Ruby code to traverse a website and copy the data you need.</li>
	<li>
		<strong>Getting Text Out of an Image-only PDF</strong> &ndash; Use a specialized graphics library to break apart and analyze each piece of a spreadsheet contained in an image file (such as a scanned document).</li>
</ol>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	<link href="http://www.propublica.org/nerds/item/image-to-text-ocr-and-imagemagick" rel="syndication-source" />
</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	<link href="http://www.propublica.org/nerds/item/image-to-text-ocr-and-imagemagick" rel="canonical" />
</p>
]]></description> 
      <dc:date>2011-12-02T07:54:03+00:00</dc:date>
    </item>

    <item>
      <title>Turning PDFs to text</title>
      <link>http://datadrivenjournalism.net/resources/chapter_3_turning_pdfs_to_text</link>
      <guid>http://datadrivenjournalism.net/resources/chapter_3_turning_pdfs_to_text#When:12:40:48Z</guid>
      <description><![CDATA[<p>
	<em>Originally published by&nbsp;</em><em><a href="http://www.propublica.org/site/author/dan_nguyen">Dan Nguyen</a>&nbsp;on <a href="http://www.propublica.org">ProPublica</a>,&nbsp;</em><em>30 December, 2010. This article is republished with permission.</em></p>
<p>
	&nbsp;</p>
<p>
	<em><strong>Update (1/18/2011)</strong>: We originally wrote that we had promising results with the commercial product deskUNPDF&#39;s trial mode. We have since ordered the full version of deskUNPDF <a href="http://www.propublica.org/nerds/item/turning-pdfs-to-text-doc-dollars-guide#deskunpdf">and tried using it on some of the latest payments data.</a></em></p>
<p>
	Adobe&rsquo;s Portable Document Format is a great format for digital documents when it&rsquo;s important to maintain the layout of the original format. However, it&rsquo;s a document format and not a data format.</p>
<p>
	Unfortunately, it <a href="http://sunlightlabs.com/blog/2009/adobe-bad-open-government/">seems to be treated like a data transfer format, especially by some government agencies and others</a>, who use it to release data that would be much more useful for journalists and researchers as a spreadsheet or even as a plain text file.</p>
<p>
	In our <a href="http://projects.propublica.org/docdollars/companies/">Dollars for Docs project</a>, companies provided their data in PDF format.</p>
<p>
	Wikipedia has a <a href="http://en.wikipedia.org/wiki/List_of_PDF_software">good list of PDF tools and converters</a>. However, we didn&rsquo;t find a one-click-does-it-all solution for converting PDFs into spreadsheets while gathering the <a href="http://projects.propublica.org/docdollars/">Dollars for Docs data</a>.</p>
<p>
	We recently tested the commercial product <a href="http://www.docudesk.com/deskunpdf_product_home.shtml">deskUNPDF</a> on several of the latest payment lists. In the vast majority of entries, deskUNPDF does an accurate conversion. But like the other methods described in this guide, it does not work perfectly for all the sets of data. For example, with the most recent Johnson &amp; Johnson PDF, deskUNPDF omitted some of the text within some cells that contained long strings (like the names of the payees), This required us to manually verify each cell for accuracy.</p>
<p>
	Here are three other conversion methods we used for Dollars for Docs that involve a mix of software and coding. However, they still require some manual clean-up, which can be time-consuming for 50+ page documents.</p>
<p>
	Note: The following guide is for PDFs that actually have embedded text in them. Can you highlight the text to copy and paste it? Then this is the right guide. Otherwise, for PDFs that are secure, or PDFs that are essentially images of text &ndash; such as scanned documents, <a href="http://www.propublica.org/nerds/item/image-to-text-ocr-and-imagemagick">visit this tutorial</a>.</p>
<h4>
	<strong>Method 1: Third-Party Sites</strong></h4>
<p>
	<a href="http://www.cometdocs.com/">Cometdocs</a> and <a href="http://zamzar.com/">Zamzar</a>&nbsp;are web-based services that convert PDF files that you upload. After a short turnaround time, you&rsquo;ll receive an e-mail with a download link (as well as an advertisement for their enterprise services).</p>
<p>
	We&rsquo;ve had good results from <a href="http://www.cometdocs.com/">CometDocs</a>. For the <a href="http://www.janssenpharmaceuticalsinc.com/?r=www.ortho-mcneil.com">Johnson &amp; Johnson (Ortho-Mcneil-Janssen division) file</a>, which you can download here, we still had to manually clean up entries that were split across several lines.</p>
<p>
	However, the mistakes in conversion can be more than superficial. For example, using <a href="http://www.cometdocs.com/">CometDocs</a> on the <a href="http://www.lillyphysicianpaymentregistry.com/">Eli Lilly PDF</a> yielded this conversion:</p>
<p>
	<img alt="" src="http://www.propublica.org/images/ngen/gypsy_big_image/pdf-lilly-comet-unalign.png" style="margin-left: 5px; margin-right: 5px; margin-top: 5px; margin-bottom: 5px; width: 480px; height: 216px; " /></p>
<p>
	<em>Left: The PDF translated to spreadsheet format; the numbers in red are in the wrong column.<br />
	Right: The original PDF.</em></p>
<p>
	On this page, it appears that an entire column of numbers was shifted over. This is an error that would be difficult to catch without comparing the output to the original PDFs.</p>
<h4>
	<strong>Method 2: Convert to HTML in Acrobat</strong></h4>
<p>
	As it turns out, Lilly&rsquo;s PDF has some structure behind it, which we can take advantage of by converting the PDF to HTML. We don&rsquo;t know of any free PDF to HTML tools, so hopefully your shop already has a copy of <a href="http://www.adobe.com/products/acrobat.html">Adobe Acrobat Pro</a>.</p>
<p>
	After downloading the <a href="http://www.lillyphysicianpaymentregistry.com/">Lilly report</a>, open it with Acrobat. Then select <strong>Save As</strong>, then select <strong>HTML 3.2</strong> as the format.</p>
<p>
	<strong>Optional programming</strong></p>
<p>
	At this point, you are pretty much done. You can use your web browser to open up the gigantic HTML file that was just created, <strong>Select All</strong>, <strong>Copy</strong>, and then <strong>Paste</strong> into Excel. You&rsquo;ll spend a little time deleting the header rows and finding anomalies, but Excel generally does a good job of automatically converting HTML tables into spreadsheet form.</p>
<p>
	With a little programming, you can parse through the file and do some cleanup at the same time (we go into more explanatory detail about the Ruby parsing library, <a href="http://nokogiri.org/">Nokogiri</a>, in the <a href="http://www.propublica.org/nerds/item/reading-flash-data">Flash</a>&nbsp;and <a href="http://www.propublica.org/nerds/item/scraping-websites">web scraping</a> tutorials):</p>
<p>
	<script src="https://gist.github.com/1417045.js"> </script></p>
<p>
	The above code will print out all the PDF contents, including the header row and narrative description text. So, assuming that actual data fits in a specified format (a table row with nine columns), we can alter the script to separate the rows into different files. Rows with three columns, for example, outputs to a file called &#39;pdf-columns-3.txt&#39;</p>
<p>
	When you do this, you&#39;ll find that all valid data rows have nine columns. But there is one more issue with this particular PDF: some rows have each column value repeated twice:</p>
<p>
	<img alt="" src="http://www.propublica.org/images/ngen/gypsy_big_image/gsk-duplicate.png" style="margin-left: 5px; margin-right: 5px; margin-top: 5px; margin-bottom: 5px; width: 480px; height: 175px; " /></p>
<p>
	<em>In the highlighted row, the values are repeated twice in each column.</em></p>
<p>
	So, for data rows in which there are nine columns, we can check to see if the third column (state initials) contains exactly two capital letters. If not, then the column has the duplicated-data error. In this special case, we can print the corrected data (by splitting the duplicated-data values in half) next to the erroneous columns and then go into a spreadsheet program to compare the results. Here is the code for the entire process:</p>
<p>
	<script src="https://gist.github.com/1417296.js"> </script></p>
<h4>
	<strong>Method 3: Convert to Text, Measure Column Widths</strong></h4>
<p>
	Unfortunately, not all PDF tables convert to nice HTML. Try the above method on the GSK file, for example. Converting it to HTML results in this mess:</p>
<p>
	<img alt="" src="http://www.propublica.org/images/ngen/gypsy_big_image/gsk-html-mess-columns.png" style="margin-left: 5px; margin-right: 5px; margin-top: 5px; margin-bottom: 5px; width: 298px; height: 233px; " /></p>
<p>
	<em>http://www.propublica.org/images/ngen/gypsy_big_image/gsk-html-mess-columns.png</em></p>
<p>
	One possible strategy is to analyze the whitespace between columns. This requires the <a href="http://www.regular-expressions.info/">use of regular expressions</a>. If you don&#39;t know about them, they&rsquo;re worth learning. Even without programming experience, you&#39;ll find regular expressions extremely useful when doing data cleaning or even advanced document searches.</p>
<p>
	The first step is to convert the PDF to plain text. You can use the aptly named <a href="http://en.wikipedia.org/wiki/Pdftotext">pdftotext</a>, which is part of the free <a href="http://www.foolabs.com/xpdf/download.html">xpdf</a> package. We&#39;re using a Mac to do this. Linux instructions are pretty similar. Under Windows, your best bet would be to use <a href="http://www.cygwin.com/">Cygwin</a>.</p>
<p>
	For this example, we will use the GSK disclosure PDF, <a href="http://us.gsk.com/docs-pdf/responsibility/hcp-fee-disclosure-2q-4q2009.pdf">which you can download here</a>.</p>
<p>
	<script src="https://gist.github.com/1417307.js"> </script></p>
<p>
	This produces <strong>hcp-fee-disclosure-2q-4q2009.txt</strong>. The <strong>-layout </strong>flag preserves the spacing of the words as they were in the original PDF. This is what the <a href="http://us.gsk.com/docs-pdf/responsibility/hcp-fee-disclosure-2q-4q2009.pdf">GSK file</a> looks like in text form:</p>
<p>
	<script src="https://gist.github.com/1417315.js"> </script></p>
<p>
	Let&#39;s look at the easiest scenario of text-handling, where every cell has a value:</p>
<p>
	<script src="https://gist.github.com/1417328.js"> </script></p>
<p>
	There&#39;s no special character, such as a comma or tab, that defines where each column ends and begins.</p>
<p>
	However, values in separate columns appear to have two or more spaces separating them. So, we can just use our text editing program to find and replace those to a special character of our choosing.</p>
<p>
	Regular expressions allow us to specify a match of something like &quot;one space <strong>or more</strong>.&quot; In this case, we want to convert every set of two-or-more consecutive spaces into a pipe character (&quot;|&quot;).</p>
<p>
	Many major text-editors allow the use of <a href="http://www.regular-expressions.info/">regular expressions</a>. We use <a href="http://macromates.com/">TextMate</a>. For Mac users, <a href="http://www.barebones.com/products/textwrangler/textwranglerpower.html">TextWrangler</a> is a great free text editor that supports find-and-replace operations with regular expressions. <a href="http://notepad-plus-plus.org/">Notepad++</a> is a free Windows text-editor; here&#39;s a tutorial on <a href="http://www.slideshare.net/anjesh/the-power-of-regular-expression-use-in-notepad">how to use regular expressions in it</a>.</p>
<p>
	In regular expression syntax, curly brackets <strong>{x,y}</strong> denote a range between<em> x</em> and <em>y </em>occurrences of the character <em>preceding the brackets</em>. So <strong>e{1,2}</strong> will match 1 to 2 &#39;e&#39; characters. So the regular expression to find &quot;bet&quot; and &quot;beet&quot; is: <strong>be{1,2}t</strong>.</p>
<p>
	Leaving off the second number, as in <strong>e{1,}</strong>, means we want to match at least one &#39;e&#39;, and any number of that character thereafter. So, to capture two-or-more whitespaces, we simply do: &quot; <strong>{2,}</strong>&quot;.</p>
<p>
	So entering &quot; {2,}&quot; into the &quot;Find:&quot; field and &quot;|&quot; into &quot;Replace:&quot;, we get:</p>
<p>
	<script src="https://gist.github.com/1417348.js"> </script></p>
<p>
	Easy enough. But a common problem is when a cell is left blank. This causes two empty columns to be seen as just one empty column, according to our regular expression:</p>
<p>
	<script src="https://gist.github.com/1417340.js"> </script></p>
<p>
	<script src="https://gist.github.com/1417347.js"> </script></p>
<p>
	If you&#39;ve worked with older textfile databases or mainframe output, you probably have come across tables with <strong>fixed-width</strong> columns, where the boundaries of columns is a pre-determined length.</p>
<p>
	Looking at the above table, we can see that even if there are blanks in the column, the actual data falls within a certain space. So, using regular expressions with a little Ruby scripting, we can programatically determine these columns.</p>
<p>
	We first delimit each row with the &quot; {2,}&quot; regular expression. As we saw in the example above, we&#39;ll end up with lines of varying number of columns.</p>
<p>
	If we then iterate through each column and find the farthest-left and the farthest-right position per column on the page, according to each word&rsquo;s position and length, we should be able to produce on-the-fly a fixed-width format for this table.</p>
<p>
	This is easier to explain with a diagram. Here&#39;s a sparsely populated table of four columns.</p>
<p>
	<script src="https://gist.github.com/1417355.js"> </script></p>
<p>
	If we delimit the above with &quot; {2,}&quot;, we&#39;ll find that the first row will have 2 columns; the second row, 3 columns; and the third, 1 column.</p>
<p>
	Programmatically, we&#39;re going to store each of these lines of text as an array, so Row_1 would be [&quot;Banana&quot;, &quot;Current&quot;], for instance. This is just an intermediary step, though. What we really want is where each word begins and ends on that line. If the very first position is 0, then &quot;Banana&quot; begins at position 13 and ends at position 19, that is, 19 spaces from the beginning of the line. Doing this for each line gets us:</p>
<p>
	<script src="https://gist.github.com/1417362.js"> </script></p>
<p>
	So as we read the values for each line, let&#39;s keep a master list of the farthest-left and farthest-right positions of each column.</p>
<p>
	Reading through the first line, this list will be: [13,19], [24,31], where &ldquo;Banana&rdquo; and &ldquo;Currant&rdquo; are positioned, respectively.</p>
<p>
	When our script reads through the second line, it finds a word (Alaska) at position 4 and ending at 10.</p>
<p>
	Since it ends before the starting position (10 &lt; 13) of what the program previously thought was the starting boundary of the first column, it stands to reason that the space containing &quot;Alaska&quot; is actually the table&#39;s first column.</p>
<p>
	When the script reads &quot;Colorado&quot;, it sees that it intersects with &quot;Currant&quot;&#39;s position in the first line. It assumes that the two share the same column (now the third), and changes the definition of that column from [24,31] to [24,33], since &quot;Colorado&quot; is a slightly longer word.</p>
<p>
	The list of columns is now: [4,10], [13,19], [24,33], [36,44].</p>
<p>
	In the third line, the only word is &quot;Bear&quot; and its dimensions fall within the previously defined second column&#39;s positions [13,19]</p>
<p>
	So now with our master list of positions, we can read each line again and break it apart by these column definitions, getting us a four-column table as expected.</p>
<h4>
	<strong>Splitting the PDF</strong></h4>
<p>
	When converting the PDF to text, sometimes the columns won&#39;t be positioned the same across every page. So let&rsquo;s begin by splitting the PDF into separate pages by calling pdftotext within Ruby:</p>
<p>
	<script src="https://gist.github.com/1417158.js"> </script></p>
<p>
	And then iterate through each page to calculate its fixed-width format with the algorithm described above. Here&#39;s the commented code for the entire program:</p>
<p>
	<script src="https://gist.github.com/1417148.js"> </script></p>
<p>
	You&rsquo;ll note that in the section where we output the results to compiled_file, we&rsquo;ve also included the page number, line number, and number of columns in that page. When we try this program on Lilly&rsquo;s PDF, there are some columns in which the data is spread out enough to be considered separate columns by our program. So keeping track of the columns found per page allows us to quickly identify problem pages and fix them manually.</p>
<p>
	<img alt="" src="http://www.propublica.org/images/ngen/gypsy_big_image/pdf-bad-spacing-column-lilly.png" style="margin-left: 5px; margin-right: 5px; margin-top: 5px; margin-bottom: 5px; width: 502px; height: 203px; " /></p>
<p>
	<em>Because of the wide spacing in this particular PDF-to-text translation, our program would mistakenly create two columns where the original PDF only had one.</em></p>
<h4>
	<strong>PDF-to-Text Anomalies</strong></h4>
<p>
	Almost every conversion ends up with some strange artifacts. For example, in the above conversion of the GSK document, we get some entries in the last column that are repeated over several lines.</p>
<p>
	I don&#39;t know enough about how PDFs are generated to prevent this. But after any conversion, you&#39;ll need to use Excel, Google <a href="http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning">Refine</a>, or some custom code to check that all the fields have values in an expected range.</p>
<p>
	Regular expressions are pretty much essential to this, allowing you to determine which cells don&#39;t fit a certain format, such as an exact length of characters, or a currency format like $xx,xxx.00.</p>
<h4>
	<strong>Conclusions</strong></h4>
<p>
	There is no single method we could find that does PDF translation perfectly. We recommend trying one of the web services first. If the result isn&rsquo;t as accurate as you like, it&rsquo;s not too much work to write some text-processing code.</p>
<p>
	With any method, you may end up spending lots of time cleaning up the occasionally mistranslated cell, but at least it won&#39;t be as arduous as manually retyping the entire PDF.</p>
<p>
	&nbsp;</p>
<h4>
	The Dollars for Docs Data Guides</h4>
<p>
	<strong>Introduction:</strong> <a href="http://www.propublica.org/nerds/item/the-coders-cause-in-dollars-for-docs" style="font-weight: normal; ">The Coder&#39;s Cause</a> &ndash; Public records gathering as a programming challenge.</p>
<ol>
	<li>
		<a href="http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning">Using Google Refine to Clean Messy Data</a> &ndash; Google Refine, which is downloadable software, can quickly sort and reconcile the imperfections in real-world data.</li>
	<li>
		<a href="http://www.propublica.org/nerds/item/reading-flash-data">Reading Data from Flash Sites</a> &ndash; Use Firefox&#39;s Firebug plugin to discover and capture raw data sent to your browser.</li>
	<li>
		<strong>Parsing PDFs</strong> &ndash; Convert made-for-printer documents into usable spreadsheets with third-party sites or command-line utilities and some Ruby scripting.</li>
	<li>
		<a href="http://www.propublica.org/nerds/item/scraping-websites">Scraping HTML</a> &ndash; Write Ruby code to traverse a website and copy the data you need.</li>
	<li>
		<a href="http://www.propublica.org/nerds/item/image-to-text-ocr-and-imagemagick">Getting Text Out of an Image-only PDF</a> &ndash; Use a specialized graphics library to break apart and analyze each piece of a spreadsheet contained in an image file (such as a scanned document).</li>
</ol>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	<link href="http://www.propublica.org/nerds/item/turning-pdfs-to-text-doc-dollars-guide" rel="syndication-source" />
</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	<link href="http://www.propublica.org/nerds/item/turning-pdfs-to-text-doc-dollars-guide" rel="canonical" />
</p>
]]></description> 
      <dc:date>2011-12-01T12:40:48+00:00</dc:date>
    </item>

    <item>
      <title>Using Google Refine to clean messy data</title>
      <link>http://datadrivenjournalism.net/resources/using_google_refine_to_clean_messy_data</link>
      <guid>http://datadrivenjournalism.net/resources/using_google_refine_to_clean_messy_data#When:15:18:45Z</guid>
      <description><![CDATA[<script type="text/javascript" src="http://pixel.propublica.org/pixel.js" async="true"></script>
<p>
	<em>This article&nbsp;</em><em>was originally published&nbsp;</em><em>by <a href="http://www.propublica.org/site/author/dan_nguyen/">Dan Nguyen</a>&nbsp;on&nbsp;<a href="http://www.propublica.org/">ProPublica</a>,&nbsp;</em><em>30 December, 2010. This article is republished with permission.</em></p>
<p>
	&nbsp;</p>
<p>
	<a href="http://code.google.com/p/google-refine/">Google Refine</a> (the program formerly known as Freebase Gridworks) is described by its creators as a &ldquo;power tool for working with messy data&rdquo; but could very well be advertised as &ldquo;remedy for eye fatigue, migraines, depression, and other symptoms of prolonged data-cleaning.&rdquo;</p>
<p>
	Even journalists with little database expertise should be using Refine to organize and analyze data; it doesn&#39;t require much more technical skill than clicking through a webpage. For skilled programmers, and journalists well-versed in Access and Excel, Refine can greatly reduce the time spent doing the most tedious part of data-management.</p>
<p>
	Other reasons why you should try <a href="http://code.google.com/p/google-refine/">Google Refine</a>:</p>
<ul>
	<li>
		It&rsquo;s free.</li>
	<li>
		It works in any browser and uses a point-and-click interface similar to Google Docs.</li>
	<li>
		Despite the Google moniker, it works offline. There&rsquo;s no requirement to send anything across the Internet.</li>
	<li>
		There&rsquo;s a host of convenient features, such as an <a href="http://code.google.com/p/google-refine/wiki/History">undo</a> function, and a way to visualize your data&rsquo;s characteristics. For example, check out this guide on how to use it for <a href="http://www.youtube.com/watch?v=m5ER2qRH1OQ">geocoding addresses</a>.</li>
</ul>
<p>
	<a href="http://code.google.com/p/google-refine/wiki/Downloads?tm=2">Download and installation instructions for Refine are here</a>.</p>
<p>
	This tutorial covers the same ground as this screencast by Refine&rsquo;s developer David Huynh (the <a href="http://google-opensource.blogspot.com/2010/11/announcing-google-refine-20-power-tool.html">other two videos are here</a>):</p>
<p>
	<object height="360" width="640"><param name="movie" value="http://www.youtube.com/v/yNccGtn3Wb0&amp;hl=en_US&amp;feature=player_embedded&amp;version=3" /><param name="allowFullScreen" value="true" /><param name="allowScriptAccess" value="always" /><embed allowfullscreen="true" allowscriptaccess="always" height="360" src="http://www.youtube.com/v/yNccGtn3Wb0&amp;hl=en_US&amp;feature=player_embedded&amp;version=3" type="application/x-shockwave-flash" width="640"></embed></object></p>
<h3>
	&nbsp;</h3>
<h3>
	<strong>The Basics of &quot;Messy Data&quot;</strong></h3>
<p>
	&ldquo;Messy data&rdquo; refers to data that&rsquo;s riddled with inconsistencies, either because of human error or poorly designed record systems. So, a column that contains dates may hold values such as &ldquo;12-10-2004&rdquo;, &ldquo;May 9, 1989&rdquo;, and &ldquo;12/4/10.&rdquo;</p>
<p>
	These consistences can wreak havoc when trying to perform analysis on the data, so they have to be addressed before starting any analysis.</p>
<p>
	Badly formatted dates are such a common problem that most modern software, such as Excel, has been programmed to (usually) handle all the different date formats.</p>
<p>
	But there&rsquo;s no easy conversion standard for other kinds of data, such as names. If you wanted to get all rows with the name &ldquo;Tina Fey,&rdquo; you&rsquo;d miss the rows that had &ldquo;TINA FEY&rdquo; and &ldquo;Fey, Tina.&ldquo; Even differences in capitalization throws off a computer&rsquo;s basic comparison routine.</p>
<p>
	<img alt="" src="http://propublica.s3.amazonaws.com/assets/docdollars/guide/refine/refine-fey.png" style="width: 460px; height: 233px; " /></p>
<p style="text-align: left; ">
	<em>Photo by <a href="http://www.flickr.com/photos/danielgene/2925216514">daniel.gene</a></em></p>
<p style="text-align: left; ">
	It&rsquo;s easy enough to write a piece of software that ignores capitalization and punctuation. It takes a little more thinking to have that software ignore middle initials and names if they exist in certain rows but not others.</p>
<p style="text-align: left; ">
	To handle nicknames, you could write up a long list of translations, such as &ldquo;Jake&rdquo;=&rdquo;Jacob&rdquo; and &ldquo;Sam&rdquo;=&rdquo;Samuel&rdquo; (or &ldquo;Samantha,&rdquo; for that matter).</p>
<p>
	You could also deal with typos by having your code allow for differences of one or two letters when matching names. But to be certain the computer doesn&rsquo;t match &ldquo;Dan Smith&rdquo; and &ldquo;Don Smith,&rdquo; you&rsquo;d have to write a way to flag ambiguous matches.</p>
<p>
	You can see how doing all of this would be tedious. Enter Google Refine, which does all of the above and more.</p>
<p>
	&nbsp;</p>
<h3>
	<strong>How We Used Refine in &ldquo;Dollars for Docs&rdquo;</strong></h3>
<p>
	For the <a href="http://projects.propublica.org/docdollars/">Dollars for Doctors project</a>, one of our initial inquiries was to find out who the top-paid doctors and why they were held in high regard by the industry. This became the crux of our first stories, as we discovered problems in how companies <a href="http://www.propublica.org/article/dollars-to-doctors-physician-disciplinary-records">screened doctors they hired for promotional work</a>.</p>
<p>
	In <a href="http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data">our series of technical guides</a>, we detail how we converted the companies&rsquo; payment records from unsortable PDFs to spreadsheets. While this allowed us to find top-paid doctors per company, we wanted to know the total they earned from all the companies.</p>
<p>
	But even within the same company&rsquo;s records, the names of payees varied. Some companies included middle names and suffixes, others didn&rsquo;t. Born names also made simple comparisons tricky.</p>
<p>
	<img alt="" src="http://propublica.s3.amazonaws.com/assets/docdollars/guide/refine/refine-maria-carmen.png" style="width: 678px; height: 388px; " /></p>
<p>
	So before attempting any calculations or analysis in Excel or Access, we used <a href="http://code.google.com/p/google-refine/">Google Refine</a>.</p>
<p>
	&nbsp;</p>
<h3>
	<strong>Starting a Project</strong></h3>
<p>
	You can download Refine&rsquo;s one-step installation package <a href="http://code.google.com/p/google-refine/">here</a>. After Refine is installed, clicking on its application icon will pop open your default web browser (I&rsquo;ve had the best performance with <a href="http://www.google.com/chrome">Google Chrome</a>). Start a new project by opening a delimited-text file or Excel spreadsheet.</p>
<p>
	&nbsp;</p>
<p>
	<img alt="" src="http://propublica.s3.amazonaws.com/assets/docdollars/guide/refine/refine-first-page.png" style="width: 700px; height: 351px; " /></p>
<p style="text-align: center; ">
	<em style="text-align: -webkit-auto; ">Your data as viewed through Refine</em></p>
<p style="text-align: center; ">
	&nbsp;</p>
<p style="text-align: left; ">
	Right away you&rsquo;ll see that the data is arranged in a familiar spreadsheet format. Clicking on an individual cell lets you edit it. Clicking on a column header brings up a submenu of operations, including sorting.</p>
<h3>
	&nbsp;</h3>
<h3>
	<strong>Faceting</strong></h3>
<p>
	Refine&rsquo;s faceting feature allows us to summarize the unique values in a column. The easiest way to see its effect is to try it. Clicking on the <strong>companies.name</strong> column header brings up a pop-up menu, from which we choose <strong>Facet -&gt; Text Facet</strong>.</p>
<p>
	<img alt="" src="http://propublica.s3.amazonaws.com/assets/docdollars/guide/refine/refine-text-facet-submenu.png" style="width: 475px; height: 479px; " /></p>
<p style="text-align: left; ">
	<em style="text-align: -webkit-auto; ">Click on the column-header to bring up submenus.</em></p>
<p style="text-align: left; ">
	Now check out the left panel. Refine has listed the seven different company names found in that column, as well as the number of records per company. This is a very convenient way to see if there are any unexpected values in a column.</p>
<p style="text-align: left; ">
	&nbsp;</p>
<p style="text-align: left; ">
	<img alt="" src="http://propublica.s3.amazonaws.com/assets/docdollars/guide/refine/refine-companies-faceted.png" style="width: 612px; height: 404px; " /></p>
<p style="text-align: left; ">
	<em style="text-align: -webkit-auto; ">Faceting the <strong>companies</strong> column lets us see the different companies that exist in our data and how many records belong to each.</em></p>
<p style="text-align: left; ">
	If some of the entries had the company &ldquo;Merck&rdquo; misspelled as &ldquo;Merk&rdquo;, both would&rsquo;ve shown up in the left panel. Clicking <strong>Remove All</strong> remove the company name faceting.</p>
<p>
	We will be faceting the records&rsquo; doctor names in the <strong>full_name</strong> column to see how many variations exist. We will then use Refine&rsquo;s <a href="http://code.google.com/p/google-refine/wiki/Clustering">clustering</a> feature to condense all the variations (and misspellings) of a name into a single identity, which allows us to find all the records associated with that name, and then add up the payment amounts.</p>
<p>
	Read <a href="http://code.google.com/p/google-refine/wiki/Clustering">more about clustering</a> from Refine developer David Huynh.</p>
<p>
	Refine can automate the clustering process, and make it easy for you to do it by hand, but there&rsquo;s always room for error. For proofreading purposes, we want to keep track of what was originally in the <strong>full_name </strong>column, so let&rsquo;s duplicate the column by clicking on the column header, then <strong>Edit Column-&gt;Add column based on this column</strong>.</p>
<p>
	&nbsp;</p>
<p>
	<img alt="" src="http://propublica.s3.amazonaws.com/assets/docdollars/guide/refine/refine-add-column.png" style="width: 684px; height: 474px; " /></p>
<p>
	<em>Add a column that duplicates (and upper-cases) the values in the <strong>full_name</strong> column</em></p>
<p>
	&nbsp;</p>
<p>
	Refine supports the <a href="http://code.google.com/p/google-refine/wiki/UnderstandingExpressions">several programming languages</a> for transforming or calculating values in a column. For our purposes, we&rsquo;ll just populate our new column, <strong>common_name</strong>, with the results of <strong>toUppercase(value)</strong>, which uppercases the text in the corresponding <strong>full_name</strong> cell.</p>
<p>
	&nbsp;</p>
<p>
	<img alt="" src="http://propublica.s3.amazonaws.com/assets/docdollars/guide/refine/refine-column-added.png" style="width: 391px; height: 332px; " /></p>
<p>
	<em>The new column, <strong>common_name</strong></em></p>
<p>
	&nbsp;</p>
<p>
	Now we facet the <strong>common_name</strong> with the <strong>Facet-&gt;Text Facet</strong> command. Refine&rsquo;s left panel is now populated with all the variations of names found in the records.</p>
<p>
	<img alt="" src="http://propublica.s3.amazonaws.com/assets/docdollars/guide/refine/refine-common-name-faceted.png" style="width: 536px; height: 445px; " /></p>
<p>
	<em>The common_name column faceted.</em></p>
<p>
	&nbsp;</p>
<p>
	Notice (circled in red) how the number of total rows (1,238) differs from the number of choices (938). The latter number represents the unique names found among the 1,238. You can see in the list, for instance, two names (circled in yellow) that have two repetitions each.</p>
<p>
	Now, we click on the <strong>Cluster</strong> button to begin the relatively painless process of clustering these names.</p>
<h3>
	&nbsp;</h3>
<h3>
	<strong>Clustering</strong></h3>
<p>
	Refine gives you five algorithms for guessing the similarity of names. It starts you off with the <a href="http://code.google.com/p/google-refine/wiki/ClusteringInDepth#Fingerprint"><strong>fingerprint</strong> function</a>, which uses the strictest &ndash; and safest &ndash; formula. <strong>Fingerprint</strong> assumes that two names have identical alphabetical characters and spacing, regardless of capitalization and punctuation.</p>
<p>
	So, &ldquo;Johnny R. Cash,&rdquo; &ldquo;JOHNNY R. CASH,&rdquo; and &ldquo;Cash, Johnny R,&rdquo; when translated by the <strong>fingerprint</strong> function, all end up being equivalent to &ldquo;cash johnny r.&quot;</p>
<p>
	Refine conveniently lets us click on which of the three variations you want to settle on. Or you can choose<strong> Select All</strong>, and <strong>Merge Selected and Re-Cluster</strong>, and Refine will make the choices for you. With the <strong>fingerprint</strong> function, you can feel pretty confident that the names it clusters together are indeed equivalent.</p>
<p>
	&nbsp;</p>
<p>
	<img alt="" src="http://propublica.s3.amazonaws.com/assets/docdollars/guide/refine/refine-fingerprint.png" style="width: 780px; height: 509px; " /></p>
<p style="text-align: center; ">
	<em style="text-align: -webkit-auto; ">The fingerprint clustering function</em></p>
<p style="text-align: left; ">
	&nbsp;</p>
<p style="text-align: left; ">
	You can read the nitty gritty on <a href="http://code.google.com/p/google-refine/wiki/ClusteringInDepth">all of Refine&rsquo;s clustering functions here</a>. They progress from stricter to looser. For example, the <a href="http://code.google.com/p/google-refine/wiki/ClusteringInDepth#Metaphone_Fingerprint">double-metaphone function</a> groups names by how they sound. This catches variations caused by typos, such as &ldquo;Bobb Woodword&rdquo; and &ldquo;Bob Woodward&rdquo; that stricter formulas would consider to be different. However, it also considers &ldquo;Samir&rdquo; and &ldquo;Semir&rdquo; to be equivalent. So clustering isn&rsquo;t an automatic process, but at least Refine makes it as painless as possible.</p>
<p>
	The most lax algorithm, <a href="http://code.google.com/p/google-refine/wiki/ClusteringInDepth#PPM">PPM</a> (short for <strong>Prediction by Partial Matching</strong>), can match up particularly different names, though this means you have to put more effort in weeding out false positives.</p>
<p>
	&nbsp;</p>
<p>
	<img alt="" src="http://propublica.s3.amazonaws.com/assets/docdollars/guide/refine/refine-ppm-1.png" style="width: 780px; height: 578px; " /></p>
<p style="text-align: center; ">
	<em>First pass with PPM clustering</em></p>
<p style="text-align: left; ">
	&nbsp;</p>
<p style="text-align: left; ">
	The first pass with PPM appears to be all real matches, matching names where either the doctor&rsquo;s middle name or just middle initial were included.</p>
<p>
	It&rsquo;s pretty easy to write a database command to match entries where first name, last name, and the first letter of the middle name are equivalent.</p>
<p>
	But since we were dealing with seven different companies with seven different ways of recording names, we didn&rsquo;t know that beforehand. Some companies may include the middle name in the first name field. Others may have a separate column for it, and others may omit the middle name entirely.</p>
<p>
	So Refine not only gives us a quick way to cluster names, but does it without needing to know beforehand the actual details of the data format. The full_name column, from which the common_name column is derived from, could consist of entries that are &ldquo;LAST_NAME, FIRST_NAME&rdquo; or &ldquo;FIRST_NAME MIDDLE_NAME LAST_NAME&rdquo;; Refine doesn&rsquo;t care.</p>
<h3 style="text-align: left; ">
	&nbsp;</h3>
<h3 style="text-align: left; ">
	<strong>The Net Size</strong></h3>
<p style="text-align: left; ">
	If we increase the radius of the PPM function to 2 (the larger the number, the wider the net; read the <a href="http://code.google.com/p/google-refine/wiki/ClusteringInDepth#PPM">technical details here</a>), we start to see where Refine&rsquo;s algorithm makes inaccurate guesses:</p>
<p>
	&nbsp;</p>
<p>
	<img alt="" src="http://propublica.s3.amazonaws.com/assets/docdollars/guide/refine/refine-ppm-2.png" style="width: 743px; height: 613px; " /></p>
<p style="text-align: center; ">
	<em>Looser PPM clustering</em></p>
<p>
	&nbsp;</p>
<p>
	The looser algorithm catches the likely match of Cathleen Mullarkey and Cathleen J. Mullarkey-Desapio. However, it also guesses that &ldquo;Joseph N Gritzzanti&rdquo; and &ldquo;Joseph N Ranieri&rdquo; are the same person.</p>
<p>
	In cases where it&rsquo;s ambiguous, Refine makes it easy to examine the cluster yourself. In the example below, Refine has grouped &ldquo;Edward Julie&rdquo; and &ldquo;Ed Julie&rdquo; together. Is &ldquo;Ed&rdquo; short for &ldquo;Edward,&rdquo; or could it be an &ldquo;Edmund&rdquo;, and thus, a completely different person?</p>
<p>
	Move the mouse pointer over the row brings up the <strong>Browse this cluster</strong> option.</p>
<p>
	&nbsp;</p>
<p style="text-align: left; ">
	<img alt="" src="http://propublica.s3.amazonaws.com/assets/docdollars/guide/refine/refine-browse-cluster-click.png" style="width: 695px; height: 344px; " /></p>
<p style="text-align: left; ">
	<em style="text-align: -webkit-auto; ">Hover your mouse over a clustered entry to bring up the option to get a closer look.</em></p>
<p style="text-align: left; ">
	&nbsp;</p>
<p style="text-align: left; ">
	Clicking it pops up a new browser tab with just the entries from the potential cluster. Here, we can see that Dr. &ldquo;Ed&rdquo; Julie shares the same city and consulting firm as another record that lists his name as &ldquo;Edward,&rdquo; so it seems likely the records refer to the same person.</p>
<p style="text-align: left; ">
	&nbsp;</p>
<p style="text-align: left; ">
	<img alt="" src="http://propublica.s3.amazonaws.com/assets/docdollars/guide/refine/refine-inside-cluster.png" style="width: 813px; height: 396px; " /></p>
<p style="text-align: center; ">
	<em style="text-align: -webkit-auto; ">Inside the cluster</em></p>
<p style="text-align: left; ">
	&nbsp;</p>
<p style="text-align: left; ">
	Refine allows you to export to a delimited format, including Excel spreadsheet, where you can do your math and graphing.</p>
<p>
	And to answer our original question &ndash; which doctors made $100,000 or more through drug company work &ndash; we simply group entries by the <strong>common_name</strong> column and sum up the amounts. <a href="http://projects.propublica.org/docdollars/top_earners">You can see our findings here</a>.</p>
<p>
	It&rsquo;s important to point out that even after using Refine, we researched each of these identities to confirm that it was a single person. In at least one case, there was a father and son, with identical names, who were both doctors in the same geographical area. This research requires time and labor beyond what computer analysis can provide. The rest of the Dollars of Docs database<a href="http://projects.propublica.org/docdollars/payments"> keeps the payment records separate</a>, as originally listed by the companies.</p>
<h3>
	&nbsp;</h3>
<h3>
	<strong>Refined Journalism</strong></h3>
<p>
	<img alt="" src="http://www.propublica.org/images/uploads/mobile/File-IBM1403controltape.jpg" style="margin-left: 5px; margin-right: 5px; margin-top: 5px; margin-bottom: 5px; float: right; width: 275px; height: 245px; " /></p>
<p>
	A major time sink in data cleaning is designing a way to easily pore through the results list and eliminate false positives and find missed matches. Refine has this built in, and reporters with no programmingexperience can jump in and help clean and proofread.</p>
<p>
	Real-world data never comes as clean as we&rsquo;d like. And the tedium and difficulty of poring through messy data stifles our ability to see trends and leads. Refine&rsquo;s a great tool for speeding up the cleaning process and for getting a clear view of your data, no matter your technical aptitude.</p>
<p>
	You can download <a href="http://code.google.com/p/google-refine/">Google Refine</a> here.</p>
<p>
	For a more detailed tutorial, watch Refine developer David Huynh&rsquo;s <a href="http://google-opensource.blogspot.com/2010/11/announcing-google-refine-20-power-tool.html">excellent screencast tutorials</a>. Also, check out Refine&rsquo;s <a href="http://groups.google.com/group/google-refine">Google Group</a>, where Refine developers respond quickly to bug reports and feature requests.</p>
<p>
	<em>*Photo (right) by <a href="http://www.flickr.com/photos/8399025@N07/2355775225/">Marcin Wichery</a></em></p>
<h3>
	&nbsp;</h3>
<h3>
	<strong>The <a href="http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data">Dollars for Docs Data Guides</a></strong></h3>
<p>
	<strong>Introduction: </strong><a href="http://www.propublica.org/nerds/item/the-coders-cause-in-dollars-for-docs">The Coder&#39;s Cause</a> &ndash; Public records gathering as a programming challenge.</p>
<ol>
	<li>
		<strong>Using Google Refine to Clean Messy Data</strong> &ndash; Google Refine, which is downloadable software, can quickly sort and reconcile the imperfections in real-world data.</li>
	<li>
		<a href="http://www.propublica.org/nerds/item/reading-flash-data">Reading Data from Flash Sites</a> &ndash; Use Firefox&#39;s Firebug plugin to discover and capture raw data sent to your browser.</li>
	<li>
		<a href="http://www.propublica.org/nerds/item/turning-pdfs-to-text-doc-dollars-guide">Parsing PDFs</a> &ndash; Convert made-for-printer documents into usable spreadsheets with third-party sites or command-line utilities and some Ruby scripting.</li>
	<li>
		<a href="http://www.propublica.org/nerds/item/scraping-websites">Scraping HTML</a> &ndash; Write Ruby code to traverse a website and copy the data you need.</li>
	<li>
		<a href="http://www.propublica.org/nerds/item/image-to-text-ocr-and-imagemagick">Getting Text Out of an Image-only PDF</a> &ndash; Use a specialized graphics library to break apart and analyze each piece of a spreadsheet contained in an image file (such as a scanned document).</li>
</ol>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	<link href="http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning" rel="syndication-source" />
</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	&nbsp;</p>
<p>
	<link href="http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning" rel="canonical" />
</p>
]]></description> 
      <dc:date>2011-11-25T15:18:45+00:00</dc:date>
    </item>

    
    </channel>
</rss>
