A Data Point on Every Block
An Interview with Adrian Holovaty
The first time you try to describe EveryBlock to someone, it can sound kinda boring. It aggregates piles of local information, like restaurant reviews and crime stats, which are then displayed block-by-block. Hm, that's interesting, but is it compelling?
If you give it some time, the answer is absolutely. Once you start playing with the site (and "playing" might be the best word to describe the meandering sensation of floating around in the data pools), your mind begins to wander with speculation: how did they get that? what does this say about my neighborhood? what else could be done with all this data? how can I add to this?
Those were just some of the many questions I had about EveryBlock, which launched a few weeks ago with the help of a $1.1 million Knight News Challenge grant. A few stories and interviews popped up when the site launched, but I noticed that the interviewers seldom asked the other questions that I had about the site. So I decided to ask site's founder, Adrian Holovaty, some questions directly. Here's our exchange:
Last year, New York City famously banned trans fats in restaurants. I found a page on EveryBlock that shows all the violations of this ban -- several every day! I love these little hidden narratives inside of EveryBlock. Do you have any favorites?
Great question. Here are a few interesting nuggets:
- San Francisco public housing listings by accessibility status. Just over 90% of the public housing listings posted to the San Francisco city site are not accessible.
- Chicago has a special business license type of "Wrigley Field," which applies to the famous rooftop decks across the street from the park.
- Many elevator violations are reported in New York -- more than any other type of building violation that we tabulate.
- Building permits for bath houses in San Francisco. You know, just because.
Also, more generally, it's fascinating to follow address-specific breaking news/events on our site. For example, a couple of weeks ago, a water main broke on the north side of Chicago. Afterward, on the relevant EveryBlock pages -- for example, Ravenswood or the 1800 block of W. Montrose -- you could see a bunch of assorted news items about the incident: newspaper articles from the Trib and Sun-Times, TV station reports and Flickr photos of the torn-up street that were taken by some people who happen to live nearby. Each of those "raw" chunks of information was displayed in the timeline of news for that block.
We've seen a similar thing happen with trendy new restaurants. First you see the business license, then (possibly) the liquor license application a few days later, then the restaurant inspection, then a Yelp review or two, then a writeup by the newspaper's dining critic. The story slowly unfolds over time.
One of our post-launch priorities is to clean up the fire-hose of raw information, to introduce concepts of priority and improved relevance -- but I do think there's a certain appeal to that raw dump of "here's everything that's happened around this address, in simple, reverse-chronological order." When significant events happen, they sort of "pop out" of the list.
Can you talk a little bit about what you're doing behind-the-scenes? Are you using Django as a framework?
Sure. The first layer is the army of scripts that compile data from all over the Web. This includes public APIs, private APIs, screen-scraping the "deep Web," crawling news sites, plus harvesting data from PDFs and other non-Web-friendly documents. Some data also comes to us manually, like in spreadsheets e-mailed to us on a weekly basis. For each bit of data, we determine geographic relevance and normalize it so that it fits into our system.
The second layer is the data storage layer, which we built in a way that can handle an arbitrary number of data types, each with arbitrary attributes. For example, a restaurant inspection has a violation (or multiple violations), whereas a crime has a crime type (e.g., homicide). Of course, we want to be able to query across that whole database to get a geographic "slice," so there's a strong geo focus baked into everything.
The next layer is the Web layer, which is standard Django. Oh, and I should mention that we use Python for everything, from the ground up.
What has been the hardest piece to accomplish so far?
I honestly can't decide what the hardest piece has been. A number of pieces were all hard to pull off in their own way.
The user interface was, and continues to be, a challenge. How do you display so many disparate pieces of data together, without overwhelming people? How do you account for the variety of distinct data types? (That's both a user-interface and a backend challenge.) How do you maintain visual interest when dealing with so much raw textual data? How do you make the block page feel like a geographic home page rather than a search result? Wilson, our designer, has done a great job within these constraints, but we all agree there's still much room for experimentation and gradual improvement.
Dealing with structured data is relatively easy, but attempting to determine structure from unstructured data is a challenge. The main example of unstructured data parsing is our geocoding of news articles. We do a pretty good job here, but we're not crawling all of the sources we want to crawl -- again, there's a lot of room to grow.
On a completely different note, it's been a challenge to acquire data from governments. We (namely Dan, our People Person) have been working since July to request formal data feeds from various agencies, and we've run into many roadblocks there, from the political to the technical. We expected that, of course, but the expectation doesn't make it any less of a challenge.
How much of your data aggregation is scraping html pages versus getting structured data?
At this point, we're doing more scraping than consuming formal APIs and data feeds, but I expect (and hope) the balance will shift over time. It's been tricky explaining our concept to data providers in government, but we're hoping that gets easier now that we have a public site that people can browse and understand.
Do you have any fears of scaling the system?
Yes and no. We knew from the start that EveryBlock isn't something that can be scaled overnight to every city in the world. There are too many special cases, too many relationships to build, too many local quirks to work out. There's no nationwide database of restaurant inspections or building permits that we can magically tap into; every city is different. Aggregating local information is a deep, difficult problem.
Some companies try to scale pieces of what we're doing -- like geocoding every news story in the U.S., or making maps of blog entries, or aggregating crime, or aggregating restaurant inspections -- but we're the first ones to do all of that. That's why we're taking a depth, not a breadth, approach: I'd much rather do three cities well than 1,000 cities poorly.
Rather than use Google Maps or Microsoft's Virtual Earth, you built your own mapping service application. Why?
That, along with "When will you bring EveryBlock to city XXX?", is by far the most frequently asked question we get. Paul, our developer in charge of maps, is working on an article explaining our reasoning, so I don't want to steal his thunder. I'll just say that the existing free maps APIs are optimized for driving directions and wayfinding, not for data visualization. And, besides, having non-clichéd maps is an easy way to set yourself apart. Google Maps is so 2005. ;-)
How hard was it to build?
We use an open-source library called Mapnik to render the maps, so that library does the heavy lifting for us. Paul is also working on a how-to article, in the spirit of giving back to the open-source community, that explains how to use Mapnik.
In many ways, what you're doing is taking a bunch of data sources and normalizing them for a single use case. Now that it's normalized, I imagine developers could do a ton of interesting things with this data. Are there plans to do an API?
Yes, I strongly suspect we'll have an API eventually -- it's one of the many things on our site wish list. We had to draw a line and call the thing "ready" at some point, so despite the fact that we're launched, we've got hundreds more features and data sources to add.
I was talking to someone recently about all the cool mashups you could do, and we decided that looking for patterns between Republicans and sex offenders would be the best!
Beyond the technical difficulties of creating parsers and algorithms for geotagging this data, have you had any political/legal obstacles? Is there data you'd like to get your hands on but can't for some reason?
Yes, and yes. I'd estimate we only have about 10% of the data we'd like in the long term, for Chicago, New York and San Francisco. As we expected, some government agencies haven't been able to provide us their public data, and the reasons vary. A common reason is a lack of resources. In other cases, we've simply been stymied by bureaucracy. But we're keeping at it.
An obvious example of data that's EveryBlocky (EveryBlockish? Um, location-specific?) but not yet on our site is the set of recent home sales -- lots of local relevance there. Of course, we're a news site, not a real-estate site, so it'll be interesting managing people's expectations about what real-estate data and features we offer.
I'd like to even out the three cities' data offerings, too. We publish building permits in San Francisco and New York, but not in Chicago. We publish filming locations in Chicago, but not in New York or San Francisco. We publish zoning agenda items in San Francisco, but not in the other two cities.
We're also working on improving the data we already have. An example is crime in San Francisco. After running into some problems having requested a formal data feed from them directly, we get the data by screen-scraping the SFPD's site -- but that site doesn't publish the location of each crime. In fact, the only location data the SFPD site publishes is implicit in the searches you do. The site lets you search for crimes by police district, ZIP code or neighborhood, so the best we can do is to deduce the police district, ZIP code and neighborhood that contain a particular crime. (If you search for ZIP code 94109, you can safely assume the resulting crimes are in that ZIP code.)
That's why San Francisco crime on EveryBlock, lamely, only geocodes crimes to the ZIP code level: because that's the only data we could get, and something is better than nothing. But, anyway, we're hoping the SFPD will release more granular locations in their crime data.
You've mentioned your hope that EveryBlock could introduce some standards for news organizations to do geotagging. I'm sure you've discovered wholes swaths of civic data that could use standardization. Can you talk a little bit about what you want to do in this area?
The standards we're thinking about are related to the geotagging of unstructured data -- namely, news articles. I guess there'd be some value in standardizing approaches to structured data (like, building a nationwide crime database), but we're more immediately interested in standardizing the geocoding of "blobs." The main premise is that locations in news articles should be defined in a machine-readable way. Look for something from us soon.
Everyblock lets me find everything in my neighborhood... except other people. Why is that? Do you have any plans to incorporate direct input of local voices into the site?
In time, Rex. In time. :-)
If we'd launched with awesome reader-contributed content features, that's all that people would be talking about. "EveryBlock: a user-generated news site!" People are very quick to make judgments about a Web site, pigeonholing it into some generic "user-generated" or "Web 2.0" bucket. I wanted to send the message that our focus is on providing a newspaper for your block. The tone was set. Any subsequent features that we add -- whether they involve local voices or not -- are in support of that core goal.
Besides, we already have the problem of offering so many interesting data sets and features that people can only focus on one or two of them. The classic example is that a lot of people haven't noticed that we rolled our own maps (your question above notwithstanding).
I know you constantly get asked the question about scaling the site to other local areas, but here's an idea: say I'm an enterprising small town citizen who's willing to plug in data from my city by matching data to similar fields that you are using. Possible?
Yes, that's possible -- we've built the system in a way that would allow that to happen. Again, as in my response to your reader-generated content question, it's just a matter of implementing it. We had to launch with something, and if we'd included every one of our ideas in the launch version, we'd be on target for a launch in mid 2017. :-)
One of the obligations of the Knight grant is to make all the source code available. Does that affect how you think about the site as an asset?
The open-source requirement affects both our technology and business decisions. We've engineered the thing so that it can be replicated in any area, with any data. I suppose we would've done that anyway, even without the open-source requirement, because it's just the Right Way to do it, but the open-source requirement certainly influenced us.
I'll paraphrase something really smart that Wilson, our designer, said recently: We've created a machine that's capable of publishing address-specific news, and our initial launch is a demonstration of its potential. Now that we're live, it's time to improve the machine and improve the demonstration.
On the business side, clearly we'll have to figure out how the site is going to sustain itself after our grant money is spent. I have a feeling some solution will make itself apparent at some point over the next year and a half. But even before that, we'll find out whether our idea is something that catches on with our audience -- this whole thing is an experiment, after all! For all we know, EveryBlock might be a novelty that doesn't sustain an audience in the long term. Being honest Chicago people, happily far away from the Silicon Valley BS, we have no delusions of grandeur.
I liked your answer to whether EveryBlock constitutes journalism in the OJR interview ("People can define 'journalism' however they'd like"). I'm curious, do you have traffic goals for the site? Or let me ask it a different way: how are you evaluating success?
This is cheesy, but I aim to help people, or improve the world in some way. The tricky thing is that there aren't many concrete ways of measuring that, aside from anecdotes. I suppose we could look at traffic numbers, but, no, we haven't set any traffic goals.
Okay, last question. It's a weird one. Your interest in gypsy jazz is well known. (The last time I saw you, it was in a Toronto bar that supposedly had a jazz scene, but was actually a frat bar. We were both gravely disappointed.) Do you ever think about the relationships between your musical interest and your programming/information interests? Is there anything -- structural, cognitive, performative, whatever -- that makes EveryBlock similar to Django Reinhardt?
Wow, a weird question indeed! Hmm. I guess that, in both music and programing, I strive for subtlety, for elegance.
And EveryBlock cannot be compared to Django Reinhardt. That's sacrilege.