Ask Carl Malamud About Shedding Light On Government Data 59
If you've ever tried to look up public records online, you may have run into byzantine sign-up procedures, proprietary formats, charges just to view what are ostensibly public documents, and generally the sense that you're in a snooty library with closed stacks. Carl Malamud of Public.Resource.Org has for years been forging a path through the grey goo of U.S. government data, helping to publicize the need for accessible digital archives — not just awkward, fee-per-page access. (Mother Jones calls him a "badass.") Malamud has (with help) been making it easier to get to the huge swathes of data in government sources like PACER, EDGAR, and the U.S. Patent Office. He's got a new initiative now to establish a "Federal Scanning Commission," the task of which would be to assess the scope and outcomes of a large-scale effort to actually digitize and make available online as much as practical of the vast holdings of the U.S. government. ("If we were able to put a man on the moon, why can't we launch the Library of Congress into cyberspace?") Ask Malamud below questions about his plans and challenges in disseminating public information. (But please, post unrelated questions separately, lest ye be modded down.)
Be careful ... (Score:4, Insightful)
Government Warning: Exposing the government to scrutiny can result in rape charges.
Re: (Score:2, Informative)
Right. Because power corrupts, and yet we keep putting people into power and expecting them to not get corrupted. Nothing will chenge until we open source it. [wikipedia.org]
Re: (Score:2)
Or revert back to our instincts...
Tribalism [wikipedia.org]
Re: (Score:2)
We already have international committees for resolving disputes between countries such as NATO. Sounds exactly the same with the best of intent by participation. While we're on the subject, we can take away a lot from "The Republic" by Plato as to what at least some aspects of a perfect government would be. Free read on google.
Re: (Score:3)
Hell, even suggesting a new world currency to replace the dominance of the dollar [guardian.co.uk] can get you that.
Re: (Score:3)
I know, right? The DNA evidence that showed DSK had sex with the hotel maid and her filing rape charges had nothing to do with him getting charged with rape. It was obvious that his suggesting a replacement for the dollar caused his semen to get inside that hotel maid.
Re: (Score:2)
If you've never heard of a Honey Trap Operation [slate.com], you would make a really shitty spy. It's one of the basic tactics of any good intelligence agency.
Re: (Score:2)
Have you seen the hotel maid? If you were setting up a "Honey Trap" is she the woman you'd pick?
Re: (Score:2)
Sometimes you work with the bet maid you can get. And you can't argue with success.
Re: (Score:3)
You should really give this article here [nybooks.com] a good read. It's not long and it's fascinating. Does it prove Strauss-Kahn is innocent? Not conclusively. Does it show that there's a heck of a lot more going on than something as simple as a woman claiming rape? Yeah, I'd say it does.
Re: (Score:2)
I will. Thank you for the link.
Re: (Score:2)
Government Warning: Exposing the government to scrutiny can result in rape charges.
Julian Assange did this very thing and look what it got him.
Are you aware of the GPO "fdsys" project? (Score:3)
LOC (Score:3, Interesting)
So how many GB/TB is a library of congress? :)
Or more seriously how big are you estimating? Are you using raw scans or some sort of compression (jpeg, png, ...etc)? What resolution are you using? Do you vary the resolution depending on the document?
What sort of meta data are you putting in?
Happend Top Down Already (Score:5, Interesting)
Re: (Score:2)
Re:Happend Top Down Already (Score:4, Interesting)
I scour publicly available records for fun stuff all the time. I not only find it online but I also request it from government agencies (not Federal usually but local/county/etc).
In Minnesota data must be, "easily accessible for convenient use." [mn.gov] While that has specific wording related to historical records, it basically means that on recent data it must be in some sort of electronic format or otherwise easily found and presented, free of charge as long as you do it in person, to anyone who asks--even anonymously. Now. This is great in theory. Unfortunately just because it's easy for the agency to use it doesn't mean it's easy for you to use or interpret.
Let's take for instance data on bus ridership data [lazylightning.org]. It's not well organized for outsiders to read it and due to collection methodologies (not explained to the general person who had to pay $50 to get the data in the first place) is basically useless.
They have the data and after months of fighting with them for how much they claimed it cost (they wanted to charge me more than $300 IIRC) I got it down to $50 and got what you see above even though they already pulled it (and summarized it) for the mass media but wouldn't release it in a raw format.
So. It's in a format which isn't standard. It's methodology is questionable and it's expensive. So no matter the mandates, the promises, etc, the data is not terribly useful across agencies or to the public without some intermediate steps which costs the taxpayers more than doing it right the first time around.
LOL (Score:3)
Yes. Develop Plans.
Plan: Scan everything.
Cost: A lot!
Budget: Cut.
Action: None.
regulations.gov is a good model to follow (Score:5, Interesting)
Patent Data (Score:3)
Is there any chance that all patent data will be made available in a single format (other than HTML)? The structured information in the formats is very useful, but very difficult to get to with the current system (it also costs tens of thousands of dollars to get all of the data).
Why (Score:4, Interesting)
Can you provide any explanation as to why it is so difficult and cost-prohibitive to obtain records from the government, especially considering the abundance of laws requiring government compliance with requests for information (AKA "Sunshine Laws")?
Is it simply a matter of government employee ineptitude, or have you found evidence of a more nefarious rationale?
Re: (Score:2)
Sometimes the data is just in a bad format, say completely on paper or in a specialized computer system that doesn't lend itself to easy sharing.
And the last of the list I can think of, sometimes the data might contain information which has a legitimate re
Re: (Score:3, Informative)
Having worked for the government in the recent past, I can offer a few insights...
1 - A lot of government agencies, on receiving a request for information, will kick it over to the IT department, on the grounds that "they keep the data". Unfortunately, because of the way things are structured, while the people in IT may run the disks and servers, they don't actually deal with the data... which means they either have to fight an internal battle with the people who actually manage the data, or take the path
Re: (Score:1)
Ancestry.com (Score:3, Interesting)
What is your opinion about websites like Ancestry.com which make use of public records and charge a subscription fee for access? What is the incentive for the government to migrate old documents into digital form when services like these exist? Do you think Ancestry.com should be a 501(c)(3)?
Not a 501(c)(3) (Score:3)
I am all for open data, and I like what they do, but Ancestry.com should not be a 501(c)(3). It's for profit. It’s purpose is to make money.
If they were dealing strictly with public data then I would have no bones to pick if the U.S. government moved into their business and started to offer the information for free. (well, small bones. We are running a deficit, not sure this ranks on my top 10 list for this decade, but that’s a different debate).
What they do is combine multiple government databa
Re: (Score:2)
Ancestry.com can do what it wants, but there's no obligation for the government to preserve its business model by failing to make the data easily accessible itself.
Re: (Score:2)
You misunderstood: I said there's no obligation for the government not to make the data easily accessible in order to prop up Ancestry.com's business model. If improved data access screws over Ancestry.com, too bad for them.
Who is the worst? (Score:5, Interesting)
Scanning ? (Score:3, Interesting)
How to get more attention to (Score:4, Interesting)
Recently in the federal register, there were two calls for comments about access to data and research from federally funded research:
http://federalregister.gov/a/2011-28623 [federalregister.gov]
http://federalregister.gov/a/2011-28621 [federalregister.gov]
I didn't hear about these until ~4 weeks after the original announcement, and with the holidays, it was too late to try to get the societies I'm involved with to prepare and vote on official statements. Are there any places where people can get/post notices of these sorts of things so that we can stay informed and try to help influence policies?
(note -- the second one on data access doesn't close 'til Jan 12th; NSF also has a similar RFC that closes Jan 18th [nsf.gov])
Idea (Score:4, Interesting)
The amount of information you're trying to free is entirely staggering and consists, largely, of tables of numbers. These numbers are incredibly significant, but people generally can't see them.
After you free all of this information and make it available to the public (as it should be), then what? What do you expect for the public to do with these numbers? Tables of information are not nearly as useful as graphs. This data needs to be seen, but, more importantly, it needs to be understood.
Do you have any ideas for how to disseminate this information? Perhaps a team-up with someone like gapminder.org's Hans Rosling might be particularly valuable for all of us.
data.gov (Score:3)
The 2011 update to data.gov [data.gov] actually allows whoever is submitting the data to describe it such that people can make use of it, including via visualization (maps, graphs, etc.) or via API to make custom applications.
So my question for Carl would be : What can we do to get more government agencies to actually put their data in there? And if they won't do it, should resource.org or similar groups work to put up something similar, so that people who have gotten information through FOIA can share it back out to
Re: (Score:2)
Just do the gritty work, immediately (Score:3)
Make sure you don't get stuck in standardization process where the aim is to bring different formats together, before data is entered.
Some formats are incompatible today and will be forever.
The big issue is that such a process will NEVER go anywhere, cost a ridiculous amount of both money and time, with no result in sight, ever.
Yes, I have seen those process from a closer range than I wish to remember. Big in-house, between-house, between-block, between-county fights that lead to that no data was ever entered.
Just do the gritty work immediately. Don't insist on OCR everything, just scan it as plain images, as much as you can. Then, if the money is there, then consider OCR.
IGNORE anything that sounds like an untested high-tech solution. Use well established technology, like high performance scanners etc. if it gets the initial job done, entering those damn documents into the computers.
Look at Google! They did almost all books in the world in just a few years. Did they bother with converting 16th century type setting into Times New Roman or something similar. Of course not.
Scan on!
Pacer Problems (Score:2)
How much difficulty do you anticipate in getting and publishing records in Pacer? If there's one system that should be free it the decisions that our courts make and yet you are charged by the page just to view the results. Are you concerned about a court taking an unkind view on your archiving what is in Pacer?
Re: (Score:2)
You can't always make all of the documents that the decision was based on available to anyone. Originally the courts didn't have a good separation between private and public information. This means that the you can't actually do heads down scanning. To do it right, each document must be verified that it contains no private information. In theory you can just redact that information, but it takes a huge amount of time, and there are not enough people or money to do it. You also have the problem that someone
Re: (Score:2)
I wish they taught that in Civics courses. The Library of Congress serves Congress, which in turn serves corporations.
Re: (Score:3)
The Library of Congress does serve Congress. First. Then it serves the broader US Government. Then it serves the public. [loc.gov]
Privately Owned, Copyrighted Law (Score:2)
Re: (Score:2)
The way it ought to work is that if a government body adopts your standard, then you should lose your copyright on it. The copyright was only granted at the whim of the government in the first place, after all.
Encouraging Governments? (Score:4, Interesting)
In a city such as Nashville, things as basic as business ownership and property records are not available online. In states such as New Jersey, public records such as basic corporate filings (officers, operating address/address for service of process) are accessible only for a fee.
What concrete actions can citizens confronting such situations, take to encourage accessibility and accountability?
Can the rare books collections be digitized? (Score:5, Interesting)
Three closely related questions about the rare books collections at the Library of Congress:
1. I know there is some kind of effort going on to digitize the rare books collections, but can it be sped up? There are many high-quality low-cost archival book scanners out there (such as the ones developed at diybookscanner.org).
2. It gets really annoying to have to receive paper copies of books when copies are requested. Why not DVDs of high-quality images?
3. Why is there no outreach by the LoC to smaller, cheaper book scanning efforts? The Internet Archive, DIYBookscanner.org, and Decapod all come to mind.
Re: (Score:3)
PDF/A.
If you work for the government, you should be asking NARA, not Slashdot. If you didn't know this, your records officer should.
What do you think of corporate partnerships? (Score:2)
I'd like to know what Malamud thinks about corporate partnerships in the process to get public data released. (I'm not sure if Google Patents existed before the USPTO released its databases...?) Do corporations that get involved in the process tend to make the process better without question, or are there tradeoffs in some areas because the corporations always want to help but then try to retain a proprietary version of the data for themselves?
Question for Carl: Real time legislation drafting (Score:2)
Carl: would it be possible to implement a system that would allow real-time and continuous review of legislation while it's being drafted? Much has been made over the past three years about legislation being available for review before voting by the House or Senate. The final draft for review usually is huge PDF that makes it near impossible for citizens, interest groups, and the media to thoroughly analysis in time.
Could you improve on USDA pdf's back to 1925? (Score:1)
In the past 6 months, USDA has made available past agriculture censuses,
now back to 1925.
http://agcensus.mannlib.cornell.edu/AgCensus/homepage.do
However, while these are searcheable pdf's,
there appears to be no quality control so errors appear not in the image but in the underlying searcheable data.
In some sense, the searcheability is a mere bonus of the scanning software used;
although for such pdf's, your own OCR software could create this searcheability.
Since you can't import these i