Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Government Politics Your Rights Online

Ask Carl Malamud About Shedding Light On Government Data 59

If you've ever tried to look up public records online, you may have run into byzantine sign-up procedures, proprietary formats, charges just to view what are ostensibly public documents, and generally the sense that you're in a snooty library with closed stacks. Carl Malamud of Public.Resource.Org has for years been forging a path through the grey goo of U.S. government data, helping to publicize the need for accessible digital archives — not just awkward, fee-per-page access. (Mother Jones calls him a "badass.") Malamud has (with help) been making it easier to get to the huge swathes of data in government sources like PACER, EDGAR, and the U.S. Patent Office. He's got a new initiative now to establish a "Federal Scanning Commission," the task of which would be to assess the scope and outcomes of a large-scale effort to actually digitize and make available online as much as practical of the vast holdings of the U.S. government. ("If we were able to put a man on the moon, why can't we launch the Library of Congress into cyberspace?") Ask Malamud below questions about his plans and challenges in disseminating public information. (But please, post unrelated questions separately, lest ye be modded down.)
This discussion has been archived. No new comments can be posted.

Ask Carl Malamud About Shedding Light On Government Data

Comments Filter:
  • Be careful ... (Score:4, Insightful)

    by anagama ( 611277 ) <obamaisaneocon@nothingchanged.org> on Wednesday January 04, 2012 @02:38PM (#38588008) Homepage

    Government Warning: Exposing the government to scrutiny can result in rape charges.

    • Re: (Score:2, Informative)

      by Anonymous Coward

      Right. Because power corrupts, and yet we keep putting people into power and expecting them to not get corrupted. Nothing will chenge until we open source it. [wikipedia.org]

      • Or revert back to our instincts...

        Tribalism [wikipedia.org]

    • by elrous0 ( 869638 ) *
      • Hell, even suggesting a new world currency to replace the dominance of the dollar [guardian.co.uk] can get you that.

        I know, right? The DNA evidence that showed DSK had sex with the hotel maid and her filing rape charges had nothing to do with him getting charged with rape. It was obvious that his suggesting a replacement for the dollar caused his semen to get inside that hotel maid.

           

        • by elrous0 ( 869638 ) *

          If you've never heard of a Honey Trap Operation [slate.com], you would make a really shitty spy. It's one of the basic tactics of any good intelligence agency.

          • If you've never heard of a Honey Trap Operation, you would make a really shitty spy. It's one of the basic tactics of any good intelligence agency.

            Have you seen the hotel maid? If you were setting up a "Honey Trap" is she the woman you'd pick?

        • You should really give this article here [nybooks.com] a good read. It's not long and it's fascinating. Does it prove Strauss-Kahn is innocent? Not conclusively. Does it show that there's a heck of a lot more going on than something as simple as a woman claiming rape? Yeah, I'd say it does.

          • You should really give this article here a good read. It's not long and it's fascinating.

            I will. Thank you for the link.

    • by slick7 ( 1703596 )

      Government Warning: Exposing the government to scrutiny can result in rape charges.

      Julian Assange did this very thing and look what it got him.

  • by Arrogant-Bastard ( 141720 ) on Wednesday January 04, 2012 @02:39PM (#38588020)
    (I'm guessing "yes") If you are, what do you think about the work they've done?
  • LOC (Score:3, Interesting)

    by Anonymous Coward on Wednesday January 04, 2012 @02:41PM (#38588048)

    So how many GB/TB is a library of congress? :)

    Or more seriously how big are you estimating? Are you using raw scans or some sort of compression (jpeg, png, ...etc)? What resolution are you using? Do you vary the resolution depending on the document?

    What sort of meta data are you putting in?

  • by jimmerz28 ( 1928616 ) on Wednesday January 04, 2012 @02:44PM (#38588078)
    Didn't Obama already mandate that all government agencies must digitize their records and develop plans within 4 months? http://www.simplysecurity.com/2011/12/28/obama-administration-pushes-for-digital-records-management-overhaul/ [simplysecurity.com]
    • Is Guantanamo closed? He signed that order on his first day in office three years ago. Clearly, Obama's dictates do not carry much weight.
    • by garcia ( 6573 ) on Wednesday January 04, 2012 @03:12PM (#38588350)

      I scour publicly available records for fun stuff all the time. I not only find it online but I also request it from government agencies (not Federal usually but local/county/etc).

      In Minnesota data must be, "easily accessible for convenient use." [mn.gov] While that has specific wording related to historical records, it basically means that on recent data it must be in some sort of electronic format or otherwise easily found and presented, free of charge as long as you do it in person, to anyone who asks--even anonymously. Now. This is great in theory. Unfortunately just because it's easy for the agency to use it doesn't mean it's easy for you to use or interpret.

      Let's take for instance data on bus ridership data [lazylightning.org]. It's not well organized for outsiders to read it and due to collection methodologies (not explained to the general person who had to pay $50 to get the data in the first place) is basically useless.

      They have the data and after months of fighting with them for how much they claimed it cost (they wanted to charge me more than $300 IIRC) I got it down to $50 and got what you see above even though they already pulled it (and summarized it) for the mass media but wouldn't release it in a raw format.

      So. It's in a format which isn't standard. It's methodology is questionable and it's expensive. So no matter the mandates, the promises, etc, the data is not terribly useful across agencies or to the public without some intermediate steps which costs the taxpayers more than doing it right the first time around.

    • Yes. Develop Plans.

      Plan: Scan everything.
      Cost: A lot!
      Budget: Cut.
      Action: None.

  • by hyeprofile ( 1851598 ) on Wednesday January 04, 2012 @02:44PM (#38588082)
    The US actually does a good job with sharing data on regulations and rulemaking on regulations.gov. You can pretty much search any of the regulatory dockets from msot departments, and even access public comments and supporting material. You can even take advantage of regulatory policy updates and eRulemaking Program activities on your Twitter stream. Wouldn't this be a good model to follow to systematically publish everything online? I'm thinking publishing everything online on a government website would make for a great summer job for students, and help boost the economy and employment stats, no?
  • by andymadigan ( 792996 ) <amadigan&gmail,com> on Wednesday January 04, 2012 @02:44PM (#38588090)
    Speaking from experience, the digitized (with text available, not just scanned images) USPTO patent data comes in 4 formats. The oldest format looks like it was based on 'cards', the second format was SGML, the third was a bizarre XML format based on the SGML format, and the current format is based on alterations to international standards. When my former employer wanted to analyze this data, I needed to write parsers for each one.

    Is there any chance that all patent data will be made available in a single format (other than HTML)? The structured information in the formats is very useful, but very difficult to get to with the current system (it also costs tens of thousands of dollars to get all of the data).
  • Why (Score:4, Interesting)

    by CanHasDIY ( 1672858 ) on Wednesday January 04, 2012 @02:46PM (#38588100) Homepage Journal

    Can you provide any explanation as to why it is so difficult and cost-prohibitive to obtain records from the government, especially considering the abundance of laws requiring government compliance with requests for information (AKA "Sunshine Laws")?

    Is it simply a matter of government employee ineptitude, or have you found evidence of a more nefarious rationale?

    • by khallow ( 566160 )
      I can think of several overlapping reasons of the top of my head. The least charitable cause is that someone doesn't want the information released. Sometimes the information is embarrassing or risky to someone in a position of power or a bureaucracy.

      Sometimes the data is just in a bad format, say completely on paper or in a specialized computer system that doesn't lend itself to easy sharing.

      And the last of the list I can think of, sometimes the data might contain information which has a legitimate re
    • Re: (Score:3, Informative)

      by Anonymous Coward

      Having worked for the government in the recent past, I can offer a few insights...

      1 - A lot of government agencies, on receiving a request for information, will kick it over to the IT department, on the grounds that "they keep the data". Unfortunately, because of the way things are structured, while the people in IT may run the disks and servers, they don't actually deal with the data... which means they either have to fight an internal battle with the people who actually manage the data, or take the path

    • I currently work for a municipal agency dealing with building codes, health and safety inspections and planning in an IT capacity who receives the bulk of my agencies public records requests, and can offer my $0.02 on this. As another poster pointed out, I don't normally deal with the data, so when I receive a request, the first thing I have to do is research/discuss with people in the know what it is the request even says. Then dig through the database and determine if it exists and how to get it out. Fin
  • Ancestry.com (Score:3, Interesting)

    by Anonymous Coward on Wednesday January 04, 2012 @02:51PM (#38588158)

    What is your opinion about websites like Ancestry.com which make use of public records and charge a subscription fee for access? What is the incentive for the government to migrate old documents into digital form when services like these exist? Do you think Ancestry.com should be a 501(c)(3)?

    • I am all for open data, and I like what they do, but Ancestry.com should not be a 501(c)(3). It's for profit. It’s purpose is to make money.

      If they were dealing strictly with public data then I would have no bones to pick if the U.S. government moved into their business and started to offer the information for free. (well, small bones. We are running a deficit, not sure this ranks on my top 10 list for this decade, but that’s a different debate).

      What they do is combine multiple government databa

    • Ancestry.com can do what it wants, but there's no obligation for the government to preserve its business model by failing to make the data easily accessible itself.

  • Who is the worst? (Score:5, Interesting)

    by TheBrez ( 1748 ) <brez@brezworks.com> on Wednesday January 04, 2012 @02:51PM (#38588160) Homepage
    Which government agency is the worst to get information from?
  • Scanning ? (Score:3, Interesting)

    by SoothingMist ( 1517119 ) on Wednesday January 04, 2012 @02:53PM (#38588184)
    By "scanning", what do you mean? Are we talking about searchable records or just a bunch of images? If searchable, what quality control is going to be provided? As someone who has re-published books that are out of copyright, it takes a lot of quality control to ensure a usable product. Unless high-quality searchable records in a solid database are the end result, the project is not worth funding, in my personal opinion.
  • by oneiros27 ( 46144 ) on Wednesday January 04, 2012 @02:56PM (#38588208) Homepage

    Recently in the federal register, there were two calls for comments about access to data and research from federally funded research:

    http://federalregister.gov/a/2011-28623 [federalregister.gov]
    http://federalregister.gov/a/2011-28621 [federalregister.gov]

    I didn't hear about these until ~4 weeks after the original announcement, and with the holidays, it was too late to try to get the societies I'm involved with to prepare and vote on official statements. Are there any places where people can get/post notices of these sorts of things so that we can stay informed and try to help influence policies?

    (note -- the second one on data access doesn't close 'til Jan 12th; NSF also has a similar RFC that closes Jan 18th [nsf.gov])

  • Idea (Score:4, Interesting)

    by hardwarejunkie9 ( 878942 ) on Wednesday January 04, 2012 @03:06PM (#38588298)
    Something has been rattling around my head in recent days on this topic and now I think it's a proper time to let it out.

    The amount of information you're trying to free is entirely staggering and consists, largely, of tables of numbers. These numbers are incredibly significant, but people generally can't see them.

    After you free all of this information and make it available to the public (as it should be), then what? What do you expect for the public to do with these numbers? Tables of information are not nearly as useful as graphs. This data needs to be seen, but, more importantly, it needs to be understood.

    Do you have any ideas for how to disseminate this information? Perhaps a team-up with someone like gapminder.org's Hans Rosling might be particularly valuable for all of us.

    • The 2011 update to data.gov [data.gov] actually allows whoever is submitting the data to describe it such that people can make use of it, including via visualization (maps, graphs, etc.) or via API to make custom applications.

      So my question for Carl would be : What can we do to get more government agencies to actually put their data in there? And if they won't do it, should resource.org or similar groups work to put up something similar, so that people who have gotten information through FOIA can share it back out to

    • This is a great idea--you should do it!
  • by G3ckoG33k ( 647276 ) on Wednesday January 04, 2012 @03:11PM (#38588334)

    Make sure you don't get stuck in standardization process where the aim is to bring different formats together, before data is entered.

    Some formats are incompatible today and will be forever.

    The big issue is that such a process will NEVER go anywhere, cost a ridiculous amount of both money and time, with no result in sight, ever.

    Yes, I have seen those process from a closer range than I wish to remember. Big in-house, between-house, between-block, between-county fights that lead to that no data was ever entered.

    Just do the gritty work immediately. Don't insist on OCR everything, just scan it as plain images, as much as you can. Then, if the money is there, then consider OCR.

    IGNORE anything that sounds like an untested high-tech solution. Use well established technology, like high performance scanners etc. if it gets the initial job done, entering those damn documents into the computers.

    Look at Google! They did almost all books in the world in just a few years. Did they bother with converting 16th century type setting into Times New Roman or something similar. Of course not.

    Scan on!

  • How much difficulty do you anticipate in getting and publishing records in Pacer? If there's one system that should be free it the decisions that our courts make and yet you are charged by the page just to view the results. Are you concerned about a court taking an unkind view on your archiving what is in Pacer?

    • You can't always make all of the documents that the decision was based on available to anyone. Originally the courts didn't have a good separation between private and public information. This means that the you can't actually do heads down scanning. To do it right, each document must be verified that it contains no private information. In theory you can just redact that information, but it takes a huge amount of time, and there are not enough people or money to do it. You also have the problem that someone

  • I think I have read that the law itself cannot be copyrighted and it should be possible to make it available available to everyone. But as a techie who drafts standards and specifications, I was wondering about how far this goes--especially since Congress recently proposed enacting some of our standards into law. (They decided not to, but they read some parts into the committee records as they debated.) Can you still accomplish your project if a governmental body adopts (or considers adopting) a privatel
    • The way it ought to work is that if a government body adopts your standard, then you should lose your copyright on it. The copyright was only granted at the whim of the government in the first place, after all.

  • by theNAM666 ( 179776 ) on Wednesday January 04, 2012 @03:58PM (#38588790)

    In a city such as Nashville, things as basic as business ownership and property records are not available online. In states such as New Jersey, public records such as basic corporate filings (officers, operating address/address for service of process) are accessible only for a fee.

    What concrete actions can citizens confronting such situations, take to encourage accessibility and accountability?

  • by autophile ( 640621 ) on Wednesday January 04, 2012 @04:02PM (#38588842)

    Three closely related questions about the rare books collections at the Library of Congress:

    1. I know there is some kind of effort going on to digitize the rare books collections, but can it be sped up? There are many high-quality low-cost archival book scanners out there (such as the ones developed at diybookscanner.org).

    2. It gets really annoying to have to receive paper copies of books when copies are requested. Why not DVDs of high-quality images?

    3. Why is there no outreach by the LoC to smaller, cheaper book scanning efforts? The Internet Archive, DIYBookscanner.org, and Decapod all come to mind.

  • I'd like to know what Malamud thinks about corporate partnerships in the process to get public data released. (I'm not sure if Google Patents existed before the USPTO released its databases...?) Do corporations that get involved in the process tend to make the process better without question, or are there tradeoffs in some areas because the corporations always want to help but then try to retain a proprietary version of the data for themselves?

  • Carl: would it be possible to implement a system that would allow real-time and continuous review of legislation while it's being drafted? Much has been made over the past three years about legislation being available for review before voting by the House or Senate. The final draft for review usually is huge PDF that makes it near impossible for citizens, interest groups, and the media to thoroughly analysis in time.

  • In the past 6 months, USDA has made available past agriculture censuses,
    now back to 1925.
    http://agcensus.mannlib.cornell.edu/AgCensus/homepage.do
    However, while these are searcheable pdf's,
    there appears to be no quality control so errors appear not in the image but in the underlying searcheable data.
    In some sense, the searcheability is a mere bonus of the scanning software used;
    although for such pdf's, your own OCR software could create this searcheability.
    Since you can't import these i

You can tune a piano, but you can't tuna fish. You can tune a filesystem, but you can't tuna fish. -- from the tunefs(8) man page

Working...