I am doing a side project on investing need financial statements: balance sheets, income statements, cash flows. The SEC's EDGAR tool is terrible and Google & Yahoo have a lot of companies where they just don't have the info posted. I would only need the NYSE data for yearly/quarterly reports.<p>I've scraped Google Finance for info but I have a lot of invalid rows in my database.<p>Thanks all.
<a href="http://code.google.com/apis/finance/" rel="nofollow">http://code.google.com/apis/finance/</a>
<a href="http://kottke.org/09/06/online-financial-data-apis-and-resources" rel="nofollow">http://kottke.org/09/06/online-financial-data-apis-and-resou...</a>
<a href="http://stackoverflow.com/questions/417453/best-most-comprehensive-api-for-stocks-financial-data" rel="nofollow">http://stackoverflow.com/questions/417453/best-most-comprehe...</a><p>(edit: Added the link for Google's finance api because I'm not sure if you are referring to their api or finance site when you say 'scrape')
I've done a lot of scraping and parsing. Your best option is to fetch the SEC's RSS, then fetch the hard-to-parse XML/Free-form and parse it. XBRL is great in its own regard, but it's very difficult to relate XBRL fields with non-XBRL filings. You would do well to separate the two results.<p>SEC form 4 filings are in XBRL dating back to Jan 1, 2004 for every company. There are well over 1,000,000 forms filed between then and now... I know, I have them all locally right now.<p>You can scrape Google's Financial pages, obviously, and you can even get 2-minute data from a JSON "_5d" variable.<p>You can get fundamentals data from nasdaq pretty easily, too. Scraping it is a little difficult, but you can go 120 quarters back for many companies, and 5 years back for annual data.<p>I have a financial statements database populated with nasdaq scraped data right now. They update within a week or so after it's published to the SEC. You'll always be behind the curve, but you will have good information, and it is good information, albeit incomplete (missing things like the number of shares outstanding).
The SEC has a new XML-like data system for financial information called XBRL: <a href="http://xbrl.sec.gov/" rel="nofollow">http://xbrl.sec.gov/</a><p>Not all companies are required to report in this format at this time but I believe over the next year most Fortune 500 companies will be required to provide their data in this format.
I've been working on a closely related project for about a year now. It's not worth your time to attempt scraping. The best advice I can give you is that any financial data source worth consuming is going to cost money, you might as well pay for it and focus your energy/time on building the product itself. The free data sources are unreliable and stale and scraping legit sites is problematic because of throttling issues.
SECWatch has an api: <a href="http://secwatch.com/api.jsp" rel="nofollow">http://secwatch.com/api.jsp</a><p>10kWizard's cheapest plan works out to about $21/month.