科技回声

My company, Easy Vino (easyvino.com), is gearing up for beta release and we need to populate our database with wine lists. The job consists of extracting information from wine lists (which we have and are usually PDF, HTML or Pictures) to put it into our database.<p>We have a simple back office that connects to a wine API to search for wine info and we need help inputing the data. I'd rather have the same person (or team) doing this as the learning curve is significant.<p>Does anyone know a cheap resource for this type of task? Any help or reference is appreciated.<p>Thanks a lot!

I'm not sure exactly what sort of answer you are expecting. Unless the data you want is in a standardized format (such as a standardized XML schema), any effort to extract data would require writing custom parsers for each set of data that has a different structure. I'm not sure if you are asking for advice on which technology stack to use for writing this or are looking for a pre-made tool that can extract this for you? There may be some tools that can "attempt" to do this without requiring you to write custom code but I am not sure how effective they would be.

The typical way of doing this is to use mechanical turk, there are some third party services (their name escapes me) which are built on top of mturk to provide reliability.<p>The typical way they do this is to have two different people enter the data and when there's a mismatch have a supervisor decide which is right.

You might have good luck just hiring some cheap Virtual Assistants to do this work for you. oDesk or elance are pretty good for these types of administrative tasks

Ask HN: Do you know a good resource for large data scraping job?

3 条评论

Ask HN: Do you know a good resource for large data scraping job?

3 条评论