I know, I know, some of you are probably sick of me idly speculating on what Google, Yahoo!, and Microsoft are going to do next, but I just had yet another vision that I wanted to share with you.
One of the search engines is going to build or buy a leading OCR and/or photo scanning software package.
Why?
Well, just do the plotline in your head. Google just built a system (Google Base) which — if, perhaps rather inelegantly — lets people add additional content in bulk for that search engine to slurp up.
Google and (separately) the Open Content Alliance are busy scanning the world’s books.
So we have Web pages, music, images, scholarly research, books, and more being indexed… but what about all those zillions of papers folks have laying around? Like the ones I just set about scanning this evening to reduce some of the clutter around my desk.
What have I been scanning? A list of waltz moves, an e-mail directory, a memorable schedule of a recent dance camp I attended, and a funny article I wrote for my high school newspaper.
How much of this would the world be interested in? How much would I really WANT to share? Not all of it, to be sure.
But from older academic papers to newspaper clippings to home photos and more… there’s a TON of information out there that’s not digitized.
Not digitized yet, that is.
And interestingly enough, decent scanners (albeit not slide scanners) are pretty darn cheap ($50 or less, especially used ones on ebay). But really good OCR software? At least $150, from what I’ve gathered. Students, families, home-office professionals… I bet most of them have scanners. But I doubt most of them have OCR software.
Then again, perhaps the search engines could simply piggyback onto non-OCR scanning software and do the OCR on their supercomputers inhouse. Greater ability to iterate, do A|B testing on scan quality, etc., without depending upon users to update software.
* * *
Benefit to engines:
- A huge database to improve NLP (natural language processing) algorithms… better understanding the interplay of text, graphs, photos, etc.
- Access to a ton of new content
- Further enticement to consumers to get onto their desktops (e.g., perhaps bundled in with Google Desktop or MSN Search or Yahoo-X1 search, etc.)
Benefit to consumers:
- Ability to archive documents and/or photos online with greater accuracy, and for less money (even free) for personal retrieval.
- Easier way to share not-yet-digitized documents with colleagues, using an OCR’d (much less bandwidth intensive) format
- Probably other stuff I’m overlooking
* * *
What are your thoughts on this?
1) How feasible do you think it is that one of the search engines will buy/build such a service?
2) Which search engine’d do this first?
3) How useful would it actually be to general consumers? Small business folks? Others?
What do you think?