I've been involved in this a couple of times and want to share a thought.
If you are just aggregating data, that's not a moat.
A lot of folks are buying data from a bunch of sources to give complete coverage - e.g. a dataset of all plane ticket prices, where before you could only get separate datasets from each of amadeus and rivals.
Someone, usually someone big, just goes round you to build their own infrastructure for some internal function, e.g. Google flights api, and then realises they can replace you as a secondary revenue stream.
Instead, I think you gotta somehow add value to your data, which is best done as a side effect of another buisness.
Reuters have a huge news dataset with amazing annotation because they got thier editors to curate it as it was produced. That's an unassailable free text training set that no one else is gonna match.
So build a dating app that causes users to create a curated dataset. Or a game. Or a buisness tool. Or an api. Or a really good AI secret sauce built in an expensive privately curated training set that ate most of your funds.
Some of the more interesting AI research is being done in the area of developing very accurate models of the real world. This way the many varied iterations needed to develop the AI model can be done without the same physical limitations of aggregating human data. Games are definitely in this same realm, although it's not always practical to make a human playable simulation-game and an AI playable simulation game given development constraints.
What I find fascinating is what AR might do for training AI models. To execute on AR we'll need to digitize a model of our physical surroundings so the software can interact. At that point we'll have a compelling pipeline of actionable data in regards to machine learning - especially for robotics.
Why do you think companies like Google release so many things open source related with AI (papers, models, frameworks)? Because they have the data to train them. In my company 90% of my time is dealing with the dataset, having a good and big dataset is first step to train any algorithm.
I'd never thought of this angle before, but maybe to establish prior art? They have several high level competitors all working on the same stuff. The algorithms aren't that useful without the dataset anyway, so they release the algorithms in order to block any future attempts to patent the tech?
Licensing quality, trusted data is very expensive right now, but as IoT, device tracking, and other data-producing technologies come online there will be a race to the bottom in cost of quality. A lot of older data providers who haven't kept up with the times will be put out of business. I doubt it will take that long; I've had a few ideas in the space and they're not all that complicated...
open/free data sources are likely to become very important. AI hasn't yet been super-important in the open data world, but I'd expect it to gain a lot of prominence as time goes by.
Starting a data set company would probably be a good idea. Necessarily has some humans labeling them, but you could probably build a lot of tools around it to make it as smooth as possible. Also, task rabbit and Amazon turk workers could be used.
Yep, open data and models with state-of-the-art performance are popping up more and more. I expect companies to appear which will sell data and models as a service, too.
Two additional points are (1) dataset collection is low variance relative to fundamental algorithmic advances, and (2) dataset collection relies less on having tip-top research talent (than fundamental algorithmic advances).
Definitely an echo of what I just commented (didn't see your comment until after I posted mine); data sets and the collection of them already are seen as very valuable, and I would imagine this would only become more true as time passes, if we don't hit another winter, that is...
It was the reason Blekko had value to IBM's Watson effort, the crawler was state of the art. With that and the 'web' you can create data sets that others can't. It is the not-so-secret advantage that both Google and Microsoft leverage (their search engine crawlers).
Could you describe at a high level(or whatever level you'd like,) what goes into, or more specifically, what are the components of a crawler that makes it superior for these purposes?
Doesn't die -- HTML being a specification in name only, there are a lot of really crazy web pages out there that render on browsers but are pathological edge cases.
Does a good job of distinguishing 'good' links from 'bad' links on a page. -- Lots of pages have links that should not be followed, some are easy they are rendered in the same color as the background (SEO black hat link juice) and others refer to crawler traps.
Crawler traps come in many forms -- Rich Skrenta created a great example one where the page generated a random number and said "%d is an interesting number" here are two more interesting numbers "%d and %d" the each link went to a new URL that ended in the number. So if you tried to crawl that site exhaustively you would fill your entire crawler cache with random number pages.
Dynamic importance scaling -- you want to crawl the 'best' pages for a topic so you need to figure out a way to measure which pages are important and which aren't. This was the secret sauce of the PageRank patent Google had but it's been gamed to death by SEO types. So now you need better heuristics to understand which are the more important links to follow.
Effective crawl frontier management - for every billion pages you decide to crawl there are probably 20 to 50 billion pages you "know about". These URIs that are known but not yet crawled are referred to as the 'crawl frontier'. Picking where to go looking in the crawl frontier to find useful new pages is half art and half good machine learning.
Good algorithmic de-packing -- many many pages today are generated algorithmicly from a set of rules, whether it is the product pages on Amazon or posts in a PHP forum, if you can recognize the algorithm early, you can effectively avoid crawling pages that are duplicates or not useful.
Good page de-duping -- There is a lot of repetition on the web. Whether it is the 'how to sign up' page of every PHPBBB site ever or the same product with 10 different keywords in the URI.
Selective JS interpretation -- sometimes the page exists in the JS code, not in the HTML code, so unless you want to store 'this page needs Javascript enabled to run' into your crawler cache you need to recognize this situation and get the page out of the Javascript.
When you say google and microsoft have this advantage in creating data sets, is it just the massive size of the web indices they are able to compile or do they use their crawlers in specific ways for compiling structured data that would be more useful for certain ML projects than a general web index?
Are there any tweaks you'd make to a crawler if you sent it out with the purpose of creating a dataset for a specific AI / ML project, rather than a general purpose web index?
> ... do they use their crawlers in specific ways for
> compiling structured data that would be more
> useful for certain ML projects than a general
> web index?
There are many uses for a large index. For example, they decode into structured data for many of the 'one box' results, a small box that shows up on the search results which has the answer to your query, even though that answer came from a web page. This is good for the consumer, they get their answer right away without clicking through to a web page, and its good for Google as it keeps the customer on the search results page with its advertising rather than having go to some page on the web potentially with someone else's advertising on it.
Google also post processed crawl data to indicate the spread of flu in their experiment of extracting health data from query logs.
> Are there any tweaks you'd make to a crawler if you
> sent it out with the purpose of creating a dataset
> for a specific AI / ML project, rather than a
> general purpose web index?
Yes there are many. Some of them made it into the Watson crawler. One of Blekko's claims to fame was their notion of 'slashtags' which were curated lists of known 'good' pages on a topic. Using such pre-validated URI lists can help you improve the fidelity of the datasets you collect. There are also clever ways to use existing data to validate the new data you are looking at. I'm on a couple of patent applications around that space which, if they ever issue, will make things a bit more obvious than they are today :-).
All the one-shot approaches I've seen are IMO transfer learning in disguise, which raises the question of "what are you transferring from?". So while these should hopefully reduce the need for truly gigantic datasets about everything, there is still a limit of how much info you can extract from a fixed amount of data.