Andrew Ng is raising a $150M AI Fund

cr0sh · on Aug 16, 2017

I look at announcements like this, and past ones about Ng, and I always marvel at how things have gone since I took and completed his 2011 ML Class...

That was one helluva course, challenging and interesting, and fun all at the same time (and so much "concretely" - lol).

From what I understand, that course is still available thru Coursera (which Ng booted up after the ML Class experiment; Udacity was Thrun's contribution after his and Norvig's AI Class, which ran at the same time in 2011).

narvind · on Aug 16, 2017

Nice.

I wrote down my thoughts after taking his new DL course. Hope it helps you all :) https://medium.com/towards-data-science/thoughts-after-takin...

fizwhiz · on Aug 17, 2017

Excellent review! I haven't taken Ng's original ML course and noticed that you mentioned it would be a pre-requisite to taking any DL course (whether fast.ai or Ng's DL course). Care to elaborate why?

gwhat · on Aug 17, 2017

Thanks for this! You just reignited my interest in ML, months after dipping my toes at the end of EDx's Intro to Computational Thinking and Data Science.

learningram · on Aug 17, 2017

How long did the courses take you to complete ?

gantengx · on Aug 17, 2017

That's a really good summary. Thanks for posting it :-)

emerongi · on Aug 16, 2017

It's still available. I've been taking it for the past month or so.

After learning more about him, he's probably the only person in the world that I envy. I resonate with his ideas a lot, but I'm like 1% of what he is. It makes me a bit sad. I'll be taking his new course as well and hopefully one day I will be able to work in the same field as him.

zitterbewegung · on Aug 16, 2017

You should focus on what you can do to make yourself a better person. Anyone on earth can compare themselves to another person and get the sense they feel inferior . Wherever You Go, There You Are.

noir_lord · on Aug 16, 2017

Yep, it helps to remember that only 0.000000014% of the population is the literal best at anything.

evernflow · on Aug 19, 2017

Did you just quote buckaroo bonsai?

thoughtpalette · on Aug 16, 2017

Imposter Syndrome is pretty common in CS/Software dev. I try to recognize it for what it is when I feel it and move along.

https://en.wikipedia.org/wiki/Impostor_syndrome

gmarx · on Aug 16, 2017

I watched several of the course videos and was struck by how long it's been since college and I did anything with matrices. I kept pausing to go look up what things meant and meanwhile some of his students were catching errors in real time in class

subroutine · on Aug 16, 2017

The final project in that course should be using ML to detect every time Ng says "concretely" and compile those clips into a video montage.

mlboss · on Aug 16, 2017

Its already done https://www.youtube.com/watch?v=5ZNJPSe1nZs

fnl · on Aug 17, 2017

Thanks for the link to a few concrete examples! :-D

firebones · on Aug 17, 2017

Did anyone do "intuition"?

ForHackernews · on Aug 16, 2017

Abu-Mostafa's Learning From Data[0] was more more rigorous, in my experience. It's a full-fledged Caltech course, sometimes taught on EdX concurrently with on-campus sessions.

The Ng class was a good introduction, but it was mostly applications, not the mathematics and theory behind them.

[0] https://work.caltech.edu/telecourse.html

rz2k · on Aug 17, 2017

It is starting again on September 17, 2017 though you can access some of the material already.[1] In addition to the quality of the content, one of the most amazing things about previous offerings is just how involved he has been in the forums, directly helping students.

[1] https://www.edx.org/course/learning-data-introductory-machin...

superfx · on Aug 16, 2017

I took CS221 from Andrew in 2006 (or was it 2007?) Even more has changed since then ;-) It was my second ML course, after taking Daphne Koller's punishing CS229. Right then though I knew ML will sweep the world pretty soon.

randcraw · on Aug 16, 2017

Ng famously taught CS229 too, which also looks punishing. Those 2008 videos are available on YouTube: https://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

smaddali · on Aug 16, 2017

i thought Daphne Koller taught CS228. probabilistic graphical models. It was very very grueling when i took it in 05.

wichert · on Aug 17, 2017

Daphne has a PGM course on coursera as well. From the half dozen courses I have done on coursera that was by far the most difficult one, to the point where people were talking about making t-shirts stating "I survived week 5". Personally I found it the most interesting and rewarding as well.

superfx · on Aug 16, 2017

Oh yes my bad. I meant to say I took Andrew's CS229, and Daphne's CS228. I never took CS221.

jinqueeny · on Aug 17, 2017

I gave it a try several years ago, but it was very difficult for me to follow. Maybe I should try again...

RivieraKid · on Aug 16, 2017

It's far from challenging. The Stanford one (CS229) is though.

unkown-unknowns · on Aug 16, 2017

It's not a competition you know ;)

RivieraKid · on Aug 17, 2017

There are two versions of Ng's course, one is on Coursera, the other one taught at Stanford, both available online. Just wanted to point out that the Coursera one is much easier.

deepnotderp · on Aug 16, 2017

The Stanford cs classes on ML and deep learning were honestly surprisingly easy for a Stanford DL class. Or maybe that was because they were good teachers, who knows :)

cr0sh · on Aug 16, 2017

Maybe not if you are someone lucky enough to have gone to Stanford, and are a CS major.

When I took the ML Class (I also took the AI Class at the same time, but had to drop out due to personal reasons - but I stayed in on the ML Class and finished it), I hadn't really touched linear algebra since high school.

I graduated high school in 1991; Ng's course was 20 years later.

I also didn't have any stats or probability experience under my belt. Nor anything about derivatives or integrals.

I basically had to pick all of this up on-the-fly (fortunately there are internet resources), and even to this day, I barely understand them (I understand matrix operations mostly, but I struggle with probabilities, and I have little-to-no idea on derivatives or integrals).

After high school I went on to get a 1 year, virtually worthless today associates degree from a now-defunct voc-tech school here in Phoenix. Since then, I've been steadily employed as a software engineer here in the valley, and well compensated (I believe) for it. I own my own house, and I have zero debt except for a mortgage.

Given all of that, one should be able to see how such a course would be a challenge. There were a ton of people who signed up, but from what I understand, the majority dropped out after the first couple of weeks. This actually seems "par for the course" though for MOOCs.

I know it was a simplified intro to ML, but for me, it and what I took of the AI Class taught me more than what I ever was able to figure out on my own, especially on neural networks. The light really clicked on for me there. But I was really disappointed to have to drop out of the AI Class.

Later, in the Spring 2012, after Udacity had been established, they weren't able to offer the AI Class as one of their courses. So Thrun came up with another course, which was originally titled "CS373 - How to Build Your Own Self-Driving Vehicle" - and I jumped on that one, and completed it as well. I found it fairly challenging too (but not as challenging as the AI Class was). This course has since been renamed to "AI for Robotics" - which is more apt, I think.

It took a while - but eventually the AI Class was made into a course (I think there was some kind of licensing issue, but I don't know for sure, that was preventing it from being part of Udacity's offerings). I have yet to retake it, but it is on my list (plus a ton of others).

Today, I'm in the home stretch of the 3rd term of Udacity's Self-Driving Car Engineer nanodegree. I'm struggling mightily to get my path planner project to work properly, but I almost have it done (it can make it around the track, but for some reason my behavior planner isn't costing things properly). Got an elective, and the integration project to do, all by mid-October or so.

I don't know if any of this will lead anywhere for me career-wise. I'm happy with my current employer, so I expect to stick around here for a while. I have hopes, dreams, ambitions to perhaps get a degree of some sort in CompSci. I want to really learn more mathematics. I've always been a lifelong learner, but this kind of stuff is really fascinating to me, even if it is (what seems to me at least) complex and not always intuitive. But if it were easy, it probably wouldn't be as fun (but I will say Keras and Tensorflow really make things much easier than when we had to implement a neural net in Octave and Python).

spynxic · on Aug 16, 2017

> I'm struggling mightily to get my path planner project to work properly

Having not completed an AI course, I tread lightly.. However, I would guess that this project involves re-implementing an established solution. -- It is work like this that drives me away from such courses; as I can't imagine how creative practices are promoted, instead "correct" techniques are repeatedly hammered in.

> I don't know if any of this will lead anywhere for me career-wise. I'm happy with my current employer, so I expect to stick around here for a while.

Professionals learning to program late in their career typically have a misconception that they're only eligible for entry-level positions in the field of software engineering. Many fail to realize that the 10+ years of experience in their own field can be coupled with their newfound-skill, giving them a background unlike that of many existing professional developers.

rodiger · on Aug 16, 2017

Difficulty is subjective

leakyvalve · on Aug 16, 2017

Not if you compare one course to another.

taneq · on Aug 17, 2017

It is if you're comparing them subjectively. Which you are, since difficulty is subjective...

KasianFranks · on Aug 16, 2017

"Many of these funds are putting time and resources into securing data sets" - this is key.

jrowen · on Aug 16, 2017

Whenever I see announcements like this, it's very unclear to me what is meant by "AI." Are they talking about basically getting the most out of the current ML/deep learning type systems? If so then I guess building data sets makes sense but it seems more like an uninteresting business strategy than what I think of as pushing AI forward.

If, on the other hand, they are talking about making progress on the more traditional dream of AI, the focus on building these data sets seems to be a sad way to lock us into a local maximum for a long time.

Massive curated data sets are crutches that lead us into narrow-minded hyper-specialized systems. Are any of these funds investing in people working on systems that try to make sense of raw sensory data streams? I don't think we're ever going to move away from data sets by creating more and better data sets.

boomzilla · on Aug 16, 2017

I am of the opinion that `true AI` is the science/engineering of understanding and replicating human intelligence. Why are we able to come up with abstract concepts from the surrounding physical environments? Why do we look at the stars and wonder what they are (and why)? How are we able to communicate with one another through pictures, words, writings, snapchat. Is that something special about our brains, our collective society, or something else, that enables such remarkable different behaviors from other any animal on earth? I don't know which direction we can start to go down to answer these questions, but collecting good data sets is probably as good as anything. Maybe we'll get the `quantity` of smarter specialized systems first, and once we get the `quantity`, maybe the `quality` will follow?

pinouchon · on Aug 16, 2017

I agree. I think the fields of "computational cognitive science" and developmental psychology are the ones to look into to make progress towards the "hard fundamental problems". Some of the leading labs working on this are MIT CBMM (https://cbmm.mit.edu/, they have a nice youtube channel) and Berkeley Cocosci (https://cocosci.berkeley.edu/index.php).

Google Brain/DeepMind are also pushing some of those ideas. They must be, since they aggressively poach all the top researchers from those labs...

Ng approach is different: he wants a world powered by Deep Learning, so his goal is to make applied deep learning thrive. His strategy to do that: give those data-hungry models even more data, which is completely reasonable.

Those two approaches - fundamental research and applied deep learning - are often referred to as AI, causing much confusion.

Cybiote · on Aug 16, 2017

Don't have anything relevant to say except to make a fun note that we can change your second sentence, keeping it correct, while increasing its ambiguity. If we restate it as:

Are they talking about basically getting the most out of the current ML type systems?

Since Deep learning is a subset of machine learning, the sentence retains correctness but there are now two equally valid interpretations.

Okay, I do have something relevant to say. I don't see what advantage there is in raw sensory data streams, models can be trained off-line to operate on sensor streams just fine. What we actually want are systems that learn adaptively and on-line. To do that well, they'd need to also be data efficient.

jrowen · on Aug 16, 2017

models can be trained off-line to operate on sensor streams just fine.

I mentioned this in another comment, but I don't know that we can train these models just fine, a human can get a lot more information out of an audiovisual stream than a rudimentary transcription of recognized speech, objects and/or text.

If we make these models more sophisticated to capture more information (e.g. body language, tone, context), we have to decide how that information is structured and communicated to the "higher level meaning interpretation" stage. No matter what, their output is going to be more rigidly structured and will contain less information than the raw stream. The extent to which the sensory processing model captures the human-recognizable information in the stream is the extent to which you have created an intelligent system.

There is some form of this structuring and reducing happening inside our brains, but we will never get machines to do that if we continue curating structured data sets. We, the researchers, are using our human intelligence to process raw sensory data and put it into a nice format for the AI. They need to be able to do that themselves.

IanCal · on Aug 17, 2017

> we have to decide how that information is structured and communicated to the "higher level meaning interpretation" stage.

Do we? I'm not convinced. I've joined different sensory inputs before and just glommed together the top of two nets, I didn't have to create any of my own representations.

> but we will never get machines to do that if we continue curating structured data sets

Curated datasets are important though. Want your machine to understand more about the world around it? Then you need high quality inputs in formats you can load in. These data also need to be licensed correctly.

> They need to be able to do that themselves.

We don't chuck kids out into the wilderness and expect them to come back as a useful member of society. We have a huge range of inputs specifically curated to help (from toys and shows to school curricula). Later on with specialisation we pay for extremely carefully selected data, presented in a specific order!

Creating high quality datasets is vital for really anything from niche specialist systems to large general ones.

nightski · on Aug 16, 2017

What is the difference between a data set and a raw sensory data stream? More specifically, isn't a raw sensory data stream just a data set? I think you are getting hung up on semantics.

Or is it just the time correlation that interests you? Because some of these data sets are very likely to indeed be time correlated. Like a video/audio data set for example.

jrowen · on Aug 16, 2017

Yes in the most general sense, a "data set" could be anything, but I'm talking about the highly curated and labeled data sets that are used to train contemporary ML systems.

You can feed a million images that are labeled "cat" or "no cat" to one of these systems and it can achieve a human-level proficiency at identifying cats in images. But, it won't be able to do anything other than identify cats, it's far too narrow to be considered intelligent in any way.

If you can feed a series of timestamped photos to a system, basically an unlabeled arbitrary video stream, and you can demonstrate that it formed some notion of what a cat is, that would be very interesting indeed.

rjtavares · on Aug 16, 2017

Have you ever seen a child learn how to speak? He also needs loads of "labelled" data to initially learn about concepts.

jrowen · on Aug 16, 2017

The data is not labelled in the same way. The child has still acquired their knowledge solely from sensory experience. How do they know to apply the spoken label "mom" to the recurring pattern in their visual data stream?

With modern data sets and labelling, so much of the problem domain is deeply hard-coded into the system. It doesn't have to learn what letters and words mean and how to identify them in a totally arbitrary* visual or audio stream. It just gets a relatively minute amount of structured data that it has an embedded understanding of what to do with.

Sure there are things like OCR and speech-to-text, but I don't think you could just run your streams through those, there's just so much subtle information loss. In order to make AI that really has a chance of reaching what humans would call intelligence, I firmly believe it has to make meaning out of some kind of raw sensory experience analogous to ours.

*Ok, not totally arbitrary, a human would not learn language from a video of a forest, and that's where the "labelling" comes in, from observing other people using language, but a child can still learn any language from just that, and the labels are often vague, inaccurate, contradictory, complex, abstract, etc. There's no master training set with the right answers, you have to decide for yourself. And humans were also capable of bootstrapping language from nothing. I just don't see modern supervised learning systems ever doing things like that.

IanCal · on Aug 17, 2017

Actually you can learn a lot of stuff unsupervised, and then just use supervised learning to extract the information from within the net. So you can just throw huge amounts of letters at a deep net then later look at the activation at the top and say "Ok, so when neurons X, Y and Z are on it's when it's looking at a 1". I'm pretty sure you can just connect two at the top and have it say the number it sees.

> In order to make AI that really has a chance of reaching what humans would call intelligence, I firmly believe it has to make meaning out of some kind of raw sensory experience analogous to ours.

And why must it be able to do this from scratch? Why must it be hampered with the same limitations we have?

nl · on Aug 17, 2017

Unlabelled datasets are valuable in themselves.

Self-Taught Learning: Transfer Learning from Unlabeled Data

http://www.andrewng.org/portfolio/self-taught-learning-trans...

mongodude · on Aug 16, 2017

I would disagree on this point, humans unlike current AI systems can learn from one or two data points, especially at easier tasks like identifying cats. Current AI algorithms need huge labeled data sets for solving narrow problems so one needs to build more generalization ability to our current AI systems.

Tenoke · on Aug 16, 2017

That's because we are already VERY pretrained on very similar tasks. You should look into transfer learning and one shot learning for examples of techniques that in practice do something very similar.

mongodude · on Aug 18, 2017

Yes, I have heard of transfer learning and used them in practice. Very powerful but still primitive. One-shot learning techniques are still to mature but I agree with you that technologies like these will reduce our reliance on datasets and make the AI algorithms learn in more human like way.

korzun · on Aug 16, 2017

Are you comparing a child learning to speak with a set of conditional statements?

_delirium · on Aug 16, 2017

I'm personally most interested in approaches to AI not based as heavily on dataset-collecting. There's a now almost standard method of: 1) first curate a nice, large dataset (often explicitly labeled by humans), then 2) carefully engineer a model architecture that has good performance on the problem represented by that dataset. Example: dataset is ImageNet, problem is tagging images. Obviously there's a lot of practical value in that, and it's a perfectly reasonable thing to research and use, but that's only one specific kind of inference.

wooter · on Aug 16, 2017

this all seems quite naive of the literature in the field. what other "kind of inference" do you wish he would pursue?

_delirium · on Aug 16, 2017

Quite possible. I do have a recent PhD in AI, but it's a big field and I can't claim to know more than a small percentage of it.

The problem with the dataset-first approach is that humans are still providing a lage part of the intelligence: defining the domain, defining what good performance on it looks like, carefully designing model architectures, collecting and labeling large datasets, etc. This is fine for narrow task-specific problems, but is not really the end-all of AI, and does not even seem to work well on all well-defined tasks. As an example of another kind of inference, how about mathematical reasoning? I purposely pick one here that is seemingly very formal; should be possible for a computer to do it. Mathematicians are somehow able to invent conjectures and prove theorems without first being exposed to terabytes of labeled mathematical facts. The scientific method is kind of an even messier version of this. Or to take something laypeople do, people can usually learn games to at least a passable level from just a handful of playthroughs, not AlphaGo-style millions of plays (imagine if you had to play even 1,000 games of MT:G before you got the basic hang of it...).

All this kind of stuff is quite well-represented in the literature though, if you mean the scientific literature. Pop-press AI writing tends to cover a pretty specific subset of what's going on at AI conferences.

AndrewKemendo · on Aug 16, 2017

I don't think you're up to speed with the latest in ML approaches. You have to be narrow before you can be broad, drawing cross correlations between great narrow inference towards more generalized problem sets.

jrowen · on Aug 16, 2017

I would be grateful if you could point me toward some relevant resources.

AndrewKemendo · on Aug 16, 2017

This is one of the better examples I've seen recently that I really love:

https://arxiv.org/abs/1705.08168

paulsutter · on Aug 16, 2017

> Are any of these funds investing in people working on systems that try to make sense of raw sensory data streams?

Yes. That's also my total focus right now. Happy to discuss, email on profile.

icebraining · on Aug 16, 2017

Have you considered writing a blog post? I'm sure it'd be very well received on HN.

heurist · on Aug 16, 2017

The building blocks of AI products are not stable enough for the industry to be respected outside of specialized cases. Too many crap models, not enough simple tools ensuring that models are not crap. I'm guessing these funds would go toward producing those tools. Data quality is a huge issue, so curated data sets are vital, at least until the standard toolchain is strong enough to reliably work on noisy information.

elmar · on Aug 16, 2017

Looks like Data sets are going to be the moat of AI companies.

shubb · on Aug 16, 2017

I've been involved in this a couple of times and want to share a thought.

If you are just aggregating data, that's not a moat.

A lot of folks are buying data from a bunch of sources to give complete coverage - e.g. a dataset of all plane ticket prices, where before you could only get separate datasets from each of amadeus and rivals.

Someone, usually someone big, just goes round you to build their own infrastructure for some internal function, e.g. Google flights api, and then realises they can replace you as a secondary revenue stream.

Instead, I think you gotta somehow add value to your data, which is best done as a side effect of another buisness.

Reuters have a huge news dataset with amazing annotation because they got thier editors to curate it as it was produced. That's an unassailable free text training set that no one else is gonna match.

So build a dating app that causes users to create a curated dataset. Or a game. Or a buisness tool. Or an api. Or a really good AI secret sauce built in an expensive privately curated training set that ate most of your funds.

retromario · on Aug 16, 2017

What kind of useful curated datasets do you imagine could be created by a game? Like driving data from a GTA-like game?

mrdoops · on Aug 16, 2017

Some of the more interesting AI research is being done in the area of developing very accurate models of the real world. This way the many varied iterations needed to develop the AI model can be done without the same physical limitations of aggregating human data. Games are definitely in this same realm, although it's not always practical to make a human playable simulation-game and an AI playable simulation game given development constraints.

What I find fascinating is what AR might do for training AI models. To execute on AR we'll need to digitize a model of our physical surroundings so the software can interact. At that point we'll have a compelling pipeline of actionable data in regards to machine learning - especially for robotics.

jorgemf · on Aug 16, 2017

Why do you think companies like Google release so many things open source related with AI (papers, models, frameworks)? Because they have the data to train them. In my company 90% of my time is dealing with the dataset, having a good and big dataset is first step to train any algorithm.

taneq · on Aug 17, 2017

I'd never thought of this angle before, but maybe to establish prior art? They have several high level competitors all working on the same stuff. The algorithms aren't that useful without the dataset anyway, so they release the algorithms in order to block any future attempts to patent the tech?

jorgemf · on Aug 17, 2017

They release the algorithms so other researches can improve them and they can use the new ones with their dataset.

They release the frameworks so people learn them and then it is easier and cheaper to find employees.

That is what I think.

heurist · on Aug 16, 2017

Licensing quality, trusted data is very expensive right now, but as IoT, device tracking, and other data-producing technologies come online there will be a race to the bottom in cost of quality. A lot of older data providers who haven't kept up with the times will be put out of business. I doubt it will take that long; I've had a few ideas in the space and they're not all that complicated...

danohu · on Aug 16, 2017

open/free data sources are likely to become very important. AI hasn't yet been super-important in the open data world, but I'd expect it to gain a lot of prominence as time goes by.

FLUX-YOU · on Aug 16, 2017

Starting a data set company would probably be a good idea. Necessarily has some humans labeling them, but you could probably build a lot of tools around it to make it as smooth as possible. Also, task rabbit and Amazon turk workers could be used.

metafunctor · on Aug 16, 2017

Yep, open data and models with state-of-the-art performance are popping up more and more. I expect companies to appear which will sell data and models as a service, too.

psb217 · on Aug 16, 2017

Two additional points are (1) dataset collection is low variance relative to fundamental algorithmic advances, and (2) dataset collection relies less on having tip-top research talent (than fundamental algorithmic advances).

cr0sh · on Aug 16, 2017

Definitely an echo of what I just commented (didn't see your comment until after I posted mine); data sets and the collection of them already are seen as very valuable, and I would imagine this would only become more true as time passes, if we don't hit another winter, that is...

ChuckMcM · on Aug 16, 2017

It was the reason Blekko had value to IBM's Watson effort, the crawler was state of the art. With that and the 'web' you can create data sets that others can't. It is the not-so-secret advantage that both Google and Microsoft leverage (their search engine crawlers).

canoebuilder · on Aug 16, 2017

Could you describe at a high level(or whatever level you'd like,) what goes into, or more specifically, what are the components of a crawler that makes it superior for these purposes?

ChuckMcM · on Aug 16, 2017

Sure;

Doesn't die -- HTML being a specification in name only, there are a lot of really crazy web pages out there that render on browsers but are pathological edge cases.

Does a good job of distinguishing 'good' links from 'bad' links on a page. -- Lots of pages have links that should not be followed, some are easy they are rendered in the same color as the background (SEO black hat link juice) and others refer to crawler traps.

Crawler traps come in many forms -- Rich Skrenta created a great example one where the page generated a random number and said "%d is an interesting number" here are two more interesting numbers "%d and %d" the each link went to a new URL that ended in the number. So if you tried to crawl that site exhaustively you would fill your entire crawler cache with random number pages.

Dynamic importance scaling -- you want to crawl the 'best' pages for a topic so you need to figure out a way to measure which pages are important and which aren't. This was the secret sauce of the PageRank patent Google had but it's been gamed to death by SEO types. So now you need better heuristics to understand which are the more important links to follow.

Effective crawl frontier management - for every billion pages you decide to crawl there are probably 20 to 50 billion pages you "know about". These URIs that are known but not yet crawled are referred to as the 'crawl frontier'. Picking where to go looking in the crawl frontier to find useful new pages is half art and half good machine learning.

Good algorithmic de-packing -- many many pages today are generated algorithmicly from a set of rules, whether it is the product pages on Amazon or posts in a PHP forum, if you can recognize the algorithm early, you can effectively avoid crawling pages that are duplicates or not useful.

Good page de-duping -- There is a lot of repetition on the web. Whether it is the 'how to sign up' page of every PHPBBB site ever or the same product with 10 different keywords in the URI.

Selective JS interpretation -- sometimes the page exists in the JS code, not in the HTML code, so unless you want to store 'this page needs Javascript enabled to run' into your crawler cache you need to recognize this situation and get the page out of the Javascript.

That's just off the top of my head.

canoebuilder · on Aug 16, 2017

Cool, thanks.

When you say google and microsoft have this advantage in creating data sets, is it just the massive size of the web indices they are able to compile or do they use their crawlers in specific ways for compiling structured data that would be more useful for certain ML projects than a general web index?

Are there any tweaks you'd make to a crawler if you sent it out with the purpose of creating a dataset for a specific AI / ML project, rather than a general purpose web index?

ChuckMcM · on Aug 16, 2017

   > ... do they use their crawlers in specific ways for
   > compiling structured data that would be more 
   > useful for certain ML projects than a general 
   > web index?

There are many uses for a large index. For example, they decode into structured data for many of the 'one box' results, a small box that shows up on the search results which has the answer to your query, even though that answer came from a web page. This is good for the consumer, they get their answer right away without clicking through to a web page, and its good for Google as it keeps the customer on the search results page with its advertising rather than having go to some page on the web potentially with someone else's advertising on it.

Google also post processed crawl data to indicate the spread of flu in their experiment of extracting health data from query logs.

   > Are there any tweaks you'd make to a crawler if you 
   > sent it out with the purpose of creating a dataset
   > for a specific AI / ML project, rather than a
   > general purpose web index?

Yes there are many. Some of them made it into the Watson crawler. One of Blekko's claims to fame was their notion of 'slashtags' which were curated lists of known 'good' pages on a topic. Using such pre-validated URI lists can help you improve the fidelity of the datasets you collect. There are also clever ways to use existing data to validate the new data you are looking at. I'm on a couple of patent applications around that space which, if they ever issue, will make things a bit more obvious than they are today :-).

kensoh · on Aug 16, 2017

Thanks Chuck for sharing, I enjoyed your sharing these details :)

bobsil1 · on Aug 16, 2017

Until one shot and transfer learning work.

Eridrus · on Aug 16, 2017

All the one-shot approaches I've seen are IMO transfer learning in disguise, which raises the question of "what are you transferring from?". So while these should hopefully reduce the need for truly gigantic datasets about everything, there is still a limit of how much info you can extract from a fixed amount of data.

0xbear · on Aug 16, 2017

Large datasets will still be important for all but strictly perceptual tasks.

seanmcdirmid · on Aug 16, 2017

And the primary advantage of China's AI push.

cr0sh · on Aug 16, 2017

Definitely - without the proper data and large amounts of it, things are pretty dead in the water for research, let alone building actual products.

More and more varied datasets are needed for this (but understandably they can be seen as valuable on their own, so reluctance to share is understandable - at least from a business perspective).

AndrewKemendo · on Aug 16, 2017

People say that but I don't see many ML companies being acquired or funded because of their data sets.

rjtavares · on Aug 16, 2017

Big companies have valuable datasets, small companies have talent. It makes sense for big companies to acquire talent, not datasets.

AndrewKemendo · on Aug 16, 2017

Increasingly, small companies have really novel and valuable data sets.

wooter · on Aug 16, 2017

like?

AndrewKemendo · on Aug 16, 2017

Like Comma.ai, Pair (my company), Travelflan handful of others

flamedoge · on Aug 16, 2017

plus acquiring data is trivial compared to devising a good model

nl · on Aug 17, 2017

I think (hope) you missed a sarcasm tag there?

I spend 90% of my time manipulating data to try to build bigger, better datasets, and only 5% modelling.

flamedoge · on Aug 17, 2017

acquiring != manipulating

beambot · on Aug 16, 2017

Makes sense. As one of the most public personas in AI, he probably gets pitched frequently by AI startups. Might as well let someone else bankroll his dealflow while collecting 2% annual management fees and participate in the upside via carry a decade later.

_m8fo · on Aug 16, 2017

I'm curious to the opinions of people here on companies collecting data to build data sets vs. privacy.

lr4444lr · on Aug 16, 2017

Regardless of our individual opinions, probably far more than enough of the general public does not care about or know about securing their online privacy than is needed for an invasive AI startup to get what they need to make a valuable product.

canoebuilder · on Aug 16, 2017

Do you, or anyone else, have any thoughts as to why Americans have, at least stereotypically, a more laissez faire attitude to personal privacy compared to those in Europe? Just the somewhat recent history of secret police in Europe? The American founders were very pro-privacy weren't they?

lr4444lr · on Aug 16, 2017

I couldn't say. I don't know enough about European data sharing habits to really draw a comparison. I would remind you though as Apple proved during the San Bernadino iPhone showdown, the U.S. government is heavily restrained by law from just storming in and taking company data that they claim is necessary for the interests of national security.

nihonde · on Aug 16, 2017

Personal privacy is a paper tiger, as Facebook and others have proven again and again. You only need to look at the consequence-free landscape of privacy breaches to understand that people have no meaningful right to avoid direct marketing and spillage of their contact info.

On the other hand, confidential information of corporations is pretty well guarded, and confidential contract terms, costs, pricing, and other information that doesn't get shared among peer companies or competitors is what will propel AI into the economic stratosphere. Finding ways to get confidential data into learning systems and provide actionable feedback is the killer app.

lifeisstillgood · on Aug 16, 2017

I am not sure I get it. If I am Ford and get access to GMs Salary database or factory electrical meter, do I win much? It's almost certain I know the ballpark frommrunning my own factory and probably have hired three of GMs managers with detailed knowledge in their heads last week.

Or is it more hedge funds - like satellite images of Walmart car parks to estimate the revenue figures?

Either way these don't seem like things we need AI to pick out patterns ?

Or am I missing something?

nihonde · on Aug 17, 2017

It's about normalizing pricing and terms in the supply chain. (You chose bad examples, as the larger buyers implicate antitrust/competition rules and illegal collusion.)

If an AI has access to a broad spectrum of confidential information, it can reliably answer questions about the state of the market on an anonymized basis. This has the effect of normalizing sourcing behavior, which reduces purchasing and sales friction and improves overall efficiencies.

komali2 · on Aug 16, 2017

Well, why does it have to involve privacy? A camera on a marsh with thousands of hours of footage could be useful data (has bird, doesn't have bird? Water level? Cloud cover?)

A tollgate recording vehicle pass-through with a timestamp, a train station gate recording individual pass-through with a timestamp, etc, with huge volume could be useful data.

real-hacker · on Aug 17, 2017

Indeed, it is very frightening what the companies can do with user's data. While in U.S, the data of a user may reside in the cloud of multiple companies and needs to be aggregated to be useful, in China, companies like Tencent knows about almost everything about every user: your IM messages, your purchases, the movies/music you like, the news you read, your geolocation. I cannot imagine what they are capable of with all those data. Think black mirror.

Eridrus · on Aug 16, 2017

Privacy activists have long ignored any benefits of data collection and as we continue to extract more and more value from data this should become more evident and we will be forced to start discussing concrete harms rather than people's general discomfort.

icebraining · on Aug 16, 2017

Of course, this is not true. Plenty of harms have been shown, particularly based on past history. But these are dismissed as things that couldn't possibly happen again. Which means that the objections will only be accepted when it's too late.

Marge: Do I have to be dead before you’ll help me?

Wiggum: Well, not dead – dying. [Marge gets up to leave] No, no, no, no. Don’t walk away. How about this: just show me the knife in your back. Not too deep, but it should be able to stand by itself.

canoebuilder · on Aug 16, 2017

Aside from running large scale analyses over large health data sets, what are some examples where the value derived from large aggregations of personal data is dispersed widely through a society rather than being captured mostly by a single corporation or organization?

stevenhuang · on Aug 16, 2017

Building large data sets doesn't necessarily mean from personal data. Look at open-data initiatives such as http://open.canada.ca/en/open-data . Lots of potential for useful tools to be created if the data is there, which won't happen if even benign data like that are kept under wraps/not collected.

valgor · on Aug 16, 2017

>Ng told me that his personal goal is to help bring about an AI-powered society.

Anyone have links to interviews or information on Ng's vision? I'd love to hear the details.

Barrin92 · on Aug 16, 2017

not sure if there's a transcript anywhere but he gave a lecture on his broad views in a Stanford lecture

https://www.youtube.com/watch?v=21EiKfQYZXc

fiatjaf · on Aug 16, 2017

If he had called it AICOIN he could have raised much more.

thinbeige · on Aug 16, 2017

My thought before opening this thread.

I thought the AI hype is over again but apparantly not.

ThomPete · on Aug 16, 2017

Andrew NG believes that AI is just as important as electricity. I tend to agree.

lowglow · on Aug 16, 2017

Synapse AI is an "AI Coin" you might be interested in.

https://synapse.ai/

fiatjaf · on Aug 17, 2017

Oh, dear, what is that? How can people merge these things together?

Apparently they don't know also. The website is empty.

ratsimihah · on Aug 16, 2017

AR-AICOIN

eternalcode · on Aug 16, 2017

I chuckled at the Un-settling reality of Alt-Coins and greed.

wyldfire · on Aug 16, 2017

Aside: how do you pronounce his surname?

sh33mp · on Aug 16, 2017

He spent a couple of years in Singapore, so I'm going by the pronunciation there. (May be different in different regions, and I'm not sure about his preferred pronunciation now). It would do something like this:

Start with the word "urn". Now don't drag it out, make it short. Make the "n" and "ng" sound at the end (urng). Now take out the "r" sound (uhng).

It seems like Americans do tend to pronounce it "ehng" instead.

bllguo · on Aug 16, 2017

Your comment appears to be correct: https://www.quora.com/How-does-Andrew-Ng-prefer-his-name-to-...

Danihan · on Aug 16, 2017

In the video it sounds like "ooge". Similar to "rouge" or "stooge" but cropped short.

https://www.youtube.com/watch?v=sUO3Pk0nOCM

white-flame · on Aug 16, 2017

Sounds like "ng" to me. In english, we can say "mmm" or "nnnn" as drawn-out sounds on their own, just do the same with "ng".

poopchute · on Aug 16, 2017

So it should rhyme with Hung and Sung?

doppp · on Aug 17, 2017

As a Singaporean, this is the correct answer.

n3t · on Aug 16, 2017

Tip: when you don't know how to pronounce someone's name, find an interview (podcast? video interview?) with that person.

a1exyz · on Aug 16, 2017

"ing"

adam12 · on Aug 16, 2017

sounds more like oong or ooung to me (1 syllable).

ryanSrich · on Aug 17, 2017

Slight side topic: Has anyone gone through the new deeplearning.ai track on Coursera yet? Wondering how difficult it is for someone that can write code, but never had any formal academic training in CS.

kovek · on Aug 17, 2017

Hey Ryan,

I see this question asked very often in the last year or two. I am not an expert on deep learning, nor have I taken the deeplearning.ai track, but am currently learning about the topic.

Some of the resources out there have nice primers. I think you need to be somewhat comfortable with understanding the usual log/exp functions for the very base, understand calculus, with partial derivatives, and be used to linear algebra and matrix operations. Some good understanding of statistics could be useful as well when learning about ML. I don't think that a good background in CS is necessary for this stuff. This has not much to do with programming languages, operating systems, Turing completeness. Maybe having a good base in algorithms could be useful for _implementing_ the libraries to make sure they are optimal.

I was wondering why do people ask this question (or ask about resources on learning about this topic in general), when answers on this are so easily findable online.

quest88 · on Aug 17, 2017

I'm on course two after completing the assignments in course one. If you can write code, you can do the course. So far there's nothing where a cs degree is needed. It's mainly array manipulation for now.

The coding exercises are setup very nicely for you with a lot of code comments and hints. A lot of the time, if not all the time, you're simply completing the right hand side of an assignment.

sandGorgon · on Aug 16, 2017

> Many of these funds are putting time and resources into securing data sets, technical mentors and advanced simulation tools to support the unique needs of AI startups

What are "advanced simulation tools" ? something like https://github.com/marcotcr/lime ?

tachyonbeam · on Aug 16, 2017

Simulated worlds / games like the OpenAI gym and DeepMind lab come to mind:

https://gym.openai.com/

https://github.com/deepmind/lab

ericjang · on Aug 16, 2017

I would guess "environment simulators"? Environments are to RL as datasets are to supervised/unsupervised learning.

Quintus_ · on Aug 16, 2017

Is it likely that I (and other Andrew Ng 'fans') will be able to buy stock in his company?

BayesStreet · on Aug 16, 2017

I would also love to but extremely unlikely, crowdsource bureaucracy is a nightmare and he won't have problems raising that much.

alphydan · on Aug 16, 2017

He will be issuing the NG coin in an ICO ...

forgotmysn · on Aug 16, 2017

there is usually a minimum investment amount. for a fund of this size, im guessing it would be around $5m

tachyonbeam · on Aug 16, 2017

I find this really frustrating about tech investing. Much of the early investment opportunities are only available to the richest. It's an insider's game.

forgotmysn · on Aug 17, 2017

It's not unique to tech, but it's necessary for funds of this size. Let's say Andrew's entire fund is made up of $5m investors. That means he has to close at least 30 investors, which means he probably has to pitch 60-90. Not only that, but actively maintaining 30 relationships and making sure they are effectively informed of the funds progress is difficult and time consuming as well. Andrew's time is probably best spent working with portfolio companies on their tech, so the less time he has to spend courting and showing love to LP's the better.

That's why they limit investment size. It's not to be purposefully exclusionary, they just have to be mindful of efficient use of time.

likelynew · on Aug 16, 2017

I don't think it is not unique to tech investing, but it is true for all small companies.

padobson · on Aug 16, 2017

Or big companies. If you wanted a chance to profit from Snap, Square, Twitter or (coming soon) Pinterest, you had to be a high level investor.

I believe most regulations that block this kind of investment are done in the name of protecting the little guy.

goobynight · on Aug 17, 2017

They are done in order to protect the little guy, due to the high risk associated with the high reward.

Otherwise, you will get people that have negative net worth, maybe $20k in credit card debt + student loans that they pay minimums on, and a few kids dumping a year of savings into Snapchat and losing it all.

While it sucks for small players that are ok with high risk investing (and would be ok losing it), there just isn't a way to stop the flood of stupid that would come with it.

Probably the closest thing I've seen to being able to invest in something like this is cryptocurrecy,

PopsiclePete · on Aug 16, 2017

"During an earlier conversation, Ng told me that his personal goal is to help bring about an AI-powered society."

So is this Elon Musk's arch-nemesis?

Tenoke · on Aug 16, 2017

Andrew NG is mostly talking about society powered by current techniques, which Musk also likes. Musk's worry is over the potential once/if we've advanced the technology enough (which given current progress and open avenues for research isn't too far-fetched of an expectation).

justboxing · on Aug 16, 2017

Is AI the new Social?

jraines · on Aug 16, 2017

SoMoLo --> AIVRCoin

doesn't quite jump off the powerpoint slide as well though

panabee · on Aug 16, 2017

one of the best ways to monetize education is with student investments instead of student payments. YC is doing this with startup education.

Jdam · on Aug 16, 2017

Lol, even Filecoin outraised that Fund

ktta · on Aug 16, 2017

Offtopic, but this amp page is the cleanest page I've ever seen.

I would love to just use the amp version for all TechCrunch pages. Anyone in the mood to make a chrome extension? (I'm on desktop, and the results are still clean without adblocker)

taytus · on Aug 16, 2017

Plug: We are on it :) https://roboamp.com

SoMisanthrope · on Aug 16, 2017

Go, Andrew, go!!!