Demystifying the business of AI, machine learning, big data…
Andrew Wu discusses more emerging technologies that make the investment process more efficient and more accessible. Wu talks about misconceptions surrounding AI, big data and machine learning within investment technology.
Excerpt From
Transcript
The tech in the investec sector
really comes in two flavors. There are a lot of new technologies that
make the investment process more efficient and more accessible. We've seen some of that in
the first module of this course. Automation in asset allocation, financial
planning and security settlement have really started to drive down cost
across many investment sectors and enable new developments like Robo
advisers, automated financial planners and smart index funds to make the average
investors portfolios more customized and more diversified and
at the same time less costly. The other category are the technologies
that make the investment process more precise and these are broadly
labeled sometimes incorrectly as AI, big data and
machine learning technologies, which are simply data analytics tools
that enhance our ability to interpret the historical financial data and make
more accurate predictions for the future. We've talked about some of
these such as neural networks primarily from a technical
point of view in this. Module will take a broader business focus
point of view on the application of AI and machine learning techniques
in the investment industry. Will take a primarily non-technical
approach in this module with the goal of providing an industry overview did
mystifying the key buzzwords and clarifying some common misconceptions
providing some practical advice and directing you to the more
in-depth research and analysis in the relevant subject
fields such as computer science. In statistics, should you be
interested in pursuing further? First let's take a look at what Ai and or machine learning really
are as these concepts are probably on par with crypto as some
of the most over hyped terms in finance. First of all, there are two types of
AI systems one is called narrow AI which is defined by divorce key is
a computer system that matches or exceeds human intelligence, but
in a narrowly defined area under this definition most computer programs can
be called narrow AI to some degree as most of them would exceed human's
capabilities in some area next. The general AI is probably the one
that you're more familiar with and that's essentially
The Skynet is a software. But without the pre-program rules
an artificial neural network that is truly able to code itself and
rewire self with new data patterns. Essentially the general AI is a true
imitation of the human brain so far, the current computing
capabilities of our hardware is far from being able to even remotely
construct a general AI network. So most of the research in this
type has been theoretical therefore the current application of all of
the so-called AI systems in finance has been at best narrow AI, a more
appropriate label for AI in finance though is probably machine learning you see
the financial industry particularly. The investment sector has been
all about processing data and information extracting new insights and
signals from current and historical data. In these settings the use of AI has
essentially been the use of more and more advanced models of machine learning. Which is self is a fancy buzzword for
statistics essentially using machine learning or quote unquote AI investments
is using advanced statistical models on the same financial data to
better interpret patterns in the data conduct better statistical inferences or
make better predictions. Now, let's put the tools and
the data together. To have a more complete picture of
the type of the data out there in the investment landscape and the type of
tools that you can use on them first. There are all kinds of data that you can
use to guide your investment decisions out there numbers, reports,
text, news, you name it. Now, these data will belong
to one of two types. Let's call the first type
quote-unquote small data. These are the data small
enough in size like files and excel spreadsheets that you can store. Or on your computer and
search on your computer and the counterpart to that is big data. And what's Big Data? Well data set that's large enough that you
can't store it on your own computer, but instead have to put it in
a dedicated storage server and have to use some specialized
database tools like Hadoop and Mapreduce to efficiently process and
manage the data. That's it. That's literally the bus free meaning of
big data now within both the small data. And dictate the categories data can
further be classified into two types based on their structure. The first type is called structured data. These data are essentially numbers,
numbers that you can organize neatly in rows and columns like in
an excel spreadsheet now surprisingly. It's a little easier to perform
analysis on structured data because there are already quite organized
in contrast to that are unstructured data. These are the natural languages,
text, images, videos, etc. Because these data are not neatly
stored as numbers in rows and columns that will take a little
bit more effort to analyze first. We'll have to use some statistical
technique like natural language processing or NLP for text data and
signal analysis for audio and video data to convert these data into
numbers before we can analyze them but the important thing to note here, is that
contrary to what you hear in the hype. There's no mutual exclusivity
between the data types. For example, not all unstructured
data are big data things like text or image files could be very small in size. So you wouldn't need mapreduce
to work with them, but they still require the extra processing
step to be converted into numbers. There could be small unstructured data and there could be big structured data,
now with all that data available. Here are all the tools that you can use
to mine these data for insights and signals starting with the simplest. We have what I call the simple analytics
like basic summary statistics, for example, averages standard
deviations correlations etc. Now basically everything else can
be labeled as machine learning which again is just a fancy buzzword for
the more advanced statistical models beyond simple statistics, like averages
that you can use on the data to analyze patterns conduct inferences or
make predictions next. There are two types of machine
learning tools the easier type. Let's call that quote unquote shallow
learning a classic example of shallow machine learning is a linear regression,
shallow learning is essentially a statistical model where we the user have
to specify most of the model parameters. For example,
when you're estimating a regression, you have to choose what
are the independent variables the regressors that you want
to include in that regression. The counterpart to that
is deep learning and the bus free meaning of deep learning is
essentially a statistical model with more parameters that can be directly determined
by the data an example of deep learning is regularized linear regressions,
like the lasso or the elastic net. These are the regression
models where instead of you choosing the independent variables you low
the entire variable list to the model and based on some basic criteria. The model will choose the most quoted. Quote important regressors to include for
you that's essentially deep learning, letting the data determine the model
parameters from the last module. We already saw that neural networks or
another classic example of deep learning. And this class of models is the closest to
quote unquote and AI that we can use for financial data analytics and
further within each machine learning type. We also have two sub categories. The first category is called
supervised learning and as its name suggests supervised learning
models have to be trained with data before we can use them for
tasks like predictions again, a linear regression is a classic
example of supervised machine learning. Or lending regression examples
from the credit tech class any regression model there has
to be fitted with data first. We need to get the parameter estimates for
these alphas and betas and only then can we plug in new data and get the predicted values like the default
rates by this token, you can view other much more advanced supervised learning
models as super souped-up regressions. The other class of machine
learning models are unsupervised. This means that we don't need to use
existing data to estimate the model before we can use it. We can just directly use it a classic
example of unsupervised learning models is clustering given
a bunch of data points. You can directly take them to
a clustering model and the model will cluster them into groups where
points within a group are close enough. We've seen a model like that in
action in the platform landing module of the credit tech course, so
basically both shallow learning and deep learning can either be supervised or unsupervised finally to put the data and
the analytics together. Well, you have complete freedom to
choose whatever tool you see fit to analyze the type of data at hand. You can use a supervised shallow
learning tool like a regression to analyze structured big data and there
are unsupervised deep learning tools for unstructured small data when it comes
to using AI or machine learning for investments. You're really limited by number one. Your computing power and number two
choosing the most appropriate model that works best for
your data type and your questions. Let's now take a deeper look at the usage
of unstructured data in investments because with the competition for
Alpha is getting more and more intense. Investors are increasingly turning to the
analysis of unstructured financial data under using ever more
sophisticated models to mine them, if you think about it in finance the bulk
of the growth in data that we have available has been an unstructured data. This is from Apple's SEC 10K
filing is annual report for 2019. And this income statement here is
an example of structured data with numbers neatly stored in columns for
tabulation and analysis. And when released these data are obviously
the first focus of analysts and traders alike. However, think about how long
this annual statement is the tables are only
a few pages at best and the rest of the statement that hundreds of
pages of text is all unstructured data. Much fewer people pay attention to
these not to mention finishing them as they're not very exciting reading but these languages might
contain a lot of information. What worse and what sentences are used and how they're put together can tell
you a lot about what the management might be thinking beyond just the income
numbers, same for the social media data. This is a random Twitter page with posts
about Apple stocks at the end of 2019. As you can see these social
media languages how positive or uncertain they are for example can also
tell us a lot about investor sentiment and psychology around the stock. Of course, there's way too much
of these unstructured data text images that are relevant to the stock
out there for us to read them all and that's precisely where computers can help. For example,
we can use natural processing tools to first convert the text
into some form of numbers that we can fit our existing machine
learning models on at the simplest level. For example, the whole 10 K filing could
be turned into a word vector with entries denoting to the frequencies of each
English word that's used in the document. We can then stack different document
vectors together into a matrix and run all sorts of shallow and deep machine learning
algorithms on it to get an estimate for both content and on text like how
positive the languages are how complex the sentence structure is how many topics
are talked about among many other things. So I encourage you to explore a bit more
about natural language processing and other methods for
unstructured data analysis, because when used appropriately they can
really extract information and signals from these data that are truly incremental
to what you get from purely the numbers.