ENIAC

Written by Ryan Morrison

History of data science: the modern computer

Gary Mulder

In last week’s blog post, ‘History of data science: pre-20th century’, I told the story of data science before you probably thought it even existed – looking 5,000 years back and beyond.

In this part 2 (of 3), I will explore the modern data science revolution, which took place in the 20th century and will take us up to present day and the future (to be discussed on the final part of this series).

 

Computer prediction

In the 20th century, the major revolution in data science occurred shortly after World War II, when one of the inventors of the modern computer, John Von Neumann, created the first computer-based weather forecast model.

The forecast was for one day, yet it took about a day to compute; so not particularly useful. However, within a year the same one day forecast compute time was reduced to minutes through better design of algorithms.

This pattern of improved computing hardware and improved algorithms, resulting in faster and better predictions, is a fundamental driving force behind the development of data science.

ENIAC was one of the first computers and was used for one of the first weather forecasts

 

Expert systems

In the 1980s the state-of-the-art in data science was called an ‘Expert System’. These were a product of artificial intelligence (AI) research in the 1960s and 1970s.

Expert Systems codified a subject matter expert’s knowledge in a collection of ‘if-then’ rules. If the ‘if-then’ rules were a good representation of the expert’s knowledge then the Expert System could be used by non-experts to make predictions.

In an Expert System non-expert users are provided a simple user interface to access expert-derived knowledge

 

Expert Systems as rules-based fraud detection

Fraud detection systems based on rules are a derivative of Expert Systems. They are very effective at identifying simple relationships in data that might indicate the presence of fraud.

An example of a fraud detection rule is as follows:

If the current income of the applicant is two standard deviations higher than the average income for the applicant’s occupation, flag as possibly suspicious.

A lot of expert knowledge is encapsulated in this rule. We have a model of incomes for different occupations, including the distribution of incomes for each occupation. Two standard deviations is a statistical measurement of the likelihood of a particular applicant’s income, relative to the average and variation of incomes for the applicant’s occupation.

However, rules are only as good as the expert who codified the rules. If average incomes over time change for an occupation, then the above rule may never match, or may match many applicants that have legitimately reported their incomes. In the former case the rule may be ineffective in detecting possible fraud, which is called a ‘false negative’. In the latter case the rule is generating a ‘false positive’, seeing indications of possible fraud, when there is actually no fraud.

 

False negatives and false positives

The generation of false negatives and false positives is the bane of rules-based fraud detection systems. False negatives will incur fraudulent losses to a credit issuing organisation. False positives can cause an organisation’s customers to be erroneously accused of fraud, which creates a very poor customer experience.

In order to avoid either falsehoods, human fraud experts need to manually review every positive match, and determine which are ‘true positives’ (actual fraud) and which are ‘false positives’ (misclassified fraud). There is a major cost involved for an organisation to run and manage a large workforce trained in manual fraud detection.

Therefore, in order to reduce false positives and negatives, experts are always required to review, correct, and add rules in order to identify new patterns of fraud. These sets of rules then start to grow in complexity and number, without bound and become a challenge to manage in themselves. At the same time, the volume of credit applications have increased as banks and other credit offering organisations provide online credit applications. Organisations need new technologies to manage the volume of applications they receive, and carefully review for possible fraud. Rules-based fraud detection, while very useful, is not a cost-effective or scalable approach for modern fraud detection. The fraudsters continually change their approach in order to circumvent organisations’ fraud controls. There is a clear need for smarter, better and more powerful methods for detecting and eliminating fraud. Careful, considered use of machine learning coupled with traditional rules-based detection allows TruNarrative to tell exactly who, what and when novel fraud has occurred.

Keep a lookout for the final blog of this 3-part series, ‘History of data science: present day and the future’, coming soon. Read the first blog, ‘History of data science: pre-20th century’, here.