Architecting a Machine Learning System for Risk

Architecting a Machine Learning System for Risk

Online risk mitigation

At Airbnb, we want to build the world’s most trusted community. Guests trust Airbnb to connect them with world-class hosts for unique and memorable travel experiences. Airbnb hosts trust that guests will treat their home with the same care and respect that they would their own. The Airbnb review system helps users find community members who earn this trust through positive interactions with others, and the ecosystem as a whole prospers.

We can mitigate the potential for bad actors to carry out different types of attacks in different ways.

1) Product changes

Many risks can be mitigated through user-facing changes to the product that require additional verification from the user.

2) Anomaly detection

Scripted attacks are often associated with a noticeable increase in some measurable metric over a short period of time.

3) Simple heuristics or a machine learning model based on a number of different variables

Fraudulent actors often exhibit repetitive patterns.

 

Machine Learning Architecture

Different risk vectors can require different architectures. For example, some risk vectors are not time critical, but require computationally intensive techniques to detect. An offline architecture is best suited for this kind of detection. For the purposes of this post, we are focusing on risks requiring realtime or near-realtime action. From a broad perspective, a machine-learning pipeline for these kinds of risk must balance two important goals:

1) The framework must be fast and robust.

That is, we should experience essentially zero downtime and the model scoring framework should provide instant feedback.

2) The framework must be agile.

Since fraud vectors constantly morph, new models and features must be tested and pushed into production quickly.

What do you think about this article? What have you learned from it? If you have any opinions, leave comments below or send us a message.

Hadoop and Data Warehouses

Hadoop and Data Warehouses

I see a lot of confusion when it comes to Hadoop and its role in a data warehouse solution.  Hadoop should not be a replacement for a data warehouse, but rather should augment/complement a data warehouse.  Hadoop and a data warehouse will often work together in a single information supply chain: Hadoop excels in handling raw, unstructured and complex data with vast programming flexibility; Data warehouses, on the other hand, manage structured data, integrating subject areas and providing interactive performance through BI tools.

There are three main use cases for Hadoop with a data warehouse, with the above picture an example of use case 3:

1. Archiving data warehouse data to Hadoop (move)
Hadoop as cold storage/Long Term Raw Data Archiving:
– So don’t need to buy bigger PDW or SAN or tape

2. Exporting relational data to Hadoop (copy)
Hadoop as backup/DR, analysis, cloud use:
– Export conformed dimensions to compare incoming raw data with what is already in PDW
– Can use dimensions against older fact table
– Sending validated relational data to Hadoop
– Hadoop data to WASB and have that used by other tools/products (i.e. Cloud ML Studio)
– Incremental Hadoop load / report

3. Importing Hadoop data into data warehouse (copy)
Hadoop as staging area:
– Great for real-time data, social networks, sensor data, log data, automated data, RFID data (ambient data)
– Where you can capture the data and only pass the relevant data to PDW
– Can do processing of the data as it sits in Hadoop (clean it, aggregate it, transform it)
– Some processing is better done on Hadoop instead of SSIS
– Way to keep staging data
– Long-term raw data archiving on cheap storage that is online all the time (instead of tape) – great if need to keep the data for legal reasons
– Others can do analysis on it and later pull it into data warehouse if find something useful

Thanks for reading this article. If you have any opinions, please leave a comment below or send us a message