Machine Learning : Association Rule- Apiori (Part 21)
assume that in a super shop, the data scientist fount that folks who buy diapers also buy cold drinks (No alchohol)
So, the super shop designed these two in such a distance that you have to look for those and in the meantime folks can buy more things than expected.
So, we can say this Apiori as
You can also take another example,
You can see people who watched movie 1 also watched movie 2, who watched movie 2 also watched movie 4
But we as business man need to know exactly if a person will choice one thing alongside other
Another example,
you can see folks who bought burgers also bought french fries and so on.
So, as a shop keeper " If you see one buying a burger, you can recommend french fries"
That's how we do business!
Let's learn the algo now:
We need to learn support , confidence here.
Assume we have a lot of people and among them
let's see how many watched "3 idiots"
So, our support is 10/100 %
Now confidence,
now , we are looking for people who watched "Interstellar" also watched '3 idiots'
here, M1 refers to interstellar and M2 refers to ' 3 idiots'
Assume this green portion of people watched Interstellar(M1)= 40 people
and out of that, 7 people watched '3 idiots'
so, the confidence is 17.5%
So, if we want to recommend food or movie, this is how it will work
for example,
these red square people watched '3 idiots'
and green people watched 'Interstellar'
So, what is the likelihood that a choosen person will like to watch '3 idiots'? 10% as we knew earlier
but if we say that those who watched 'Interstellar' how many might like to watch '3 idiots'? it's 17.5% as we got earlier
So, the lift (So the lift is the improvement in your prediction.)) is 1.75%
If you first ask the question, have you seen and liked "Interstellar?"
If they say, yes, and then you recommend "3 idiots" the likelihood
of a successful recommendation there is 17.5%.
So the lift is by definition 1.75.
Now what is the algorithm?
Let's code it:
Problem statement : We want to knows if customer buys one, which other has the high probability to get bought
each row represent different customer
So, the first customer bought these
and same goes for others...
- Import apyori
!pip install apyori
import others
Data pre processing
from the data we saw that the first row didn't have column name . So, it itself is a transaction. So, when importing the dataset, we need to set header=None
If we don't do that, it will ignore the first row
dataset =
pd.read
_csv('Market_Basket_Optimisation.csv', header = None)
Now, we will later use apriori function which requires some format. We haven't set that yet. Let's do that.
The format is basically list of transactions
transactions = []
Take an empty listas we have 7501 transactions, let's add all
for i in range(0, 7501):
transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])
Here, transaction.append([carries list of products]) . We have maximum 20 products. So, the main loop will loop all rows and second one will loop for all products.
for j in range (0,20)
So, we ensured we will loop all transactions and all product. Now what are we expecting?
dataset.values[i,j] will have our transaction
Finally, all of the things needs to be string.
so, str(dataset.values[i,j]
Here you can see transaction list has every transactions listed like this format:
[[transaction 1],[transaction 2],...................[transaction 7501]]
We can make sure the length by this
Training the Apriori model on the dataset
import the apriori function
from apyori import apriori #importing apriori function
then create an object of it
rules = apriori()
this will have transactions list. So, we can set it
rules=apriori(transactions=transactions))
Now, we learned about support and we need to set a minimum number of support set to check. For example, when looking for folks who watched 'Interstellar' and '3 idiots', we want to make sure minimum number of times when people watched both.
rules=apriori(transactions=transactions, min_support=))
So, we have set 3. So, at least 3 transactions need to be there where the product matches.
So, support will be
Here we had 7501 transaction in 1 week (7 day) and assuming 3 is minimum support
3*7/7501 =0.003
rules=apriori(transactions=transactions, min_support=0.003))
Now we have to set minimum confidence.
We can take any values. Let's take 0.2
rules=apriori(transactions=transactions, min_support=0.003, min_confidence=0.2))
Now lift: If you work a lot , you will see minimum lift 3 works well. So, let's set it
rules=apriori(transactions=transactions, min_support=0.003, min_confidence=0.2,min_lift=3)
Now we were planning to see one product with one. So, we are expecting 2 products.
So, set max_length=2 (means 2 product left and right) and min_length =2 (means 2 product left and right)
Note: [We can giveaway 1 product with one . For example, assume 1000 people buys milkshake with rice. So,once we get the connection, we can give one free with another.So, buyers will have the greed to buy one even with high price
So, for buy 1 , get 1 set max_length=2 & min_length =2
For, buy 2 and get 1 set max_length=3 & min_length =3
rules=apriori(transactions=transactions, min_support=0.003, min_confidence=0.2,min_lift=3, min_length=2, max_length=2)
Visualising the results
we put our rules to a list
results = list(rules)
You can see all of the rules which crosses our minimum support, minimum confidence, lift etc.
all of the rules have value equal or bigger than our minimum set values.
We have found 9 rules.
Let's analyze the first rule:
RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)])
Here
support=0.004532728969470737
> 0.003confidence=0.29059829059829057
> 0.2lift=4.84395061728395
> 3Also,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'})
means that once one buys light cream, he/she will buy chickenThings seem pretty messy,right?
Let's organize them using pandas dataframe
Putting the results well organised into a Pandas DataFrame
def inspect(results):
lhs = [tuple(result[2][0][0])[0] for result in results]
rhs = [tuple(result[2][0][1])[0] for result in results]
supports = [result[1] for result in results]
confidences = [result[2][0][2] for result in results]
lifts = [result[2][0][3] for result in results]
return list(zip(lhs, rhs, supports, confidences, lifts))
resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])
Let's understand what we get with (
result[2][0][0]
)gives us
[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395
]Now, we want left side which is light cream
So, (
result[2][0][0]
) gives usOrderedStatistic(items_base=frozenset({'light cream'})
and (result[2][0][0])[0]
gives us 'light cream'
So, we used it for lhs (Left hand side)
(result[2][0][1])
gives usitems_add=frozenset({'chicken'})
and
(result[2][0][1])[0]
gives us'chicken'
result[2][0][2]
gives usconfidence=0.29059829059829057
result[2][0][3]
gives uslift=4.84395061728395
result[1]
gives ussupport=0.004532728969470737
Now we have a good table with connection between goods
Also we can check that in the descending order
Done!!
Code from here : Repository