Machine Learning : Association Rule- Apiori (Part 21)

assume that in a super shop, the data scientist fount that folks who buy diapers also buy cold drinks (No alchohol)

So, the super shop designed these two in such a distance that you have to look for those and in the meantime folks can buy more things than expected.

So, we can say this Apiori as

You can also take another example,

You can see people who watched movie 1 also watched movie 2, who watched movie 2 also watched movie 4

But we as business man need to know exactly if a person will choice one thing alongside other

Another example,

you can see folks who bought burgers also bought french fries and so on.

So, as a shop keeper " If you see one buying a burger, you can recommend french fries"

That's how we do business!

Let's learn the algo now:

We need to learn support , confidence here.

Assume we have a lot of people and among them

let's see how many watched "3 idiots"

So, our support is 10/100 %

Now confidence,

now , we are looking for people who watched "Interstellar" also watched '3 idiots'

here, M1 refers to interstellar and M2 refers to ' 3 idiots'

Assume this green portion of people watched Interstellar(M1)= 40 people

and out of that, 7 people watched '3 idiots'

so, the confidence is 17.5%

So, if we want to recommend food or movie, this is how it will work

for example,

these red square people watched '3 idiots'

and green people watched 'Interstellar'

So, what is the likelihood that a choosen person will like to watch '3 idiots'? 10% as we knew earlier

but if we say that those who watched 'Interstellar' how many might like to watch '3 idiots'? it's 17.5% as we got earlier

So, the lift (So the lift is the improvement in your prediction.)) is 1.75%

If you first ask the question, have you seen and liked "Interstellar?"

If they say, yes, and then you recommend "3 idiots" the likelihood

of a successful recommendation there is 17.5%.

So the lift is by definition 1.75.

Now what is the algorithm?

Let's code it:

Problem statement : We want to knows if customer buys one, which other has the high probability to get bought

each row represent different customer

So, the first customer bought these

and same goes for others...

  1. Import apyori

!pip install apyori

  1. import others

  2. Data pre processing

    from the data we saw that the first row didn't have column name . So, it itself is a transaction. So, when importing the dataset, we need to set header=None

    If we don't do that, it will ignore the first row

    dataset =pd.read_csv('Market_Basket_Optimisation.csv', header = None)

    Now, we will later use apriori function which requires some format. We haven't set that yet. Let's do that.

    The format is basically list of transactions

    transactions = []Take an empty list

    as we have 7501 transactions, let's add all

    for i in range(0, 7501):

    transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])

    Here, transaction.append([carries list of products]) . We have maximum 20 products. So, the main loop will loop all rows and second one will loop for all products.

    for j in range (0,20)

    So, we ensured we will loop all transactions and all product. Now what are we expecting?

    dataset.values[i,j] will have our transaction

    Finally, all of the things needs to be string.

    so, str(dataset.values[i,j]

    Here you can see transaction list has every transactions listed like this format:

    [[transaction 1],[transaction 2],...................[transaction 7501]]

    We can make sure the length by this

    1. Training the Apriori model on the dataset

      import the apriori function

      from apyori import apriori #importing apriori function

      then create an object of it

      rules = apriori()

      this will have transactions list. So, we can set it

      rules=apriori(transactions=transactions))

      Now, we learned about support and we need to set a minimum number of support set to check. For example, when looking for folks who watched 'Interstellar' and '3 idiots', we want to make sure minimum number of times when people watched both.

      rules=apriori(transactions=transactions, min_support=))

      So, we have set 3. So, at least 3 transactions need to be there where the product matches.

      So, support will be

      Here we had 7501 transaction in 1 week (7 day) and assuming 3 is minimum support

      3*7/7501 =0.003

      rules=apriori(transactions=transactions, min_support=0.003))

      Now we have to set minimum confidence.

      We can take any values. Let's take 0.2

      rules=apriori(transactions=transactions, min_support=0.003, min_confidence=0.2))

      Now lift: If you work a lot , you will see minimum lift 3 works well. So, let's set it

      rules=apriori(transactions=transactions, min_support=0.003, min_confidence=0.2,min_lift=3)

      Now we were planning to see one product with one. So, we are expecting 2 products.

      So, set max_length=2 (means 2 product left and right) and min_length =2 (means 2 product left and right)

      Note: [We can giveaway 1 product with one . For example, assume 1000 people buys milkshake with rice. So,once we get the connection, we can give one free with another.So, buyers will have the greed to buy one even with high price

      So, for buy 1 , get 1 set max_length=2 & min_length =2

      For, buy 2 and get 1 set max_length=3 & min_length =3

      rules=apriori(transactions=transactions, min_support=0.003, min_confidence=0.2,min_lift=3, min_length=2, max_length=2)

    2. Visualising the results

      we put our rules to a list

      results = list(rules)

      You can see all of the rules which crosses our minimum support, minimum confidence, lift etc.

      all of the rules have value equal or bigger than our minimum set values.

      We have found 9 rules.

      Let's analyze the first rule:

      RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)])

      Here support=0.004532728969470737 > 0.003

      confidence=0.29059829059829057 > 0.2

      lift=4.84395061728395 > 3

      Also, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}) means that once one buys light cream, he/she will buy chicken

    3. Things seem pretty messy,right?

      Let's organize them using pandas dataframe

      Putting the results well organised into a Pandas DataFrame

      def inspect(results):

      lhs = [tuple(result[2][0][0])[0] for result in results]

      rhs = [tuple(result[2][0][1])[0] for result in results]

      supports = [result[1] for result in results]

      confidences = [result[2][0][2] for result in results]

      lifts = [result[2][0][3] for result in results]

      return list(zip(lhs, rhs, supports, confidences, lifts))

      resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])

      Let's understand what we get with (result[2][0][0])

      gives us

      [OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395]

      Now, we want left side which is light cream

      So, (result[2][0][0]) gives us OrderedStatistic(items_base=frozenset({'light cream'}) and (result[2][0][0])[0] gives us 'light cream'

      So, we used it for lhs (Left hand side)

      (result[2][0][1]) gives us items_add=frozenset({'chicken'})

      and (result[2][0][1])[0] gives us 'chicken'

      result[2][0][2] gives us confidence=0.29059829059829057

      result[2][0][3] gives us lift=4.84395061728395

      result[1] gives us support=0.004532728969470737

      Now we have a good table with connection between goods

    4. Also we can check that in the descending order

      Done!!

Code from here : Repository