2/22/15

Frequent ItemSets : Apriori Algorithm, Support and Confidence Part II

The Part I tutorial, is based on Apriori algorithm and we stated a few about association rules. Today, we will look about association rules, confidence and support. 

Association Rule

If we go by our previous post we defined learning association rule as means finding those items which were bought together most often i.e. single items, pair-wise items, triples etc.

In technical terms, If-then rules about the contents of the basket. Example is below:

Rule for {i1, i2, i3, i4, i5...., iN} -> j means : "if a basket contains all of i1,..., iN then its likely to contain an item j.

Confidence

Confidence of the association rule is the probability of j given i1,..., iN. Simple terms, it's the Ratio of support for I U { j } with support for I. Suppot of I is the number of baskets/transactions containing item I.

Example

Our Transactions/Baskets
Now if we want to check the association rule for {2, 4} -> 5.

The confidence is: Ratio of {2, 4} U {5} with support of {2, 4}. Therefore,

Confidence = 3 / 3 => 1

We can say that, {2, 4} -> {5} has a confidence of 1. But, we want to know how interesting the rule is. For this, we have an new parameter called Interest.

Interest of an association rule is the difference of it's confidence and the fraction of baskets which contain item j.

I ({2, 4} -> 5) =  conf( {2, 4} -> 5) - Fr(5)
                       = 1 - (3/4)
                       = 1 - .75
                       = .25

Therefore, the Interest is just 25 %. It's not an interesting rule. 

Interesting rules are those with high positive or negative interest values. As high positive or negative values means the presence of I encourages or discourages the presence of j.

Frequent ItemSets : Apriori Algorithm and Example Part I

This is the starting for our new Tutorial Topic, "Data Mining". Apriori Algorithm is one of the classic algorithm used in Data Mining to find association rules. An initial reading to Apriori might look complex but it's not. Let me give an example and try explaining it:

Suppose we have transactions of a shopping centre as below:


Learning association rule means finding those items which were bought together most often i.e. single items, pair-wise items, triples etc.

So, as I mentioned earlier Apriori is a classic and the most basic algorithm when it comes to find association rules. A lot of resources are available over the internet which we can find, but here I will try to make it intuitive and easy.

Algorithm:

- A two-pass algorithm which limits the need for main memory.
- One of the Key Idea behind Apriori is Monotonicity: If a set of items I appear at least s times, so does every subset J of I.

Pass 1: Read the baskets and count in main memory the occurrence/frequency of each item.

After the Pass 1, is completed, check the count for each item. And, if the count of item is more than equal to s i.e. Count(i) >= s, then the item i is frequent. Save this for next pass.

Pass 2: Read baskets again and count in main memory the occurrence/frequency of pair of items formed using the frequent items (which we got from Pass 1).

After Pass 2  end, check for the count of each pair of item and if more than equal to s, the pair if considered to be frequent, i.e. Cunt(i, j) >= s.



Example: 

We will consider few things:

- Our Support or threshold is 3.

Our Transaction Table: 


Step 1: Count the occurrence of each item.



Step 2: Remember, the algorithm says, an item is considered to be frequent if it's bought more then the Support/Threshold i.e. 3. Therefore, below is the list of Frequent Singletons.



Step 3: We start making pairs out of the frequent itemsets we got in the above step.


Step 4: After getting the frequent Item Pairs, we start counting the occurrence of these pairs in the Transaction Set.


Step 5: Now again, follow the Golden Rule, and discard non-frequent paris.



Now we have a table with pair of frequent items. Suppose we want to find frequent triplets. We the above table and make all the possible combinations.

Step 6: Make combinations of triples using the frequent Item pairs.

To make triples, the rule is: IF 12 and 13 are frequent, then the triple would be 123. Similarly, if 24 and 26 then triple would be 246.

So, using the above logic and our Frequent ItemPairs table, we get the below triples:


Step 7: Get the count of the above triples (Candidates).


After, this, if we can find quartets, then we find those and count their occurrence/frequency. 

If we had 123, 124, 134, 135, 234 and we wanted to generate a quartet then it would be 1234 and 1345. And after finding quartet we would have again got their count of occurrence /frequency and repeated the same also, until the Frequent ItemSet is null.

Thus, the frequent ItemSets are:

- Frequent Itemsets of Size 1: 1, 2, 4, 5, 6
- Frequent Itemsets of Size 2: 14, 24, 25, 45, 46
- Frequent Itemsets of Size 3: 245

To know more about how good the association rule formed is, i.e. calculating the confidence and  explanation of support, please click here for the Part II of this.



1/20/15

Python Tutorial: Strings Datatype

Data stored in memory can of different types and Python like other languages have different standard data types. Sometime back we did a post on Python Numbers. Today we will be covering other standard datatypes i.e. Strings.

Note: All examples shown in the post are based on python3.


Like other languages, python also has the same meaning/definition for Strings. They are a contiguous set of characters enclosed within single/double quotation marks

#!/usr/bin/python
strA = "Hello "
strB = 'World!'

#Printing the above variables on screen

print strA #This will work in Python2
print strB #This will work in Python2

print( strA ) #This will work in Python3
print( strB ) #This will work in Python3

Result of above Python3 Code

String Slicing:

Strings can be sliced i.e. subsets of a string, using the slice operator ([:] or []). The index starts from 0.

#!/usr/bin/python

print( strA[0] )   #prints the first character of variable strA
print( strA[1:3] ) #prints characters from first index to third
print( strA[3:] )  #prints characters from third index


String Concatenation:

Like other languages python also provides the functionality to concatenate strings. It is done its the + operator

#!/usr/bin/python

print( "Print Concatenated Output: " + strA + strB )

Code Output
If you try to concat another datatype using + operator, you would get an error "cannot convert 'int' object to str implicitly". So to achieve that we have two ways:

1. We can do by putting values using a comma inside print()
2. Other way, we can use an inbuilt function str(). This will convert any datatype to string thus, allowing us to use + operator

#!/usr/bin/python

print( strA + 4 ) #This will give an error as mentioned above

#Correct Way to Concat String and another Datatype

print( strA, 4 ) #Method 1
print( strA + str(1234) ) #Method 2


Escape Characters:

The definition an escape character is a character which invokes an alternative interpretation on subsequent characters in a character sequence. It can be interpreted in a single as well as double quoted string.

Below is the list of escape characters with their description:


Some of the special operators

We only saw the + operator, but apart from this there are many others. Below is the list of all operators:


Formatting Operator

Formatting Operator %, is one of the features which reminded me of the time when I used to write code in C. Here in python it functions the same way:

Below is a list of formatting operators:


Example:

#!/usr/bin/python

"""You can have multiple formatting operators, but remember the sequence of variables must be followed after % inside a bracket () separated by comma"""

num = 2
post_num = 129

print( "Code %s Learn" %num) 

print( "Code %s Learn\'s post number: %s" %(num,post_num))

String Formatting Example Output

You must have noticed that I have used Triple quotes in the above example. Triple quote is used for writing multi-line comments, whereas # is used for writing a single line comment.

Python also provides multiple built-in functions for String manipulations. Below is the gist of some functions:



Check Python Docs for detailed reference.


1/13/15

Install Python on Mac

Mac by default comes with a Python 2.7 installed, but if you want to install latest version of Python, it can be down easily.

We will need to follow the below steps:

Step 1: You can either go to Python website and download the latest stable release. (Or just click here).

Step 2: Open the installer and instal Python on your Mac.

Step 3: Open Terminal and check the versions of the python that is came pre-installed by python --version. If this gives you a result of python 2.7.6 (2.7.*) then don't worry, we are not done yet.

Step 4: In your terminal if you type python, it will run the pre-installed version i.e. 2.7.*. To make our newly installed python we will have to type python3. We can make the terminal start Python 3, just by doing aliasing.

Step 5: On your terminal type open ~/.bash_profile. If this gives a "Not Found" error, just make the file by typing: touch ~/.bash_profile. This will create the file.

Step 6: After the file is created, copy the text
 alias python="python3"

Step 7: We are not done yet. To make the changes take effect, type the following in your terminal source ~/.bash_profile. This will apply the changes you made in your file.

Step 8: You are done!! Type python on terminal and let us know :)



1/10/15

Informatics : The Future

Informatics, a new term/concept for people. Many people who stay in Europe know this term "Informatics" as a synonym to Computer Science. Like myself when I was thinking to do a Master's from Europe, I found that universities where I was looking to go didn't have Computer Science instead they Master's in Informatics.

This term for parents all over the world is very new and parents are very sceptical about sending their son/daughter or even getting themselves a degree in Informatics. With this post hopefully the picture will be cleared and will help students/professional to look at it in a different aspect.

What does Informatics means?

If we go by the traditional definition that we see on the search engines is the Science of Information. But for me:

Informatics stands as the term DTP where D is Data, T is Technology and P is People i.e. combination of all three. Its where computing (which we learn from computer science) is done with respect to another domain. 

With the above diagram, it seems clear that, Data which is generated by us is transformed by developers/analysts using technology in such a way that it can help people solve a problem or make a the world a better place to live.

Informatics offered by:

There are a lot of Universities offering degree in Informatics in the United States of America, below are a few:

1. University of Southern California
2. Indiana University
3. University of Michigan
4. University of Washington
5. UC Irvine
6. Carnegie Mellon University
7. Georgia Tech
8. Rutgers
9. Penn State

Informatics is not just restricted to Computers field, it has application in Medical, Retail, Social Networking, Health, Ocean, Sports etc.

Roles offered after degree:

Informatics as mentioned above has its application in various field so depending on the course and field you undertake the roles vary.

But for Informatics (General) the common roles are:

- Data Scientist
- Data Analyst
- Analyst
- Information Architects
- Software Engineers
- Hadoop Expert
- Interaction Designers, to name a few.

Future of Informatics:

The future of Informatics is very bright as the data being generated in every field is increasing every second, the roles and jobs are proportionally increasing and getting more diverse and niche.

Below is an image which explains how and why of data being generated:

Infographic is from Domo, a data visualizing firm.
At last I would like to say,  in our world we might not be aware of what power and meaning data has hidden in it. But I feel lucky to come across some real-life examples where data with use of technology and people is making a difference. Just for an example : West Ham Football Club (a Football club in the famous English Premier League) Manager Sam Allardyce uses footballing data and then buys players accordingly and so far if we see West Ham current season, it's turning out that the knowledge offered to him on recruiting is turning out to be a success.

10/19/13

Primary Index Choice Criteria

We have already seen Tutorial Primary Index and we have understood how Primary Index work and how they us in maximizing performance.

Today we will be learning on how to choose Primary Index/Indexes in a given table. But before we move ahead in defining the criteria for choice of Primary Index, here is a Tip:

TIP

If you don't define any Index on a Table, then Teradata decides on its own. Its makes decision on the below points:

1. If you have any column with Primary Key Constraint in the table definition, Teradata will make column as Unique Primary Index.

2. If you have a/many columns with Unique Constraint, then Teradata will choose only the 1st column in the table definition as Unique Primary Index & others as Unique Secondary Index.

3. If you don't have column with defined as either Primary key or with Unique Constraint, then Teradata makes the 1st column in the table definition as Non Unique Primary Index.

Primary Index Choice Criteria

Now lets start with the Criteria's of Choosing a good Primary Index. Basically, there are three Primary Index Choice Criteria: Access Demographics, Distribution Demographics,  and Volatility.

Access Demographics

By Access Demographics we mean, those columns which were used by the user to access the table, i.e. columns used in the WHERE clause of the SQL Statement. So choose the column(s) which were most frequently used for access to maximize the number of one-AMP operation. We need to consider both value as well as join access

Distribution Demographics

So as we know from our previous post on Primary Indexes, More unique the index, the better the distribution. Optimizing distribution optimizes parallel-processing.

Volatilty

You must understood if you know the meaning of the word Volatile. This point means that we have to choose a column which will have a low change rate. The Primary Index should not at all be volatile because any changes in the PI will result in heavy I/O overhead i.e. as a result in change the PI has to be moved from one AMP to another. Therefore, we have to choose a column which will have stable values.

**Note**

There is a trade-off between the access & distribution demographics. The most desirable situation is to find a column which has a good access and good distribution demographics.