1/20/15

Python Tutorial: Strings Datatype

Data stored in memory can of different types and Python like other languages have different standard data types. Sometime back we did a post on Python Numbers. Today we will be covering other standard datatypes i.e. Strings.

Note: All examples shown in the post are based on python3.


Like other languages, python also has the same meaning/definition for Strings. They are a contiguous set of characters enclosed within single/double quotation marks

#!/usr/bin/python
strA = "Hello "
strB = 'World!'

#Printing the above variables on screen

print strA #This will work in Python2
print strB #This will work in Python2

print( strA ) #This will work in Python3
print( strB ) #This will work in Python3

Result of above Python3 Code

String Slicing:

Strings can be sliced i.e. subsets of a string, using the slice operator ([:] or []). The index starts from 0.

#!/usr/bin/python

print( strA[0] )   #prints the first character of variable strA
print( strA[1:3] ) #prints characters from first index to third
print( strA[3:] )  #prints characters from third index


String Concatenation:

Like other languages python also provides the functionality to concatenate strings. It is done its the + operator

#!/usr/bin/python

print( "Print Concatenated Output: " + strA + strB )

Code Output
If you try to concat another datatype using + operator, you would get an error "cannot convert 'int' object to str implicitly". So to achieve that we have two ways:

1. We can do by putting values using a comma inside print()
2. Other way, we can use an inbuilt function str(). This will convert any datatype to string thus, allowing us to use + operator

#!/usr/bin/python

print( strA + 4 ) #This will give an error as mentioned above

#Correct Way to Concat String and another Datatype

print( strA, 4 ) #Method 1
print( strA + str(1234) ) #Method 2


Escape Characters:

The definition an escape character is a character which invokes an alternative interpretation on subsequent characters in a character sequence. It can be interpreted in a single as well as double quoted string.

Below is the list of escape characters with their description:


Some of the special operators

We only saw the + operator, but apart from this there are many others. Below is the list of all operators:


Formatting Operator

Formatting Operator %, is one of the features which reminded me of the time when I used to write code in C. Here in python it functions the same way:

Below is a list of formatting operators:


Example:

#!/usr/bin/python

"""You can have multiple formatting operators, but remember the sequence of variables must be followed after % inside a bracket () separated by comma"""

num = 2
post_num = 129

print( "Code %s Learn" %num) 

print( "Code %s Learn\'s post number: %s" %(num,post_num))

String Formatting Example Output

You must have noticed that I have used Triple quotes in the above example. Triple quote is used for writing multi-line comments, whereas # is used for writing a single line comment.

Python also provides multiple built-in functions for String manipulations. Below is the gist of some functions:



Check Python Docs for detailed reference.


1/13/15

Install Python on Mac

Mac by default comes with a Python 2.7 installed, but if you want to install latest version of Python, it can be down easily.

We will need to follow the below steps:

Step 1: You can either go to Python website and download the latest stable release. (Or just click here).

Step 2: Open the installer and instal Python on your Mac.

Step 3: Open Terminal and check the versions of the python that is came pre-installed by python --version. If this gives you a result of python 2.7.6 (2.7.*) then don't worry, we are not done yet.

Step 4: In your terminal if you type python, it will run the pre-installed version i.e. 2.7.*. To make our newly installed python we will have to type python3. We can make the terminal start Python 3, just by doing aliasing.

Step 5: On your terminal type open ~/.bash_profile. If this gives a "Not Found" error, just make the file by typing: touch ~/.bash_profile. This will create the file.

Step 6: After the file is created, copy the text
 alias python="python3"

Step 7: We are not done yet. To make the changes take effect, type the following in your terminal source ~/.bash_profile. This will apply the changes you made in your file.

Step 8: You are done!! Type python on terminal and let us know :)



1/10/15

Informatics : The Future

Informatics, a new term/concept for people. Many people who stay in Europe know this term "Informatics" as a synonym to Computer Science. Like myself when I was thinking to do a Master's from Europe, I found that universities where I was looking to go didn't have Computer Science instead they Master's in Informatics.

This term for parents all over the world is very new and parents are very sceptical about sending their son/daughter or even getting themselves a degree in Informatics. With this post hopefully the picture will be cleared and will help students/professional to look at it in a different aspect.

What does Informatics means?

If we go by the traditional definition that we see on the search engines is the Science of Information. But for me:

Informatics stands as the term DTP where D is Data, T is Technology and P is People i.e. combination of all three. Its where computing (which we learn from computer science) is done with respect to another domain. 

With the above diagram, it seems clear that, Data which is generated by us is transformed by developers/analysts using technology in such a way that it can help people solve a problem or make a the world a better place to live.

Informatics offered by:

There are a lot of Universities offering degree in Informatics in the United States of America, below are a few:

1. University of Southern California
2. Indiana University
3. University of Michigan
4. University of Washington
5. UC Irvine
6. Carnegie Mellon University
7. Georgia Tech
8. Rutgers
9. Penn State

Informatics is not just restricted to Computers field, it has application in Medical, Retail, Social Networking, Health, Ocean, Sports etc.

Roles offered after degree:

Informatics as mentioned above has its application in various field so depending on the course and field you undertake the roles vary.

But for Informatics (General) the common roles are:

- Data Scientist
- Data Analyst
- Analyst
- Information Architects
- Software Engineers
- Hadoop Expert
- Interaction Designers, to name a few.

Future of Informatics:

The future of Informatics is very bright as the data being generated in every field is increasing every second, the roles and jobs are proportionally increasing and getting more diverse and niche.

Below is an image which explains how and why of data being generated:

Infographic is from Domo, a data visualizing firm.
At last I would like to say,  in our world we might not be aware of what power and meaning data has hidden in it. But I feel lucky to come across some real-life examples where data with use of technology and people is making a difference. Just for an example : West Ham Football Club (a Football club in the famous English Premier League) Manager Sam Allardyce uses footballing data and then buys players accordingly and so far if we see West Ham current season, it's turning out that the knowledge offered to him on recruiting is turning out to be a success.

10/19/13

Primary Index Choice Criteria

We have already seen Tutorial Primary Index and we have understood how Primary Index work and how they us in maximizing performance.

Today we will be learning on how to choose Primary Index/Indexes in a given table. But before we move ahead in defining the criteria for choice of Primary Index, here is a Tip:

TIP

If you don't define any Index on a Table, then Teradata decides on its own. Its makes decision on the below points:

1. If you have any column with Primary Key Constraint in the table definition, Teradata will make column as Unique Primary Index.

2. If you have a/many columns with Unique Constraint, then Teradata will choose only the 1st column in the table definition as Unique Primary Index & others as Unique Secondary Index.

3. If you don't have column with defined as either Primary key or with Unique Constraint, then Teradata makes the 1st column in the table definition as Non Unique Primary Index.

Primary Index Choice Criteria

Now lets start with the Criteria's of Choosing a good Primary Index. Basically, there are three Primary Index Choice Criteria: Access Demographics, Distribution Demographics,  and Volatility.

Access Demographics

By Access Demographics we mean, those columns which were used by the user to access the table, i.e. columns used in the WHERE clause of the SQL Statement. So choose the column(s) which were most frequently used for access to maximize the number of one-AMP operation. We need to consider both value as well as join access

Distribution Demographics

So as we know from our previous post on Primary Indexes, More unique the index, the better the distribution. Optimizing distribution optimizes parallel-processing.

Volatilty

You must understood if you know the meaning of the word Volatile. This point means that we have to choose a column which will have a low change rate. The Primary Index should not at all be volatile because any changes in the PI will result in heavy I/O overhead i.e. as a result in change the PI has to be moved from one AMP to another. Therefore, we have to choose a column which will have stable values.

**Note**

There is a trade-off between the access & distribution demographics. The most desirable situation is to find a column which has a good access and good distribution demographics.


5/27/13

Tutorial on Primary Index in Teradata

Last tutorial was on Indexes in Teradata in which we covered Primary Index, Secondary Index and Join Index. What we saw was just a summary of all the Indexes.

This tutorial will help you in understanding Primary Index in a much clearer way. As we saw last time a Primary Index determines the distribution of the rows  on the disks controlled AMPs. A primary index in Teradata in required for row distribution and storage. 

Below image shows how the distribution takes place using the Primary Indexes: 



The PI for a table should represent the data values used by SQL to access the data for the table. Therefore, a PI for a table should be selected very carefully as it can improve or degrade the performance of SQL used for accessing that table. 

Rules for Defining a PRIMARY INDEX



Below are some rules that I have come across while identifying on which column to be defined as PI:

* The PI selected should be such that it has unique index as more the unique index more evenly the rows will be distributed by the AMPs and better will be the space utilization.
* Define the index on as few columns as you can. 
* The PI defined can have unique or non-unique values. If defined as unique then you will have to make sure that the each value you pass into the column is unique and never repeated whereas if defined non-unique you can pass duplicate values into that column.
* Flags and Junk Dimensions are never to be defined as PI as it leads to more skewness. Therefore that data in skewed and when you perform a join it can hamper the performance and lead to spool space error.

NOTE: 

If you forget to define PRIMARY INDEX while doing a CREATE TABLE, the default will be use i.e. on the following:

* PRIMARY key
* First UNIQUE constraint
* First column

Creating PRIMARY INDEX


Unique primary index in created using the (UNIQUE) PRIMARY INDEX clause in the CREATE TABLE statement. Non-unique primary index are also created in the same way just removing the (UNIQUE) clause.

NOTE : Once a PRIMARY INDEX in created on TABLE it cannot be dropped or modified, the index must be changed by recreating the TABLE.

Example: Creating a UNIQUE PRIMARY INDEX


CREATE TABLE EMPLOYEE
(EMP_ID INTEGER
, EMP_NAME CHAR(15)
, EMP_GRADE CHAR(1))
 UNIQUE PRIMARY INDEX (EMP_ID);

Example: Creating a NON-UNIQUE PRIMARY INDEX


CREATE TABLE STUDENT
(STUDENT_ID CHAR(15)
, STUDENT_NAME CHAR(20)
, DEPARTMENT CHAR(10)) 
PRIMARY INDEX (STUDENT_ID);


Data Access using PI




When a query contains a WHERE clause, and it has primary index value(s), the request in then processed by hashing the values to find the AMP where the row is stored and then the row which contains the same hash value in the RowID part of it is returned.


Primary Key vs Primary Index



I have seen many people who are working on Teradata still confuse between Primary Key and Primary Index. So here I have done a small comparison between both as both differ conceptually:


Term
Primary Key
Primary Index
Requirement Not required, unless referential integrity checks are to be performed Required
Defining Define by CREATE TABLE statement Defined by CREATE TABLE statement
Uniqueness Unique Unique or non-unique
Function Identifies a row uniquely Distributes rows
Values can be changed? No Yes
Can be null? No Yes
Related to access path? No Yes

Hope this tutorial and the above short comparison helps everyone to understand what exactly PRIMARY INDEX is and how is should be chosen and defined. Do let me know your views about the post and help me in improving it.


4/20/13

Indexes in Teradata

Teradata as we all know is a Relational Database Management System (RDBMS) for the world's largest commercial databases. It is the market leader in  Data Warehousing. Its architecture is so designed that it takes advantage of the concept of parallelism.

Almost every Database has what we know as KEY i.e. Primary Key, Foreign Key etc defined while implementing the DDL of a table. But in Teradata we don't have any concept of Keys instead what we have is INDEX. Indexes are conceptually different from keys. PRIMARY KEY is relational modelling term that defines, in the logical model whereas the PRIMARY INDEX is a physical database implementation term that defines the actual columns used to distribute and access rows in a table.

What is an Index?


In the Teradata RDBMS, an index is used to define row uniqueness and retrieve data rows, it also can be used to enforce the primary key and unique constraint for a table. The Teradata RDBMS support five types of indexes:

* Unique Primary Index (UPI)
* Unique Secondary Index (USI)
* Non-Unique Primary Index (NUPI)
* Non-Unique Secondary Index (NUSI)
* Join Index

A typical index contains two fields:

* A Value
* A Pointer to instances of the value in a data table

Teradata uses the index and hashing to distribute rows across the AMPs, the value is condensed into an entity called a row hash.

The following rules apply to the indexes:

* As mentioned above index help in distribution and retrieval of rows of a table. It can be made up of one or more than one column.
* A table can have a number of indexes, including 1 Primary Index and more than 32 Secondary Index.
* An index defined for a table can be primary or secondary, and can be unique or non-unique. Each type of index affects the performance of the system and also the data integrity.
* An index is generally defined on columns of a table which are generally used in WHERE or join condition.
* Index helps in enforcing PRIMARY KEY and UNIQUE constraint.


Primary Index



Primary Index determines the distribution of the rows on the disks controlled AMPs. A primary index in Teradata RDBMS is required for row distribution and storage. When a row is inserted, its hash code is calculated using a hashing algorithm and depending on the whether the index is UNIQUE or Non-UNIQUE it is checked for duplication of those index. Rows having the same hash value are stored on the AMP.




Secondary Index



In addition to a primary index, up to 32 unique and non-unique secondary indexes can be defined to a table. Comparing to primary indexes, Secondary indexes allow an alternate path to access the rows in the table which is less frequently used. Basically a Secondary Index is a subtable that is stored in all AMPs separately from the primary table. The subtable made and maintained by the system contains the following information:

* RowIDs of the subtable rows
* Base table index column values
* RowIDs of the base table rows

Below diagram shows how the secondary index subtable on each AMP is associated with the base table by the RowIDs:


Join Index



A join index is nothing but an indexing structure which contains columns from multiple tables. Rather than having to join individual tables each time the join  operation is needed, the query can be resolved via a join index and in most cases dramatically improves performance.

SYNTAX:

CREATE JOIN INDEX [<database-name>.]<index_name>
[[NO] FALLBACK]
AS SELECT
[<columns>]
[SUM numeric-expression]
[COUNT coulmn-expression]
[EXTRACT year|month from date-expressino]
FROM [<database-name(s).]<table-names>]
[WHERE <search-condition>]
[GROUP BY <coulmn-name>]
[ORDER BY <coulmn-name>]
PRIMARY INDEX(<column-name>)
[index <column-name> ORDER by HASH|VALUES]
;