Data Analysis with Python

Monte Lunacek

pandas

Resources

Data Analysis

Raw data

Processing

Exploratory data Analysis

Analysis

Visualization

Why pandas?

80% of the effort in data analysis is spent cleaning data. Hadley Wickham

Efficency

Raw data is often in the wrong format

Storage may be best in a different format

Outline

Simple example

Movies

Simple example

Based on Data Wrangling Kung Fu with Pandas by Wes McKinney

In [1]:
import os
import pandas as  pd
import numpy as np
In [2]:
dates = ['2014-02-16', '2014-02-17', '2014-02-18', '2014-02-19']
algs = ['Model-A','Model-B','Model-C']

filename = os.path.join('data','example.csv')

with open(filename,'w') as outfile:
    outfile.write('date,type,value\n')
    for d in dates:
        for a in algs:
            v = np.random.randint(10, size=1)[0]
            tmp = '{0},{1},{2}\n'.format(d,a,v)
            outfile.write(tmp)
In [3]:
with open(filename, 'r') as infile:
    print infile.read()
date,type,value
2014-02-16,Model-A,1
2014-02-16,Model-B,3
2014-02-16,Model-C,4
2014-02-17,Model-A,8
2014-02-17,Model-B,5
2014-02-17,Model-C,8
2014-02-18,Model-A,8
2014-02-18,Model-B,8
2014-02-18,Model-C,0
2014-02-19,Model-A,5
2014-02-19,Model-B,1
2014-02-19,Model-C,6


Creating a DataFrame

In [5]:
df = pd.read_csv(filename)
df
Out[5]:
date type value
0 2014-02-16 Model-A 1
1 2014-02-16 Model-B 3
2 2014-02-16 Model-C 4
3 2014-02-17 Model-A 8
4 2014-02-17 Model-B 5
5 2014-02-17 Model-C 8
6 2014-02-18 Model-A 8
7 2014-02-18 Model-B 8
8 2014-02-18 Model-C 0
9 2014-02-19 Model-A 5
10 2014-02-19 Model-B 1
11 2014-02-19 Model-C 6

12 rows × 3 columns

Why store it this way?

Reshape with pivot

In [6]:
results = df.pivot('date', 'type', 'value') #row, column, values (optional)
print results
type        Model-A  Model-B  Model-C
date                                 
2014-02-16        1        3        4
2014-02-17        8        5        8
2014-02-18        8        8        0
2014-02-19        5        1        6

[4 rows x 3 columns]

In [7]:
results.columns
Out[7]:
Index([u'Model-A', u'Model-B', u'Model-C'], dtype='object')
In [8]:
results.index
Out[8]:
Index([u'2014-02-16', u'2014-02-17', u'2014-02-18', u'2014-02-19'], dtype='object')

Columns access

In [9]:
results['Model-A']
Out[9]:
date
2014-02-16    1
2014-02-17    8
2014-02-18    8
2014-02-19    5
Name: Model-A, dtype: int64
In [10]:
results['Model-A'].values
Out[10]:
array([1, 8, 8, 5])

Row access

In [11]:
results.ix[0]
Out[11]:
type
Model-A    1
Model-B    3
Model-C    4
Name: 2014-02-16, dtype: int64
In [12]:
results.ix['2014-02-16']
Out[12]:
type
Model-A    1
Model-B    3
Model-C    4
Name: 2014-02-16, dtype: int64

Range access

In [13]:
print results.ix[2:4,1:]
type        Model-B  Model-C
date                        
2014-02-18        8        0
2014-02-19        1        6

[2 rows x 2 columns]

Summarize rows and columns

Question: What is the average value for each date?

In [14]:
results.mean(axis=1)
Out[14]:
date
2014-02-16    2.666667
2014-02-17    7.000000
2014-02-18    5.333333
2014-02-19    4.000000
dtype: float64

How many observations do I have for each model?

In [15]:
results.count(axis=0)
Out[15]:
type
Model-A    4
Model-B    4
Model-C    4
dtype: int64

Add some data with pd.concat

In [16]:
df = pd.read_csv(filename)
tmp = {'date': ['2014-02-16','2014-02-18'],
       'type': ['Model-D', 'Model-D'],
       'value': [11, 7]}

pd.DataFrame(tmp)
Out[16]:
date type value
0 2014-02-16 Model-D 11
1 2014-02-18 Model-D 7

2 rows × 3 columns

In [17]:
df = pd.concat([df,pd.DataFrame(tmp)], ignore_index=True)
df.shape
Out[17]:
(14, 3)

Delete a row

In [18]:
df.drop(2, axis=0).head()
Out[18]:
date type value
0 2014-02-16 Model-A 1
1 2014-02-16 Model-B 3
3 2014-02-17 Model-A 8
4 2014-02-17 Model-B 5
5 2014-02-17 Model-C 8

5 rows × 3 columns

In [19]:
df.drop(2, inplace=True)
In [20]:
df.drop('type', axis=1).head()
Out[20]:
date value
0 2014-02-16 1
1 2014-02-16 3
3 2014-02-17 8
4 2014-02-17 5
5 2014-02-17 8

5 rows × 2 columns

Let's reshape again...

In [21]:
results = df.pivot('date','type', 'value')
print results
type        Model-A  Model-B  Model-C  Model-D
date                                          
2014-02-16        1        3      NaN       11
2014-02-17        8        5        8      NaN
2014-02-18        8        8        0        7
2014-02-19        5        1        6      NaN

[4 rows x 4 columns]

In [22]:
results.mean(axis=1)
Out[22]:
date
2014-02-16    5.00
2014-02-17    7.00
2014-02-18    5.75
2014-02-19    4.00
dtype: float64
In [23]:
results.count(axis=0)
Out[23]:
type
Model-A    4
Model-B    4
Model-C    3
Model-D    2
dtype: int64
In [24]:
results.count(axis=1)
Out[24]:
date
2014-02-16    3
2014-02-17    3
2014-02-18    4
2014-02-19    3
dtype: int64

Missing vales: isnull() and fillna()

In [27]:
print results.isnull()
type       Model-A Model-B Model-C Model-D
date                                      
2014-02-16   False   False    True   False
2014-02-17   False   False   False    True
2014-02-18   False   False   False   False
2014-02-19   False   False   False    True

[4 rows x 4 columns]

In [28]:
print results.fillna(0)
type        Model-A  Model-B  Model-C  Model-D
date                                          
2014-02-16        1        3        0       11
2014-02-17        8        5        8        0
2014-02-18        8        8        0        7
2014-02-19        5        1        6        0

[4 rows x 4 columns]

In [29]:
print results
type        Model-A  Model-B  Model-C  Model-D
date                                          
2014-02-16        1        3      NaN       11
2014-02-17        8        5        8      NaN
2014-02-18        8        8        0        7
2014-02-19        5        1        6      NaN

[4 rows x 4 columns]

In [30]:
tmp = results.copy()
In [31]:
tmp.fillna(0, inplace=True)
print tmp
type        Model-A  Model-B  Model-C  Model-D
date                                          
2014-02-16        1        3        0       11
2014-02-17        8        5        8        0
2014-02-18        8        8        0        7
2014-02-19        5        1        6        0

[4 rows x 4 columns]

reset_index

In [32]:
tmp.reset_index(inplace=True)
tmp.columns
Out[32]:
Index([u'date', u'Model-A', u'Model-B', u'Model-C', u'Model-D'], dtype='object')
In [33]:
print tmp
type        date  Model-A  Model-B  Model-C  Model-D
0     2014-02-16        1        3        0       11
1     2014-02-17        8        5        8        0
2     2014-02-18        8        8        0        7
3     2014-02-19        5        1        6        0

[4 rows x 5 columns]

Convert to a numpy array

In [34]:
print tmp.set_index('date')
            Model-A  Model-B  Model-C  Model-D
date                                          
2014-02-16        1        3        0       11
2014-02-17        8        5        8        0
2014-02-18        8        8        0        7
2014-02-19        5        1        6        0

[4 rows x 4 columns]

In [35]:
X = tmp.set_index('date').as_matrix()
X
Out[35]:
array([[  1.,   3.,   0.,  11.],
       [  8.,   5.,   8.,   0.],
       [  8.,   8.,   0.,   7.],
       [  5.,   1.,   6.,   0.]])

Reshape with melt

In [36]:
results = df.pivot('date','type', 'value')
print results
type        Model-A  Model-B  Model-C  Model-D
date                                          
2014-02-16        1        3      NaN       11
2014-02-17        8        5        8      NaN
2014-02-18        8        8        0        7
2014-02-19        5        1        6      NaN

[4 rows x 4 columns]

In [37]:
results.reset_index(inplace=True)
print results
type        date  Model-A  Model-B  Model-C  Model-D
0     2014-02-16        1        3      NaN       11
1     2014-02-17        8        5        8      NaN
2     2014-02-18        8        8        0        7
3     2014-02-19        5        1        6      NaN

[4 rows x 5 columns]

In [38]:
back = pd.melt(results, id_vars=['date'])
print back
          date     type  value
0   2014-02-16  Model-A      1
1   2014-02-17  Model-A      8
2   2014-02-18  Model-A      8
3   2014-02-19  Model-A      5
4   2014-02-16  Model-B      3
5   2014-02-17  Model-B      5
6   2014-02-18  Model-B      8
7   2014-02-19  Model-B      1
8   2014-02-16  Model-C    NaN
9   2014-02-17  Model-C      8
10  2014-02-18  Model-C      0
11  2014-02-19  Model-C      6
12  2014-02-16  Model-D     11
13  2014-02-17  Model-D    NaN
14  2014-02-18  Model-D      7
15  2014-02-19  Model-D    NaN

[16 rows x 3 columns]

dropna()

In [40]:
back.dropna(axis=0)
Out[40]:
date type value
0 2014-02-16 Model-A 1
1 2014-02-17 Model-A 8
2 2014-02-18 Model-A 8
3 2014-02-19 Model-A 5
4 2014-02-16 Model-B 3
5 2014-02-17 Model-B 5
6 2014-02-18 Model-B 8
7 2014-02-19 Model-B 1
9 2014-02-17 Model-C 8
10 2014-02-18 Model-C 0
11 2014-02-19 Model-C 6
12 2014-02-16 Model-D 11
14 2014-02-18 Model-D 7

13 rows × 3 columns

In [41]:
back.dropna(axis=1).head()
Out[41]:
date type
0 2014-02-16 Model-A
1 2014-02-17 Model-A
2 2014-02-18 Model-A
3 2014-02-19 Model-A
4 2014-02-16 Model-B

5 rows × 2 columns

Write to file to_csv

In [42]:
back.dropna(axis=0, inplace=True)
In [43]:
back.to_csv('back.csv', index=False)
In [44]:
print open('back.csv').read()
date,type,value
2014-02-16,Model-A,1.0
2014-02-17,Model-A,8.0
2014-02-18,Model-A,8.0
2014-02-19,Model-A,5.0
2014-02-16,Model-B,3.0
2014-02-17,Model-B,5.0
2014-02-18,Model-B,8.0
2014-02-19,Model-B,1.0
2014-02-17,Model-C,8.0
2014-02-18,Model-C,0.0
2014-02-19,Model-C,6.0
2014-02-16,Model-D,11.0
2014-02-18,Model-D,7.0


Movies

This is adapted from Python for Data Analysis by Wes McKinney

Ultimate question: What's the best date-night movie?

In [1]:
import os
import pandas as pd

Read in the movie data: pd.read_table

In [2]:
def get_movie_data():
    
    unames = ['user_id','gender','age','occupation','zip']
    users = pd.read_table(os.path.join('data','movies','users.dat'), 
                          sep='::', header=None, names=unames)
    
    rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
    ratings = pd.read_table(os.path.join('data','movies','ratings.dat'), 
                            sep='::', header=None, names=rnames)
    
    mnames = ['movie_id', 'title','genres']
    movies = pd.read_table(os.path.join('data','movies','movies.dat'), 
                           sep='::', header=None, names=mnames)

    return users, ratings, movies
In [3]:
users, ratings, movies = get_movie_data()
In [4]:
print users.head()
   user_id gender  age  occupation    zip
0        1      F    1          10  48067
1        2      M   56          16  70072
2        3      M   25          15  55117
3        4      M   45           7  02460
4        5      M   25          20  55455

[5 rows x 5 columns]

In [5]:
print ratings.head()
   user_id  movie_id  rating  timestamp
0        1      1193       5  978300760
1        1       661       3  978302109
2        1       914       3  978301968
3        1      3408       4  978300275
4        1      2355       5  978824291

[5 rows x 4 columns]

In [6]:
print movies.head()
   movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children's|Comedy
1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy

[5 rows x 3 columns]

Clean up the movies

This is dense code. Skip.

In [7]:
tmp = movies.title.str.match('(.*) \(([0-9]+)\)')
movies['year'] = tmp.map(lambda x: x[1] if len(x) > 0 else None)
movies['short_title'] = tmp.map(lambda x: x[0][:40] if len(x) > 0 else None)

Join the tables with pd.merge

In [8]:
data = pd.merge(pd.merge(ratings, users), movies)
In [9]:
for c in data.columns:
    print c
user_id
movie_id
rating
timestamp
gender
age
occupation
zip
title
genres
year
short_title

What's the highest rated movie?

In [14]:
tmp = data[['short_title','rating']]
print tmp.head()
print len(tmp)
                       short_title  rating
0  One Flew Over the Cuckoo's Nest       5
1  One Flew Over the Cuckoo's Nest       5
2  One Flew Over the Cuckoo's Nest       4
3  One Flew Over the Cuckoo's Nest       4
4  One Flew Over the Cuckoo's Nest       5

[5 rows x 2 columns]
1000209

Summary operations with groupby

In [15]:
grp = tmp.groupby('short_title')
In [16]:
print type(grp)
<class 'pandas.core.groupby.DataFrameGroupBy'>

Summary with describe()

In [17]:
mean_rating = grp.mean()
print mean_rating.describe()
            rating
count  3664.000000
mean      3.237347
std       0.674236
min       1.000000
25%       2.820216
50%       3.329545
75%       3.740150
max       5.000000

[8 rows x 1 columns]

In [18]:
print mean_rating.head()
                          rating
short_title                     
$1,000,000 Duck         3.027027
'Night Mother           3.371429
'Til There Was You      2.692308
'burbs, The             2.910891
...And Justice for All  3.713568

[5 rows x 1 columns]

What's the highest rated movie?

Sort by ratings using the sort() method.

In [19]:
print mean_rating.sort('rating', ascending=False).head(10)
                                    rating
short_title                               
Ulysses (Ulisse)                         5
Schlafes Bruder (Brother of Sleep)       5
Smashing Time                            5
Song of Freedom                          5
Gate of Heavenly Peace, The              5
Lured                                    5
Baby, The                                5
Bittersweet Motel                        5
Follow the Bitch                         5
One Little Indian                        5

[10 rows x 1 columns]

Apply more than one function to the group with the agg() method.

In [20]:
mean_rating = grp['rating'].agg(['mean','count'])
print mean_rating.sort('mean', ascending=False).head(10)
                                    mean  count
short_title                                    
Ulysses (Ulisse)                       5      1
Schlafes Bruder (Brother of Sleep)     5      1
Smashing Time                          5      2
Song of Freedom                        5      1
Gate of Heavenly Peace, The            5      3
Lured                                  5      1
Baby, The                              5      1
Bittersweet Motel                      5      1
Follow the Bitch                       5      1
One Little Indian                      5      1

[10 rows x 2 columns]

Threshold on the number of ratings

In [21]:
mask = mean_rating['count'] > 1000

print type(mask)
print sum(mask)
print mask.head()
<class 'pandas.core.series.Series'>
210
short_title
$1,000,000 Duck           False
'Night Mother             False
'Til There Was You        False
'burbs, The               False
...And Justice for All    False
Name: count, dtype: bool

In [22]:
print mean_rating.ix[mask].head()
                           mean  count
short_title                           
2001: A Space Odyssey  4.068765   1716
Abyss, The             3.683965   1715
African Queen, The     4.251656   1057
Air Force One          3.588290   1076
Airplane!              3.971115   1731

[5 rows x 2 columns]

In [23]:
mean_rating.ix[mask]['count'].min()
Out[23]:
1001

Highest rated movie with at least 1000 votes?

In [24]:
print mean_rating.ix[mask].sort('mean', ascending=False).head(10)
                                              mean  count
short_title                                              
Shawshank Redemption, The                 4.554558   2227
Godfather, The                            4.524966   2223
Usual Suspects, The                       4.517106   1783
Schindler's List                          4.510417   2304
Raiders of the Lost Ark                   4.477725   2514
Rear Window                               4.476190   1050
Star Wars: Episode IV - A New Hope        4.453694   2991
Dr. Strangelove or: How I Learned to Sto  4.449890   1367
Casablanca                                4.412822   1669
Sixth Sense, The                          4.406263   2459

[10 rows x 2 columns]

What about gender?

In [25]:
data.head(2)
Out[25]:
user_id movie_id rating timestamp gender age occupation zip title genres year short_title
0 1 1193 5 978300760 F 1 10 48067 One Flew Over the Cuckoo's Nest (1975) Drama 1975 One Flew Over the Cuckoo's Nest
1 2 1193 5 978298413 M 56 16 70072 One Flew Over the Cuckoo's Nest (1975) Drama 1975 One Flew Over the Cuckoo's Nest

2 rows × 12 columns

Summary pivot with pd.pivot_table

Like pivot, but will summarize and group.

In [26]:
mean_ratings = pd.pivot_table(data, 'rating', rows='short_title', 
                              cols='gender', aggfunc='mean')
print mean_ratings.head(10)
gender                             F         M
short_title                                   
$1,000,000 Duck             3.375000  2.761905
'Night Mother               3.388889  3.352941
'Til There Was You          2.675676  2.733333
'burbs, The                 2.793478  2.962085
...And Justice for All      3.828571  3.689024
1-900                       2.000000  3.000000
10 Things I Hate About You  3.646552  3.311966
101 Dalmatians              3.545994  3.287162
12 Angry Men                4.184397  4.328421
13th Warrior, The           3.112000  3.168000

[10 rows x 2 columns]

Only those that have at least 1000 votes.

In [27]:
pd.pivot_table?
In [29]:
mean_ratings = mean_ratings.ix[mask]

Favorites for M

In [30]:
print mean_ratings.sort('M', ascending=False).head(10)
gender                                           F         M
short_title                                                 
Godfather, The                            4.314700  4.583333
Shawshank Redemption, The                 4.539075  4.560625
Raiders of the Lost Ark                   4.332168  4.520597
Usual Suspects, The                       4.513317  4.518248
Star Wars: Episode IV - A New Hope        4.302937  4.495307
Schindler's List                          4.562602  4.491415
Rear Window                               4.484536  4.472991
Dr. Strangelove or: How I Learned to Sto  4.376623  4.464789
Casablanca                                4.300990  4.461340
Godfather: Part II, The                   4.040936  4.437778

[10 rows x 2 columns]

Favorites for F

In [31]:
print mean_ratings.sort('F', ascending=False).head(10)
gender                                           F         M
short_title                                                 
Schindler's List                          4.562602  4.491415
Shawshank Redemption, The                 4.539075  4.560625
Usual Suspects, The                       4.513317  4.518248
Rear Window                               4.484536  4.472991
Sixth Sense, The                          4.477410  4.379944
Life Is Beautiful (La Vita � bella)       4.422343  4.286624
Dr. Strangelove or: How I Learned to Sto  4.376623  4.464789
North by Northwest                        4.364458  4.390641
Wizard of Oz, The                         4.355030  4.203138
Amadeus                                   4.346734  4.213415

[10 rows x 2 columns]

Which movies do differ the most in gender ratings?

In [32]:
mean_ratings['diff'] = abs(mean_ratings['M'] - mean_ratings['F'])
In [33]:
mean_ratings.sort('diff', ascending=False).head(10)
Out[33]:
gender F M diff
short_title
Animal House 3.628906 4.167192 0.538286
Rocky Horror Picture Show, The 3.673016 3.160131 0.512885
Mary Poppins 4.197740 3.730594 0.467147
Reservoir Dogs 3.769231 4.213873 0.444642
Gone with the Wind 4.269841 3.829371 0.440471
South Park: Bigger, Longer and Uncut 3.422481 3.846686 0.424206
Airplane! 3.656566 4.064419 0.407854
Predator 3.299401 3.706195 0.406793
Godfather: Part II, The 4.040936 4.437778 0.396842
Clockwork Orange, A 3.757009 4.145813 0.388803

10 rows × 3 columns

In [34]:
mean_ratings.sort('diff', ascending=True).head(10)
Out[34]:
gender F M diff
short_title
Jerry Maguire 3.758315 3.759424 0.001109
Indiana Jones and the Temple of Doom 3.674312 3.676568 0.002256
Good Will Hunting 4.174672 4.177064 0.002392
Fugitive, The 4.100457 4.104046 0.003590
Batman Returns 2.980100 2.975904 0.004196
Usual Suspects, The 4.513317 4.518248 0.004931
Green Mile, The 4.159722 4.153105 0.006617
Boogie Nights 3.763838 3.771295 0.007458
Chicken Run 3.885559 3.877339 0.008220
Blair Witch Project, The 3.038732 3.029381 0.009351

10 rows × 3 columns

Date-night pick?

What's the highest rated movies where the difference is minimal?

Let's pick out minimal.

In [35]:
mean_ratings['diff'].hist(alpha=0.5)
show()
In [38]:
mean_ratings.hist(alpha=0.5) #blue
#mean_ratings['M'].hist(alpha=0.5)
show()

How about diff < 0.05 and rating > 4.25?

In [39]:
diff_mask = mean_ratings['diff'] < 0.05 
m_mask = mean_ratings['M'] > 4.25
f_mask = mean_ratings['F'] > 4.25
In [40]:
mask = diff_mask & m_mask & f_mask
In [41]:
tmp = mean_ratings[mask]
In [42]:
print tmp
gender                            F         M      diff
short_title                                            
North by Northwest         4.364458  4.390641  0.026183
Rear Window                4.484536  4.472991  0.011545
Shawshank Redemption, The  4.539075  4.560625  0.021550
Usual Suspects, The        4.513317  4.518248  0.004931

[4 rows x 3 columns]

In [43]:
tmp['mean'] = tmp['M'] + tmp['F']
In [44]:
print tmp.sort('mean', ascending=True)
gender                            F         M      diff      mean
short_title                                                      
North by Northwest         4.364458  4.390641  0.026183  8.755099
Rear Window                4.484536  4.472991  0.011545  8.957527
Usual Suspects, The        4.513317  4.518248  0.004931  9.031565
Shawshank Redemption, The  4.539075  4.560625  0.021550  9.099700

[4 rows x 4 columns]

Hierarchical Indexing

Based on Data Wrangling Kung Fu with Pandas by Wes McKinney

In [1]:
import os
import pandas as  pd
import numpy as np
In [2]:
filename = os.path.join('data','example.csv')
df = pd.read_csv(filename)
print df
          date     type  value
0   2014-02-16  Model-A      1
1   2014-02-16  Model-B      3
2   2014-02-16  Model-C      4
3   2014-02-17  Model-A      8
4   2014-02-17  Model-B      5
5   2014-02-17  Model-C      8
6   2014-02-18  Model-A      8
7   2014-02-18  Model-B      8
8   2014-02-18  Model-C      0
9   2014-02-19  Model-A      5
10  2014-02-19  Model-B      1
11  2014-02-19  Model-C      6

[12 rows x 3 columns]

Add another column of data

In [3]:
df.shape
Out[3]:
(12, 3)
In [4]:
df['score'] = np.random.rand(len(df))
df.shape
Out[4]:
(12, 4)
In [5]:
print df.head()
         date     type  value     score
0  2014-02-16  Model-A      1  0.202855
1  2014-02-16  Model-B      3  0.287901
2  2014-02-16  Model-C      4  0.539970
3  2014-02-17  Model-A      8  0.142716
4  2014-02-17  Model-B      5  0.252482

[5 rows x 4 columns]

Hierarchical columns

In [6]:
results = df.pivot('date', 'type') #row, column, values (optional)
print results
              value                       score                    
type        Model-A  Model-B  Model-C   Model-A   Model-B   Model-C
date                                                               
2014-02-16        1        3        4  0.202855  0.287901  0.539970
2014-02-17        8        5        8  0.142716  0.252482  0.801581
2014-02-18        8        8        0  0.510448  0.752879  0.038923
2014-02-19        5        1        6  0.742021  0.561749  0.210681

[4 rows x 6 columns]

I have a hierarchical index on the columns:

In [8]:
results.columns
Out[8]:
MultiIndex(levels=[[u'value', u'score'], [u'Model-A', u'Model-B', u'Model-C']],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
           names=[None, u'type'])
In [9]:
results.count(axis=1)
Out[9]:
date
2014-02-16    6
2014-02-17    6
2014-02-18    6
2014-02-19    6
dtype: int64
In [10]:
results['value'].count(axis=1)
Out[10]:
date
2014-02-16    3
2014-02-17    3
2014-02-18    3
2014-02-19    3
dtype: int64

I can access each component of the index.

In [12]:
print results['score']['Model-A']
date
2014-02-16    0.202855
2014-02-17    0.142716
2014-02-18    0.510448
2014-02-19    0.742021
Name: Model-A, dtype: float64

Swap the order of the index.

In [13]:
tmp = results.swaplevel(0,1, axis=1)
print tmp
type        Model-A  Model-B  Model-C   Model-A   Model-B   Model-C
              value    value    value     score     score     score
date                                                               
2014-02-16        1        3        4  0.202855  0.287901  0.539970
2014-02-17        8        5        8  0.142716  0.252482  0.801581
2014-02-18        8        8        0  0.510448  0.752879  0.038923
2014-02-19        5        1        6  0.742021  0.561749  0.210681

[4 rows x 6 columns]

In [14]:
print tmp['Model-A']
            value     score
date                       
2014-02-16      1  0.202855
2014-02-17      8  0.142716
2014-02-18      8  0.510448
2014-02-19      5  0.742021

[4 rows x 2 columns]

stack and unstack

In [15]:
print results
              value                       score                    
type        Model-A  Model-B  Model-C   Model-A   Model-B   Model-C
date                                                               
2014-02-16        1        3        4  0.202855  0.287901  0.539970
2014-02-17        8        5        8  0.142716  0.252482  0.801581
2014-02-18        8        8        0  0.510448  0.752879  0.038923
2014-02-19        5        1        6  0.742021  0.561749  0.210681

[4 rows x 6 columns]

In [16]:
print results.stack() #Defaults to highest level, eg. 1 in this case
                    value     score
date       type                    
2014-02-16 Model-A      1  0.202855
           Model-B      3  0.287901
           Model-C      4  0.539970
2014-02-17 Model-A      8  0.142716
           Model-B      5  0.252482
           Model-C      8  0.801581
2014-02-18 Model-A      8  0.510448
           Model-B      8  0.752879
           Model-C      0  0.038923
2014-02-19 Model-A      5  0.742021
           Model-B      1  0.561749
           Model-C      6  0.210681

[12 rows x 2 columns]

Now we have a hierarchical index on the rows.

In [17]:
print results.stack().index
date        type   
2014-02-16  Model-A
            Model-B
            Model-C
2014-02-17  Model-A
            Model-B
            Model-C
2014-02-18  Model-A
            Model-B
            Model-C
2014-02-19  Model-A
            Model-B
            Model-C

In [18]:
print results.stack(0)
type               Model-A   Model-B   Model-C
date                                          
2014-02-16 value  1.000000  3.000000  4.000000
           score  0.202855  0.287901  0.539970
2014-02-17 value  8.000000  5.000000  8.000000
           score  0.142716  0.252482  0.801581
2014-02-18 value  8.000000  8.000000  0.000000
           score  0.510448  0.752879  0.038923
2014-02-19 value  5.000000  1.000000  6.000000
           score  0.742021  0.561749  0.210681

[8 rows x 3 columns]

In [19]:
print results.stack(0).unstack()
type        Model-A            Model-B            Model-C          
              value     score    value     score    value     score
date                                                               
2014-02-16        1  0.202855        3  0.287901        4  0.539970
2014-02-17        8  0.142716        5  0.252482        8  0.801581
2014-02-18        8  0.510448        8  0.752879        0  0.038923
2014-02-19        5  0.742021        1  0.561749        6  0.210681

[4 rows x 6 columns]

Hierarchical Rows

In [20]:
df.head()
Out[20]:
date type value score
0 2014-02-16 Model-A 1 0.202855
1 2014-02-16 Model-B 3 0.287901
2 2014-02-16 Model-C 4 0.539970
3 2014-02-17 Model-A 8 0.142716
4 2014-02-17 Model-B 5 0.252482

5 rows × 4 columns

In [21]:
df.set_index(['date','type'], inplace=True)
df.head()
Out[21]:
value score
date type
2014-02-16 Model-A 1 0.202855
Model-B 3 0.287901
Model-C 4 0.539970
2014-02-17 Model-A 8 0.142716
Model-B 5 0.252482

5 rows × 2 columns

Accessing index by name

In [24]:
df.ix['2014-02-16']
Out[24]:
value score
type
Model-A 1 0.202855
Model-B 3 0.287901
Model-C 4 0.539970

3 rows × 2 columns

In [25]:
df.swaplevel(0,1, axis=0).ix['Model-A']
Out[25]:
value score
date
2014-02-16 1 0.202855
2014-02-17 8 0.142716
2014-02-18 8 0.510448
2014-02-19 5 0.742021

4 rows × 2 columns

In [26]:
df.unstack()
Out[26]:
value score
type Model-A Model-B Model-C Model-A Model-B Model-C
date
2014-02-16 1 3 4 0.202855 0.287901 0.539970
2014-02-17 8 5 8 0.142716 0.252482 0.801581
2014-02-18 8 8 0 0.510448 0.752879 0.038923
2014-02-19 5 1 6 0.742021 0.561749 0.210681

4 rows × 6 columns