A great introduction to data analysis in python with pandas

Pandas is fantastic for doing data analysis in python. In this tutorial by Brandon Rhodes walks through a comprehensive set of computations you can apply to data sets using pandas.

For those new to data analysis with python, following along to this tutorial will definitely increase your data-fu a couple of levels.

Associated documentation for this tutorial can be found on Brandon’s GitHub page. On the GitHub page are all the instructions to get the environment set up as well as the ipython (now jupyter) notebooks you can follow along with.

Brandon does a great job of walking through many of the common data manipulations and how you can do them using pandas.

There are lots of nuggets of gold in here for wrangling data you have in a table format. What would be arduous work in excel can be achieved with a couple of lines of python.

To get the most out of this it is well worth following along and trying to complete the associated exercises to solidify your understanding of the range of capabilities with pandas.

Below is a cheat sheet provided by Brandon, of all the key functions that will let you do practically anything with tabular data that you would want to (assuming you know how to use them!):

  len(df)       series + value    df[df.c == value]
  df.head()     series + series2  df[(df.c >= value) & (df.d < value)]
  df.tail()     series.notnull()  df[(df.c < value) | (df.d != value)]
  df.COLUMN     series.isnull()   df.sort_values('column')
  df['COLUMN']  series.order()    df.sort_values(['column1', 'column2'])

  s.str.len()        s.value_counts()
  s.str.contains()   s.sort_index()    df[['column1', 'column2']]
  s.str.startswith() s.plot(...)       df.plot(x='a', y='b', kind='bar')

  df.set_index('a').sort_index()        df.loc['value']
  df.set_index(['a', 'b']).sort_index() df.loc[('v','u')]
  df.groupby('column')                  .size() .mean() .min() .max()
  df.groupby(['column1', 'column2'])    .agg(['min', 'max'])

  df.unstack()      s.dt.year       df.merge(df2, how='outer', ...)
  df.stack()        s.dt.month      df.rename(columns={'a': 'y', 'b': 'z'})
  df.fillna(value)  s.dt.day        pd.concat([df1, df2])
  s.fillna(value)   s.dt.dayofweek

One thought on “A great introduction to data analysis in python with pandas”

Send a Comment

Your email address will not be published.