{"id":47338,"date":"2024-11-09T16:21:15","date_gmt":"2024-11-09T16:21:15","guid":{"rendered":"https:\/\/www.devopsschool.com\/blog\/?p=47338"},"modified":"2024-11-10T09:41:57","modified_gmt":"2024-11-10T09:41:57","slug":"jupyter-notebook-lab-session-1-exploring-dataset-with-pandas-and-numpy","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/blog\/jupyter-notebook-lab-session-1-exploring-dataset-with-pandas-and-numpy\/","title":{"rendered":"Jupyter notebook &#8211; Lab Session  &#8211; 1 &#8211; Exploring Dataset with Pandas and NumPy"},"content":{"rendered":"\n<p><strong>Importing Libraries<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>import pandas as pd<br>import numpy as np<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Import the essential libraries.<\/p>\n\n\n\n<p><strong>Loading the Dataset<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df = pd.read_csv('\/path_to_your_dataset.csv')<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Load the dataset into a Pandas DataFrame.<\/p>\n\n\n\n<p><strong>Display First Few Rows<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.head()<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Display the first five rows to understand the structure.<\/p>\n\n\n\n<p><strong>Display Last Few Rows<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.tail()<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Display the last five rows of the dataset.<\/p>\n\n\n\n<p><strong>Dataset Information<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.info()<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Get an overview, including data types and null values.<\/p>\n\n\n\n<p><strong>Descriptive Statistics<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.describe()<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Get statistics like mean, median, min, and max for each column.<\/p>\n\n\n\n<p><strong>Column Names<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.columns<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> List all column names in the dataset.<\/p>\n\n\n\n<p><strong>Shape of the Dataset<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.shape<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Get the number of rows and columns.<\/p>\n\n\n\n<p><strong>Check for Null Values<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.isnull().sum()<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Count null values in each column.<\/p>\n\n\n\n<p><strong>Drop Rows with Null Values<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df_cleaned = df.dropna()<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Remove rows with null values for a cleaner dataset.<\/p>\n\n\n\n<p><strong>Fill Null Values<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.fillna(value='Unknown', inplace=True)<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Fill null values with a placeholder.<\/p>\n\n\n\n<p><strong>Unique Values in a Column<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df['column_name'].unique()<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Display unique values in a specific column.<\/p>\n\n\n\n<p><strong>Value Counts<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df['column_name'].value_counts()<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Count the occurrences of each unique value in a column.<\/p>\n\n\n\n<p><strong>Filter Rows by Condition<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df_filtered = df[df['column_name'] > some_value]<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Filter rows based on a condition.<\/p>\n\n\n\n<p><strong>Selecting Multiple Columns<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df[['column1', 'column2']]<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Select and display specific columns.<\/p>\n\n\n\n<p><strong>Add a New Column<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df['new_column'] = df['column1'] + df['column2']<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Add a new column by combining values from other columns.<\/p>\n\n\n\n<p><strong>Rename Columns<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.rename(columns={'old_name': 'new_name'}, inplace=True)<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Rename columns for better readability.<\/p>\n\n\n\n<p><strong>Sorting Values<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.sort_values(by='column_name', ascending=False)<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Sort the dataset by a specific column.<\/p>\n\n\n\n<p><strong>Drop a Column<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.drop('column_name', axis=1, inplace=True)<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Remove a specific column.<\/p>\n\n\n\n<p><strong>Group By and Aggregate<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.groupby('column_name').sum()<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Group by a column and apply an aggregate function like sum.<\/p>\n\n\n\n<p><strong>Calculate Mean of a Column<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df['column_name'].mean()<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Calculate the mean of a specific column.<\/p>\n\n\n\n<p><strong>Calculate Median of a Column<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df['column_name'].median()<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Calculate the median of a specific column.<\/p>\n\n\n\n<p><strong>Standard Deviation of a Column<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df['column_name'].std()<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Calculate the standard deviation of a specific column.<\/p>\n\n\n\n<p><strong>Detecting Outliers<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df[(df['column_name'] > upper_limit) | (df['column_name'] &lt; lower_limit)]<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Detect outliers by specifying upper and lower limits.<\/p>\n\n\n\n<p><strong>Apply Custom Function<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df['new_column'] = df['column_name'].apply(lambda x: x * 2)<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Apply a custom function to each value in a column.<\/p>\n\n\n\n<p><strong>Pivot Table<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.pivot_table(values='value_column', index='index_column', columns='column_name')<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Create a pivot table to analyze relationships.<\/p>\n\n\n\n<p><strong>Correlation Matrix<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.corr()<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Calculate the correlation matrix for numeric columns.<\/p>\n\n\n\n<p><strong>Visualizing with Histograms<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df['column_name'].hist()<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Plot a histogram for a column to view the distribution.<\/p>\n\n\n\n<p><strong>Scatter Plot<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.plot.scatter(x='column_x', y='column_y')<br><\/code><\/pre>\n\n\n\n<p><em>Explanation:<\/em> Create a scatter plot to see relationships between two columns.<\/p>\n\n\n\n<p><strong>Box Plot<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>df.boxplot(column='column_name')\n<\/code>\n<em>Explanation:<\/em> Generate a box plot to identify the spread and outliers.<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Live Example of Data set Attached<\/h2>\n\n\n\n<p>DOWNLOAD from HERE &#8211; <a href=\"https:\/\/github.com\/devops-school\/MLOps-Certified-Professional\/tree\/main\/datasets\" target=\"_blank\" rel=\"noopener\">CLICK HERE<\/a><\/p>\n\n\n\n<script src=\"https:\/\/gist.github.com\/devops-school\/88744d3438eb8d7b75d6bdad8b86c0b8.js\"><\/script>\n","protected":false},"excerpt":{"rendered":"<p>Importing Libraries import pandas as pdimport numpy as np Explanation: Import the essential libraries. Loading the Dataset df = pd.read_csv(&#8216;\/path_to_your_dataset.csv&#8217;) Explanation: Load the dataset into a Pandas DataFrame. Display First&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"_joinchat":[],"footnotes":""},"categories":[8225],"tags":[],"class_list":["post-47338","post","type-post","status-publish","format-standard","hentry","category-jupyter-notebook"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/47338","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/comments?post=47338"}],"version-history":[{"count":4,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/47338\/revisions"}],"predecessor-version":[{"id":47343,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/47338\/revisions\/47343"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/media?parent=47338"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/categories?post=47338"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/tags?post=47338"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}