Exploratory Data Analysis on Iris Dataset in Python

Discover the essentials of Exploratory Data Analysis on the Iris dataset using Python, covering visualization, correlation, and outlier handling.

Exploratory Data Analysis (EDA) on the Iris Dataset in Python involves a thorough examination of data through visualization and statistical techniques. Using Python libraries like Pandas, Seaborn, and Matplotlib, EDA includes creating histograms, box plots, and scatter plots to understand distributions and relationships among variables like sepal length and petal width. It also entails examining correlations and detecting outliers, providing insights essential for further data modeling. This process is pivotal in revealing underlying patterns and trends within the Iris dataset, a cornerstone example in data science.

To understand things more clearly, let's first understand what is Exploratory Data Analysis (EDA) and Iris Dataset.

What Is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is a critical step in understanding and interpreting data, specifically applied here to the Iris Dataset in Python. It involves techniques for visualizing, summarizing, and interpreting the underlying patterns in data. In the context of the Iris Dataset, EDA typically includes plotting histograms, scatter plots, and box plots to understand the distribution and relationships of sepal and petal dimensions across different iris species. Python's libraries, such as Pandas for data manipulation and Matplotlib for visualization, play a key role in this process. EDA's primary goal is to explore data characteristics, identify anomalies, and formulate hypotheses for further statistical analysis or machine learning modeling.

What is Iris Dataset?

The Iris Dataset is renowned in the field of machine learning and statistics and is frequently used for pattern recognition and exploratory data analysis. The Iris Dataset consists of 150 records of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width, measured in centimetres. This dataset encompasses three species of iris flowers: Setosa, Versicolor, and Virginica, with 50 samples for each species. The Iris Dataset is often utilized in Python for demonstrating data analysis and various machine learning model techniques, serving as a foundation for classification algorithms. Its simplicity and clarity make it an ideal dataset for beginners to practice exploratory data analysis, data visualization, and machine learning modelling.

Iris Dataset is considered as the Hello World for data science. It contains five columns namely – Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. Iris is a flowering plant, the researchers have measured various features of the different iris flowers and recorded them digitally.

Note: This dataset can be downloaded from here.

You can download the Iris.csv file from the above link. Now we will use the Pandas library to load this CSV file, and we will convert it into the dataframe.read_csv() method is used to read CSV files.

Example.

import pandas as pd
 
# Reading the CSV file
df = pd.read_csv("https://datahub.io/machine-learning/iris/r/iris.csv")
 
# Printing top 5 rows
df.head()

Output.

alt_text

Getting Information about the Dataset

Use the shape parameter to get the shape of the row in the dataset.

df.shape

Output.

(150, 6)

It illustrates that the data frame array contains 150 rows and 6 columns.

Let’s get a quick statistical summary of the dataset using the describe() method. The describe() function applies basic statistical computations on the dataset like extreme values, count of data points standard deviation, etc. Any missing value or NaN value is automatically skipped. describe() function gives a good picture of the distribution of the data set.

Example.

df.describe()

Output.

alt_text

How to Check the Missing Values?

Detecting and handling missing data is a crucial part of Exploratory Data Analysis on the Iris Dataset. Python's Pandas library provides tools to identify missing values. You can use DataFrame.isnull() combined with sum() to count missing values in each column. This process helps in deciding whether to impute missing data, remove rows, or make other adjustments. Ensuring data completeness is essential for accurate analysis and modeling. Here's a snippet demonstrating how to identify missing data in the Iris Dataset.

import pandas as pd

# Load the dataset
iris_df = pd.read_csv('https://datahub.io/machine-learning/iris/r/iris.csv')

# Check for missing values
missing_values = iris_df.isnull().sum()
print(missing_values)

The output for the above code is:

alt_text

This code helps in identifying any gaps in the dataset, a critical step in data preprocessing.

How to Check and Drop Duplicates?

In Exploratory Data Analysis on the Iris Dataset, identifying and removing duplicate entries is a vital step. Using Python's own Pandas dataframe library, we can check for duplicates with DataFrame.duplicated(), and drop them using DataFrame.drop_duplicates(). This ensures data accuracy and prevents skewing of results. Below is a code example for detecting and removing duplicates.

import pandas as pd

# Load the dataset
iris_df = pd.read_csv('https://datahub.io/machine-learning/iris/r/iris.csv')

# Identify duplicate rows
duplicate_rows = iris_df.duplicated().sum()
print("Number of duplicate rows:", duplicate_rows)

# Drop duplicates
iris_df = iris_df.drop_duplicates()

This approach helps maintain a clean and reliable dataset, crucial for effective analysis.

Data Visualization

1. Visualizing the Target Column

Visualizing the target column of the Iris dataset, which categorizes iris flowers into various species, is a crucial step in exploratory data analysis (EDA). This process involves using graphical representations to understand the distribution and relationship of the species within the dataset.

Python, with libraries like Matplotlib and Seaborn, offers robust tools for visualization. These libraries enable the creation of clear and informative plots, crucial for interpreting the Iris dataset effectively.

Here is a Python code example to visualize the target column.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset
url = "https://datahub.io/machine-learning/iris/r/iris.csv"
iris_data = pd.read_csv(url)

# Create a count plot for the target column 'class'
sns.countplot(x='class', data=iris_data)
plt.title('Distribution of Iris Species')
plt.xlabel('Species')
plt.ylabel('Count')
plt.show()

This code first imports the necessary libraries, then loads the Iris dataset from the provided URL. It uses Seaborn's countplot to create a bar chart, which shows the frequency of each species in the dataset. The plot is titled and labeled for clarity, aiding in the quick assessment of the species distribution. This visualization is an integral part of EDA, providing insights into the balance of classes within the dataset.

2. Relation between Variables

Exploring the relationship between variables in the Iris dataset is a fundamental aspect of Exploratory Data Analysis (EDA) in Python. This involves examining how different features, such as sepal length, sepal width, petal length, sepal width cm and petal width, interact with each other and influence the classification of iris species.

Python's Seaborn library is particularly useful for visualizing these relationships through scatter plots and pair plots. These visual tools help in identifying patterns, correlations, and potential clusters within the dataset.

Here's a Python code snippet to visualize the relationships between variables.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset
url = "https://datahub.io/machine-learning/iris/r/iris.csv"
iris_data = pd.read_csv(url)

# Create a pair plot
sns.pairplot(iris_data, hue='class')
plt.suptitle('Pair Plot of Iris Dataset Variables')
plt.show()

This code snippet employs Seaborn's pairplot function to create a matrix of scatter plots. Each plot in the matrix shows the relationship between two variables, with data points colored based on iris species. This visual representation allows for easy observation of how variables correlate with each other and with the target variable, thus providing valuable insights for data analysis and modeling.

Histograms

Creating a histogram for the Iris dataset is a vital step in the exploratory data analysis process in Python. A histogram provides a visual representation of the distribution of a dataset's numerical variables, such as sepal length, sepal width, petal length, and petal width. This visualization helps in understanding the frequency distribution of these measurements across the Iris dataset.

Python offers efficient tools for this purpose, notably Matplotlib and Seaborn. These libraries allow for the easy generation of histograms, offering insights into the shape, spread, and central tendency of the data.

The Python code to create a histogram for the sepal length.

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
 
 
fig, axes = plt.subplots(2, 2, figsize=(10,10))
 
axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df['SepalLengthCm'], bins=7)
 
axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df['SepalWidthCm'], bins=5);
 
axes[1,0].set_title("Petal Length")
axes[1,0].hist(df['PetalLengthCm'], bins=6);
 
axes[1,1].set_title("Petal Width")
axes[1,1].hist(df['PetalWidthCm'], bins=6);

Output.

alt_text

From the above plot, we can see that –

  • The highest frequency of the sepal length is between 30 and 35 which is between 5.5 and 6
  • The highest frequency of the sepal Width is around 70 which is between 3.0 and 3.5
  • The highest frequency of the petal length is around 50 which is between 1 and 2
  • The highest frequency of the petal width is between 40 and 50 which is between 0.0 and 0.5

Handling Correlation

Handling correlation in the Iris dataset is a critical component of Exploratory Data Analysis (EDA) in Python. Correlation analysis helps in understanding the linear relationship between different numerical features, like sepal length, sepal width, petal length, and petal width. Identifying these relationships is essential for feature selection and predictive modeling.

In Python, the Pandas and Seaborn libraries are commonly used to calculate and visualize correlations. The Pandas corr() method quickly computes pairwise correlation of columns, while Seaborn's heatmap provides a visual representation of these correlations.

To handle Correlation in Python.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset
url = "https://datahub.io/machine-learning/iris/r/iris.csv"
iris_data = pd.read_csv(url)

# Calculate correlations
correlation_matrix = iris_data.corr()

# Visualize the correlation matrix using a heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Iris Dataset Features')
plt.show()

This code first computes the correlation matrix using the corr() method. Then, it uses Seaborn's heatmap to create a color-coded vector representation, with annotations to show the actual correlation values. This visual aid provides a clear and immediate understanding of how the features in the Iris dataset are correlated, guiding further analysis and feature engineering steps.

Heatmaps

Creating heatmaps for the Iris dataset is a valuable technique in Exploratory Data Analysis (EDA) using Python. Heatmaps provide a visual representation of data through variations in coloring. They are particularly useful for displaying the correlations between features in the dataset, such as sepal length, sepal width, petal length, and petal width. In Python, the Seaborn library is commonly used for creating heatmaps due to its simplicity and effectiveness.

To create a heatmap in Python.

import seaborn as sns
import matplotlib.pyplot as plt
 
 
sns.heatmap(df.corr(method='pearson').drop(
  ['Id'], axis=1).drop(['Id'], axis=0),
            annot = True);
 
plt.show()

Output.

alt_text

From the above graph, we can see that –

  • Petal width and petal length have high correlations.
  • Petal length and sepal width have good correlations.
  • Petal Width and Sepal length have good correlations.

Handling Outliers

Handling outliers in the Iris dataset is an essential step in Exploratory Data Analysis (EDA) in Python. Outliers are data points that differ significantly from other observations and can affect the results of the analysis. In the context of the Iris dataset, outliers may be present in features like sepal length, sepal width, petal length, and petal width.

Python provides effective tools for identifying and managing outliers, notably through libraries like Pandas and Seaborn. Boxplots are a common method for visual outlier detection, as they graphically depict numerical data through quartiles and highlight points that fall outside the interquartile range.

To handle Outliers in Python, the steps are.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset
url = "https://datahub.io/machine-learning/iris/r/iris.csv"
iris_data = pd.read_csv(url)

# Create a boxplot for each feature to identify outliers
features = ['sepallength', 'sepalwidth', 'petallength', 'petalwidth']
for feature in features:
    sns.boxplot(x=iris_data[feature])
    plt.title(f'Boxplot of {feature}')
    plt.show()

In this code, the Iris dataset is loaded, and boxplots are created for each feature using Seaborn's boxplot function. These plots allow for the visual identification of outliers, as they appear as points outside the whiskers of the boxplot. Handling these outliers appropriately, either by removing or adjusting them, ensures a more accurate and reliable analysis of the Iris dataset.

In conclusion, the Exploratory Data Analysis (EDA) of the Iris dataset in Python provides invaluable insights into the dataset's structure, distribution, and relationships. The EDA process involves various techniques, including visualizing the target column, understanding relationships between variables, creating histograms, handling correlation, and managing outliers.

These techniques collectively offer a comprehensive understanding of the Iris dataset. They highlight key features such as sepal length, sepal width, petal length, and petal width, and their interactions across different Iris species. Through EDA, we gain a deeper appreciation of the dataset's nuances, which is crucial for any subsequent predictive modeling or data analysis.

The use of Python and its libraries like Pandas, Seaborn, and Matplotlib, makes EDA an efficient and insightful process. This process is not only applicable to the Iris dataset but also serves as a blueprint for analyzing other datasets in various fields. EDA remains a cornerstone in data science, providing the groundwork for informed decision-making and advanced analytical studies.

You can also check these blogs:

  1. Bisection Method In Python
  2. N-Gram Language Modelling with NLTK Using Python
  3. How to Print to stderr and stdout in Python?
  4. Expressions in Python
  5. Converting DateTime to UNIX Timestamp in Python
  6. How to Check if a String is an Integer in Python?
  7. Python Pipeline Operator
  8. Constructor Overloading In Python
  9. How to Clamp Floating Numbers in Python?