Documentation

Complete guide to data analysis with Python, from basics to advanced techniques

Module 1: Introduction to Data Analysis
What is Data Analysis?

Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It's a crucial skill in today's data-driven world, with applications in business, science, healthcare, and many other fields.

Python has become the leading language for data analysis due to its powerful libraries, ease of use, and strong community support. Python's data analysis ecosystem includes libraries for data manipulation (Pandas), numerical computing (NumPy), visualization (Matplotlib, Seaborn), and machine learning (Scikit-learn).

Why Python for Data Analysis?
  • Rich Ecosystem: Python offers a comprehensive set of libraries specifically designed for data analysis
  • Easy to Learn: Python's simple syntax makes it accessible to beginners
  • Versatile: Can handle everything from data cleaning to machine learning
  • Integration: Works well with other languages and tools
  • Community: Large, active community provides extensive documentation and support
The Data Analysis Workflow

A typical data analysis project follows these steps:

  1. Data Collection: Gathering data from various sources
  2. Data Cleaning: Handling missing values, correcting errors, and ensuring consistency
  3. Exploratory Analysis: Understanding the data through visualization and summary statistics
  4. Feature Engineering: Creating new variables from existing ones
  5. Modeling: Applying statistical or machine learning models
  6. Communication: Presenting findings through visualizations and reports
Key Python Libraries for Data Analysis
NumPy

NumPy (Numerical Python) is the fundamental package for scientific computing in Python. It provides support for arrays, matrices, and mathematical functions.

Pandas

Pandas provides data structures and functions needed to manipulate structured data. It's built on top of NumPy and is particularly well-suited for tabular data with columns of different types.

Matplotlib & Seaborn

These libraries are used for data visualization. Matplotlib is highly customizable, while Seaborn provides a high-level interface for drawing attractive statistical graphics.

Scikit-learn

Scikit-learn is a machine learning library that provides simple and efficient tools for data mining and data analysis.

SciPy

SciPy provides algorithms for scientific and technical computing, including optimization, integration, interpolation, signal processing, linear algebra, and statistics.

Interactive Example

Try a simple data analysis operation:

Module 2: Setting Up the Environment
Python Installation

Before starting with data analysis, you need to have Python installed on your system. The recommended approach is to use the Anaconda distribution, which comes with most data analysis libraries pre-installed.

Installing Anaconda

# Download Anaconda from https://www.anaconda.com/products/distribution
# Follow the installation instructions for your operating system

# Verify installation
python --version
conda --version
Creating a Virtual Environment

Virtual environments allow you to manage dependencies for different projects separately. This is a best practice to avoid conflicts between project requirements.


# Create a new environment
conda create -n data-analysis python=3.9

# Activate the environment
conda activate data-analysis

# Install additional packages
conda install pandas numpy matplotlib seaborn scikit-learn jupyter

# Save the environment
conda env export > environment.yml
Using pip

If you prefer to use pip instead of conda, you can install packages individually:


# Install essential data analysis libraries
pip install pandas numpy matplotlib seaborn scipy scikit-learn jupyter

# For advanced data analysis
pip install plotly dash bokeh statsmodels

# For working with specific data formats
pip install openpyxl xlrd hdf5 pytables sqlalchemy

# For web scraping
pip install beautifulsoup4 requests scrapy

# For big data processing
pip install dask pyspark
Jupyter Notebooks

Jupyter Notebooks provide an interactive computing environment that allows you to combine code, text, and visualizations in a single document. They are ideal for data analysis and exploration.


# Start Jupyter Notebook
jupyter notebook

# Or start Jupyter Lab (more modern interface)
jupyter lab
IDEs for Data Analysis

While Jupyter Notebooks are great for exploration, you might prefer an IDE for larger projects:

  • PyCharm Professional: Excellent for data science projects with built-in support for Jupyter
  • VS Code: Free, lightweight option with excellent Python extensions
  • Spyder: Scientific Python IDE with built-in data exploration tools
  • JupyterLab: Web-based interface for Jupyter
Environment Verification

Verify your Python environment setup:

Module 3: NumPy Fundamentals
Introduction to NumPy

NumPy (Numerical Python) is the fundamental package for scientific computing in Python. It provides support for arrays, matrices, and mathematical functions. NumPy arrays are more memory-efficient and faster than Python lists for numerical operations.

Creating NumPy Arrays

import numpy as np

# Creating arrays from Python lists
arr1 = np.array([1, 2, 3, 4, 5])  # 1D array
arr2 = np.array([[1, 2, 3], [4, 5, 6]])  # 2D array

# Creating arrays with built-in functions
arr3 = np.zeros((3, 3))  # Array of zeros
arr4 = np.ones((2, 4))  # Array of ones
arr5 = np.full((2, 3), 7)  # Array filled with a specific value
arr6 = np.random.rand(3, 3)  # Random array with values between 0 and 1
arr7 = np.random.randn(3, 3)  # Random array with normal distribution
arr8 = np.arange(0, 10, 2)  # Array with step
arr9 = np.linspace(0, 10, 5)  # Linearly spaced values
arr10 = np.eye(3)  # Identity matrix
Array Properties and Attributes

arr = np.array([[1, 2, 3], [4, 5, 6]])

# Array properties
print(arr.shape)  # (2, 3) - dimensions of the array
print(arr.ndim)   # 2 - number of dimensions
print(arr.dtype)  # dtype('int64') - data type of elements
print(arr.size)   # 6 - total number of elements
print(arr.itemsize)  # 8 - size in bytes of each element
print(arr.nbytes)  # 48 - total bytes consumed by the array

# Changing data type
arr_float = arr.astype(np.float32)
print(arr_float.dtype)  # dtype('float32')
Array Indexing and Slicing

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Indexing
print(arr[0, 1])  # 2 - element at row 0, column 1
print(arr[1])     # [4 5 6] - entire second row

# Slicing
print(arr[:, 1])    # [2 5 8] - second column
print(arr[1, :])    # [4 5 6] - second row
print(arr[:2, :2])  # [[1 2] [4 5]] - top-left 2x2 subarray

# Advanced slicing
print(arr[::2, ::2])  # [[1 3] [7 9]] - every other element starting from 0
print(arr[1::-1, ::-1])  # [6 5 4] - second row in reverse
Array Operations

arr = np.array([1, 2, 3, 4, 5])

# Arithmetic operations
print(arr + 10)  # [11 12 13 14 15] - broadcasting
print(arr * 2)   # [ 2 4 6 8 10]
print(arr - 3)   # [-2 -1 0 1 2]
print(arr / 2)   # [0.5 1. 1.5 2. 2.5]

# Comparison operations
print(arr > 3)    # [False False False  True  True]
print(arr == 3)   # [False False True False False]

# Mathematical functions
print(np.sqrt(arr))    # [1.         1.41421356 1.73205081 2.         2.23606798]
print(np.exp(arr))     # [  2.71828183  7.3890561  20.08553692 54.59815003 148.4131591]
print(np.log(arr))     # [0.         0.69314718 1.09861229 1.38629436 1.60943791]

# Statistical functions
print(np.mean(arr))   # 3.0
print(np.median(arr)) # 3.0
print(np.std(arr))    # 1.4142135623730951
print(np.var(arr))    # 2.0
print(np.min(arr))    # 1
print(np.max(arr))    # 5
Linear Algebra with NumPy

# Matrix operations
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Matrix multiplication
print(np.dot(A, B))  # [[19 22] [43 50]]

# Transpose
print(A.T)  # [[1 3] [2 4]]

# Inverse
print(np.linalg.inv(A))  # [[-2.   1. ] [ 1.5 -0.5]]

# Determinant
print(np.linalg.det(A))  # -2.0

# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:", eigenvectors)

# Solving linear systems
x = np.array([3, 4])
b = np.array([7, 10])
solution = np.linalg.solve(A, b)
print("Solution:", solution)  # [1. 1.]
Reshaping and Stacking Arrays

# Reshaping
arr = np.arange(12)
reshaped = arr.reshape(3, 4)
print(reshaped)
# [[ 0  1  2 3]
#  [ 4  5  6 7]
#  [ 8 9 10 11]]

# Flattening
flattened = reshaped.flatten()
print(flattened)  # [ 0 1 2 3 4 5 6 7 8 9 10 11]

# Stacking arrays
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
stacked_v = np.vstack((arr1, arr2))  # Vertical stacking
stacked_h = np.hstack((arr1, arr2))  # Horizontal stacking
print("Vertical stack:", stacked_v)  # [[1 2 3] [4 5 6]]
print("Horizontal stack:", stacked_h)  # [1 2 3 4 5 6]

# Concatenating along different axes
arr3 = np.array([7, 8, 9])
concat_axis0 = np.concatenate((arr1.reshape(1, 3), arr3.reshape(1, 3)))
concat_axis1 = np.concatenate((arr1.reshape(3, 1), arr3.reshape(3, 1)), axis=1)
print("Concat along axis 0:", concat_axis0)  # [[1 2 3] [7 8 9]]
print("Concat along axis 1:", concat_axis1)  # [[1 7] [2 8] [3 9]]
Interactive Example

Try NumPy array operations:

Module 4: Pandas DataFrames
Introduction to Pandas

Pandas is a powerful library for data manipulation and analysis. It provides two main data structures: Series (1D labeled array) and DataFrame (2D labeled data structure with columns of potentially different types).

Creating DataFrames

import pandas as pd
import numpy as np

# From dictionary of lists
df1 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'Salary': [70000, 80000, 90000, 75000]
})

# From list of lists
df2 = pd.DataFrame([
    ['Alice', 25, 'New York', 70000],
    ['Bob', 30, 'Los Angeles', 80000],
    ['Charlie', 35, 'Chicago', 90000],
    ['Diana', 28, 'Houston', 75000]
], columns=['Name', 'Age', 'City', 'Salary'])

# From NumPy array
data = np.random.randn(5, 4)
df3 = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'])

# From CSV file
# df4 = pd.read_csv('data.csv')

# From Excel file
# df5 = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# From SQL database
# import sqlite3
# conn = sqlite3.connect('database.db')
# df6 = pd.read_sql_query('SELECT * FROM table_name', conn)
Exploring DataFrames

# Display first/last rows
print(df.head())  # First 5 rows
print(df.tail())  # Last 5 rows

# DataFrame information
print(df.info())  # Summary including data types and non-null values
print(df.shape)  # Dimensions (rows, columns)
print(df.columns)  # Column names
print(df.index)   # Index information
print(df.dtypes)  # Data types of columns

# Statistical summary
print(df.describe())  # Statistical summary for numeric columns
print(df.describe(include='all'))  # Include all columns

# Accessing columns
print(df['Name'])  # Select column (Series)
print(df[['Name', 'Age']])  # Select multiple columns (DataFrame)

# Accessing rows
print(df.loc[0])  # Select row by label
print(df.iloc[0])  # Select row by position
print(df.loc[0:2, ['Name', 'Age']])  # Select rows and columns by label
print(df.iloc[0:2, 0:2])  # Select rows and columns by position
Data Selection and Filtering

# Conditional filtering
print(df[df['Age'] > 30])  # Filter by condition
print(df[(df['Age'] > 25) & (df['Salary'] < 80000)])  # Multiple conditions
print(df[df['City'].isin(['New York', 'Chicago'])])  # Filter by list

# Using query method
print(df.query('Age > 30 and Salary < 80000'))

# Using isin method
print(df[df['City'].isin(['New York', 'Chicago'])])

# Using str methods for string columns
print(df[df['Name'].str.startswith('A')])  # Names starting with 'A'
print(df[df['Name'].str.contains('a')])  # Names containing 'a'

# Using between method
print(df[df['Age'].between(25, 30)])  # Ages between 25 and 30
Grouping and Aggregation

# Group by a single column
grouped = df.groupby('City')
print(grouped.mean())  # Mean by group
print(grouped['Salary'].agg(['mean', 'max', 'min', 'count']))  # Multiple aggregations

# Group by multiple columns
multi_grouped = df.groupby(['City', 'Age > 30'])
print(multi_grouped.mean())

# Transform operations
df['Salary_Rank'] = df.groupby('City')['Salary'].rank(ascending=False)
df['Salary_Pct'] = df.groupby('City')['Salary'].transform(lambda x: x / x.max())

# Filter groups
high_salary_cities = df.groupby('City').filter(lambda x: x['Salary'].mean() > 75000)
print(high_salary_cities.mean())
Handling Missing Data

# Create DataFrame with missing values
df_with_nan = df.copy()
df_with_nan.loc[2, 'Salary'] = np.nan
df_with_nan.loc[3, 'City'] = np.nan

# Check for missing values
print(df_with_nan.isnull())  # Boolean mask of missing values
print(df_with_nan.isnull().sum())  # Count missing values per column
print(df_with_nan.isnull().sum().sum())  # Total missing values

# Drop missing values
df_dropped_rows = df_with_nan.dropna()  # Drop rows with any missing values
df_dropped_cols = df_with_nan.dropna(axis=1)  # Drop columns with any missing values
df_dropped_subset = df_with_nan.dropna(subset=['Age', 'Salary'])  # Drop if specific columns are missing

# Fill missing values
df_filled_zero = df_with_nan.fillna(0)  # Fill with 0
df_filled_mean = df_with_nan.fillna(df_with_nan.mean())  # Fill with mean (only for numeric columns)
df_filled_median = df_with_nan.fillna(df_with_nan.median())  # Fill with median
df_filled_mode = df_with_nan.fillna(df_with_nan.mode().iloc[0])  # Fill with mode

# Forward fill and backward fill
df_ffill = df_with_nan.fillna(method='ffill')  # Forward fill
df_bfill = df_with_nan.fillna(method='bfill')  # Backward fill

# Interpolation
df_interpolated = df_with_nan.interpolate()  # Linear interpolation

# Custom fill strategies
df['Age'].fillna(df['Age'].median(), inplace=True)  # Fill Age with median
df['City'].fillna('Unknown', inplace=True)  # Fill City with 'Unknown'
df['Salary'].fillna(df.groupby('City')['Salary'].transform('mean'), inplace=True)  # Fill with group mean
Merging and Joining DataFrames

# Create additional DataFrame
df2 = pd.DataFrame({
    'Name': ['Eve', 'Frank', 'Grace'],
    'Age': [22, 40, 33],
    'City': ['Boston', 'Seattle', 'Miami'],
    'Salary': [65000, 95000, 85000]
})

# Concatenate DataFrames
df_concat = pd.concat([df, df2])
print(df_concat)

# Merge DataFrames
df3 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Department': ['HR', 'IT', 'Finance', 'IT']
})

# Inner join (default)
merged_inner = pd.merge(df, df3, on='Name')
print(merged_inner)

# Left join
merged_left = pd.merge(df, df3, on='Name', how='left')
print(merged_left)

# Right join
merged_right = pd.merge(df, df3, on='Name', how='right')
print(merged_right)

# Outer join
merged_outer = pd.merge(df, df3, on='Name', how='outer')
print(merged_outer)

# Join on multiple columns
df4 = pd.DataFrame({
    'First_Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Last_Name': ['Smith', 'Johnson', 'Brown', 'Jones'],
    'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'diana@example.com']
})

merged_multi = pd.merge(df, df4, left_on=['Name'], right_on=['First_Name'])
print(merged_multi)
Reading and Writing Data

# Reading CSV files
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv', sep=';', encoding='utf-8')  # Custom separator and encoding
df = pd.read_csv('data.csv', nrows=100)  # Read first 100 rows
df = pd.read_csv('data.csv', usecols=['Name', 'Age'])  # Read specific columns

# Writing CSV files
df.to_csv('output.csv', index=False)  # Don't write index
df.to_csv('output.csv', sep=';', encoding='utf-8')  # Custom separator and encoding

# Reading Excel files
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
df = pd.read_excel('data.xlsx', sheet_name=['Sheet1', 'Sheet2'])  # Multiple sheets

# Writing Excel files
df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)
with pd.ExcelWriter('output.xlsx') as writer:
    df.to_excel(writer, sheet_name='Sheet1')
    df2.to_excel(writer, sheet_name='Sheet2')

# Reading JSON files
df = pd.read_json('data.json')
df = pd.read_json('data.json', orient='records')  # Different JSON formats

# Writing JSON files
df.to_json('output.json', orient='records')
df.to_json('output.json', orient='records', lines=True)  # JSON lines format

# Reading from databases
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM table_name', conn)
conn.close()

# Writing to databases
conn = sqlite3.connect('database.db')
df.to_sql('table_name', conn, if_exists='replace', index=False)
conn.close()

# Reading from web
url = 'https://example.com/data.csv'
df = pd.read_csv(url)

# Reading HTML tables
tables = pd.read_html('https://example.com/tables.html')
df = tables[0]  # First table
Interactive Example

Try DataFrame operations:

Module 5: Data Cleaning and Preprocessing
Introduction to Data Cleaning

Data cleaning is a crucial step in data analysis. It involves handling missing values, removing duplicates, correcting data types, and dealing with outliers. Clean data ensures accurate analysis and reliable results.

Handling Missing Values

import pandas as pd
import numpy as np

# Create sample data with missing values
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, np.nan, 35, 28, 22],
    'Salary': [70000, 80000, np.nan, 75000, 65000],
    'City': ['New York', 'Los Angeles', 'Chicago', np.nan, 'Houston']
})

# Check for missing values
print(df.isnull())  # Boolean mask of missing values
print(df.isnull().sum())  # Count missing values per column
print(df.isnull().sum().sum())  # Total missing values

# Drop missing values
df_dropped_rows = df.dropna()  # Drop rows with any missing values
df_dropped_cols = df.dropna(axis=1)  # Drop columns with any missing values
df_dropped_subset = df.dropna(subset=['Age', 'Salary'])  # Drop if specific columns are missing

# Fill missing values
df_filled_zero = df.fillna(0)  # Fill with 0
df_filled_mean = df.fillna(df.mean())  # Fill with mean (only for numeric columns)
df_filled_median = df.fillna(df.median())  # Fill with median
df_filled_mode = df.fillna(df.mode().iloc[0])  # Fill with mode

# Forward fill and backward fill
df_ffill = df.fillna(method='ffill')  # Forward fill
df_bfill = df.fillna(method='bfill')  # Backward fill

# Interpolation
df_interpolated = df.interpolate()  # Linear interpolation

# Custom fill strategies
df['Age'].fillna(df['Age'].median(), inplace=True)  # Fill Age with median
df['City'].fillna('Unknown', inplace=True)  # Fill City with 'Unknown'
df['Salary'].fillna(df.groupby('City')['Salary'].transform('mean'), inplace=True)  # Fill with group mean

# Advanced: Using sklearn's SimpleImputer
from sklearn.impute import SimpleImputer

imputer_mean = SimpleImputer(strategy='mean')
df[['Age', 'Salary']] = imputer_mean.fit_transform(df[['Age', 'Salary']])

imputer_mode = SimpleImputer(strategy='most_frequent')
df[['City']] = imputer_mode.fit_transform(df[['City']])
Handling Duplicates

# Create data with duplicates
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'Age': [25, 30, 25, 35, 30],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles']
})

# Check for duplicates
print(df.duplicated())  # Boolean mask of duplicates
print(df.duplicated().sum())  # Count duplicates

# Drop duplicates
df_no_duplicates = df.drop_duplicates()  # Drop all duplicate rows
df_subset_duplicates = df.drop_duplicates(subset=['Name'])  # Drop based on specific columns
df_keep_last = df.drop_duplicates(keep='last')  # Keep last occurrence
df_keep_false = df.drop_duplicates(keep=False)  # Drop all duplicates

# Find and analyze duplicates
duplicate_rows = df[df.duplicated(keep=False)]
print(duplicate_rows)
Data Type Conversion

# Create data with type issues
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': ['25', '30', '35'],
    'Salary': ['70000', '80000', '90000'],
    'Join_Date': ['2020-01-01', '2019-05-15', '2018-11-30'],
    'Is_Active': ['True', 'False', 'True']
})

# Check data types
print(df.dtypes)

# Convert data types
df['Age'] = df['Age'].astype(int)  # Convert to integer
df['Salary'] = pd.to_numeric(df['Salary'])  # Convert to numeric
df['Join_Date'] = pd.to_datetime(df['Join_Date'])  # Convert to datetime
df['Is_Active'] = df['Is_Active'].astype(bool)  # Convert to boolean

# Convert with error handling
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')  # Invalid values become NaN
df['Age'] = df['Age'].fillna(df['Age'].median()).astype(int)  # Fill NaN and convert

# Extract datetime components
df['Year'] = df['Join_Date'].dt.year
df['Month'] = df['Join_Date'].dt.month
df['Day'] = df['Join_Date'].dt.day
df['DayOfWeek'] = df['Join_Date'].dt.dayofweek
df['Quarter'] = df['Join_Date'].dt.quarter

# Categorical data
df['Department'] = ['HR', 'IT', 'Finance']
df['Department'] = df['Department'].astype('category')  # Convert to category
df['Department_Cat'] = df['Department'].cat.codes  # Get categorical codes
Handling Outliers

import numpy as np
import pandas as pd
from scipy import stats

# Create data with outliers
np.random.seed(42)
data = np.random.normal(100, 15, 1000)
data = np.append(data, [200, 250, 300])  # Add outliers
df = pd.DataFrame({'Value': data})

# Detect outliers using Z-score
df['Z_Score'] = np.abs(stats.zscore(df['Value']))
outliers_z = df[df['Z_Score'] > 3]
print(f"Outliers using Z-score: {len(outliers_z)}")

# Detect outliers using IQR
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = df[(df['Value'] < lower_bound) | (df['Value'] > upper_bound)]
print(f"Outliers using IQR: {len(outliers_iqr)}")

# Remove outliers
df_no_outliers = df[(df['Value'] >= lower_bound) & (df['Value'] <= upper_bound)]

# Cap outliers (winsorization)
df['Value_Capped'] = df['Value'].clip(lower_bound, upper_bound)

# Transform outliers
df['Value_Log'] = np.log(df['Value'])
df['Value_Sqrt'] = np.sqrt(df['Value'])

# Visualize outliers
import matplotlib.pyplot as plt
plt.boxplot(df['Value'])
plt.title('Boxplot to Identify Outliers')
plt.show()
Text Data Processing

# Create sample text data
df = pd.DataFrame({
    'Text': ['Hello World', 'Python is great', 'Data Analysis', 'Machine Learning', 'Artificial Intelligence'],
    'Category': ['Greeting', 'Statement', 'Topic', 'Topic', 'Topic']
})

# String operations
df['Text_Length'] = df['Text'].str.len()  # Length of strings
df['Word_Count'] = df['Text'].str.split().str.len()  # Number of words
df['Contains_Python'] = df['Text'].str.contains('Python')  # Boolean mask

# Case operations
df['Text_Upper'] = df['Text'].str.upper()
df['Text_Lower'] = df['Text'].str.lower()
df['Text_Title'] = df['Text'].str.title()

# Extract substrings
df['First_Word'] = df['Text'].str.split().str[0]
df['Last_Word'] = df['Text'].str.split().str[-1]

# Replace substrings
df['Text_Cleaned'] = df['Text'].str.replace('Machine Learning', 'ML')

# Regular expressions
df['Has_Numbers'] = df['Text'].str.contains(r'\d+', regex=True)
df['Emails'] = df['Text'].str.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}')
Interactive Example

Try data cleaning operations:

Module 6: Exploratory Data Analysis
Introduction to EDA

Exploratory Data Analysis (EDA) is the process of analyzing and visualizing data to understand its main characteristics, patterns, and relationships. EDA helps generate hypotheses, identify data quality issues, and guide further analysis.

Descriptive Statistics

import pandas as pd
import numpy as np

# Load sample data
df = pd.DataFrame({
    'Age': [25, 30, 35, 28, 22, 45, 38, 32, 29, 26],
    'Salary': [70000, 80000, 90000, 75000, 65000, 120000, 95000, 85000, 78000, 72000],
    'Experience': [2, 5, 10, 4, 1, 20, 12, 7, 5, 3],
    'Department': ['IT', 'HR', 'Finance', 'IT', 'HR', 'Finance', 'IT', 'HR', 'Finance', 'IT', 'HR']
})

# Basic statistics
print(df.describe())  # Summary statistics for numeric columns
print(df.describe(include='all'))  # Include categorical columns

# Specific statistics
print(f"Mean Age: {df['Age'].mean():.2f}")
print(f"Median Salary: {df['Salary'].median()}")
print(f"Standard Deviation of Experience: {df['Experience'].std():.2f}")
print(f"Minimum Age: {df['Age'].min()}")
print(f"Maximum Salary: {df['Salary'].max()}")

# Percentiles
print(f"25th percentile of Age: {df['Age'].quantile(0.25)}")
print(f"75th percentile of Salary: {df['Salary'].quantile(0.75)}")

# Correlation
print(df.corr())  # Correlation matrix
print(df.corrwith(df['Salary']))  # Correlation with specific column

# Covariance
print(df.cov())  # Covariance matrix

# Value counts for categorical data
print(df['Department'].value_counts())
print(df['Department'].value_counts(normalize=True))  # Proportions

# Cross-tabulation
print(pd.crosstab(df['Department'], df['Age'] > 30))
Data Visualization with Matplotlib

import matplotlib.pyplot as plt
import numpy as np

# Set style
plt.style.use('seaborn')

# Line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.figure(figsize=(10, 6))
plt.plot(x, y, label='sin(x)')
plt.plot(x, np.cos(x), label='cos(x)')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Trigonometric Functions')
plt.legend()
plt.grid(True)
plt.show()

# Scatter plot
np.random.seed(42)
x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.5
plt.figure(figsize=(10, 6))
plt.scatter(x, y, alpha=0.6)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot')
plt.show()

# Histogram
data = np.random.normal(100, 15, 1000)
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()

# Bar plot
categories = ['A', 'B', 'C', 'D', 'E']
values = [23, 45, 56, 78, 32]
plt.figure(figsize=(10, 6))
plt.bar(categories, values, color=['red', 'green', 'blue', 'orange', 'purple'])
plt.xlabel('Category')
plt.ylabel('Value')
plt.title('Bar Plot')
plt.show()

# Box plot
data1 = np.random.normal(100, 10, 100)
data2 = np.random.normal(110, 15, 100)
data3 = np.random.normal(90, 12, 100)
plt.figure(figsize=(10, 6))
plt.boxplot([data1, data2, data3], labels=['Group 1', 'Group 2', 'Group 3'])
plt.ylabel('Value')
plt.title('Box Plot')
plt.show()

# Subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0, 0].plot(x, y)
axes[0, 0].set_title('Line Plot')
axes[0, 1].scatter(x, y)
axes[0, 1].set_title('Scatter Plot')
axes[1, 0].hist(data, bins=20)
axes[1, 0].set_title('Histogram')
axes[1, 1].bar(categories, values)
axes[1, 1].set_title('Bar Plot')
plt.tight_layout()
plt.show()
Data Visualization with Seaborn

import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load sample dataset
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')

# Set style
sns.set_style("whitegrid")
sns.set_palette("husl")

# Distribution plot
plt.figure(figsize=(10, 6))
sns.histplot(tips['total_bill'], kde=True)
plt.title('Distribution of Total Bill')
plt.show()

# Box plot with categories
plt.figure(figsize=(10, 6))
sns.boxplot(x='day', y='total_bill', data=tips)
plt.title('Total Bill by Day')
plt.show()

# Violin plot
plt.figure(figsize=(10, 6))
sns.violinplot(x='day', y='total_bill', data=tips)
plt.title('Total Bill Distribution by Day')
plt.show()

# Scatter plot with regression
plt.figure(figsize=(10, 6))
sns.regplot(x='total_bill', y='tip', data=tips)
plt.title('Tip vs Total Bill')
plt.show()

# Pair plot
sns.pairplot(iris, hue='species')
plt.suptitle('Pair Plot of Iris Dataset', y=1.02)
plt.show()

# Heatmap (correlation matrix)
plt.figure(figsize=(10, 8))
correlation_matrix = iris.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.show()

# Count plot
plt.figure(figsize=(10, 6))
sns.countplot(x='day', data=tips)
plt.title('Count of Observations by Day')
plt.show()

# Joint plot
sns.jointplot(x='total_bill', y='tip', data=tips, kind='scatter')
plt.suptitle('Joint Plot of Total Bill and Tip', y=1.02)
plt.show()

# Facet grid
g = sns.FacetGrid(tips, col='time', row='sex')
g.map(sns.scatterplot, 'total_bill', 'tip')
plt.suptitle('Tips by Time and Sex', y=1.02)
plt.show()
Interactive Example

Try creating visualizations:

Module 7: Data Visualization
Advanced Visualization Techniques

Data visualization is a crucial part of data analysis, helping to communicate insights effectively. Python offers a rich ecosystem of visualization libraries, from basic plots to complex interactive dashboards.

Statistical Visualizations

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Load sample data
tips = sns.load_dataset('tips')

# Distribution plots
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.histplot(tips['total_bill'], kde=True, bins=20)
plt.title('Distribution of Total Bill')
plt.subplot(1, 2, 2)
sns.kdeplot(tips['total_bill'], shade=True)
plt.title('KDE Plot of Total Bill')
plt.tight_layout()
plt.show()

# Comparative distributions
plt.figure(figsize=(12, 6))
sns.kdeplot(data=tips, x='total_bill', hue='day', shade=True)
plt.title('Distribution of Total Bill by Day')
plt.show()

# Violin plots for comparing distributions
plt.figure(figsize=(12, 6))
sns.violinplot(x='day', y='total_bill', data=tips, inner='quartile')
plt.title('Violin Plot of Total Bill by Day')
plt.show()

# Ridgeline plots
plt.figure(figsize=(12, 6))
sns.kdeplot(data=tips, x='total_bill', y='tip', hue='day', shade=True, shade_lowest=False, shade_highest=False)
plt.title('Ridgeline Plot of Tip vs Total Bill by Day')
plt.show()
Advanced Plot Types

# Pair plots with regression
sns.jointplot(x='total_bill', y='tip', data=tips, kind='reg', hue='day')
plt.title('Joint Plot with Regression Line')
plt.show()

# Pair grid with regression
sns.pairplot(tips, kind='reg', hue='day')
plt.title('Pair Plot Grid with Regression')
plt.show()

# Pair plot with kde
sns.pairplot(tips, kind='kde', hue='day')
plt.title('Pair Plot Grid with KDE')
plt.show()

# Pair plot with hexbin
sns.pairplot(tips, kind='hex', hue='day')
plt.title('Pair Plot Grid with Hexbin')
plt.show()

# Categorical plots
plt.figure(figsize=(12, 6))
sns.catplot(x='day', y='total_bill', data=tips, kind='box')
plt.title('Box Plot of Total Bill by Day')
plt.show()

# Swarm plot
plt.figure(figsize=(12, 6))
sns.swarmplot(x='day', y='total_bill', data=tips)
plt.title('Swarm Plot of Total Bill by Day')
plt.show()

# Point plot
plt.figure(figsize=(12, 6))
sns.pointplot(x='day', y='total_bill', data=tips, hue='sex', join=False, dodge=True)
plt.title('Point Plot of Total Bill by Day and Sex')
plt.show()
Interactive Visualizations

import plotly.graph_objects as go
import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
dates = pd.date_range('2022-01-01', periods=365, freq='D')
values = np.cumsum(np.random.randn(365) + 100)

# Interactive line chart
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=dates,
    y=values,
    mode='lines+markers',
    name='Value',
    line=dict(color='royalblue', width=2),
    marker=dict(color='royalblue', size=6)
)

fig.update_layout(
    title='Interactive Time Series',
    xaxis_title='Date',
    yaxis_title='Value',
    hovermode='x unified'
)

fig.show()

# Interactive scatter plot
fig = go.Figure()

# Add scatter plot
fig.add_trace(go.Scatter(
    x=np.random.randn(100),
    y=np.random.randn(100),
    mode='markers',
    marker=dict(
        size=10,
        color=np.random.choice(['red', 'blue', 'green', 'purple', 'orange']),
        opacity=0.7,
        line=dict(width=0)
    ),
    name='Random Points'
)

# Add dropdown for filtering
fig.update_layout(
    title='Interactive Scatter Plot',
    xaxis_title='X Value',
    yaxis_title='Y Value',
    updatemenus=[dict(type='buttons', buttons=[dict(label='All', method='restyle')],
                     dict(type='dropdown', x='xaxis.title', y='yaxis.title')]
)

fig.show()
Geospatial Data Visualization

import geopandas as gpd
import matplotlib.pyplot as plt

# Load world map data
world = gpd.read_file(gpd.datasets.get_path('naturalearth'))

# Create a choropleth map
fig, ax = plt.subplots(figsize=(15, 10))
world.plot(ax=ax, color='lightgray', edgecolor='white')

# Add color-coded data
world['gdp_per_capita'] = world['gdp_per_capita'] / 1000000
world.plot(column='gdp_per_capita', cmap='viridis', legend=False, ax=ax, 
        legend_kwds={'label': 'GDP per Capita', 'orientation': 'horizontal'})

# Add annotations for specific countries
for i, row in world.nlargest(5, 'gdp_md_est').iterrows():
    ax.annotate(
        row['name'], 
        xy=(row['geometry'].representative_point().coords[0], row['geometry'].representative_point().coords[1]),
        ha='center',
        fontsize=8
    )

plt.title('World GDP per Capita')
plt.colorbar(label='GDP per Capita ($)')
plt.show()
Network Graphs

import networkx as nx
import matplotlib.pyplot as plt

# Create a random graph
G = nx.erdos_renyi(10, 0.3, seed=42)

# Calculate centrality measures
centrality = nx.degree_centrality(G)
betweenness = nx.betweenness_centrality(G)

# Create a visualization
plt.figure(figsize=(12, 8))
pos = nx.spring_layout(G, k=0.15, iterations=50)
pos = nx.spring_layout(G, pos=pos, fixed=[0, 0])
nx.draw_network(G, pos, with_labels=True, node_size=300, node_color=centrality.values(), cmap=plt.cm.viridis)
plt.title('Network Graph with Node Centrality')
plt.colorbar(label='Centrality', orientation='vertical')
plt.show()
Interactive Example

Try creating different visualizations:

Module 8: Statistical Analysis
Statistical Concepts

Statistical analysis involves collecting, analyzing, interpreting, and presenting data to discover patterns and trends. It helps us make informed decisions based on data rather than intuition.

Descriptive Statistics

import numpy as np
import pandas as pd
from scipy import stats

# Generate sample data
np.random.seed(42)
data = np.random.normal(100, 15, 1000)

# Calculate descriptive statistics
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data)
variance = np.var(data)
std_dev = np.std(data)
min_val = np.min(data)
max_val = np.max(data)
percentiles = np.percentile(data, [25, 50, 75])

print(f"Mean: {mean:.2f}")
print(f"Median: {median:.2f}")
print(f"Mode: {mode.mode[0]:.2f}")
print(f"Variance: {variance:.2f}")
print(f"Standard Deviation: {std_dev:.2f}")
print(f"Min: {min_val:.2f}")
print(f"Max: {max_val:.2f}")
print(f"25th Percentile: {percentiles[0]:.2f}")
print(f"50th Percentile: {percentiles[1]:.2f}")
print(f"75th Percentile: {percentiles[2]:.2f}")
Hypothesis Testing

from scipy import stats

# Generate sample data
np.random.seed(42)
group1 = np.random.normal(100, 15, 100)
group2 = np.random.normal(105, 15, 100)

# T-test (independent samples)
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
    print("Reject null hypothesis: There is a significant difference between groups")
else:
    print("Fail to reject null hypothesis: No significant difference between groups")

# Paired t-test
before = np.random.normal(100, 10, 50)
after = before + np.random.normal(5, 5, 50)
t_stat, p_value = stats.ttest_rel(before, after)
print(f"\nPaired T-test:")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# One-way ANOVA
group1 = np.random.normal(100, 10, 30)
group2 = np.random.normal(105, 10, 30)
group3 = np.random.normal(110, 10, 30)
f_stat, p_value = stats.f_oneway(group1, group2, group3)
print(f"\nANOVA:")
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Chi-square test
observed = np.array([[50, 30], [20, 40]])
chi2, p_value, dof, expected = stats.chi2_contingency(observed)
print(f"\nChi-square test:")
print(f"Chi2 statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")

# Correlation test
x = np.random.randn(100)
y = x + np.random.randn(100) * 0.5
corr, p_value = stats.pearsonr(x, y)
print(f"\nCorrelation test:")
print(f"Correlation coefficient: {corr:.4f}")
print(f"P-value: {p_value:.4f}")
Regression Analysis

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# Generate sample data
np.random.seed(42)
X = np.random.randn(100, 3)
y = 2 * X[:, 0] + 3 * X[:, 1] - 1 * X[:, 2] + np.random.randn(100) * 0.5

# Simple Linear Regression with sklearn
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

print("Linear Regression Results:")
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
print(f"R-squared: {r2_score(y, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y, y_pred)):.4f}")

# Multiple Linear Regression with statsmodels
X_with_const = sm.add_constant(X)
model = sm.OLS(y, X_with_const)
results = model.fit()
print("\nDetailed Regression Results:")
print(results.summary())

# Polynomial Regression
from sklearn.preprocessing import PolynomialFeatures

X_poly = PolynomialFeatures(degree=2).fit_transform(X)
model_poly = LinearRegression()
model_poly.fit(X_poly, y)
y_pred_poly = model_poly.predict(X_poly)

print(f"\nPolynomial Regression R-squared: {r2_score(y, y_pred_poly):.4f}")

# Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

X_class, y_class = make_classification(n_samples=1000, n_features=2, n_redundant=0, 
                                       n_informative=2, random_state=42, n_clusters_per_class=1)
model_logistic = LogisticRegression()
model_logistic.fit(X_class, y_class)
y_pred_logistic = model_logistic.predict(X_class)

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
print(f"\nLogistic Regression Accuracy: {accuracy_score(y_class, y_pred_logistic):.4f}")
print("\nClassification Report:")
print(classification_report(y_class, y_pred_logistic))
print("\nConfusion Matrix:")
print(confusion_matrix(y_class, y_pred_logistic))
Interactive Example

Try statistical operations:

Module 9: Machine Learning Basics
Introduction to Machine Learning

Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. Python's scikit-learn library provides simple and efficient tools for data mining and data analysis.

Supervised Learning

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_iris, make_classification

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, dt_pred))

# Random Forest
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))

# Support Vector Machine
from sklearn.svm import SVC
svm_model = SVC(kernel='rbf', random_state=42)
svm_model.fit(X_train_scaled, y_train)
svm_pred = svm_model.predict(X_test_scaled)
print("SVM Accuracy:", accuracy_score(y_test, svm_pred))

# Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)
gb_pred = gb_model.predict(X_test)
print("Gradient Boosting Accuracy:", accuracy_score(y_test, gb_pred))

# Neural Network
from sklearn.neural_network import MLPClassifier
nn_model = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=1000, random_state=42)
nn_model.fit(X_train_scaled, y_train)
nn_pred = nn_model.predict(X_test_scaled)
print("Neural Network Accuracy:", accuracy_score(y_test, nn_pred))
Unsupervised Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits

# Load high-dimensional data
digits = load_digits()
X = digits.data
y = digits.target

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA for dimensionality reduction
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_pca)
centers = kmeans.cluster_centers_

# Visualize clusters
plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis', alpha=0.6)
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=300, marker='X', label='Centroids')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('K-Means Clustering')
plt.legend()
plt.show()

# Elbow method to find optimal k
inertias = []
silhouette_scores = []
k_range = range(2, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_pca)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_pca, kmeans.labels_))

# Plot elbow method
plt.figure(figsize=(10, 6))
plt.plot(range(2, 11), inertias, 'bo-')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

# Hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Perform hierarchical clustering
hc = AgglomerativeClustering(n_clusters=3, linkage='ward')
hc_clusters = hc.fit(X_pca)

# Create dendrogram
linkage_matrix = linkage(X_pca, method='ward')
plt.figure(figsize=(12, 8))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()
Model Evaluation

from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Cross-validation
cv_scores = cross_val_score(rf_model, X, y, cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.4f}")

# Different evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.4f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Feature Importance

# Get feature importance from Random Forest
feature_importance = pd.DataFrame({
    'feature': iris.feature_names,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
print("Feature Importance:")
print(feature_importance)

# Get feature importance from Decision Tree
from sklearn.tree import DecisionTree
dt_model = DecisionTree(random_state=42)
dt_model.fit(X_train, y)
feature_importance_dt = pd.DataFrame({
    'feature': iris.feature_names,
    'importance': dt_model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nDecision Tree Feature Importance:")
print(feature_importance_dt)

# Get feature importance from Gradient Boosting
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y)
feature_importance_gb = pd.DataFrame({
    'feature': iris.feature_names,
    'importance': gb_model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nGradient Boosting Feature Importance:")
print(feature_importance_gb)
Interactive Example

Try machine learning operations:

Module 10: Time Series Analysis
Introduction to Time Series

Time series analysis involves analyzing data points collected over time to identify patterns, trends, and seasonality. This is particularly important in finance, economics, weather forecasting, and any domain where understanding temporal patterns is crucial.

Time Series Components

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create a time series
dates = pd.date_range('2020-01-01', periods=365, freq='D')
values = np.random.randn(365) + 100
ts = pd.Series(values, index=dates)

# Time series components
print(ts.head())
print(ts.index)  # DatetimeIndex
print(ts.index.freq)  # Frequency (D for daily)
print(ts.index.month)  # Month component
print(ts.index.day)  # Day component
print(ts.index.dayofweek)  # Day of week (0=Monday)
print(ts.index.quarter)  # Quarter (1-4)
Time Series Operations

# Resampling
ts_daily = ts.resample('D')  # Daily
ts_weekly = ts.resample('W')  # Weekly
ts_monthly = ts.resample('M')  # Monthly

# Shifting
ts_shifted = ts.shift(1)  # Shift forward by 1 period
ts_shifted_back = ts.shift(-1) 6gt;  # Shift backward by 1 period

# Rolling calculations
ts_7day_avg = ts.rolling(window=7).mean()
ts_30day_avg = ts.rolling(window=30).mean()
ts_std_30day = ts.rolling(window=30).std()

# Expanding and window calculations
ts_expanding = ts.expanding().mean()
ts_cumsum = ts.expanding().sum()

# Differences
ts_diff = ts.diff()  # First difference
ts_diff2 = ts.diff(2)  # Second difference
Decomposition

from statsmodels.tsa.seasonal import seasonal_decompose

# Decompose time series
decomposition = seasonal_decompose(ts, model='additive', period=30)
fig = decomposition.plot()
fig.set_size(12, 8)
plt.show()

# Access components
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
Stationarity Testing

from statsmodels.tsa.stattools import adfuller

# Check for stationarity
def check_stationarity(timeseries):
    result = adfuller(timeseries)
    print('ADF Statistic:', result[0])
    print('P-value:', result[1])
    print('Critical Values:')
    for key, value in result[4].items():
        print(f'\t{key}: {value}')
    
    if result[1] <= 0.05:
        print("Time series is stationary")
    else:
        print("Time series is non-stationary")

# Make time series stationary
ts_diff = ts.diff().dropna()
check_stationarity(ts_diff)
ARIMA Models

from statsmodels.tsa.arima.model import ARIMA

# Split data
train_size = int(len(ts) * 0.8)
train, test = ts[:train_size], ts[train_size:]

# Fit ARIMA model
model = ARIMA(train, order=(1, 1, 1))
model_fit = model.fit()
print(model_fit.summary())

# Make predictions
predictions = model_fit.forecast(steps=len(test))
print(f"ARIMA Model Predictions:")
print(predictions.head())

# Plot predictions
plt.figure(figsize=(12, 6))
plt.plot(train.index, train, label='Train')
plt.plot(test.index, test, label='Test')
plt.plot(test.index, predictions, label='Predictions', color='red')
plt.title('ARIMA Model Predictions')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()

# Calculate forecast accuracy
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(test, predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse:.2f}')
Interactive Example

Try time series operations:

Module 11: Final Project
Project Overview

For the final project, you'll create a comprehensive data analysis project that demonstrates all the concepts learned throughout the course. This project will involve loading, cleaning, analyzing, and visualizing a real-world dataset to extract meaningful insights.

Project Structure

# project.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

class DataAnalysisProject:
    def __init__(self, data_path):
        self.data_path = data_path
        self.df = None
        self.X = None
        self.y = None
        self.model = None
        self.scaler = None
        self.X_train = None
        self.X_test = None
        self.y_train = None
        self.y_test = None
        self.y_pred = None
        
    def load_data(self):
        """Load data from CSV file"""
        try:
            self.df = pd.read_csv(self.data_path)
            print(f"Data loaded successfully. Shape: {self.df.shape}")
            return True
        except FileNotFoundError:
            print(f"Error: File {self.data_path} not found")
            return False
    
    def clean_data(self):
        """Clean the dataset"""
        # Handle missing values
        numeric_columns = self.df.select_dtypes(include=[np.number]).columns
        self.df[numeric_columns] = self.df[numeric_columns].fillna(self.df[numeric_columns].mean())
        
        # Handle duplicates
        self.df = self.df.drop_duplicates()
        
        # Convert data types
        self.df['date'] = pd.to_datetime(self.df['date'])
        
        print(f"Data cleaned. Shape: {self.df.shape}")
        return True
    
    def explore_data(self):
        """Perform exploratory data analysis"""
        print("Data Overview:")
        print(self.df.describe())
        print("\nData Types:")
        print(self.df.dtypes)
        print("\nMissing Values:")
        print(self.df.isnull().sum())
        
        # Visualize distributions
        plt.figure(figsize=(12, 6))
        sns.hist(self.df['value'], bins=20)
        plt.title('Distribution of Values')
        plt.show()
        
        # Visualize correlations
        if len(self.df.select_dtypes(include=[np.number]).columns) > 1:
            plt.figure(figsize=(10, 8))
            sns.heatmap(self.df.select_dtypes(include=[np.number]).corr(), annot=True, cmap='coolwarm', center=0)
            plt.title('Correlation Matrix')
            plt.show()
    
    def feature_engineering(self):
        """Create new features from existing ones"""
        # Date features
        self.df['year'] = self.df['date'].dt.year
        self.df['month'] = self.df['date'].dt.month
        self.df['day'] = self.df['date'].dt.day
        self.df['day_of_week'] = self.df['date'].dt.dayofweek
        
        # Interaction features
        self.df['value_log'] = np.log(self.df['value'])
        self.df['value_sqrt'] = np.sqrt(self.df['value'])
        
        return True
    
    def build_model(self, target_column):
        """Build a predictive model"""
        # Define features and target
        features = ['year', 'month', 'day', 'day_of_week', 'value_log', 'value_sqrt']
        self.X = self.df[features]
        self.y = self.df[target_column]
        
        # Split data
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            self.X, self.y, test_size=0.2, random_state=42
        )
        
        # Scale features
        self.scaler = StandardScaler()
        self.X_train = self.scaler.fit_transform(self.X_train)
        self.X_test = self.scaler.transform(self.X_test)
        
        # Train model
        self.model = LinearRegression()
        self.model.fit(self.X_train, self.y_train)
        
        # Evaluate model
        self.y_pred = self.model.predict(self.X_test)
        r2 = r2_score(self.y_test, self.y_pred)
        rmse = np.sqrt(mean_squared_error(self.y_test, self.y_pred))
        
        print(f"Model Performance:")
        print(f"R² Score: {r2:.4f}")
        print(f"RMSE: {rmse:.4f}")
        
        return True
    
    def visualize_results(self):
        """Visualize model results"""
        # Actual vs Predicted
        plt.figure(figsize=(12, 6))
        plt.scatter(self.y_test, self.y_pred, alpha=0.6)
        plt.plot([min(self.y_test.min(), self.y_test.max()], [min(self.y_pred.min(), self.y_pred.max()], 'r--', 'r--')
        plt.xlabel('Actual Values')
        plt.ylabel('Predicted Values')
        plt.title('Actual vs Predicted Values')
        plt.show()
        
        # Residuals
        residuals = self.y_test - self.y_pred
        plt.figure(figsize=(12, 4))
        plt.scatter(self.y_test, residuals, alpha=0.6)
        plt.axhline(y=0, color='red', linestyle='--', alpha=0.7)
        plt.title('Residuals Plot')
        plt.xlabel('Actual Values')
        plt.ylabel('Residuals')
        plt.show()
    
    def run_analysis(self):
        """Run the complete analysis pipeline"""
        if not self.load_data():
            print("Failed to load data. Please check the file path.")
            return
        
        if not self.clean_data():
            print("Failed to clean data.")
            return
            
        if not self.explore_data():
            print("Failed to explore data.")
            return
            
        if not self.feature_engineering():
            print("Failed to engineer features.")
            return
            
        if not self.build_model('value'):
            print("Failed to build model.")
            return
            
        if not self.visualize_results():
            print("Failed to visualize results.")
            return
            
        print("Analysis completed successfully!")
Project Features
  • Data Loading: Load data from various sources (CSV, Excel, databases)
  • Data Cleaning: Handle missing values, duplicates, and type issues
  • Exploratory Analysis: Generate descriptive statistics and visualizations
  • Feature Engineering: Create new features from existing ones
  • Model Building: Train and evaluate predictive models
  • Visualization: Create compelling visualizations to communicate findings
Project Extensions

Once you've implemented the basic analysis, consider these extensions:

  • Add more advanced statistical tests
  • Implement multiple models and compare performance
  • Create an interactive dashboard using Dash or Streamlit
  • Deploy the analysis as a web application
  • Add automated reporting
Interactive Demo

Try the complete analysis pipeline: