In this study, we analyzed the spatial and temporal distribution of polycyclic aromatic hydrocarbons (PAHs) in suspended particulate matter PM10 based on data from various monitoring stations. By analyzing the concentrations of PM10 and PAHs over time, we aim to identify patterns, sources, and potential environmental impacts of these pollutants.
Polycyclic aromatic hydrocarbons (PAHs) are hazardous environmental pollutants known for their carcinogenic and mutagenic properties \citep{iarc2010iarc, haritash2009polycyclic}. They are mainly produced as a result of incomplete combustion and are prevalent in urban environments \citep{kim2013polycyclic}. Understanding the distribution and concentration levels of PAHs in suspended particulate matter, such as PM10, is crucial for assessing air quality and potential health risks \citep{yang2002sources, li2006urban}.
Data were collected from four monitoring stations (ID: 101, 102, 103, 104) during the period from January 1, 2023, to December 31, 2023. The pollutants analyzed are PM10 and PAHs. The dataset consists of 100 records, each containing the station identifier, type of pollutant, date, and measured concentration.
Data analysis was performed using the Python programming language. Histogram plots depict the distribution of pollutant concentrations, and time series analysis allows observation of temporal trends.
To effectively process and analyze the PAH data, we employed a mathematical matrix model utilizing linear algebra techniques. This model facilitates the quantification of relationships between pollutant concentrations and various factors such as monitoring stations, pollutant types, and temporal variables.
The data were structured into matrices to enable efficient mathematical operations. Since the dataset comprises measurements from multiple stations over time for different pollutants, we represented it so that each row corresponds to an observation and each column represents a variable.
\subsubsection{Feature Engineering}
We created features that capture the relevant information:
\begin{itemize}
\item\textbf{Station Indicators}: Station IDs were encoded using one-hot encoding, resulting in a matrix $\mathbf{S}$ of size $n \times m$, where $n$ is the number of observations and $m$ is the number of stations.
\item\textbf{Pollutant Indicators}: Pollutants were also one-hot encoded, forming a matrix $\mathbf{P}$ of size $n \times p$, with $p$ being the number of pollutants.
\item\textbf{Temporal Variables}: Temporal features such as the day of the year were extracted, resulting in a matrix $\mathbf{T}$ of size $n \times k$, where $k$ is the number of temporal features.
\item\textbf{Concentration Values}: The target variable $\mathbf{y}$ of size $n \times1$, representing the measured concentrations.
\end{itemize}
\subsubsection{Design Matrix}
The features were combined to form the design matrix $\mathbf{X}$:
The estimated coefficients provide insights into the impact of each feature on pollutant concentrations:
\begin{itemize}
\item\textbf{Station Effects}: Differences in concentrations attributable to different monitoring stations.
\item\textbf{Pollutant Effects}: Variations between PAH and PM10 concentrations.
\item\textbf{Temporal Effects}: Seasonal trends and time-related variations.
\end{itemize}
\subsection{Implementation}
The model was implemented using Python with NumPy and Pandas libraries. Data preprocessing included handling missing values and encoding categorical variables. The linear regression model was fitted using matrix operations, and model performance was evaluated using metrics such as Root Mean Square Error (RMSE).
\subsubsection{Python Implementation Example}
\begin{verbatim}
import numpy as np
import pandas as pd
# Assume 'data' is a DataFrame containing the dataset
Model performance was assessed using RMSE, providing a measure of the differences between predicted and observed concentrations. The low RMSE value indicates a good fit of the model to the data.
\subsection{Benefits of the Matrix Model}
This matrix-based approach offers several advantages:
\begin{itemize}
\item\textbf{Efficiency}: Matrix operations are computationally efficient and suitable for large datasets.
\item\textbf{Clarity}: Provides a clear mathematical framework for understanding relationships between variables.
\item\textbf{Extendability}: Can be extended to more complex models or integrated with machine learning algorithms.
\end{itemize}
\subsection{Considerations}
While the linear regression model is effective, certain assumptions must be considered:
\begin{itemize}
\item\textbf{Linearity}: Assumes a linear relationship between predictors and the target variable.
\item\textbf{Independence}: Observations are assumed to be independent.
\item\textbf{Homoscedasticity}: Constant variance of errors is assumed.
\item\textbf{Normality}: Errors are assumed to be normally distributed.
\end{itemize}
Potential issues like multicollinearity among features were checked to ensure the stability of coefficient estimates. Regularization techniques can be employed if overfitting is a concern.
\section{Results}
In this study, we analyzed the spatial and temporal distribution of polycyclic aromatic hydrocarbons (PAHs) in suspended particulate matter. Below, we present the key results using visualizations.
\subsection{Concentration Histograms}
The histograms in Figures~\ref{fig:histogram_pm10} and~\ref{fig:histogram_pah} show the distribution of concentrations for PM10 and PAHs, respectively. These plots highlight the variability of pollutant concentrations across different measurements.
The histograms indicate that PM10 concentrations have a wider distribution compared to PAHs, suggesting greater variability of PM10 levels at the monitoring stations.
Figure~\ref{fig:mean_concentration_over_time} presents the mean concentrations of pollutants over time. The time series analysis reveals seasonal patterns and potential temporal variability in pollutant levels.
The time series plot shows that both PM10 and PAH concentrations exhibit fluctuations throughout the year, with possible peaks in certain months, indicating potential seasonal effects influenced by environmental conditions and emission sources.
The visualizations indicate that pollutant concentrations exhibit significant variability, influenced by environmental conditions and emission sources. PM10 concentrations showed a wider range of distribution compared to PAHs, while temporal trends suggest a potential seasonal effect. Further analysis is required to correlate these trends with specific environmental factors or emission events.
The observed variability in PM10 and PAH concentrations is consistent with previous studies emphasizing the impact of anthropogenic activities and environmental conditions on pollutant levels \citep{chen2007polycyclic, kim2013polycyclic}. Potential seasonal trends may be attributed to factors such as heating during winter months, increased emissions from transportation, or atmospheric conditions affecting pollutant dispersion.
This study demonstrates significant variability in PM10 and PAH concentrations across different monitoring stations and over time. The results highlight the importance of continuous monitoring and analysis to understand the factors influencing air pollutant levels. Future research should focus on identifying specific emission sources and assessing the health impacts associated with exposure to these pollutants.