The Degree of Best Downsampled Convolutional Neural Network Approximation in Terms of Orlicz Modulus of Smoothness

The Degree of Best Downsampled Convolutional Neural Network Approximation in Terms of Orlicz Modulus of Smoothness

Amna Manaf Al-Janabi* Hawraa Abbas Almurieb

Department of Mathematics, College of Education for Pure Sciences, University of Babylon, Hillah 51001, Iraq

Corresponding Author Email: 
amnammn22@gmail.com
Page: 
2235-2242
|
DOI: 
https://doi.org/10.18280/mmep.110825
Received: 
6 December 2023
|
Revised: 
28 April 2024
|
Accepted: 
15 May 2024
|
Available online: 
28 August 2024
| Citation

© 2024 The authors. This article is published by IIETA and is licensed under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/).

OPEN ACCESS

Abstract: 

It is essential to study approximation theoretical foundations of deep convolutional neural networks, because of its interesting developments in vital domains. In spite of its dependence upon approximation substantially. The aim is to study approximation abilities of deep convolution neural network produced by downsampling operators in Orlicz spaces to reduce high dimensions that causes overfitting implementation. The degree of best approximation of Orlicz functions are estimated in terms of high order modulus of smoothness. Moreover, direct and inverse theorems are proved here to get finally both bounds of degree of approximation, with upper and lower bounds to restrict the degree of approximation with modulus of smoothness. The concluded degree of approximation by our CNN vanishes theoretically faster than the classical ones due to its dependence on modulus of smoothness.

Keywords: 

approximation, Orlicz, modulus of smoothness, convolutional neural networks, downsampling

1. Introduction

Orlicz spaces are one of the essential wide spaces that could help in practical usage. It is interesting to study approximation of convolutional neural networks (CNNs) in Orlicz spaces, to get degree of approximation in terms of modulus of smoothness. We define an extended class of Orlicz function, called quasi-Orlicz space.

We introduce the introduction within two sections. The first section is about the space of interest, that is Orlicz space. It includes preliminaries and development stations about it. Then we define Orlicz space and modulus of smoothness with the auxiliary results about them. Second, we introduce the literature review about CNNs and sampling method. So we construct our CNN to get theoretical approximation results in terms of Orlicz modulus of smoothness. Then an application about approximation of CNNs is introduced to get best approximation for function from generalized Orlicz space.

1.1 Orlicz spaces

The first who studied this type of spaces was the mathematician Orlicz [1]. As a generalization of the Lebesgue integrable spaces ${{L}_{p}}$. With completeness property, Orlicz space form an extended Banach space:

${{L}_{\Phi }}\left( \mu  \right)=\left\{ f:f~\text{is }\!\!~\!\!\text{ }\mu -\text{meas }\!\!~\!\!\text{ and }\!\!~\!\!\text{ }||{{f}||_{\Phi }}<\infty  \right\}$                 (1)

where, the norm ${{||·||}_{\Phi }}$ had been defined in many ways to cover Orlicz space, beginning with Orlicz him self with his norm.

$||f||_{\Phi }^{0}=\sup \left\{ \underset{T}{\mathop \int }\,\left| f\left( t \right)y\left( t \right) \right|d\mu ~:y\in {{L}_{\Psi ,}}{{I}_{\Psi }}\left( y \right)\le 1 \right\}$                 (2)

where, Orlicz norm was defined depending on Young function $\Phi $ and the complementary function $\Psi $, the generator of the modular unit ball. The function $\Psi $ is defined by

$\Psi \left( u \right)=\text{sup}\left\{ \mid u\mid v-\Phi \left( v \right)\,\!:v\ge 0 \right\}$                   (3)

for all $u\in \mathbb{R}$.

In simultaneous times in the fifteenth, Nakano [2], Morse and Transue [3] and Luxembourg [4] investigated the Luxembourg norm, which has been defined, using the concept of functional Minikowski over a convex modular unit ball, $\left\{ x:~I\left( x \right)\le 1 \right\}$, with

${||{f}||_{\Phi }}=\underset{\lambda >0}{\mathop{\text{inf}}}\,\left\{ {{I}_{\Phi }}\left( \frac{f}{\lambda } \right)\le 1 \right\}$                    (4)

Another main norm was defined by Amemiya at the same time, named after his name $p$-Amemiya norm [2].

$||f||_{\Phi }^{A}=\underset{\lambda >0}{\mathop{\inf }}\,\frac{1}{\lambda }\left( 1+{{I}_{\Phi }}\left( \lambda f \right) \right)$                  (5)

In separated papers, Krasnoselskii and RutickiI [5], Nakano [2] and Luxembourg and Zaneen [6] proved under additional conditions on $\Phi $, that Amemiy norm $||·||_{\Phi }^{A}$ is exactly the Orlicz norm$||·||_{\Phi }^{0}$. Moreover, Cui et al. [7] gave the basic results about the so called $p$-Amemiya norms equipped with other spaces. Also, Wisła [8] showed an overview of the developments of the spaces defined later. Moreover, Wisła [9] introduced a more generalized $p$-Amemiya type norms by restricting his conditions about the outer function. Later, he introduced the concept of outer function $s$ and get $s$-norms on Orlicz spaces.

${||{f}||_{\Phi ,s}}=\underset{\lambda >0~}{\mathop{\inf }}\,\frac{1}{\lambda }s\left( {{I}_{\Phi }}\left( \lambda f \right) \right)$                     (6)

Until today, many papers improved the norms of Orlicz [10-12].

All the above definitions of Orlicz norms depend on the values of $p$ to be within $\left[ 1,~\infty  \right]$. For the case$~0<p<1$, AL-Janabi and Almurieb [13] made another contribution to Orlicz spaces via outer function ${{S}_{p}}:\left[ 0,\infty  \right)\to \left[ 0,\infty  \right)$ in the following definition.

Definition 1. [13] For $0<p<1$ the family ${{S}_{p}}$ of functions is defined by

 ${{S}_{p}}~\left( f \right)={{\left( 1+{{f}^{2}} \right)}^{\frac{1}{p}}}$                 (7)

where, $f\in {{L}_{\Phi ,~{{S}_{p}}}}\left( {{I}^{d}} \right),$ the quasi-normed space.

${{L}_{\Phi ,{{S}_{p}}}}\left( {{I}^{d}} \right)=\left\{ f\in {{L}_{p}}\left( {{I}^{d}} \right)\left| {{||f||}_{\Phi ,~{{S}_{p}}}}<\infty  \right. \right\}$                 (8)

where,

${{||f||}_{\Phi ,{{S}_{p}}}}=\underset{\lambda >0}{\mathop{\inf }}\,\frac{1}{\lambda }{{\left( I_{\text{ }\!\!\Phi\!\!\text{ }}^{2}~\left( \lambda f \right)+1 \right)}^{\frac{1}{p}}}$                     (9)

As mentioned by AL-Janabi and Almurieb [13], ${{S}_{p}}$, $p<1$, are convex, strictly increasing and all ${{S}_{p}}$ match at zero. As concluded there, ${{L}_{\Phi ,~{{S}_{p}}}},~p\le 1$ quasi-normed space is a generalization of $p-$Amemiy’s normed space on $p\ge 1$.

Moreover, the degree of best approximation for functions from ${{L}_{\Phi ,~~{{S}_{p}}}}~$in terms of CNNs were studied [13], so that

$||f-{{N}||_{\Phi ,~~{{S}_{p}}}}<\epsilon $                 (10)

More precisely, the following theorem is given by AL-Janabi and Almurieb [13].

Theorem A. Let $f\in {{L}_{\text{ }\!\!\Phi\!\!\text{ },{{S}_{p}}}}\left( {{I}^{d}} \right)$,$~0<p<1$, then by (10) there exists $f_{j}^{w,b}$ s.t $||f-f{{_{J}^{w,b}}||_{\text{ }\!\!\Phi\!\!\text{ },{{S}_{p}}}}\le {{C}_{\left( p,k \right)}}{{||f||}_{\text{ }\!\!\Phi\!\!\text{ },{{S}_{p}}}}$.

Our target here is to improve the degree of approximation by studying modulus of smoothness in our Orlicz quasi-normed space in the following section.

1.2 Orlicz moduli of smoothness

Moduli of smoothness had been studied widely in Orlicz space. The importance of this comes from the need to improve the degree of function approximation via direct approaches. Direct approaches approximation implies a faster convergence to zero than general previous estimates. On the other hand, converse theorems give a characterization of smoothness of functions depending on its degree of approximation in direct approach. First results were given by Jackson [14], and Bernstein [15], for direct and inverse theorems, respectively, for the space of continuous functions in terms of modulus of continuous [16]. Later, second and third orders moduli of smoothness were involved in many papers concerning many wide generalizations of function’s spaces [17-22].

However, the birth of modulus of smoothness is much deeper in history [23-26], we concern here with moduli of smoothness as for as relates to Orlicz spaces. It’s very hard to investigate the early works of this topic, since they were written in Russian. However, the oldest paper we have obtained is attributed to Tsyganok [27] in 1966. He proved direct theorem as a modulus of continuity for Orlicz spaces. Later, in the eighteens of the last century, Ramazanov generalized direct theorem in Orlicz spaces for modulus of smoothness with higher orders [28]. In 1985, Musielak [24], generalized τ-modulus of smoothness, resulted by Popov [25], for Orlicz spaces. In 1991, Garidi generalized those results of Ramazanov [28], Garidi [29]. Many other results were studied by Israfilov and Akgün [30], Akgün [31], Chaichenko et al. [32], for weighted Orlicz spaces. More extensions were resulted to Orlicz spaces by Shidlich and Chaichenko [33, 34], which were combined by Chaichenko Shidlich et al. [32].

1.3 Modulus of Smoothness in ${{L}_{\Phi ,~{{S}_{p}}}}\left( {{I}^{d}} \right)$

Let’s now present the main definitions of our modulus of smoothness.

First, from Ditzian and Totik [23], define the following N-th symmetric difference

$\Delta_{h}^{N}\left( f\left( x \right) \right)=\underset{i=1}{\overset{N}{\mathop \sum }}\,{{\left( -1 \right)}^{i}}\left( \begin{matrix}   N  \\   i  \\ \end{matrix} \right)~f\left( x-ih \right)$                     (11)

The Orlicz Modulus of smoothness in terms of (11) is given by

${{\omega }_{N}}{{\left( f,\delta  \right)}_{\Phi ,~{{S}_{p}}}}=\underset{\left| h \right|\le \delta }{\mathop{\sup }}\,||\Delta_{h}^{N}{{\left( f \right)}||_{\Phi ,~{{S}_{p}}}}$                (12)

We study the main properties of the N-th symmetric difference (11) and the Orlicz modulus of smoothness (12) in the following section.

1.3.1 Modulus of smoothness properties (Auxiliary results)

In the following lemma, we prove that the $N-$th symmetric difference (11) satisfies the following properties under our Orlicz norm (9):

Lemma 1

Let $f~$be any function from (8), then

  1. $||\Delta_{h}^{N}{{\left( f \right)}||_{\Phi ,~{{S}_{p}}}}\le k\left( N \right){{||f||}_{\Phi ,~{{S}_{p}}}}$, where $k\left( N \right)=\underset{i=0}{\overset{\infty }{\mathop \sum }}\,\left| \left( \begin{matrix}   N  \\   i  \\\end{matrix} \right) \right|\le {{2}^{N}},$ $N=\inf \left\{ k\in \mathbb{N}:k>N \right\}$.
  2. $\left( \Delta_{h}^{N}\left( \Delta_{h}^{M}\left( f \right) \right) \right)\left( x \right)=\Delta _{h}^{N+M}\left( f\left( x \right) \right)\left( a.~e. \right)$.
  3. $||\Delta_{h}^{N+M}{{\left( f \right)}||_{\Phi ,~{{S}_{p}}}}\le {{2}^{\left\{ M \right\}}}||\Delta_{h}^{N}{{\left( f \right)}||_{\Phi ,~{{S}_{p}}}}$.
  4. $\underset{h\to 0}{\mathop{\lim }}\,||\Delta_{h}^{N}{{\left( f \right)}||_{\Phi ,~{{S}_{p}}}}=0$.

Proof

  1. Let $\lambda >0$, then by (7), (9) and (11), we have

$||\Delta_{h}^{N}{{\left( f \right)}||_{\Phi ,~{{S}_{p}}}}\\=\underset{\lambda >0}{\mathop{\inf }}\,\frac{1}{\lambda }~{{S}_{p}}\left( _{h}^{N}\left( f \right) \right)\\=\underset{\lambda >0}{\mathop{\inf }}\,\frac{1}{\lambda }~{{S}_{p}}\left( \underset{i=1}{\overset{N}{\mathop \sum }}\,{{\left( -1 \right)}^{i}}\left( \begin{matrix}

   N  \\

   i  \\

\end{matrix} \right)f\left( x-ih \right) \right)\\ \le C\underset{\lambda >0}{\mathop{\inf }}\,\frac{1}{\lambda }~{{S}_{p}}\left( \underset{i=1}{\overset{\infty }{\mathop \sum }}\,{{\left( -1 \right)}^{i}}\left( \begin{matrix}

   N  \\

   i  \\

\end{matrix} \right)f\left( x-ih \right) \right) \\ \le C\underset{\lambda >0}{\mathop{\inf }}\,\frac{1}{\lambda }~\underset{i=0}{\overset{\infty }{\mathop \sum }}\,\left( \begin{matrix}

   N  \\

   i  \\

\end{matrix} \right){{S}_{p}}\left( {{\left( -1 \right)}^{i}}f\left( x-ih \right) \right) \\ \le C\underset{i=0}{\overset{\infty }{\mathop \sum }}\,\left( \begin{matrix}

   N  \\

   i  \\

\end{matrix} \right)\underset{\lambda >0}{\mathop{\inf }}\,\frac{1}{\lambda }~{{S}_{p}}\left( f\left( x-ih \right) \right)  \\ \le Ck\left( N \right){{||f}||_{\Phi ,~{{S}_{p}}}}$

  1. By (7), (9) and (11), we have

$\Delta\left(\Delta _{h}^{N}\left( \Delta_{h}^{M}\left( f \right) \right) \right)\left( x \right) \\ =\underset{i=1}{\overset{N}{\mathop \sum }}\,{{\left( -1 \right)}^{i}}\left( \begin{matrix}

   N  \\

   i  \\

\end{matrix} \right)\left( _{h}^{M}\left( f\left( x-ih \right) \right) \right) \\ =\underset{i=1}{\overset{N}{\mathop \sum }}\,{{\left( -1 \right)}^{i}}\left( \begin{matrix}

   N  \\

   i  \\

\end{matrix} \right)\left( \mathop{\sum }_{i=1}^{M}{{\left( -1 \right)}^{i}}\left( \begin{matrix}

   M  \\

   i  \\

\end{matrix} \right)\left( f\left( x-ih \right) \right) \right) \\ =\underset{i=1}{\overset{N}{\mathop \sum }}\,\underset{i-1}{\overset{M}{\mathop \sum }}\,{{\left( -1 \right)}^{i}}\left( \begin{matrix}

   N  \\

   i  \\

\end{matrix} \right)\left( \begin{matrix}

   M  \\

   i  \\

\end{matrix} \right)f\left( x-hi \right) \\ =\underset{i=1}{\overset{N+M}{\mathop \sum }}\,{{\left( -1 \right)}^{i}}\left( \begin{matrix}

   N+M  \\

   i  \\

\end{matrix} \right)f\left( x-hi \right) \\ =\Delta_{h}^{N+M}\left( f\left( x \right) \right).$

  1. By the above Lemma 1, we get

$||\Delta_{h}^{N+M}{{\left( f \right)}||_{\Phi ,\text{ }\!\!~\!\!\text{ }{{S}_{p}}}} \\ =||\Delta_{h}^{M}{{\left( \Delta_{h}^{N}\left( f \right) \right)}||_{\Phi ,\text{ }\!\!~\!\!\text{ }{{S}_{p}}}} \\ \le k\left( M \right)||\Delta_{h}^{N}{{\left( f \right)}||_{\Phi ,\text{ }\!\!~\!\!\text{ }{{S}_{p}}}} \\ \le {{2}^{\left( M \right)}}||\Delta_{h}^{N}{{\left( f \right)}_||{\Phi ,\text{ }\!\!~\!\!\text{ }{{S}_{p}}}}.$

  1. Let $\epsilon >0$, choose $\delta =\delta \left( \epsilon ,N \right)$, by (11), we have $||\Delta_{h}^{N}{{\left( f_{J}^{w,b} \right)}||_{\Phi ,\text{ }\!\!~\!\!\text{ }{{S}_{p}}}}<\frac{\epsilon }{2}$, and

$||\Delta_{h}^{N}{{\left( f \right)}||_{\Phi ,~{{S}_{p}}}} \\ \le C\left[ ||\Delta_{h}^{N}{{\left( f-f_{J}^{w,b} \right)}||_{\Phi ,~{{S}_{p}}}}+||\Delta_{h}^{N}{{\left( f_{J}^{w,b} \right)}||_{\Phi ,~{{S}_{p}}}} \right] \\ \le C\left[ {{2}^{N}}||f-f{{_{J}^{w,b}}||_{\Phi ,~{{S}_{p}}}}+||\Delta_{h}^{N}{{\left( f_{J}^{w,b} \right)}||_{\Phi ,~{{S}_{p}}}} \right] \\ \le C\left[ {{2}^{N}}.\frac{\epsilon }{{{2}^{N+1}}}+\frac{\epsilon }{2} \right]=\epsilon $

In the following lemma, we give more generalized properties for our Orlicz modulus of smoothness.

Lemma 2

  1. ${{\omega }_{N}}{{\left( f,\delta  \right)}_{\Phi ,~{{S}_{p}}}}$ is a positive nondecreasing continuous function of $\delta $ on $\left( 0,\infty  \right)$ and $\underset{\delta \to 0}{\mathop{\lim }}\,{{\omega }_{N}}{{\left( f,\delta  \right)}_{\Phi ,~{{s}_{p}}}}=0.$
  2. ${{\omega }_{N}}{{\left( f+g,\delta  \right)}_{\Phi ,~{{S}_{p}}}}\le c\left( {{\omega }_{N}}{{\left( f,\delta  \right)}_{\Phi ,~{{S}_{p}}}}+{{\omega }_{N}}{{\left( g,\delta  \right)}_{\Phi ,~{{S}_{p}}}} \right)$
  3. ${{\omega }_{N}}{{\left( f,\delta  \right)}_{\Phi ,~{{S}_{p}}}}\le {{2}^{N-M}}{{\omega }_{M}}{{\left( f,\delta  \right)}_{\Phi ,~{{S}_{p}}}}$
  4. ${{\omega }_{N}}{{\left( f,\delta  \right)}_{\Phi ,~{{S}_{p}}}}~\le {{2}^{\left\{ N \right\}}}~{{||f}||_{\Phi ,~{{S}_{p}}}}$
  5. ${{\omega }_{N}}{{\left( f,\delta  \right)}_{\Phi ,~{{S}_{p}}}}\le ~{{\omega }_{N}}{{\left( f,\overset{\acute{\ }}{\mathop{\delta }}\, \right)}_{\Phi ,~{{S}_{p}}}}$, for $\delta \le \overset{\acute{\ }}{\mathop{\delta }}\,$
  6. ${{\omega }_{N}}{{\left( f,\gamma \delta  \right)}_{\Phi ,~{{S}_{p}}}}\le {{\left( 1+\gamma  \right)}^{k}}{{\omega }_{N}}{{\left( f,\delta  \right)}_{\Phi ,~{{S}_{p}}}},$ for$~0<\gamma \le 1.$

Proof

  1. It is clear from Lemma 1
  2. From (13), (12) and Lemma 1, we have

${{\omega }_{N}}{{\left( f+g,\delta  \right)}_{\Phi ,~{{S}_{p}}}} \\ =\underset{\left| h \right|\le \delta }{\mathop{\sup }}\,||\Delta_{h}^{N}{{\left( f+g \right)}||_{\Phi ,~{{S}_{p}}}} \\ =\underset{\left| h \right|\le \delta }{\mathop{\sup }}\,||\Delta_{h}^{N}\left( f \right)+\Delta_{h}^{N}{{\left( g \right)}||_{\Phi ,~{{S}_{p}}}} \\ \le c\left( \underset{\left| h \right|\le \delta }{\mathop{\sup }}\,||\Delta_{h}^{N}{{\left( f \right)}||_{\Phi ,~{{S}_{p}}}}+\underset{\left| h \right|\le \delta }{\mathop{\sup }}\,||\Delta_{h}^{N}{{\left( g \right)}||_{\Phi ,~{{S}_{p}}}} \right) \\ \le ~c\left( {{\omega }_{N}}{{\left( f,\delta  \right)}_{\Phi ,~{{S}_{p}}}}+{{\omega }_{N}}{{\left( g,\delta  \right)}_{\Phi ,~{{S}_{p}}}} \right)$

  1. Again by (11), (12) and Lemma 1

${{\omega }_{N}}{{\left( f,\delta  \right)}_{\Phi ,~{{S}_{p}}}} \\ =\underset{\left| h \right|\le \delta }{\mathop{\sup }}\,||\Delta_{h}^{N}{{\left( f \right)}||_{\Phi ,~{{S}_{p}}}} \\ =\underset{\left| h \right|\le \delta }{\mathop{\sup }}\,||\Delta_{h}^{N-M}{{\left( _{h}^{M}\left( f \right) \right)}||_{\Phi ,~{{S}_{p}}}} \\ \le k\left( N-M \right)\underset{\left| h \right|\le \delta }{\mathop{\text{sup}}}\ ||\Delta_{h}^{M}{{\left( f \right)}||_{\Phi ,~{{S}_{p}}}} \\ \le {{2}^{N-M}}{{\omega }_{M}}{{\left( f,\delta  \right)}_{\Phi ,~{{S}_{p}}}}.$

  1. By Lemma 1

${{\omega }_{N}}\left( f,\delta  \right) \\ =\underset{\left| h \right|\le \delta }{\mathop{\text{sup}}}\ ||\Delta_{h}^{N}{{\left( f \right)}||_{\Phi ,~{{S}_{p}}}}\le {{2}^{N}}\underset{\left| h \right|\le \delta }{\mathop{\text{sup}}}\,{{||f||}_{\Phi ,~{{S}_{p}}}} \\ \le {{2}^{N}}~{{||f||}_{\Phi ,~{{S}_{p}}}}.$

  1. Let $\delta <\overset{\acute{\ }}{\mathop{\delta }}\,$, then by (12), it is clear that

${{\omega }_{N}}{{\left( f,\delta  \right)}_{\Phi ,~{{S}_{p}}}} \\ =\underset{\left| h \right|\le \delta }{\mathop{~\sup }}\,||\Delta_{h}^{N}{{\left( f \right)}||_{\Phi ,~{{S}_{p}}}} \\ \le \underset{\left| h \right|\le \overset{\acute{\ }}{\mathop{\delta }}\,}{\mathop{~\sup }}\,||\Delta_{h}^{N}{{\left( f \right)}||_{\Phi ,~{{S}_{p}}}}={{\omega }_{N}}{{\left( f,\overset{\acute{\ }}{\mathop{\delta }}\, \right)}_{\Phi ,~{{S}_{p}}}}$

  1. Noting that $\gamma \delta \le \delta $, then

${{\omega }_{N}}{{\left( f,\gamma \delta  \right)}_{\Phi ,~{{S}_{p}}}} \\ \le {{\omega }_{N}}{{\left( f,\delta  \right)}_{\Phi ,~{{S}_{p}}}} \\ \le {{\left( 1+\gamma  \right)}^{k}}{{\omega }_{N}}{{\left( f,\delta  \right)}_{\Phi ,~{{S}_{p}}}}.$

2. Convolutional Neural Networks

Deep learning is very important because of its ability to cope with large data in many quantities. One of the most common deep neural networks in deep learning is CNNs.

Since the 1950th, researchers have begun to work on a discovery through which visual data can be understood and developed called computer vision.

CNNs were developed in the 1980s, they were used only to identify handwritten numbers, read zip and pin codes ets. In 2012, there was a qualitative leap in the development of synaptic neural networks where Alex Krizhevsky neural networks was used to provide a wide range of data. Artificial neural networks have become deep neural networks, deep convolutional neural networks [35-42].

They are especially designed for processing pixel data and are used in image recognition and processing, applications of CNN recognition of pictures and videos, image classification, medical image analysis, language processing [43]. Zhou [44] provided, in 2018, a family of new deep structured neural networks: convolutional neural networks distributed in depth, show that these networks of deep neurons have the same order of computational complexity as the networks of deep convolutive neurons, and demonstrated their universality of approximation. Fang et al. [39] considered an applied family of deep convolutional neural networks functions of the unit sphere ${{\text{S}}^{\text{d}-1}}$ of ${{\mathbb{R}}^{\text{d}}}$ estimating its degree of approximation compared with the classical ones.

Now, we introduce general preliminaries about CNNs, beginning with the most prominent characteristic of any CNN, that is the convolution that imposes the network. In mathematics, convolution is a linear operation on two functions as follow

 $\left( {{f}_{1}}\circ {{f}_{2}} \right)\left( x \right)=\sum {{f}_{1}}\left( t \right)~{{f}_{2}}\left( x+t \right)$                (13)

The following description shows the convolution procedure among matrices that construct CNNs. For any $J\in \mathbb{N}$, the depth of the network, a sequence ${w^{\left( j \right)}},~~j=1\ldots ,~J~$is a filter mask that is within $\left\{ 0,~1\ldots ,~{{s}^{\left( j \right)}} \right\}$, where ${{s}^{\left( j \right)}}\in \mathbb{N}$ is the filter length with ${{s}^{\left( j \right)}}+1$ free parameters. So that the convolutional filter masks are

$\left\{ {w^{\left( j \right)}}:\mathbb{Z}\to \mathbb{R} \right\}_{j=1}^{J}$               (14)

Let $s\in \mathbb{N}$, any filter mask $w=\left(  w \right)_{k=-\infty }^{\infty }$lies within $\left\{ 0,~1,~\cdots ,~s \right\}$, satisfies ${w_{k}}=0$ if $k\notin \left\{ 0,1,\cdots ,s \right\}$. This implies a convolutional matrix ${{\Im }^w}=\left( {w_{i-k}} \right)\in {{\mathbb{R}}^{\left( D+s \right)\times D}}$, for $i=1$, $\cdots $, $D+s,~k=1,\cdots ,$ D, and $D\in \mathbb{N}$, that is given by

${{\Im }^w}=\left[ \begin{matrix}

   {w_{0}} & 0 & 0 & 0 & \ldots  & 0  \\

   {w_{1}} & {w_{0}} & 0 & 0 & \ldots  & 0  \\

   \vdots  & \ddots  & \ddots  & \ddots  & \ddots  & \vdots   \\

   {w_{\text{s}}} & {w_{\text{s}-1}} & \ldots  & {w_{0}} & 0 & 0  \\

   0 & {w_{\text{s}}} & \ldots  & {w_{1}} & {w_{0}} & 0  \\

   \vdots  & \ddots  & \ldots  & \ddots  & \ddots  & \vdots   \\

   \ldots  & \ldots  & \ddots  & \ldots  & \ldots  & {w_{0}}  \\

   \ldots  & \ldots  & \ldots  & \ddots  & \ldots  & {w_{1}}  \\

   \vdots  & \ddots  & \ddots  & \ldots  & \ldots  & \vdots   \\

   0 & \ldots  & \ldots  & 0 & {w_{\text{s}}} & {w_{\text{s}-1}}  \\

   0 & \ldots  & \ldots  & \ldots  & 0 & {w_{\text{s}}}  \\

\end{matrix} \right]$               

In their papers [39, 44, 45], Zhou and his team proposed the convolutional matrix as follow ${{\Im }^{\left( \text{j} \right)}}={{\Im }^{{w^{\left( \text{j} \right)}}}}$, with $D={{d}_{j-1}}$ and $s={{s}^{\left( j \right)}}$.

Deep CNN (DCNN) is given by convolutions of the activation functions ${{\sigma }_{\Im ,b}}:{{\mathbb{R}}^{{{d}_{j-1}}}}\to {{\mathbb{R}}^{{{d}_{j-1}}+{{s}^{\left( j \right)}}}}$ of matrix $\Im $ and bias $b\in {{\mathbb{R}}^{{{d}_{j-1}}+{{s}^{\left( j \right)}}}}$.

To get DCNN ${{h}^{j}}:~{{\mathbb{R}}^{d}}\to {{\mathbb{R}}^{{{d}_{j}}}}$, $j=1,\cdots ,J$. ${{d}_{j}}=d+js,~{{s}^{\left( j \right)}}\equiv s,~~j=1,~\cdots ,~J$, the form

${{h}^{\left( j \right)}}={{\sigma }_{{{\Im }^{\left( j \right)}},{{b}^{\left( j \right)}}}}~\circ \cdots ~\circ {{\sigma }_{{{\Im }^{\left( 1 \right)}},{{b}^{\left( 1 \right)}}}}\left( x \right)$                     (15)

where, ${{\sigma }_{\Im ,b}}\left( u \right)=\sigma \left( \Im u-b \right),~u\in {{\mathbb{R}}^{~{{d}_{j-1}}}},b\in {{\mathbb{R}}^{{{d}_{j-1}}+{{s}^{\left( j \right)}}}}$.

The real valued activation function $\sigma ~$is given by

$\sigma \left( u \right)=\text{max}\left\{ u,~0 \right\},~~u\in \mathbb{R}$                    (16)

3. Construction of CNNs

For any input $x=\left( {{x}_{i}} \right)_{i=1}^{d}$, define the oprater ${{\mathcal{L}}_{x}}:{{L}_{\Phi ,{{S}_{p}}}}\left( {{I}^{d}} \right){{\mathbb{R}}^{+}}$ as follow

${{\mathcal{L}}_{t}}\left( g\left( u \right) \right)=\mathop{\sum }_{i=1}^{d}g\left( {{x}_{i}} \right){{\delta }_{i}}\left( u \right)$      (17)

where, $g\in {{L}_{\Phi ,{{S}_{p}}}},u\in I$ and ${{\delta }_{i}}~$is given by

    ${{\delta }_{i,g}}\left( u \right)=\mathop{\sum }_{i=2}^{d}{{\left( -1 \right)}^{N-i}}\left( \begin{matrix}

   N  \\

   i  \\

\end{matrix} \right)g\left( {{t}_{i}} \right)\sigma \left( {{t}_{i}}-u \right)$         (18)

where, $\sigma :~I\to I~$is the ReLU activation function defined by (16).

In fact, Eq. (17) is a convolutional operator between the function g and the activation function $\sigma $. Moreover, $x=\left( {{x}_{i}} \right)_{i=1}^{d}~$maybe not only the input, but also any neuron acts as input in the network. This amount of convolutions implies high dimensions of the hidden layer when the outputs of neuron clusters are combined. The purpose of the pooling layers is to reduce the big widths that may cause overfitting. Another procedure that can reduce the dimensions, called downsampling, is discussed in the following section.

4. Downsampling Deep Convolutional Neural Network

Zhou [38] provided the definition of downsampling operator for purposes of reducing widths, he introduced a downsampled operation. The $\ell $ downsampled are presented at layers $\mathcal{J}=\left\{ {{J}_{k}} \right\}_{k=1}^{\ell }$ with $1<{{J}_{1}}\le \ldots \le {{J}_{\ell }}=J$. His concept of downsampling operators is induced from wavelets [46, 47]. In this section, we introduce the effect of downsampling operation into 4. Downsampling Deep Convolutional Neural Network (DDCNN) to avoid big widths in (15) that happens with pooling layers. The downsampling operation is defined below,

Definition 2. Let $m$ be a scaling parameter, the function ${{\mathfrak{O}}_{m}}:{{\mathbb{R}}^{D}}{{\mathbb{R}}^{\left[ D/m \right]}}$ is called downsampling operator, and it is given by

${{\mathfrak{O}}_{m}}\left( u \right)=\left( {{u}_{im}} \right)_{i=1}^{\left[ D/m \right]},~u\in {{\mathbb{R}}^{D}}$               (19)

a scaling parameter $m\le D$. where $\left[ u \right]$ is the integer part of $u\in {{\mathbb{R}}^{+}}$.

Definition 3. Downsampling DCNN is defined iteratively for $k=1,\cdots ,\ell $, beginning with ${{d}_{0}}=d$ and filter lengths $\left[ {{s}^{\left( j \right)}} \right]_{j=1}^{J}$ has widths $\left[ {{d}_{j}} \right]_{j=0}^{J}$where $\ell ~$downsamplings at layers $\mathcal{J}$

${{d}_{j}}=\left\{ \begin{matrix}  {{d}_{j-1}}+{{s}^{\left( j \right)}},~~~~~~~~~~~~~~~~~~\text{if}~{{J}_{k-1}}<j<{{J}_{k}}  \\   \left[ \left( {{d}_{j-1}}+{{s}^{\left( j \right)}} \right)/{{d}_{{{J}_{k-1}}}} \right],~~~~~~~~~~\text{ }\!\!~\!\!\text{ if}~j={{J}_{k}}~  \\\end{matrix} \right.$                 (20)

Define function vector sequence iteratively $\left\{ {{h}^{\left( j \right)}}~\left( x \right):~{{\mathbb{R}}^{d}}\to {{\mathbb{R}}^{dj}} \right\}_{j=1}^{J}$ by ${{h}^{0}}\left( x \right)=x$ and for $k=1,\cdots ,\ell $.

${{h}^{\left( j \right)}}\left( x \right)=\left\{ \begin{matrix}  {{\sigma }_{{{\Im }^{\left( j \right)}},{{b}^{\left( j \right)}}}}\left( {{h}^{\left( j-1 \right)}}\left( x \right) \right),~~~~~~~~\text{if }\!\!~\!\!\text{ }{{J}_{k-1}}<j<{{J}_{k}}  \\ {{\mathfrak{O}}_{{{d}_{{{J}_{k-1}}}}~}}\circ {{\sigma }_{{{\Im }^{\left( j \right)}},{{b}^{\left( j \right)}}}}~\left( {{h}^{\left( j-1 \right)}}\left( x \right) \right),~~\text{if}~j={{J}_{k}}  \\ \end{matrix} \right.$     (21)

We restrict the $\Im $ ${{b}^{\left( j \right)}}\in {{\mathbb{R}}^{{{d}_{j-1}}+{{s}^{\left( j \right)}}}}$ to satisfy

 $b_{{{s}^{\left( j \right)}}+1}^{\left( j \right)}=b_{{{s}^{\left( j \right)}}~+2}^{\left( j \right)}=\cdots =b_{{{d}_{j}}-1}^{\left( j \right)}$                   (22)

Definition 4. Uniform DDCNN is defined with uniform filter lengths for $k\in \left\{ 1,\cdots ,~\ell  \right\}$, as follow

$S=\left\{ {{s}^{\left[ k \right]}}\in \mathbb{N} \right\}~_{k=1}^{\ell }$ if ${{s}^{\left( {{J}_{k-1}}+1 \right)}}=\cdots ={{s}^{\left( {{J}_{k}} \right)}}={{s}^{\left[ k \right]}}$                    (23)

Define the degree of best approximation of functions $g\in {{L}_{\Phi ,{{S}_{p}}}}\left( {{I}^{d}} \right)$ by DCNN of the form

${{E}_{m}}{{\left( g \right)}_{\Phi ,~{{S}_{p}}}}=\inf ||g-{{L}_{t}}{{\left( g \right)}||_{\Phi ,~{{S}_{p}}}}$                 (24)

5. Auxiliary Lemmas

The following lemma arises the role of convolutions [44] concluded that ${{W}^{\left( k \right)}}$ equals ${{\Im }^{\left( {{J}_{k}} \right)}}~\cdots ~{{\Im }^{\left( {{J}_{k-1}}+1 \right)}}$.

Lemma 3

${{\Im }^{\left( {{\text{J}}_{\text{k}}},{{\text{J}}_{\text{k}-1}}+1 \right)}}={{\Im }^{\left( {{\text{J}}_{\text{k}}} \right)\text{ }\!\!~\!\!\text{ }}}\text{ }\!\!~\!\!\text{ }\cdots \text{ }\!\!~\!\!\text{ }\Im {{\text{ }\!\!~\!\!\text{ }}^{\left( {{\text{J}}_{\text{k}-1}}+2 \right)}}\Im {{\text{ }\!\!~\!\!\text{ }}^{\left( {{\text{J}}_{\text{k}-1}}+1 \right)}}$ where $k=1,\cdots ,\ell $.

Construction of filter masks is studied by Zhou [48] in his following lemma, given convolution$~{w^{\left( \text{J} \right)}}\text{*}\cdots \text{*}{w^{\left( 1 \right)}}$ resulted from factorizing $W$ as follows

Lemma 4

For $s\ge 2$, $M\ge 0$, any sequence $W=\left( {w_{k}} \right)_{-\infty }^{\infty }$ in $\left\{ 0,~\cdots ,~M \right\}$, there is a finite sequence $\left\{ {w^{\left( j \right)}} \right\}_{j=1}^{J}$ with nonzero elements in$~\left\{ 0,~\cdots ,~s \right\}$, where $J<\frac{M}{s-1}+1$ that satisfies

$W={w^{\left( j \right)}}*~\cdots *{w^{2}}*{w^{1}}$                (25)

To fit the data with the space under study, bias vectors are selected in appropriate format in accordance to the convolutional matrices ${{\Im }^w}$, let $h:~\Omega \rightarrow {{R}^{D}}$, set

${{||h}||_{\Phi ,~{{S}_{p}}}}=\underset{\lambda >0}{\mathop{\text{inf}}}\,~\frac{1}{\lambda }{{\left( I_{\Phi }^{2}\left( \lambda h \right)+1 \right)}^{\frac{1}{p}}}$                     (26)

Also, we need to denote ${{||w||}_{\Phi ,~{{S}_{p}}}}=\underset{\lambda >0}{\mathop{\inf }}\,~\frac{1}{\lambda }{{\left( I_{\Phi }^{2}\left( \lambda {w_{k}} \right)+1 \right)}^{\frac{1}{p}}}$.

Then we immediately see how convolutional matrices are defined ${{\Im }^{\left( j \right)}}={{\Im }^{{w^{\left( j \right)}}}}$ that for any $h~:\Omega \rightarrow {{\mathbb{R}}^{{{d}_{j}}-1}}$, we denoted

${{||\Im }^{\left( j \right)}}{{h}||_{\Phi ,~{{S}_{p}}}}\le {{||w||}_{\Phi ,~{{S}_{p}}}}{{||h||}_{\Phi ,~{{S}_{p}}}}$                  (27)

From our previous study AL-Janabi and Almurieb [13], we use some thoughts from DCNNs without downsampling and select biases small enough for the vectors ${{\Im }^{\left( j \right)}}{{h}^{\left( j-1 \right)}}\left( x \right)-{{b}^{\left( j \right)}}$ to get non-negative entries.

Lemma 5

For any $k\in \left\{ 1,\cdots ,\ell  \right\}$, there exists $B\in {{R}^{+}}$, $\hat{B}\in \left[ -B,B \right]$, then ${{||h}^{{{j}_{k-1}}}}-\hat{B}{{1}_{{{d}_{{{j}_{k-1}}}}}}||_{\Phi ,~{{S}_{p}}}\le B$.

Proof

Set No space

$b{{~}^{\left( {{J}_{k-1}}+1 \right)}}=\hat{B}\Im {{\text{ }\!\!~\!\!\text{ }}^{\left( {{J}_{k-1}}+1 \right)}}{{1}_{{{d}_{{{J}_{k-1}}}}}}-B{{||w}^{\left( ({{J}_{k-1}}+1 \right)}||}_{\Phi ,~{{s}_{p}}}{{1}_{{{d}_{{{J}_{k-1~}}}}+1}}$.

To prove $b_{{{s}^{\left( j \right)}}+1}^{\left( j \right)}=b_{{{s}^{\left( j \right)}}~+2}^{\left( j \right)}=\cdots =b_{{{d}_{j-1}}}^{\left( j \right)}$

From

${{b}^{\left( j \right)}}= B \left( \Pi _{p={{J}_{k-1}}+1}^{j-1}{{||w}^{\left( p \right)}}||_{\Phi ,~{{S}_{p}}} \right){{\Im }^{\left( j \right)}}{{1}_{{{d}_{j-1}}}} \ - B \left( \Pi _{p={{J}_{k-1}}+1}^{j-1}{{||w}^{\left( p \right)}}||_{\Phi ,~{{S}_{p}}} \right){{1}_{{{d}_{j-1+{{s}^{\left( j \right)}}}}}}$

For $j={{J}_{k-1}}+2,\cdots ,{{J}_{k-1}}$, then for ${{J}_{k-1}}<j<{{J}_{k}}$ and $i={{s}^{\left( j \right)}}+1,\cdots ,{{d}_{j-1}}$, we have${{\left( {{\Im }^{\left( j \right)}}{{1}_{{{d}_{j-1}}}} \right)}_{i}}=\underset{p=1}{\overset{{{d}_{j-1}}}{\mathop \sum }}\,{{\left( {{\Im }^{\left( j \right)}} \right)}_{i,p}}=\underset{p=1}{\overset{{{d}_{j-1}}}{\mathop \sum }}\,{{\left( {{w}^{\left( j \right)}} \right)}_{i-p}}$.

Notice that w(j) in $\left\{ 0,\cdots ~,~{{s}^{\left( j \right)}} \right\}$. $p\in \left( -\infty ,0\left] \cup  \right[{{d}_{j-1}}+1,\infty  \right)$ we have $i-p\in \left[ {{s}^{\left( j \right)}}+1,\infty  \right)$ which implies that ${{w}^{\left( j \right)}}_{i-p}=0$. So

${{\left( {{\Im }^{\left( j \right)}}{{1}_{{{d}_{j-1}}}} \right)}_{i}}=\underset{p=-\infty }{\overset{\infty }{\mathop \sum }}\,{{w}^{\left( j \right)}}_{i-p}=\underset{p=-\infty }{\overset{\infty }{\mathop \sum }}\,{{w}_{p}}^{\left( j \right)}\forall i={{s}^{\left( j \right)}}+1,\cdots ,{{d}_{j-1}}$

Now we prove

  ${{h}^{\left( j \right)}}\left( x \right)={{\Im }^{\left( j \right)}}\ldots {{\Im }^{\left( {{J}_{k-1}}+1 \right)}}\left( {{h}^{{{J}_{k-1}}}}\left( x \right)-\hat{B}{{1}_{{{d}_{{{J}_{k-1}}}}}} \right)+ \ B \left( \underset{p={{J}_{k-1}}+1}{\overset{j}{\mathop \prod }}\,{{||w}^{\left( p \right)}}||_{\Phi ,~{{s}_{p}}} \right)~{{1}_{{{d}_{j}}}}$

By induction, for $j={{J}_{k-1}}+1$, we have​

${{\Im }^{\left( {{J}_{k-1}}+1 \right)}}{{h}^{\left( {{J}_{k-1}} \right)}}\left( x \right)-{{b}^{\left( {{J}_{k-1}}+1 \right)}}={{\Im }^{\left( {{J}_{k-1}}+1 \right)}}\left( {{h}^{{{J}_{k-1}}}}\left( x \right)-\hat{B}{{1}_{{{d}_{{{J}_{k-1}}}}}} \right)+B{{||w}^{\left( {{J}_{k-1}}+1 \right)}}||_{\Phi ,~{{s}_{p}}}{{1}_{{{d}_{{{J}_{k-1~}}}}+1}}$

the use Sigmoid $\sigma $ is identical to the identity function on $\left[ 0,~\infty  \right)$, hence

${{h}^{\left( {{J}_{k-1}}+1 \right)}}\left( x \right)={{\Im }^{\left( {{J}_{k-1}}+1 \right)}}\left( {{h}^{{{J}_{k-1}}}}\left( x \right)-\hat{B}{{1}_{{{d}_{{{J}_{k-1}}}}}} \right)+B{{||w}^{\left( {{J}_{k-1}}+1 \right)}}||_{\Phi ,~{{s}_{p}}}{{1}_{{{d}_{{{J}_{k-1~}}}}+1}}$

for $j=1$ verifies.

Suppose that it is holds for ${{J}_{k-1}}+1\le j\le {{J}_{k-1}}$, then ${{d}_{j-1}}+{{s}^{\left( j \right)}}={{d}_{j}}$.

Through the induction hypothesis and selection (*) of the bias vector, we have

${{h}^{\left( j \right)}}\left( x \right)=\sigma \left( {{\Im }^{\left( j \right)}}{{\Im }^{\left( j-1 \right)}}\cdots {{\Im }^{\left( ~{{J}_{k+1}} \right)}}\left( {{h}^{\left( ~{{J}_{k-1}} \right)}}\left( x \right)-\hat{B}{{1}_{{{d}_{{{J}_{k-1}}}}}} \right)+B\left( \underset{p={{J}_{k-1}}+1}{\overset{j}{\mathop \prod }}\,{{||w}^{\left( p \right)}}||_{\Phi ,~{{s}_{p}}} \right){{1}_{{{d}_{j}}}} \right)$

Now to prove

$h^{\left(J_k\right)}(x)=M^{(k)} \Im^{\left(J_k J_{k-1}+1\right)}\left(h^{\left(J_{k-1}\right)}(x)-\widehat{B} \mathbf{1}_{d_{J_{k-1}}}\right) +B\left( \underset{p={{J}_{k-1}}+1}{\overset{{{J}_{k}}}{\mathop \prod }}\,{{||w}^{\left( p \right)}}||_{\Phi ,~{{s}_{p}}} \right){{1}_{{{d}_{{{J}_{k}}}}}}$

When

${{b}^{\left( j \right)}}=B\left( \Pi _{p={{J}_{k-1}}+1}^{j-1}{{||w}^{\left( p \right)}}||_{\Phi ,~{{s}_{p}}} \right){{\Im }^{\left( j \right)}}{{1}_{{{d}_{j-1}}}}-B$,

Then

${{\Im }^{\left( {{J}_{k}} \right)}}{{h}^{\left( ~{{J}_{k-1}} \right)}}\left( x \right)-{{b}^{\left( {{J}_{k}} \right)}}={{\Im }^{\left( {{J}_{k}} \right)}}\cdots {{\Im }^{\left( ~{{J}_{k-1}} \right)}}\left( {{h}^{\left( ~{{J}_{k-1}} \right)}}\left( x \right)-\hat{B}{{1}_{{{d}_{{{J}_{k-1}}}}}} \right) + B\left( \underset{p={{J}_{k-1}}+1}{\overset{{{J}_{k}}}{\mathop \prod }}\,{{||w}^{\left( p \right)}}||_{\Phi ,~{{s}_{p}}} \right)~{{1}_{{{d}_{{{J}_{k}}}}}}$

6. Approximation Abilities of CNNs in Orlicz Spaces

The main purpose of this paper is to study degree of approximation of functions from Orlicz spaces by convolutional neural networks in terms of Orlicz modulus of smoothness. The following theorem study direct theorem:

Theorem 1 (Direct theorem)

For any $g\in {{L}_{\Phi ,~{{S}_{p}}}}\left( {{I}^{d}} \right)$, there exists a CNN of the form, ${{L}_{t}}\left( g \right)\left( u \right)=\underset{i=2}{\overset{2N+2}{\mathop \sum }}\,g\left( {{t}_{i}} \right){{\delta }_{i}}\left( u \right)$, $u\in \left[ {{t}_{i-1}},{{t}_{i}} \right]$, where $\left| {{t}_{i}}-{{t}_{i-1}} \right|\sim\frac{1}{n}$ and ${{\delta }_{i}}\left( u \right)=\underset{i=1}{\overset{N}{\mathop \sum }}\,\left( \begin{matrix}   N  \\   i  \\ \end{matrix} \right){{\left( -1 \right)}^{N-i}}\sigma \left( {{t}_{i}}-u \right)$, such that ${{||L}_{t}}\left( g \right)-g||_{\Phi ,~{{S}_{p}}}^{p}\le \frac{c}{n}{{\omega }_{N}}{{\left( g,h \right)}_{\Phi ,~{{S}_{p}}}}$.

Proof

By (17), (12) and Lemma 2, we get

${||{L}_{t}}\left( g \right)-g||_{\Phi ,~{{S}_{p}}}^{p} \\ \le \mathop{||\sum }_{i=2}^{2N+2}\left[ {{\left( -1 \right)}^{N-i}}\left( \begin{matrix}  N  \\  i  \\ \end{matrix} \right)g\left( {{t}_{i}} \right)\sigma \left( {{t}_{i}}-u \right)-{{\left( -1 \right)}^{N-i}}\left( \begin{matrix}   N  \\   i  \\ \end{matrix} \right)g\left( u-\frac{i}{n} \right)+{{\left( -1 \right)}^{N-i}}\left( \begin{matrix}   N  \\   i  \\ \end{matrix} \right)g\left( u-\frac{i}{n} \right)-g\left( u \right) \right]||_{\Phi ,~{{S}_{p}}}^{p} \\ \le c\left( ||\underset{i=2}{\overset{2N+2}{\mathop \sum }}\,{{\left( -1 \right)}^{N-i}}\left( \begin{matrix}   N  \\   i  \\ \end{matrix} \right)g\left( {{t}_{i}} \right)\sigma \left( {{t}_{i}}-u \right)-{{\left( -1 \right)}^{N-i}}\left( \begin{matrix}   N  \\   i  \\ \end{matrix} \right)g\left( u-\frac{i}{n} \right)||_{\Phi ,~{{S}_{p}}}^{p}+{{||\left( -1 \right)}^{N-i}}\left( \begin{matrix}   N  \\   i  \\ \end{matrix} \right)g\left( u-\frac{i}{n} \right)-g\left( u \right)||_{\Phi ,~{{S}_{p}}}^{p} \right) \\ \le c\underset{i=2}{\overset{2N+2}{\mathop \sum }}\,{{||\left( -1 \right)}^{N-i}}\left( \begin{matrix}  N  \\   i  \\ \end{matrix} \right)\left( \frac{1}{n}g\left( {{t}_{i}} \right)-g\left( u-\frac{i}{n} \right) \right)||_{\Phi ,~{{S}_{p}}}^{p}+{{\omega }_{N}}{{\left( g,\frac{1}{n} \right)}_{\Phi ,~{{S}_{p}}}}\le \frac{c}{n}{{\omega }_{N}}{{\left( g,\frac{1}{n} \right)}_{\Phi ,~{{S}_{p}}}}$

In the next last theorem, we study inverse theorem that gives lower bound for the degree of approximation, it restricts, with direct theorem, the degree of approximation with lower and upper bounds, respectively.

Theorem 2 (Inverse Theorem)

For any $g\in {{L}_{\Phi ,~{{S}_{p}}}}\left( {{I}^{d}} \right)$, there exists ${{L}_{t}}\left( g \right)$ of the form (17), such that ${{\omega }_{N}}{{\left( g,\delta  \right)}_{\Phi ,~{{S}_{p}}}}\le c\left( p,N \right)\left( {{||g||}_{\Phi ,~{{S}_{p}}}}+{{E}_{N}}{{\left( g \right)}_{\Phi ,~{{S}_{p}}}} \right)$.

Proof

Let $b=\underset{1\le i\le 2N+3}{\mathop{\max }}\,i;{{2}^{i}}~<~n,$ and $g\left( {{t}_{2N+3}} \right)-g\left( {{t}_{1}} \right)=\left( g\left( {{t}_{2N+3}} \right)-g\left( {{t}_{{{2}^{b}}}} \right) \right)+\left( g\left( {{t}_{{{2}^{b}}}} \right)-g\left( {{t}_{{{2}^{b-1}}}} \right) \right)+\ldots \left( g\left( {{t}_{2}} \right)-g\left( {{t}_{1}} \right) \right)$, for any$~m<2N+3,$ and $\delta \sim\frac{1}{n}$.

Suppose that $||g\left( {{t}_{2N+3}} \right)-g{{\left( {{t}_{m}} \right)}||_{\Phi ,~{{S}_{p}}}}\le c{{E}_{m}}{{\left( g \right)}_{\Phi ,~{{S}_{p}},}}$ then by (24), Theorem 1 we get by Lemma 2.

${{\omega }_{N}}{{\left( g,\delta  \right)}_{\Phi ,~{{S}_{p}}}} \\ \le c\left( p \right)\left[ {{\omega }_{N}}{{\left( g-{{L}_{t}}\left( g \right),\delta  \right)}_{\Phi ,~{{S}_{p}}}}+{{\omega }_{N}}{{\left( {{L}_{t}}\left( g \right),\delta  \right)}_{\Phi ,~{{S}_{p}}}} \right] \\ \le c\left( p \right)\left[ c\left( N \right)||g-{{L}_{t}}{{\left( g \right)}||_{\Phi ,~{{S}_{p}}}}+{{\omega }_{N}}\left( {{L}_{t}}{{\left( g,\delta  \right)}_{\Phi ,~{{S}_{p}}}} \right) \right] \\ \le c\left( p \right)\left[ c\left( N \right){{\omega }_{N}}{{\left( g,\delta  \right)}_{\Phi ,~{{S}_{p}}}}+c\left( N \right){{E}_{m}}{{\left( g \right)}_{\Phi ,~{{S}_{p}}}} \right] \\ \le c\left( p,N \right)\left[ {{||g}||_{\Phi ,~{{S}_{p}}}}+{{E}_{m}}{{\left( g \right)}_{\Phi ,~{{S}_{p}}}} \right].$

7. Conclusions

Quasi-Orlicz space is a very large space of measurable convex functions that generalize Orlicz space. In quasi-Orlicz spaces, we define a special class of CNNs with downsampling operator to reduce dimensions and approximate any quasi-Orlicz function with a good degree of approximation. Some previous work focused on approximating continuous functions by CNNs. CNNs were constructed in a way that guarantee universality with ordinary degree of approximation that depends on parameters of CNNs’ features. This paper results universality of CNNs with more extended space. On the other hand, we define modulus of smoothness in quasi-Orlicz space with many properties that are useful for approximation purposes that improve the estimated degree of approximation. The degree of approximation then is given by modulus of smoothness with its upper and lower bounds. Hence, we prove universality of CNNs in quasi- Orlicz spaces with dependence on smoothness of functions for their degree of approximation. In future work, it is interesting to implement such CNNs with downsampling operator practically in fields with complicated data.

  References

[1] Orlicz, W. (1932). Über Eine GEWISSE Klasse von Räumen vom typus B. Bulletin of the Polish Academy of Sciences Technical Sciences, 8(9): 207-220.

[2] Nakano, H. (1951). Topology and Linear Topological Spaces. Maruzen Company.

[3] Morse, M., Transue, W. (1950). Functionals f bilinear over the product a×b of two pseudo-normed vector spaces: II. admissible spaces a. Annals of Mathematics, 51(3): 576-614.

[4] Luxemburg, W.A.J. (1955). Banach Function Spaces.

[5] Kranoselkii, M., Rutickii, Y.B. (1961). Convex Functions and Orlicz Spaces (translation). ed: Groningen.

[6] Luxemburg, W., Zaanen, A. (1956). Conjugate spaces of Orlicz spaces. In Indagationes Mathematicae (Proceedings), 59: 217-228.

[7] Cui, Y., Duan, L., Hudzik, H., Wisła, M. (2008). Basic theory of p-Amemiya norm in Orlicz spaces (1≤ p≤∞): Extreme points and rotundity in Orlicz spaces endowed with these norms. Nonlinear Analysis: Theory, Methods & Applications, 69(5-6): 1796-1816. https://doi.org/10.1016/j.na.2007.07.026

[8] Wisła, M. (2015). Geometric properties of Orlicz spaces equipped with p-Amemiya norms−results and open questions. Commentationes Mathematicae, 55(2): 183-209. http://doi.org/10.14708/cm.v55i2.1104

[9] Wisła, M. (2020). Orlicz spaces equipped with s-norms. Journal of Mathematical Analysis and Applications, 483(2): 123659. https://doi.org/10.1016/j.jmaa.2019.123659

[10] Shidlich, A.L., Chaichenko, S.O. (2015). Approximative properties of diagonal operators in Orlicz spaces. Numerical Functional Analysis and Optimization, 36(10): 1339-1352. https://doi.org/10.1080/01630563.2015.1066387

[11] Abdullayev, F., Chaichenko, S., Kyzy, M.I., Shidlich, A. (2020). Direct and inverse approximation theorems in the weighted Orlicz-type spaces with a variable exponent. Turkish Journal of Mathematics, 44(1): 284-299.

[12] Cui, Y., Wisła, M. (2021). Monotonicity properties and solvability of dominated best approximation problem in Orlicz spaces equipped with s-norms. Revista de la Real Academia de Ciencias Exactas, Físicas y Naturales. Serie A. Matemáticas, 115: 1-22. https://doi.org/10.1007/s13398-021-01139-8

[13] AL-Janabi, A.M., Almurieb, H.A. (2024). Convolutional neural networks approximation in quasi-Orlicz spaces on sphers. Journal of Kufa for Mathematics and Computer, 11(1): 1-5. https://doi.org/10.3906/mat-1911-3

[14] Jackson, D. (1911). Über die Genauigkeit der Annäherung stetiger Funktionen durch ganze rationale Funktionen gegebenen Grades und trigonometrische Summen gegebener Ordnung. Dieterich'schen Universität--Buchdruckerei.

[15] Bernstein, S.N. (1912). On the best approximation of continuous functions by polynomials of a given degree. Kharkov Mathematical Society, 2(13): 49-194.

[16] de la Valle-Poussin, C. (1919). Leçons sur l'approximation des fonctions d'une variable réelle.

[17] Zygmund, A. (1924). On the continuity module of the sum of the series conjugate to a Fourier series. Prace Matematyczno-Przyrodnicze. Matematyka. Fizyka, 33: 25-132.

[18] Butzer, P.L., Nessel, R.J. (1971). Fourier analysis and approximation, Vol. 1. Reviews in Group Representation Theory, Part A (Pure and Applied Mathematics Series, Vol. 7).

[19] DeVore, R.A., Lorentz, G.G. (1993). Constructive Approximation (vol. 303). Springer Science & Business Media.

[20] Dzyadyk, V.K., Shevchuk, I.A. (2008). Theory of uniform approximation of functions by polynomials. Walter de Gruyter. https://doi.org/10.1515/9783110208245.bm

[21] Timan, M.F. (2009). Approximation and Properties of Periodic Functions. Nauk. dumka, Kiev.

[22] Sterlin, M.D. (1972). Exact constants in inverse theorems in the theory of approximations. In Doklady Akademii Nauk, 202(3): 545-547.

[23] Ditzian, Z., Totik, V. (1987). Springer series in computational mathematics. Springer-Verlag New York.

[24] Musielak, H. (1985). On the τ-modulus of smoothness in generalized Orlicz spaces. Commentationes Mathematicae, 25(2): 285-293.

[25] Popov, V. (1980). On the one-sided approximation on functions. Constructive Function, 465-468.

[26] Popov, V.A., Khristov, V.K. (1983). Averaged moduli of smoothness for functions of several variables and function spaces generated by them. Trudy Matematicheskogo Instituta imeni VA Steklova, 164: 136-141.

[27] Tsyganok, I. (1966). A generalization of a theorem of Jackson. Matematicheskii Sbornik, 113(2): 257-260.

[28] Ramazanov, A.R. (1984). On approximation by polynomials and rational functions in Orlicz spaces. Analysis Mathematica, 10(2): 117-132.

[29] Garidi, W. (1991). On approximation by polynomials in Orlicz spaces. Approximation Theory and Its Applications, 7(3): 97-110. https://doi.org/10.1007/BF02836460

[30] Israfilov, D.M., Akgün, R. (2006). Approximation in weighted Smirnov-Orlicz classes. Journal of Mathematics of Kyoto University, 46(4): 755-770. https://doi.org/10.1215/kjm/1250281602

[31] Akgün, R. (2012). Approximating polynomials for functions of weighted Smirnov-Orlicz Spaces. Journal of Function Spaces, 2012(1): 982360. https://doi.org/10.1155/2012/982360

[32] Chaichenko, S., Shidlich, A., Abdullayev, F. (2019). Direct and inverse approximation theorems of functions in the Orlicz type spaces $\boldsymbol{S}_{\mathrm{M}}$. Mathematica Slovaca, 69(6): 1367-1380. https://doi.org/10.1515/ms-2017-0314

[33] Shidlich, A.L., Chaichenko, S.O. (2014). Some extremal problems in the Orlicz spaces. Matematychni Studii, 42(1): 21-32.

[34] Shidlich, A.L., Chaichenko, S.O. (2015). Approximative properties of diagonal operators in Orlicz spaces. Numerical Functional Analysis and Optimization, 36(10): 1339-1352. https://doi.org/10.1080/01630563.2015.1066387

[35] LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. Nature, 521(7553): 436-444. https://doi.org/10.1038/nature14539

[36] Panwar, H., Gupta, P.K., Siddiqui, M.K., Morales-Menendez, R., Bhardwaj, P., Singh, V. (2020). A deep learning and grad-CAM based color visualization approach for fast detection of COVID-19 cases using chest X-ray and CT-Scan images. Chaos, Solitons & Fractals, 140: 110190. https://doi.org/10.1016/j.chaos.2020.110190

[37] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278-2324. https://doi.org/10.1109/5.726791

[38] Zhou, D.X. (2020). Theory of deep convolutional neural networks: Downsampling. Neural Networks, 124: 319-327. https://doi.org/10.1016/j.neunet.2020.01.018

[39] Fang, Z., Feng, H., Huang, S., Zhou, D.X. (2020). Theory of deep convolutional neural networks II: Spherical analysis. Neural Networks, 131: 154-162. https://doi.org/10.1016/j.neunet.2020.07.029

[40] Jadhav, S.B. (2019). Convolutional neural networks for leaf image-based plant disease classification. IAES International Journal of Artificial Intelligence, 8(4): 328.

[41] Dittmer, S., King, E.J., Maass, P. (2019). Singular values for ReLU layers. IEEE Transactions on Neural Networks and Learning Systems, 31(9): 3594-3605. https://doi.org/10.1109/TNNLS.2019.2945113

[42] Kadarina, T.M., Iklima, Z., Priambodo, R., Riandini, R., Wardhani, R.N. (2023). Dental caries classification using depthwise separable convolutional neural network for teledentistry system. Bulletin of Electrical Engineering and Informatics, 12(2): 940-949. https://doi.org/10.11591/eei.v12i2.4428

[43] Collobert, R., Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, Helsinki Finland, pp. 160-167. https://doi.org/10.1145/1390156.1390177

[44] Zhou, D.X. (2018). Deep distributed convolutional neural networks: Universality. Analysis and Applications, 16(06): 895-919. https://doi.org/10.1142/S0219530518500124

[45] Mao, T., Shi, Z., Zhou, D.X. (2021). Theory of deep convolutional neural networks III: Approximating radial functions. Neural Networks, 144: 778-790. https://doi.org/10.1016/j.neunet.2021.09.027

[46] Daubechies, I. (1992). Ten lectures on wavelets. Society for Industrial and Applied Mathematics. SIAM.

[47] Mallat, S. (2016). Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065): 20150203. https://doi.org/10.1098/rsta.2015.0203

[48] Zhou, D.X. (2020). Universality of deep convolutional neural networks. Applied and Computational Harmonic Analysis, 48(2): 787-794. https://doi.org/10.1016/j.acha.2019.06.004