0% found this document useful (0 votes)
6 views15 pages

Apurv Notes - Foundations of Pytorch

The document provides an overview of foundational concepts in PyTorch, including the differences between machine learning and deep learning models, tensor operations, and the importance of optimizing neural networks using gradient descent. It explains tensor structures, operations, and how to convert between PyTorch tensors and NumPy arrays, as well as the use of CUDA for GPU acceleration. Additionally, it covers the mechanics of gradient descent optimization, including forward and backward passes, learning rates, and methods for calculating gradients.

Uploaded by

apurv78.magis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
6 views15 pages

Apurv Notes - Foundations of Pytorch

The document provides an overview of foundational concepts in PyTorch, including the differences between machine learning and deep learning models, tensor operations, and the importance of optimizing neural networks using gradient descent. It explains tensor structures, operations, and how to convert between PyTorch tensors and NumPy arrays, as well as the use of CUDA for GPU acceleration. Additionally, it covers the mechanics of gradient descent optimization, including forward and backward passes, learning rates, and methods for calculating gradients.

Uploaded by

apurv78.magis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 15
FOUNDATIONS OF PYTORCH (JANANIT -RAVWEI PLURALSIGHT) relationship, mostly in form of four most-prominent functions NAMELY, {ReLU, Sigmoid (Logit, Tanh, Step)} (4) Machine Learning vs Deep-Learning Model: In a typical machine learning model, these relevant parameters are defined/prescribed manually by the User, as against Deep-Learning Models, which decide, based on patterns (fitting of curves), as to what variables are relevant, thereby improving the Predictability /decision-making) multiple-neurons, @) The first layer (which is visible to user) is the Input Layer, and the Last Layer, also visible to the user, is the Output Layer. The middle layers are number of neurons per layer, and the number of layers must be optimized, Since, too-much processing may lead to over-fitting of the model, and hence may lead the model astray. Arrays, Vectors, Matrices and Tensors: () Vectors are 1-p array, Matrices are 2-p arrays and Tensors are n- dimensional arrays, (2) Pytorch processes tensors, and not numpy-arrays, and hence all numpy Boas from input data have also to be converted to tensors, (3) By default, Pytorch creates al clements of tensor as Float32, You can Aunually decide whether to create float32 or float64 or int32 or inte, (4) Array in Numpy are created and Processed on CPU, while tensors in Pytorch cheered om GPU twhich if idx teste due to parallel processing capabilities). (9) The dimensions of tensors start from 0 (zero). So a tensor which is an array of f4,3]]] has dimensions serial Pumbered as 0,1,2,3 along the rows ead Serial-numbered as (0, 1,2]] along the column. HL/fy porch (Saravi Rov) @ Niger h mer funchion , The boot faa uate NN, Layo f Neuron —<, Slee Gufuoddron Graph OS Oobpud ——__—___} ae flows Nu=i¢ Ode > Ape Te lake + fogm rowie + Hybriol bull \isvadoher rh dn Libraries. Gupalahon Pow (Compl — - hnoge Proestivg ) loyr-1: Pyeod ChoraR kaiatien Quy, ah) ) Layer: g as.) en 3! Idd Odeo. i Loye-4 2 i, pws will objeals Es pute by t lym = Gges » Oak 4 4 bge > No chongen In ovpub=> remy ovo vat Loar Fae Rr nt [ras fovivalion: Liroay Ravens tov ) Hf ots if poms > Brain 2) Aebuahon Racch on : coe Non- tee Rennes rR ule nde ugang a Aeural AeSivabron Fuaucdion Sraumplos _ Ege SO) 1) Relw: Reabrficck Linen Uil- y Soft Cee 5 WS a se (Sante (3) (Step) ate ! felts a ||Newrol ote t we iS E. aa | 5, Tei Voelns = tod arsay G39 Annys in Numpy => CPU Townace = Mubh-dmorsioval 21 § Tem 1” fyb “hice = Nb. of clouds in. one den lor = lo 2), £2.31, (5.4]) G) Veeby = Ona~Dimavat | 4,8), [3,011,441 Cli) Mabrieey = 2oAneyatonal: EE: =e ole tab hb - huteds ei 1 ee eae 2 ae fomnow (Orb) «home Tf: aoe ve : Pe ee eer ser des 'o” ane Operahony 0 Tonsens. ® ) Cennon $ Inshabigadion (Tord bal aoak Ie) # Delaull fife is cook olomanl of Lloahgp a tontin dd You es b Toabt uM [Sad bay 4 + Ym on add a fod ee leat teuror. add CC) H You cam Creale om feusor Wilt abt zeaner ov Man, leanecsrfeun =lovch. fe oe a (202529 4 You com creale 4 leusoy in Brym of 24) hel- © 002, 6, 37] In? fennor_ave= loved, bunasr C0012,3, 45,20), tensor ovr sy favor ( Eft, 2, 36], J, 3 tb rsa Hor Ong. Faso Bis! ey ap et ie H lock twhefn tb tt Ahonen + fy bral. ts teunor ( busesor ave) Ou? Trae # Check wbosf alomendy a heuer Lelowl, rewaltanor-on) # Gul * Oninstiolised ' Tousor + Ondy voniory aMocaion,, ho valnoy beled Ins Feuror—_uninthaliyed fords Foor fiy2) Onk: — CLlaasg ot Z Eo er ee Ft Creak randomavelya wl labrged! vy Se Feuor—inthiobise = forol, hard ( Zed) h fen —twilializeo) J Out Cl Lotaoe , 23007 Coy, +4297) He Geol Censor af 0 daftnad type (vol Hout 52) Towson int = loved... devsor (053) ype (ch, Cea ne nb WwW or ea SQ and Geo li robs r11da. nb Torinne Lr coarbm an (PU pt you with by create a (5:3) foto of WWhae ie mins Int Puroy Ink gfu = forch. fensor (3,2). lobe ota i. ita) bens — ime On? toro (5,2). dlybe = foreb.sint32 ela H Guohng bean of tyke Ideger If-bil- K beursor_infg2 = Jorel.. short Tanonl Lio, 2.0, 2.0] ) ‘eto — 12 Out tency (53); cblybe~ tovel.nt Ls SE Onok o Lemar (0) 9 Al abl 10% on vabun tram ten = fovok . Foste269, All sate = 10°) st An Idaudily wrabrine nOn-Zero = torch nonynra err eye). ® Foundations of PyTorch - Notes Part 2 Demo: Simple Operations on Tensors (1A) Functions like "fill" which have a underscore as a suffix, overwrite the original tensor by changing the values of elements of the original tensor. The other functions (without an underscore as a suffix) like ".add” create a new tensor with fixed value added to the original tensor (1B) Not all in-place functions (with an underscore as suffix) have a corresponding function without a suffix. So, while there is “fill”, there is no function such as "fill". Some in-place functions like".add_" do havea corresponding out-of-place function “add” (2A) There are also several functions from numpy library which are available in pytorch. For example ‘linspeed’, ‘chunk’, ‘cat’, etc. (28) tensor_chunk = torch.chunk(x,3,0) splits the tensor ‘x’ along dimesnion = U, into a tuple of three tensors, being tensor_chunk(0], tensor_chunk(1] and tensor_chunk[2], each with 5 elemente (2C) torch.cat(([tensor_chunk{0], tensor_chunk{1], tensor_chunk{2]),0) concatenates three tensors along dimension = 0 (2c) Array-splicing: random_tensor(1:,1:] givea tensor for all rows from row no. 1 (second row onwards) and all columns from column no. 1 (second column onwards). (20) random_tensor{3,2] gives you a tensor with the value being the value of the element (which could be a single digit or a tensor itself) at index (3,2) in "rendom_tensor" (2£) Add or delete a dimension by ‘squeeze’ or ‘unsqueeze' functions of numpy. (2f) Transpose a tensor from x,y to y,x dimension Demo: Elementwise and Matrix Operations on Tensors (26) Sorting :The sorting will be done dimension by dimension (row by row), and it will also Produce an array of the the sorted indices, (2H) Elementwise addition, subtracion, di ion and multiplication. (3) Clamp (restrict values) of tensor to minimum and maximum (4) Dot-product: torch.dot (t1,t2) and Cross-Product : torch.mn (48) Multiply matrix(x,y) with vector(, y) to get a tensor [sum of dot-product of each row of matrix with each corresponding element of the vector] (4C) torch.argmax and torch.argmin finds the index of largest & smallest element along aparticular dimension Oo (4D) Torch.Mul to multiply two matrices Demo: Converting between PyTorch Tensors and NumPy Arrays (5) Convert numpy array to torch-tensor: "tensoi resulting tensor has datatype of "dtype=float64" torch.from_numpy(np_array)" The (58) "torch.as_tensor(np_array)" avoids making a copy of np_array if it is already a tensor, thereby avoiding memory wastage, and retaining np_array in the same computation-graph ( whether on CPU or GPU. On the other hand, the tensor initiated/converted from PyTorch Support for CUDA Devices (6) CUDA is a GPU-enabled platform. Pytorch is supported/understood by CUDA. (68) Use cuda to create different types of tensors (torch.cude.floattensor), select the GPU on a multi-GPU device(torch.cude.device), (6C) Cross-GPU operations are not allowed and will cause an error if you run an operation on tensors stored on different GPU's (unless peer-to-peer memory access among devices is enabled), however, tensors operating on one GPU can be copied onto other GPU's. (6D) FIFO Queueing of executions among the GPU's, but synchronous processing can be forced using CUDA_launch_blocking = 1 (say for error-handling, copying tensors from other GPU's which may be needed on this GPU for an in-process execution) Demo: Creating Tensors on CUDA-enabled Devices (E) You have to initialize deployment ot CUDA, select & index the number ot GPU-Devices, and when you create atensor,you have to specify that it has to be created on a specific GPU else tensor will be created onCPU-Memory. Also, PYthonJupyter keeps track of current GPU, So all tensors are created on that current-GPU. To create a tensor on another GPU, first change the GPU through "torch.CUDA.device(cuda:2)" and then specify the cuda-device while creating a tensor (6F). When you copy a tensor created on cuda(2), and current device is CUDA(1), then the new tensor is created on current device CUDA(1) (66) You can even copy tensor from inside current CUDA(1) to a different device CUDA(2) , by using b2=tensor_random.to(device=CUDA2) 7) You can also check memory allocated to and memory cached for our tensors on the current-device by torchh.cuda.memory_allocated() and torch.cuda.memory_cached{) ® FOUNDATIONS OF Pytorch - Notes (Part 3) Gradient Descent Optimization (8) Gradient Descent Optimization is a technique used to optimize the mathematical model (a neuron). | © Itideals with optimization of weights of input-parameters, and bias, which leads to a best-fit curve (a curve which is closest to true-representation of relationship between inputs & outputs). * This is achieved by obtaining those model parameters (weights & bias) which lead to the least Mean-Squared-Error (MSE). (6A) Here, for simplicity, we shall assume, the neuron will have only a single | Reuron with an affine function (for linear relationship) and no activation-function (in this case being an identity function - an IDENTITY_MATRIX) (8B) Each run involves two passes - a forward-pass and a backwards-pass. (i) Forward-Pass In the first run: , in the forward-pass, we compute the Predicted value of output, based on assumed values of bias & weight. (i) Backward-Pass in the first-run: Then, based on the resulting error (predicted value- actual value of training set), we work backward through output layer > thru hidden-layers > thru the input layer, to optimize/adjust - the weights /W) matrix - to achieve the least error. * This new values of W & b are computed using an optimizer function which tweaks assumed values of W and b by an amount called as a “step” (also called the “learning-rate” of the neural network), to arrive at the least possible error. * This backward-pass generates a gradient-matrix, which is basically the tensor, whose elements are © Partial-derivative of error w.r.t. weights (that is, the rate of change in error due to only the change in weights), and © partial derivative of error w.r.t. bias (that is, the rate of change of error only due to change in bias) (8C) The second run (i) Ferward-Pass of Second Run: In this pass, the Weights Matrix then Forward-Pass of Second Run: uses these optimized values of bias and [W] matrix, to arrive at another Predicted value of output, and therefore another error. (i) Backward Pass in Second-Run: Again, the weights are again optimized/adjusted in backward-pass to arrive at the least-possible error (predicted value of output minus actual value of output). Again optimizer "makes a change in the assumed (W}] and b, to minimize error still further (8D) Learning-Rate: (i) The trade-off: The quantum of step (known as the “learning rate" for that particular run) may be large or small - both have benefits and risks. Too large a step-value may lead to swinging the model-output to other extreme, and actually increasing the error. Too small a step may not improve the error by too much, and then require many many more forward-backward passes. (i) The new values of Weights & Bias use the learning rate as follows: New Values = Old Value minus Learning Rate * Gradient (GE) Target-Value of Iterations: The optimizer-function, after every forward- Pass, to decide on a step-value, actually computes something known as a Gradient Vector for each error corresponding to the value set of weights-matrix ‘[W]' and the bias - 'b’ used in that pass, °_ Ina three-dimensional plane comprising three axes - x, y, Mean Squared- Error, the various passes generate a three-dimensional diagram formed with the combination of input, output, error i.e. matrix [x, y , MSE] © This MSE would have a minima at the point on the curve where the model performs the best (has least MSE). © Corresponding values of [W] and b, would be the best / optimized values resulting from the training of the model. That is, the lowest gradient is where the model is optimized and thereafter the error would only increase © ** The more the number of iterations, the better is approximation. (8F) The optimizer-function, * After every forward-pass, to decide on a “step-value”, Optimizer actually computes something known as a Gradient Vector for each error corresponding to the value set of weights-matrix [W] and the bias -'b' used | in that pass. © As we can see, there is a trade-off involved with learning rate and the number of passes required to arrive at least MSE (the lowest gradient), © The lowest gradient is where the model is optimized and thereafter the error would only increase * This Gradient-Vector of error corresponding to particular W and b, is a matrix with elements being (partial derivative w.r.t. W, partial derivative w.r.t. bj. That is © [rate of change of error w.r.t. W, rate of change of error w.r.t. i. © This is a matrix computation. Calculating Gradients (9) Three methods to compute gradients — () Symbolic differentiation, (i) Numeric differentiation and (ii) Automation differentiation (mostly used - Tensorflow, Pytorch). Pytorch uses "AutoGrad" package in backward-pass (known as | backpropagation’, using a technique called "reverse auto-propagation’) Using Gradients to Update Model Parameters (9A) In reverse auto-differentiation, * A gradient-matrix which is calculated, corresponds to a particular time ‘t' and resulting W & b are used for next pass at time 't+1'). * This gradient is fed into previous-run-parameters to obtain new parameters for next run "new parameter = [old parameter minus (gradient * learning- rate)]" (9B) As we can see, there is a trade-off involved with learning rate and the number of passes required to arrive at least MSE (the lowest gradient). The lowest gradient is where the model is optimized and thereafter the error would only increase ‘Two Passes in Reverse Mode Automatic Differentiation @» | : A (9C) Symbolic Differentiation: * Requires calculation of all elements of Gradient vector (ie. partial derivatives of error w.r.t. each parameter), which is time-consuming * Also, in many cases, the output function of the activation-function (and therefore the resulting error-function) for a particular pass may not even be differentiable. (9D) Automatic differentiation uses "Taylor's series" to compute errors and | therefore the gradient vector corresponding to each forward-pass, Demo: Introducing Autograd: Gradient-Matrix, Gradient-Function (10) Reverse Auto-Differentiation in AutoGrad Library: To decide the combination of [W, b, least-error], you need to maintain a history of the weight- tensors [W), bias tensor ‘b’, and [Gradient Matrix] after each forward-pass, (i) Gradient-Matrix is also a tensor, whose shape matches the tensor for which the gradient matrix is computed. (8) Gradient-Function is the transformation used to compute the Gradient-Matrix during the backward-pass (11) Gradient-Matrix: (4) Checking Status: Whether a tensor has a gradient enabled or not, can be Chen atus: Whether a tensor has a gradient enabled or not, checked using "output _tensor.requires grad’. (8) Default-Vatue: the default value is “required. i grad = False”. (C) Setting/Enabling gradient-tracking: To enable gradient-tracking, at the time of creating a tensor, , specify that you need tracking of its gradient in the backwards-phase, since * fo enable gradient-tracking, the python command is "tensor1.requires grad ()’. * This would update the gradient-tracking "in-place" - note it isa erat" function which has an underscore as a suffix and therefore updates the tensor "in-place" rather than creating a new tensor with gradient-values. (D) Checking Gradient-Matrix: The gradient calculation w.r.t. any tensor can be checked using “print (tensor1.grad)", which would produce "none" before the first backward-pass. |__@) For user created tensors ~ @ () Gradient matrix will be developed only after the backward pass is executed by "tensor_output. backward)”, even though they may not have their gradient-tracking history enabled, AND (i) Shape of gradient-matrix always matches the shape of the tensor itself (iii) The user-defined tensor1 & tensor2, which were gradient-enabled, will not have a gradient function till we make a forward-pass. It is because, only when we have made a forward-pass, that we will have an error/cost/loss value, with respect to which, the partial derivative vector of the tensor will be calculated. (F) For Output Tensor: () The tensor which is the output of "even those tensors which do not have gradient-tracking enabled’, has its gradient-tracking enabled by default, and also has a "gradient-function’ by default even when no backward-pass has been made. (ii) Output tensor will also have a gradient function automatically enabled, even though no backward-pass has been implemented yet. (12) Gradient-Function: (i) The gradient function used to compute gradient can be printed by "print (tensor1.grad_fn)". (ii) The gradient function for the output tensor would always reference with the last function in the computation-statement (i.e. reference to “meanff in Tensor_Output = (A * B) * mean() 13 Directed Acyclic Graph: The combination of the original tensor, and its gradient matrix, is called a "Directed Acyclic Computation Graph - DAG". * “Directed”: It is called “Directed” because the direction of flow of results/output is forward/defined. * In a Computation-Graph: The “node” is the actual tensor and “function” represents the transformation performed along edges of that node, where. Demo: Working with Gradients 14) Gradients may be enabled, or can be disabled. "Decorator Functions" can be used in Python to evaluate whether or not to enable gradient-matrix for the tensors. Gy * "@orch.nogradj) is typed before defining the function, to create a "no- Sradient zone" in python code - Resulting tensor would not have a gradient- matrix appended to original tensor even though the tensor itself may have had gradient-tracking enabled. ° Through a nested/embedded "with @torch.enabled |_grad()" command, you can enable gradient-tracking even within the larger-loop of "no_grad" * Alternately, you can also set value of grad = true, while instantiating a tensor (i.e. when you define & specify the tensor's elements - then & there itself): “torch. tensor{[1,0}, [2,0], requires grad=True)" 15) Variables and Tensors: (A) Earlier version of Pytorch: The gradient vector & gradient function were stored in separate variables which wrapped the base tensor and its gradient-matrix & gradient-function, (B) Current Version of Pytorch: @ The variable-wrapping API of base-tensors (along with their gradient- matrix & gradient function) has been deprecated / discontinued, So, now the gradient vector & gradient-function, both of which were carlier stored in a separate variable, are now appended to original tensor itself, (i) "Vartabie’ is a proper library, which is imported, (til) “var = Variable(torch. FloatTensor{9}))" * Returns a tensor with 9 elements, and not a variable ° Like any other tensor, the variable named "var" would have default gradient-tracking = False, and while instantiating ‘var’ itself, the gradient-tracking can be set to "requires s_grad = True’, Demo: Training a Linear Model Using Autograd in Pytorch (16) Now we build a simple neural network with neuron having only a linear affine function, and no activation function: (Step-1} we define & plot two training tensors from two numpy arrays "x train = torch.from_numpy(x_trainy and "y train = torch.from_numpy(y_train)" | ana S| (Step-2) set their gradient-tracking properties to True (requires grad=True) (Step-3) Set the size (in number of neurons) of input, hidden and output layers input_size = 1 hidden _size = 1 output_size = 1 (Step-4) Set the weights tensors between input & hidden-layers. "wl = torch.randjinput_size, output size, requires = True)" (Step-5) Set weights tensor between hidden and output layer, w2 = torch.rand{hidden_size, output. size, requires grad = True) (Step-6) Set the learning rate (the step by which the optimizer may change the output function, to calibrate the weights-tensor in the backward-pass 17) The Python Code to implement Forward-Backward pass: (Step- 1) Define the Prediction Function: ¥_Pred = X train * w1 * w2 and (Step-2) Define Loss Function: Less(t+1) = Loss(t) + (¥ Pred-¥ train)*2, (Step-3) Run a backward-pass : "Loss. Backward)’, (Step-4)Adjust the weights for next pass: W1(t+1) = W1(t) - learn_rate * wl1.grad “*" The more the number of iterations, the better is approximation. (18) For visualisation, convert tensor to numpy array by "Y_np = ¥_tensor.detach).numpy()’ (19) Scatter the points on the graph and then plot the predicted-line. 13/43 ee

You might also like