diff --git a/_data/ecosystem/pted/2021/posters.yaml b/_data/ecosystem/pted/2021/posters.yaml
index dd4f3bfc7412..5f5f22524a3a 100644
--- a/_data/ecosystem/pted/2021/posters.yaml
+++ b/_data/ecosystem/pted/2021/posters.yaml
@@ -10,7 +10,7 @@
     are provided as a Torch tensor with a defined gradient. We highlight how this
     functionality can be used to explore new paradigms in machine learning, including
     the use of hybrid models for transfer learning.
-  link: http://www.pennylane.ai
+  link: http://pennylane.ai
   poster_link: https://s3.amazonaws.com/assets.pytorch.org/pted2021/posters/K1.png
   section: K1
   thumbnail_link: https://s3.amazonaws.com/assets.pytorch.org/pted2021/posters/thumb-K1.png
@@ -321,7 +321,7 @@
     supports accelerated mixed precision training. AMD also provides hardware support
     for the PyTorch community build to help develop and maintain new features. This
     poster will highlight some of the work that has gone into enabling PyTorch support.
-  link: www.amd.com/rocm
+  link: https://www.amd.com/rocm
   poster_link: https://s3.amazonaws.com/assets.pytorch.org/pted2021/posters/K8.png
   section: K8
   thumbnail_link: https://s3.amazonaws.com/assets.pytorch.org/pted2021/posters/thumb-K8.png
diff --git a/_mobile/android.md b/_mobile/android.md
index eb25100b8da6..f057a28806ae 100644
--- a/_mobile/android.md
+++ b/_mobile/android.md
@@ -94,7 +94,7 @@ Tensor inputTensor = TensorImageUtils.bitmapToFloat32Tensor(bitmap,
     TensorImageUtils.TORCHVISION_NORM_MEAN_RGB, TensorImageUtils.TORCHVISION_NORM_STD_RGB);
 ```
 `org.pytorch.torchvision.TensorImageUtils` is part of `org.pytorch:pytorch_android_torchvision` library.
-The `TensorImageUtils#bitmapToFloat32Tensor` method creates tensors in the [torchvision format](https://pytorch.org/docs/stable/torchvision/models.html) using `android.graphics.Bitmap` as a source.
+The `TensorImageUtils#bitmapToFloat32Tensor` method creates tensors in the [torchvision format](https://pytorch.org/vision/stable/models.html) using `android.graphics.Bitmap` as a source.
 
 > All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224.
 > The images have to be loaded in to a range of `[0, 1]` and then normalized using `mean = [0.485, 0.456, 0.406]` and `std = [0.229, 0.224, 0.225]`
diff --git a/_mobile/ios.md b/_mobile/ios.md
index fb00da2ebd40..585191cc764f 100644
--- a/_mobile/ios.md
+++ b/_mobile/ios.md
@@ -23,7 +23,7 @@ HelloWorld is a simple image classification application that demonstrates how to
 
 ### Model Preparation
 
-Let's start with model preparation. If you are familiar with PyTorch, you probably should already know how to train and save your model. In case you don't, we are going to use a pre-trained image classification model - [MobileNet v2](https://pytorch.org/hub/pytorch_vision_mobilenet_v2/), which is already packaged in [TorchVision](https://pytorch.org/docs/stable/torchvision/index.html). To install it, run the command below.
+Let's start with model preparation. If you are familiar with PyTorch, you probably should already know how to train and save your model. In case you don't, we are going to use a pre-trained image classification model - [MobileNet v2](https://pytorch.org/hub/pytorch_vision_mobilenet_v2/), which is already packaged in [TorchVision](https://pytorch.org/vision/stable/index.html). To install it, run the command below.
 
 > We highly recommend following the [Pytorch Github page](https://github.com/pytorch/pytorch) to set up the Python development environment on your local machine.
 
diff --git a/_posts/2018-03-5-tensor-comprehensions.md b/_posts/2018-03-5-tensor-comprehensions.md
index a777c076a432..df83ea75dccd 100644
--- a/_posts/2018-03-5-tensor-comprehensions.md
+++ b/_posts/2018-03-5-tensor-comprehensions.md
@@ -34,7 +34,7 @@ conda install -c pytorch -c tensorcomp tensor_comprehensions
 
 At this time we only provide Linux-64 binaries which have been tested on Ubuntu 16.04 and CentOS7.
 
-TC depends on heavyweight C++ projects such as [Halide](http://halide-lang.org/), [Tapir-LLVM](https://github.com/wsmoses/Tapir-LLVM) and [ISL](http://isl.gforge.inria.fr/). Hence, we rely on Anaconda to distribute these dependencies reliably. For the same reason, TC is not available via PyPI.
+TC depends on heavyweight C++ projects such as [Halide](http://halide-lang.org/), [Tapir-LLVM](https://github.com/wsmoses/Tapir-LLVM) and ISL. Hence, we rely on Anaconda to distribute these dependencies reliably. For the same reason, TC is not available via PyPI.
 
 #### 2. Import the python package
 
@@ -74,8 +74,6 @@ The autotuner is your biggest friend. You generally do not want to use a `tc` fu
 
 When the autotuning is running, the current best performance is displayed. If you are satisfied with the current result or you are out of time, stop the tuning procedure by pressing `Ctrl+C`.
 
-![tc-autotuner](https://pytorch.org/static/img/tc_autotuner.gif)
-
 `cache` saves the results of the autotuned kernel search and saves it to the file `fcrelu_100_128_100.tc`. The next time you call the same line of code, it loads the results of the autotuning without recomputing it.
 
 The autotuner has a few hyperparameters (just like your ConvNet has learning rate, number of layers, etc.). We pick reasonable defaults, but you can read about using advanced options [here](https://facebookresearch.github.io/TensorComprehensions/framework/pytorch_integration/writing_layers.html#specifying-mapping-options).
@@ -146,7 +144,7 @@ Note: the syntax for passing in scalars is subject to change in the next release
 
 ## torch.nn layers
 
-We added some sugar-coating around the basic PyTorch integration of TC to make it easy to integrate TC into larger `torch.nn` models by defining the forward and backward TC expressions and taking `Variable` inputs / outputs. Here is an [example](https://github.com/facebookresearch/TensorComprehensions/blob/master/test_python/layers/test_convolution_train.py) of defining a convolution layer with TC.
+We added some sugar-coating around the basic PyTorch integration of TC to make it easy to integrate TC into larger `torch.nn` models by defining the forward and backward TC expressions and taking `Variable` inputs / outputs. 
 
 ## Some essentials that you will miss (we're working on them)
 
@@ -183,12 +181,12 @@ You cannot write this operation in TC: `torch.matmul(...).view(...).mean(...)`.
 ## Getting Started
 
 - [Walk through Tutorial](https://facebookresearch.github.io/TensorComprehensions/framework/pytorch_integration/writing_layers.html) to quickly get started with understanding and using Tensor Comprehensions PyTorch package.
-- [Over 20 examples](https://github.com/facebookresearch/TensorComprehensions/tree/master/test_python/layers) of various ML layers with TC, including `avgpool`, `maxpool`, `matmul`, matmul - give output buffers and `batch-matmul`, `convolution`, `strided-convolution`, `batchnorm`, `copy`, `cosine similarity`, `Linear`, `Linear + ReLU`, `group-convolutions`, strided `group-convolutions`, `indexing`, `Embedding` (lookup table), small-mobilenet, `softmax`, `tensordot`, `transpose`
+- Over 20 examples of various ML layers with TC, including `avgpool`, `maxpool`, `matmul`, matmul - give output buffers and `batch-matmul`, `convolution`, `strided-convolution`, `batchnorm`, `copy`, `cosine similarity`, `Linear`, `Linear + ReLU`, `group-convolutions`, strided `group-convolutions`, `indexing`, `Embedding` (lookup table), small-mobilenet, `softmax`, `tensordot`, `transpose`
 - [Detailed docs](https://facebookresearch.github.io/TensorComprehensions/framework/pytorch_integration/getting_started.html) on Tensor Comprehensions and integration with PyTorch.
 
 ## Communication
 
-- [Slack](https://tensorcomprehensions.herokuapp.com/): For discussion around framework integration, build support, collaboration, etc. join our slack channel.
+- Slack: For discussion around framework integration, build support, collaboration, etc. join our slack channel.
 - Email: tensorcomp@fb.com
 - [GitHub](https://github.com/facebookresearch/TensorComprehensions): bug reports, feature requests, install issues, RFCs, thoughts, etc.
 
diff --git a/_posts/2019-05-08-model-serving-in-pyorch.md b/_posts/2019-05-08-model-serving-in-pyorch.md
index c25b1c89f7ab..512268e5f198 100644
--- a/_posts/2019-05-08-model-serving-in-pyorch.md
+++ b/_posts/2019-05-08-model-serving-in-pyorch.md
@@ -52,7 +52,7 @@ If you can't use the cloud or prefer to manage all services using the same techn
 
 If you want to manage multiple models within a non-cloud service solution, there are teams developing PyTorch support in model servers like [MLFlow](https://mlflow.org/), [Kubeflow](https://www.kubeflow.org/), and [RedisAI.](https://oss.redislabs.com/redisai/) We're excited to see innovation from multiple teams building OSS model servers, and we'll continue to highlight innovation in the PyTorch ecosystem in the future.
 
-If you can use the cloud for your application, there are several great choices for working with models in the cloud. For AWS Sagemaker, you can start find a guide to [all of the resources from AWS for working with PyTorch](https://docs.aws.amazon.com/sagemaker/latest/dg/pytorch.html), including docs on how to use the [Sagemaker Python SDK](https://sagemaker.readthedocs.io/en/stable/using_pytorch.html). You can also see [some](https://youtu.be/5h1Ot2dPi2E) [talks](https://youtu.be/qc5ZikKw9_w) we've given on using PyTorch on Sagemaker. Finally, if you happen to be using PyTorch via FastAI, then they've written a [really simple guide](https://course.fast.ai/deployment_amzn_sagemaker.html) to getting up and running on Sagemaker. 
+If you can use the cloud for your application, there are several great choices for working with models in the cloud. For AWS Sagemaker, you can start find a guide to [all of the resources from AWS for working with PyTorch](https://docs.aws.amazon.com/sagemaker/latest/dg/pytorch.html), including docs on how to use the [Sagemaker Python SDK](https://sagemaker.readthedocs.io/en/stable/using_pytorch.html). You can also see [some](https://youtu.be/5h1Ot2dPi2E) [talks](https://youtu.be/qc5ZikKw9_w) we've given on using PyTorch on Sagemaker. Finally, if you happen to be using PyTorch via FastAI, then they've written a really simple guide to getting up and running on Sagemaker. 
 
 The story is similar across other major clouds. On Google Cloud, you can follow [these instructions](https://cloud.google.com/deep-learning-vm/docs/pytorch_start_instance) to get access to a Deep Learning VM with PyTorch pre-installed. On Microsoft Azure, you have a number of ways to get started from [Azure Machine Learning Service](https://azure.microsoft.com/en-us/services/machine-learning-service/) to [Azure Notebooks](https://notebooks.azure.com/pytorch/projects/tutorials) showing how to use PyTorch.
 
diff --git a/_posts/2019-06-10-towards-reproducible-research-with-pytorch-hub.md b/_posts/2019-06-10-towards-reproducible-research-with-pytorch-hub.md
index 3bdd2db84dbe..35a4306d7557 100644
--- a/_posts/2019-06-10-towards-reproducible-research-with-pytorch-hub.md
+++ b/_posts/2019-06-10-towards-reproducible-research-with-pytorch-hub.md
@@ -106,7 +106,7 @@ Users can list all available entrypoints in a repo using the ```torch.hub.list()
  'vgg19_bn']
  ```
 
-Note that PyTorch Hub also allows auxillary entrypoints (other than pretrained models), e.g. ```bertTokenizer``` for preprocessing in the [BERT](https://pytorch.org/hub/huggingface_pytorch-pretrained-bert_bert/) models, to make the user workflow smoother.
+Note that PyTorch Hub also allows auxillary entrypoints (other than pretrained models), e.g. ```bertTokenizer``` for preprocessing in the BERT models, to make the user workflow smoother.
 
 
 ### Load a model
@@ -164,7 +164,7 @@ forward(input_ids, token_type_ids=None, attention_mask=None, masked_lm_labels=No
 ...
 ```
 
-Have a closer look at the [BERT](https://pytorch.org/hub/huggingface_pytorch-pretrained-bert_bert/) and [DeepLabV3](https://pytorch.org/hub/pytorch_vision_deeplabv3_resnet101/) pages, where you can see how these models can be used once loaded.
+Have a closer look at the BERT and [DeepLabV3](https://pytorch.org/hub/pytorch_vision_deeplabv3_resnet101/) pages, where you can see how these models can be used once loaded.
 
 ### Other ways to explore
 
diff --git a/_posts/2019-07-18-pytorch-ecosystem.md b/_posts/2019-07-18-pytorch-ecosystem.md
index 3b87f2f10d0f..7351cbbd9d4f 100644
--- a/_posts/2019-07-18-pytorch-ecosystem.md
+++ b/_posts/2019-07-18-pytorch-ecosystem.md
@@ -45,7 +45,7 @@ If you would like to have your project included in the PyTorch ecosystem and fea
 
 ## PyTorch Hub for reproducible research | New models
 
-Since [launching](https://pytorch.org/blog/towards-reproducible-research-with-pytorch-hub/) the PyTorch Hub in beta, we’ve received a lot of interest from the community including the contribution of many new models. Some of the latest include [U-Net for Brain MRI](https://pytorch.org/hub/mateuszbuda_brain-segmentation-pytorch_unet/) contributed by researchers at Duke University, [Single Shot Detection](https://pytorch.org/hub/nvidia_deeplearningexamples_ssd/) from NVIDIA and [Transformer-XL](https://pytorch.org/hub/huggingface_pytorch-pretrained-bert_transformerXL/) from HuggingFace.
+Since [launching](https://pytorch.org/blog/towards-reproducible-research-with-pytorch-hub/) the PyTorch Hub in beta, we’ve received a lot of interest from the community including the contribution of many new models. Some of the latest include [U-Net for Brain MRI](https://pytorch.org/hub/mateuszbuda_brain-segmentation-pytorch_unet/) contributed by researchers at Duke University, [Single Shot Detection](https://pytorch.org/hub/nvidia_deeplearningexamples_ssd/) from NVIDIA and Transformer-XL from HuggingFace.
 
 We’ve seen organic integration of the PyTorch Hub by folks like [paperswithcode](https://paperswithcode.com/), making it even easier for you to try out the state of the art in AI research. In addition, companies like [Seldon](https://github.com/axsaucedo/seldon-core/tree/pytorch_hub/examples/models/pytorchhub) provide production-level support for PyTorch Hub models on top of Kubernetes.
 
diff --git a/_posts/2019-08-08-pytorch-1.2-and-domain-api-release.md b/_posts/2019-08-08-pytorch-1.2-and-domain-api-release.md
index 5e8ce05d52f8..bcc30d86963a 100644
--- a/_posts/2019-08-08-pytorch-1.2-and-domain-api-release.md
+++ b/_posts/2019-08-08-pytorch-1.2-and-domain-api-release.md
@@ -115,9 +115,9 @@ We are excited to see an active community around torchaudio and eager to further
 
 ## Torchtext 0.4 with supervised learning datasets
 
-A key focus area of torchtext is to provide the fundamental elements to help accelerate NLP research. This includes easy access to commonly used datasets and basic preprocessing pipelines for working on raw text based data. The torchtext 0.4.0 release includes several popular supervised learning baselines with "one-command" data loading. A [tutorial](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) is included to show how to use the new datasets for text classification analysis. We also added and improved on a few functions such as [get_tokenizer](https://pytorch.org/text/data.html?highlight=get_tokenizer#torchtext.data.get_tokenizer) and [build_vocab_from_iterator](https://pytorch.org/text/vocab.html#build-vocab-from-iterator) to make it easier to implement future datasets. Additional examples can be found [here](https://github.com/pytorch/text/tree/master/examples/text_classification).
+A key focus area of torchtext is to provide the fundamental elements to help accelerate NLP research. This includes easy access to commonly used datasets and basic preprocessing pipelines for working on raw text based data. The torchtext 0.4.0 release includes several popular supervised learning baselines with "one-command" data loading. A [tutorial](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) is included to show how to use the new datasets for text classification analysis. We also added and improved on a few functions such as get_tokenizer and build_vocab_from_iterator to make it easier to implement future datasets. Additional examples can be found [here](https://github.com/pytorch/text/tree/master/examples/text_classification).
 
-Text classification is an important task in Natural Language Processing with many applications, such as sentiment analysis. The new release includes several popular [text classification datasets](https://pytorch.org/text/datasets.html?highlight=textclassification#torchtext.datasets.TextClassificationDataset) for supervised learning including:
+Text classification is an important task in Natural Language Processing with many applications, such as sentiment analysis. The new release includes several popular text classification datasets for supervised learning including:
 
 * AG_NEWS
 * SogouNews
diff --git a/_posts/2019-12-06-pytorch-adds-new-tools-and-libraries-welcomes-preferred-networks-to-its-community.md b/_posts/2019-12-06-pytorch-adds-new-tools-and-libraries-welcomes-preferred-networks-to-its-community.md
index 0a13979349de..d39d84959ce5 100644
--- a/_posts/2019-12-06-pytorch-adds-new-tools-and-libraries-welcomes-preferred-networks-to-its-community.md
+++ b/_posts/2019-12-06-pytorch-adds-new-tools-and-libraries-welcomes-preferred-networks-to-its-community.md
@@ -39,7 +39,7 @@ Image and video classification are at the core of content understanding. To that
 * Ease of use - This framework features a modular, flexible design that allows anyone to train machine learning models on top of PyTorch using very simple abstractions. The system also has out-of-the-box integration with AWS on PyTorch Elastic, facilitating research at scale and making it simple to move between research and production.
 * High performance - Researchers can use the framework to train models such as Resnet50 on ImageNet in as little as 15 minutes.
 
-You can learn more at the [NeurIPS Expo workshop](https://nips.cc/ExpoConferences/2019/schedule?workshop_id=16) on Multi-Modal research to production or get started with the PyTorch Elastic Imagenet example [here](https://github.com/pytorch/elastic/blob/master/examples/imagenet/main.py).
+You can learn more at the NeurIPS Expo workshop on Multi-Modal research to production or get started with the PyTorch Elastic Imagenet example [here](https://github.com/pytorch/elastic/blob/master/examples/imagenet/main.py).
 
 ## Come see us at NeurIPS
 
@@ -47,13 +47,13 @@ The PyTorch team will be hosting workshops at NeurIPS during the industry expo o
 
 We’re also publishing a [paper that details the principles that drove the implementation of PyTorch](https://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library) and how they’re reflected in its architecture.
 
-*[Multi-modal Research to Production](https://nips.cc/ExpoConferences/2019/schedule?workshop_id=16)* - This workshop will dive into a number of modalities such as computer vision (large scale image classification and instance segmentation) and Translation and Speech (seq-to-seq Transformers) from the lens of taking cutting edge research to production. Lastly, we will also walk through how to use the latest APIs in PyTorch to take eager mode developed models into graph mode via Torchscript and quantize them for scale production deployment on servers or mobile devices. Libraries used include:
+*Multi-modal Research to Production* - This workshop will dive into a number of modalities such as computer vision (large scale image classification and instance segmentation) and Translation and Speech (seq-to-seq Transformers) from the lens of taking cutting edge research to production. Lastly, we will also walk through how to use the latest APIs in PyTorch to take eager mode developed models into graph mode via Torchscript and quantize them for scale production deployment on servers or mobile devices. Libraries used include:
 
 * Classification Framework - a newly open sourced PyTorch framework developed by Facebook AI for research on large-scale image and video classification. It allows researchers to quickly prototype and iterate on large distributed training jobs. Models built on the framework can be seamlessly deployed to production.
 * Detectron2 - the recently released object detection library built by the Facebook AI Research computer vision team. We will articulate the improvements over the previous version including: 1) Support for latest models and new tasks; 2) Increased flexibility, to enable new computer vision research; 3) Maintainable and scalable, to support production use cases.
 * Fairseq - general purpose sequence-to-sequence library, can be used in many applications, including (unsupervised) translation, summarization, dialog and speech recognition.
 
-*[Responsible and Reproducible AI](https://nips.cc/ExpoConferences/2019/schedule?workshop_id=14)* - This workshop on Responsible and Reproducible AI will dive into important areas that are shaping the future of how we interpret, reproduce research, and build AI with privacy in mind. We will cover major challenges, walk through solutions, and finish each talk with a hands-on tutorial.
+*Responsible and Reproducible AI* - This workshop on Responsible and Reproducible AI will dive into important areas that are shaping the future of how we interpret, reproduce research, and build AI with privacy in mind. We will cover major challenges, walk through solutions, and finish each talk with a hands-on tutorial.
 
 * Reproducibility: As the number of research papers submitted to arXiv and conferences skyrockets, scaling reproducibility becomes difficult. We must address the following challenges: aid extensibility by standardizing code bases, democratize paper implementation by writing hardware agnostic code, facilitate results validation by documenting “tricks” authors use to make their complex systems function. To offer solutions, we will dive into tool like PyTorch Hub and PyTorch Lightning which are used by some of the top researchers in the world to reproduce the state of the art.
 * Interpretability: With the increase in model complexity and the resulting lack of transparency, model interpretability methods have become increasingly important. Model understanding is both an active area of research as well as an area of focus for practical applications across industries using machine learning. To get hands on, we will use the recently released Captum library that provides state-of-the-art algorithms to provide researchers and developers with an easy way to understand the importance of neurons/layers and the predictions made by our models.`
diff --git a/_posts/2020-07-28-pytorch-feature-classification-changes.md b/_posts/2020-07-28-pytorch-feature-classification-changes.md
index 9ff4291513aa..057867a68158 100644
--- a/_posts/2020-07-28-pytorch-feature-classification-changes.md
+++ b/_posts/2020-07-28-pytorch-feature-classification-changes.md
@@ -42,7 +42,7 @@ Additionally, the following features will be reclassified under this new rubric:
 5. [Channels Last Memory Layout](https://pytorch.org/docs/stable/tensor_attributes.html#torch-memory-format): Beta (was Experimental)
 6. [Custom C++ Classes](https://pytorch.org/docs/stable/jit.html?highlight=experimental): Beta (was Experimental)
 7. [PyTorch Mobile](https://pytorch.org/mobile/home/): Beta (was Experimental)
-8. [Java Bindings](https://pytorch.org/docs/stable/packages.html#): Beta (was Experimental)
+8. [Java Bindings](https://pytorch.org/docs/stable/index.html): Beta (was Experimental)
 9. [Torch.Sparse](https://pytorch.org/docs/stable/sparse.html?highlight=experimental#): Beta (was Experimental)
 
 
diff --git a/_posts/2020-1-15-pytorch-1-dot-4-released-and-domain-libraries-updated.md b/_posts/2020-1-15-pytorch-1-dot-4-released-and-domain-libraries-updated.md
index 2be782f18b47..e55070202d16 100644
--- a/_posts/2020-1-15-pytorch-1-dot-4-released-and-domain-libraries-updated.md
+++ b/_posts/2020-1-15-pytorch-1-dot-4-released-and-domain-libraries-updated.md
@@ -40,8 +40,6 @@ With the scale of models, such as RoBERTa, continuing to increase into the billi
 To learn more about the APIs and the design of this feature, see the links below:
 
 * [API documentation](https://pytorch.org/docs/stable/rpc.html)
-* [Distributed Autograd design doc](https://pytorch.org/docs/stable/notes/distributed_autograd.html)
-* [Remote Reference design doc](https://pytorch.org/docs/stable/notes/rref.html)
 
 For the full tutorials, see the links below:  
 
diff --git a/_posts/2020-10-1-announcing-the-winners-of-the-2020-global-pytorch-summer-hackathon.md b/_posts/2020-10-1-announcing-the-winners-of-the-2020-global-pytorch-summer-hackathon.md
index 0bc542fe8484..76c7e05ef8f4 100644
--- a/_posts/2020-10-1-announcing-the-winners-of-the-2020-global-pytorch-summer-hackathon.md
+++ b/_posts/2020-10-1-announcing-the-winners-of-the-2020-global-pytorch-summer-hackathon.md
@@ -63,7 +63,7 @@ A PyTorch-based automated machine learning (AutoML) solution, carefree-learn pro
 
 **3rd Place** - [TorchExpo](https://devpost.com/software/torchexpo)
 
-TorchExpo is a collection of models and extensions that simplifies taking PyTorch from research to production in mobile devices. This library is more than a web and mobile application, and also comes with a Python library. The Python library is available via pip install and it helps researchers convert a state-of-the-art model in TorchScript and ONNX format in just one line. Detailed docs are available [here](https://torchexpo.readthedocs.io/en/latest/).
+TorchExpo is a collection of models and extensions that simplifies taking PyTorch from research to production in mobile devices. This library is more than a web and mobile application, and also comes with a Python library. The Python library is available via pip install and it helps researchers convert a state-of-the-art model in TorchScript and ONNX format in just one line. 
 
 ## Web/Mobile Applications Powered by PyTorch
 
@@ -91,10 +91,7 @@ FairTorch is a fairness library for PyTorch. It lets developers add constraints
 
 <div align="center">
       <a href="https://www.youtube.com/watch?v=b2Cj4VflFKQ">
-     <img 
-      src="https://yt-embed.herokuapp.com/embed?v=b2Cj4VflFKQ" 
-      alt="FairTorch" 
-      style="width:50%;">
+      FairTorch
       </a>
     </div>
 
diff --git a/_posts/2020-10-27-pytorch-1.7-released.md b/_posts/2020-10-27-pytorch-1.7-released.md
index a3679b25e166..766ed9889ef6 100644
--- a/_posts/2020-10-27-pytorch-1.7-released.md
+++ b/_posts/2020-10-27-pytorch-1.7-released.md
@@ -62,7 +62,6 @@ Note that this is necessary, **but not sufficient**, for determinism **within a
 
 See the documentation for ```torch.set_deterministic(bool)``` for the list of affected operations.
 * [RFC](https://github.com/pytorch/pytorch/issues/15359)
-* [Documentation](https://pytorch.org/docs/stable/generated/torch.set_deterministic.html)
 
 # Performance & Profiling
 ## [Beta] Stack traces added to profiler
diff --git a/_posts/2020-11-1-pytorch-developer-day-2020.md b/_posts/2020-11-1-pytorch-developer-day-2020.md
index ef6cb6ab8144..c68bafaec0d1 100644
--- a/_posts/2020-11-1-pytorch-developer-day-2020.md
+++ b/_posts/2020-11-1-pytorch-developer-day-2020.md
@@ -16,9 +16,8 @@ For Developer Day, we have an online networking event limited to people composed
 
 All talks will be livestreamed and available to the public.
 * [Livestream event page](https://www.facebook.com/events/802177440559164/)
-* [Apply for an invitation to the networking event](https://pytorchdeveloperday.fbreg.com/apply)
 
-Visit the [event website](https://pytorchdeveloperday.fbreg.com/) to learn more. We look forward to welcoming you to PyTorch Developer Day on November 12th! 
+Visit the event website to learn more. We look forward to welcoming you to PyTorch Developer Day on November 12th! 
 
 Thank you,
 
diff --git a/_posts/2020-4-21-pytorch-library-updates-new-model-serving-library.md b/_posts/2020-4-21-pytorch-library-updates-new-model-serving-library.md
index 69101b8abc09..af49e31a38f7 100644
--- a/_posts/2020-4-21-pytorch-library-updates-new-model-serving-library.md
+++ b/_posts/2020-4-21-pytorch-library-updates-new-model-serving-library.md
@@ -61,7 +61,7 @@ The torchvision 0.6 release includes updates to datasets, models and a significa
 * Added `aligned` flag to `RoIAlign` to match Detectron2. 
 * Refactored abstractions for C++ video decoder
 
-See the release full notes [here](https://github.com/pytorch/vision/releases) and full docs can be found [here](https://pytorch.org/docs/stable/torchvision/index.html).
+See the release full notes [here](https://github.com/pytorch/vision/releases) and full docs can be found [here](https://pytorch.org/vision/stable/index.html).
 
 ### torchtext 0.6
 The torchtext 0.6 release includes a number of bug fixes and improvements to documentation. Based on user's feedback, dataset abstractions are currently being redesigned also. Highlights for the release include:
diff --git a/_posts/2020-7-28-pytorch-1.6-released.md b/_posts/2020-7-28-pytorch-1.6-released.md
index eb07fe53867a..9d1f6442249a 100644
--- a/_posts/2020-7-28-pytorch-1.6-released.md
+++ b/_posts/2020-7-28-pytorch-1.6-released.md
@@ -101,7 +101,7 @@ torch.distributed.rpc.rpc_sync(...)
 ```
 
 * Design doc ([Link](https://github.com/pytorch/pytorch/issues/35251))
-* Documentation ([Link](https://pytorch.org/docs/stable/rpc/index.html))
+* Documentation ([Link](https://pytorch.org/docs/stable/))
 
 ## [Beta] DDP+RPC 
 
@@ -123,7 +123,7 @@ for data in batch:
 ```
 
 * DDP+RPC Tutorial ([Link](https://pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html))
-* Documentation ([Link](https://pytorch.org/docs/stable/rpc/index.html))
+* Documentation ([Link](https://pytorch.org/docs/stable/))
 * Usage Examples ([Link](https://github.com/pytorch/examples/pull/800))
 
 ## [Beta] RPC - Asynchronous User Functions
@@ -147,7 +147,7 @@ ret = rpc.rpc_sync(
 print(ret)  # prints tensor([3., 3.])
 ```
 
-* Tutorial for performant batch RPC using Asynchronous User Functions ([Link](https://github.com/pytorch/tutorials/blob/release/1.6/intermediate_source/rpc_async_execution.rst))
+* Tutorial for performant batch RPC using Asynchronous User Functions
 * Documentation ([Link](https://pytorch.org/docs/stable/rpc.html#torch.distributed.rpc.functions.async_execution))
 * Usage examples ([Link](https://github.com/pytorch/examples/tree/master/distributed/rpc/batch))
 
diff --git a/_posts/2021-08-23-pytorch-developer-day-2021.md b/_posts/2021-08-23-pytorch-developer-day-2021.md
index 0e369ab6354d..6d121fb00cd0 100644
--- a/_posts/2021-08-23-pytorch-developer-day-2021.md
+++ b/_posts/2021-08-23-pytorch-developer-day-2021.md
@@ -22,12 +22,9 @@ Stay up to date by following us on our social channels: [Twitter](https://twitte
 
 On the second day, we’ll be hosting an online poster exhibition on Gather.Town. There will be opportunities to meet the authors and learn more about their PyTorch projects as well as network with the community. This poster and networking event is limited to people composed of PyTorch maintainers and contributors, long-time stakeholders and experts in areas relevant to PyTorch’s future. Conversations from the networking event will strongly shape the future of PyTorch. As such, invitations are required to attend the networking event. 
 
-Apply for an invitation to the networking event by clicking [here](https://pytorchdeveloperday.fbreg.com/).
 
 ## Call for Content Now Open
 
 Submit your poster abstracts today! Please send us the title and brief summary of your project, tools and libraries that could benefit PyTorch researchers in academia and industry, application developers, and ML engineers for consideration. The focus must be on academic papers, machine learning research, or open-source projects related to PyTorch development, Responsible AI or Mobile. Please no sales pitches. **Deadline for submission is September 24, 2021**. 
 
-You can submit your poster abstract during your application & registration process [here](https://pytorchdeveloperday.fbreg.com/apply).
-
-Visit the [event website](https://pytorchdeveloperday.fbreg.com/) for more information and we look forward to having you at PyTorch Developer Day. For any questions about the event, contact [pytorch@fbreg.com](mailto:pytorch@fbreg.com). 
+Visit the event website for more information and we look forward to having you at PyTorch Developer Day.
\ No newline at end of file
diff --git a/_posts/2021-10-21-pytorch-1.10-new-library-releases.md b/_posts/2021-10-21-pytorch-1.10-new-library-releases.md
index 5d6413570500..8356cb1bc9cf 100644
--- a/_posts/2021-10-21-pytorch-1.10-new-library-releases.md
+++ b/_posts/2021-10-21-pytorch-1.10-new-library-releases.md
@@ -113,7 +113,7 @@ TorchAudio now adds support for differentiable Minimum Variance Distortionless R
 >>> # Get the enhanced waveform via iSTFT
 >>> waveform_enhanced = istft(specgram_enhanced, length=waveform.shape[-1])
 ```
-Please refer to the [documentation](https://pytorch.org/audio/0.10.0/transforms.html#mvdr) for more details and try out this feature using the [MVDR tutorial](https://github.com/pytorch/audio/blob/main/examples/beamforming/MVDR_tutorial.ipynb).
+Please refer to the [documentation](https://pytorch.org/audio/0.10.0/transforms.html#mvdr) for more details and try out this feature using the MVDR tutorial.
 
 ### (Beta) RNN Transducer Loss 
 The RNN transducer (RNNT) loss is part of the RNN transducer pipeline, which is a popular architecture for speech recognition tasks. Recently it has gotten attention for being used in a streaming setting, and has also achieved state-of-the-art WER for the LibriSpeech benchmark.
diff --git a/_posts/2021-12-22-introducing-torchvision-new-multi-weight-support-api.md b/_posts/2021-12-22-introducing-torchvision-new-multi-weight-support-api.md
index 99280f4d45db..6086188e92e0 100644
--- a/_posts/2021-12-22-introducing-torchvision-new-multi-weight-support-api.md
+++ b/_posts/2021-12-22-introducing-torchvision-new-multi-weight-support-api.md
@@ -2,7 +2,7 @@
 layout: blog_detail
 title: "Introducing TorchVision’s New Multi-Weight Support API"
 author: Vasilis Vryniotis
-featured-img: "assets/images/torchvision_featured.png"
+featured-img: "assets/images/torchvision_featured.jpg"
 ---
 
 TorchVision has a new backwards compatible API for building models with multi-weight support. The new API allows loading different pre-trained weights on the same model variant, keeps track of vital meta-data such as the classification labels and includes the preprocessing transforms necessary for using the models. In this blog post, we plan to review the prototype API, show-case its features and highlight key differences with the existing one.
diff --git a/_posts/2021-4-16-ml-models-torchvision-v0.9.md b/_posts/2021-4-16-ml-models-torchvision-v0.9.md
index 5a5b3bdf156e..ff4e22f2c7c6 100644
--- a/_posts/2021-4-16-ml-models-torchvision-v0.9.md
+++ b/_posts/2021-4-16-ml-models-torchvision-v0.9.md
@@ -7,7 +7,7 @@ author: Team PyTorch
 TorchVision v0.9 has been [released](https://github.com/pytorch/vision/releases) and it is packed with numerous new Machine Learning models and features, speed improvements and bug fixes. In this blog post, we provide a quick overview of the newly introduced ML models and discuss their key features and characteristics.
 
 ### Classification
-* **MobileNetV3 Large & Small:** These two classification models are optimized for Mobile use-cases and are used as backbones on other Computer Vision tasks. The implementation of the new [MobileNetV3 architecture](https://github.com/pytorch/vision/blob/master/torchvision/models/mobilenetv3.py) supports the Large & Small variants and the depth multiplier parameter as described in the [original paper](https://arxiv.org/pdf/1905.02244.pdf). We offer pre-trained weights on ImageNet for both Large and Small networks with depth multiplier 1.0 and resolution 224x224. Our previous [training recipes](https://github.com/pytorch/vision/tree/master/references/classification#mobilenetv3-large--small) have been updated and can be used to easily train the models from scratch (shoutout to Ross Wightman for inspiring some of our [training configuration](https://rwightman.github.io/pytorch-image-models/training_hparam_examples/#mobilenetv3-large-100-75766-top-1-92542-top-5)). The Large variant offers a [competitive accuracy](https://github.com/pytorch/vision/blob/master/docs/source/models.rst#classification) comparing to ResNet50 while being over 6x faster on CPU, meaning that it is a good candidate for applications where speed is important. For applications where speed is critical, one can sacrifice further accuracy for speed and use the Small variant which is 15x faster than ResNet50.
+* **MobileNetV3 Large & Small:** These two classification models are optimized for Mobile use-cases and are used as backbones on other Computer Vision tasks. The implementation of the new [MobileNetV3 architecture](https://github.com/pytorch/vision/blob/master/torchvision/models/mobilenetv3.py) supports the Large & Small variants and the depth multiplier parameter as described in the [original paper](https://arxiv.org/pdf/1905.02244.pdf). We offer pre-trained weights on ImageNet for both Large and Small networks with depth multiplier 1.0 and resolution 224x224. Our previous [training recipes](https://github.com/pytorch/vision/tree/master/references/classification#mobilenetv3-large--small) have been updated and can be used to easily train the models from scratch (shoutout to Ross Wightman for inspiring some of our training configuration). The Large variant offers a [competitive accuracy](https://github.com/pytorch/vision/blob/master/docs/source/models.rst#classification) comparing to ResNet50 while being over 6x faster on CPU, meaning that it is a good candidate for applications where speed is important. For applications where speed is critical, one can sacrifice further accuracy for speed and use the Small variant which is 15x faster than ResNet50.
 
 * **Quantized MobileNetV3 Large:** The quantized version of MobilNetV3 Large reduces the number of parameters by 45% and it is roughly 2.5x faster than the non-quantized version while remaining competitive in [terms of accuracy](https://github.com/pytorch/vision/blob/master/docs/source/models.rst#quantized-models). It was fitted on ImageNet using Quantization Aware Training by iterating on the non-quantized version and it can be trained from scratch using the existing [reference scripts](https://github.com/pytorch/vision/tree/master/references/classification#quantized).
 
diff --git a/_posts/2021-5-26-torchvision-mobilenet-v3-implementation.md b/_posts/2021-5-26-torchvision-mobilenet-v3-implementation.md
index 6496c42a1806..2dfe3bba49c1 100644
--- a/_posts/2021-5-26-torchvision-mobilenet-v3-implementation.md
+++ b/_posts/2021-5-26-torchvision-mobilenet-v3-implementation.md
@@ -81,7 +81,7 @@ Another important detail is that though PyTorch’s and TensorFlow’s RMSProp i
 
 **Increasing our accuracy by tuning hyperparameters & improving our training recipe**
 
-After configuring the optimizer to achieve fast and stable training, we turned into optimizing the accuracy of the model. There are a few techniques that helped us achieve this. First of all, to avoid overfitting we augmented out data using the AutoAugment algorithm, followed by RandomErasing. Additionally we tuned parameters such as the weight decay using cross validation. We also found beneficial to perform [weight averaging](https://github.com/pytorch/vision/blob/674e8140042c2a3cbb1eb9ebad1fa49501599130/references/classification/utils.py#L259) across different epoch checkpoints after the end of the training. Finally, though not used in our published training recipe, we found that using Label Smoothing, Stochastic Depth and LR noise injection improve the overall accuracy by over [1.5 points](https://rwightman.github.io/pytorch-image-models/training_hparam_examples/#mobilenetv3-large-100-75766-top-1-92542-top-5).
+After configuring the optimizer to achieve fast and stable training, we turned into optimizing the accuracy of the model. There are a few techniques that helped us achieve this. First of all, to avoid overfitting we augmented out data using the AutoAugment algorithm, followed by RandomErasing. Additionally we tuned parameters such as the weight decay using cross validation. We also found beneficial to perform [weight averaging](https://github.com/pytorch/vision/blob/674e8140042c2a3cbb1eb9ebad1fa49501599130/references/classification/utils.py#L259) across different epoch checkpoints after the end of the training. Finally, though not used in our published training recipe, we found that using Label Smoothing, Stochastic Depth and LR noise injection improve the overall accuracy by over 1.5 points.
 
 The graph and table depict a simplified summary of the most important iterations for improving the accuracy of the MobileNetV3 Large variant. Note that the actual number of iterations done while training the model was significantly larger and that the progress in accuracy was not always monotonically increasing. Also note that the Y-axis of the graph starts from 70% instead from 0% to make the difference between iterations more visible:
 
diff --git a/_posts/2021-6-15-pytorch-1.9-new-library-releases.md b/_posts/2021-6-15-pytorch-1.9-new-library-releases.md
index eda8ac7b4318..6ed505185db8 100644
--- a/_posts/2021-6-15-pytorch-1.9-new-library-releases.md
+++ b/_posts/2021-6-15-pytorch-1.9-new-library-releases.md
@@ -150,7 +150,7 @@ We have:
 For more details, see [the documentation](https://pytorch.org/audio/0.9.0/transforms.html#resample). 
 
 ### (Prototype) RNN Transducer Loss 
-The RNN transducer loss is used in training RNN transducer models, which is a popular architecture for speech recognition tasks. The prototype loss in torchaudio currently supports autograd, torchscript, float16 and float32, and can also be run on both CPU and CUDA. For more details, please refer to [the documentation](https://pytorch.org/audio/master/rnnt_loss.html).
+The RNN transducer loss is used in training RNN transducer models, which is a popular architecture for speech recognition tasks. The prototype loss in torchaudio currently supports autograd, torchscript, float16 and float32, and can also be run on both CPU and CUDA. For more details, please refer to [the documentation](https://pytorch.org/audio/stable/index.html).
 
 # TorchText 0.10.0
 
diff --git a/_posts/2021-6-15-pytorch-1.9-released.md b/_posts/2021-6-15-pytorch-1.9-released.md
index 71c3f80e5bc4..9394a036efca 100644
--- a/_posts/2021-6-15-pytorch-1.9-released.md
+++ b/_posts/2021-6-15-pytorch-1.9-released.md
@@ -91,7 +91,7 @@ We are releasing a new video app based on [PyTorch Video](https://pytorchvideo.o
 
 ### (Beta) TorchElastic is now part of core 
 
-[TorchElastic](https://github.com/pytorch/pytorch/issues/50621), which was open sourced over a year ago in the [pytorch/elastic](https://github.com/pytorch/elastic) github repository, is a runner and coordinator for PyTorch worker processes. Since then, it has been adopted by various distributed torch use-cases: 1) [deepspeech.pytorch](https://medium.com/pytorch/training-deepspeech-using-torchelastic-ad013539682) 2) [pytorch-lightning](https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html#torchelastic) 3) [Kubernetes CRD](https://github.com/pytorch/elastic/blob/master/kubernetes/README.md). Now, it is part of PyTorch core. 
+[TorchElastic](https://github.com/pytorch/pytorch/issues/50621), which was open sourced over a year ago in the [pytorch/elastic](https://github.com/pytorch/elastic) github repository, is a runner and coordinator for PyTorch worker processes. Since then, it has been adopted by various distributed torch use-cases: 1) [deepspeech.pytorch](https://medium.com/pytorch/training-deepspeech-using-torchelastic-ad013539682) 2) pytorch-lightning 3) [Kubernetes CRD](https://github.com/pytorch/elastic/blob/master/kubernetes/README.md). Now, it is part of PyTorch core. 
 
 As its name suggests, the core function of TorcheElastic is to gracefully handle scaling events. A notable corollary of elasticity is that peer discovery and rank assignment are built into TorchElastic enabling users to run distributed training on preemptible instances without requiring a gang scheduler. As a side note, [etcd](https://etcd.io/) used to be a hard dependency of TorchElastic. With the upstream, this is no longer the case since we have added a “standalone” rendezvous based on c10d::Store. For more details, refer to the [documentation](https://pytorch.org/docs/1.9.0/distributed.elastic.html).
 
diff --git a/_posts/2021-6-16-torchvision-ssd-implementation.md b/_posts/2021-6-16-torchvision-ssd-implementation.md
index 40ce299b1ef0..3f55188b4847 100644
--- a/_posts/2021-6-16-torchvision-ssd-implementation.md
+++ b/_posts/2021-6-16-torchvision-ssd-implementation.md
@@ -2,7 +2,6 @@
 layout: blog_detail
 title: 'Everything You Need To Know About Torchvision’s SSD Implementation'
 author: Vasilis Vryniotis
-featured-img: 'assets/images/prediction-examples.png'
 ---
 
 In TorchVision v0.10, we’ve released two new Object Detection models based on the SSD architecture. Our plan is to cover the key implementation details of the algorithms along with information on how they were trained in a two-part article.
diff --git a/_posts/2021-8-18-pipetransformer-automated-elastic-pipelining.md b/_posts/2021-8-18-pipetransformer-automated-elastic-pipelining.md
index a682c3a13382..02c73d77541b 100644
--- a/_posts/2021-8-18-pipetransformer-automated-elastic-pipelining.md
+++ b/_posts/2021-8-18-pipetransformer-automated-elastic-pipelining.md
@@ -70,13 +70,13 @@ Finally, we have also developed open-source flexible APIs for PipeTransformer, w
 
 Suppose we aim to train a massive model in a distributed training system where the hybrid of pipelined model parallelism and data parallelism is used to target scenarios where either the memory of a single GPU device cannot hold the model, or if loaded, the batch size is small enough to avoid running out of memory. More specifically, we define our settings as follows:
 
-<strong>Training task and model definition.</strong> We train Transformer models (e.g., Vision Transformer, BERT on large-scale image or text datasets. The Transformer model <img src="https://render.githubusercontent.com/render/math?math=mathcal{F}"> has <img src="https://render.githubusercontent.com/render/math?math=L"> layers, in which the <img src="https://render.githubusercontent.com/render/math?math=i"> th layer is composed of a forward computation function <img src="https://render.githubusercontent.com/render/math?math=f_i"> and a corresponding set of parameters.
+<strong>Training task and model definition.</strong> We train Transformer models (e.g., Vision Transformer, BERT on large-scale image or text datasets. The Transformer model mathcal{F} has L layers, in which the i th layer is composed of a forward computation function f_i and a corresponding set of parameters.
 
-<strong>Training infrastructure.</strong> Assume the training infrastructure contains a GPU cluster that has <img src="https://render.githubusercontent.com/render/math?math=N"> GPU servers (i.e. nodes). Each node has <img src="https://render.githubusercontent.com/render/math?math=I"> GPUs. Our cluster is homogeneous, meaning that each GPU and server have the same hardware configuration. Each GPU's memory capacity is <img src="https://render.githubusercontent.com/render/math?math=M_\text{GPU}">. Servers are connected by a high bandwidth network interface such as InfiniBand interconnect.
+<strong>Training infrastructure.</strong> Assume the training infrastructure contains a GPU cluster that has N GPU servers (i.e. nodes). Each node has I GPUs. Our cluster is homogeneous, meaning that each GPU and server have the same hardware configuration. Each GPU's memory capacity is M_\text{GPU}. Servers are connected by a high bandwidth network interface such as InfiniBand interconnect.
 
-<strong>Pipeline parallelism.</strong> In each machine, we load a model <img src="https://render.githubusercontent.com/render/math?math=\mathcal{F}"> into a pipeline <img src="https://render.githubusercontent.com/render/math?math=\mathcal{P}"> which has <img src="https://render.githubusercontent.com/render/math?math=K">partitions (<img src="https://render.githubusercontent.com/render/math?math=K"> also represents the pipeline length). The <img src="https://render.githubusercontent.com/render/math?math=k">th partition <img src="https://render.githubusercontent.com/render/math?math=p_k"> consists of consecutive layers. We assume each partition is handled by a single GPU device. <img src="https://render.githubusercontent.com/render/math?math=1 \leq K \leq I">, meaning that we can build multiple pipelines for multiple model replicas in a single machine. We assume all GPU devices in a pipeline belonging to the same machine. Our pipeline is a synchronous pipeline, which does not involve stale gradients, and the number of micro-batches is <img src="https://render.githubusercontent.com/render/math?math=M">. In the Linux OS, each pipeline is handled by a single process. We refer the reader to GPipe [10] for more details.
+<strong>Pipeline parallelism.</strong> In each machine, we load a model \mathcal{F} into a pipeline \mathcal{P} which has Kpartitions (K also represents the pipeline length). The kth partition p_k consists of consecutive layers. We assume each partition is handled by a single GPU device. 1 \leq K \leq I, meaning that we can build multiple pipelines for multiple model replicas in a single machine. We assume all GPU devices in a pipeline belonging to the same machine. Our pipeline is a synchronous pipeline, which does not involve stale gradients, and the number of micro-batches is M. In the Linux OS, each pipeline is handled by a single process. We refer the reader to GPipe [10] for more details.
 
-<strong>Data parallelism.</strong> DDP is a cross-machine distributed data-parallel process group within <img src="https://render.githubusercontent.com/render/math?math=R"> parallel workers. Each worker is a pipeline replica (a single process). The <img src="https://render.githubusercontent.com/render/math?math=r">th worker's index (ID) is rank <img src="https://render.githubusercontent.com/render/math?math=r">. For any two pipelines in DDP, they can belong to either the same GPU server or different GPU servers, and they can exchange gradients with the AllReduce algorithm.
+<strong>Data parallelism.</strong> DDP is a cross-machine distributed data-parallel process group within R parallel workers. Each worker is a pipeline replica (a single process). The rth worker's index (ID) is rank r. For any two pipelines in DDP, they can belong to either the same GPU server or different GPU servers, and they can exchange gradients with the AllReduce algorithm.
 
 Under these settings, our goal is to accelerate training by leveraging freeze training, which does not require all layers to be trained throughout the duration of the training. Additionally, it may help save computation, communication, memory cost, and potentially prevent overfitting by consecutively freezing layers. However, these benefits can only be achieved by overcoming the four challenges of designing an adaptive freezing algorithm, dynamical pipeline re-partitioning, efficient resource reallocation, and cross-process caching, as discussed in the introduction.
 
@@ -131,9 +131,9 @@ In dynamic training system such as PipeTransformer, maintaining optimally balanc
 Figure 6. The partition boundary is in the middle of a skip connection
 </p>
 
-1. <strong>Cross-partition communication overhead.</strong> Placing a partition boundary in the middle of a skip connection leads to additional communications since tensors in the skip connection must now be copied to a different GPU. For example, with BERT partitions in Figure 6, partition <img src="https://render.githubusercontent.com/render/math?math=k"> must take intermediate outputs from both partition <img src="https://render.githubusercontent.com/render/math?math=k-2"> and partition <img src="https://render.githubusercontent.com/render/math?math=k-1">. In contrast, if the boundary is placed after the addition layer, the communication overhead between partition <img src="https://render.githubusercontent.com/render/math?math=k-1"> and <img src="https://render.githubusercontent.com/render/math?math=k"> is visibly smaller. Our measurements show that having cross-device communication is more expensive than having slightly imbalanced partitions (see the Appendix in our paper). Therefore, we do not consider breaking skip connections (highlighted separately as an entire attention layer and MLP layer in green color at line 7 in Algorithm 1.
+1. <strong>Cross-partition communication overhead.</strong> Placing a partition boundary in the middle of a skip connection leads to additional communications since tensors in the skip connection must now be copied to a different GPU. For example, with BERT partitions in Figure 6, partition k must take intermediate outputs from both partition k-2 and partition k-1. In contrast, if the boundary is placed after the addition layer, the communication overhead between partition k-1 and k is visibly smaller. Our measurements show that having cross-device communication is more expensive than having slightly imbalanced partitions (see the Appendix in our paper). Therefore, we do not consider breaking skip connections (highlighted separately as an entire attention layer and MLP layer in green color at line 7 in Algorithm 1.
 
-2. <strong>Frozen layer memory footprint.</strong> During training, AutoPipe must recompute partition boundaries several times to balance two distinct types of layers: frozen layers and active layers. The frozen layer's memory cost is a fraction of that inactive layer, given that the frozen layer does not need backward activation maps, optimizer states, and gradients. Instead of launching intrusive profilers to obtain thorough metrics on memory and computational cost, we define a tunable cost factor <img src="https://render.githubusercontent.com/render/math?math=lambda_{\text{frozen}}"> to estimate the memory footprint ratio of a frozen layer over the same active layer. Based on empirical measurements in our experimental hardware, we set it to <img src="https://render.githubusercontent.com/render/math?math=\frac{1}{6}">.
+2. <strong>Frozen layer memory footprint.</strong> During training, AutoPipe must recompute partition boundaries several times to balance two distinct types of layers: frozen layers and active layers. The frozen layer's memory cost is a fraction of that inactive layer, given that the frozen layer does not need backward activation maps, optimizer states, and gradients. Instead of launching intrusive profilers to obtain thorough metrics on memory and computational cost, we define a tunable cost factor lambda_{\text{frozen}} to estimate the memory footprint ratio of a frozen layer over the same active layer. Based on empirical measurements in our experimental hardware, we set it to \frac{1}{6}.
 
 
 
@@ -142,33 +142,33 @@ Figure 6. The partition boundary is in the middle of a skip connection
 <br>
 </p>
 
-Based on the above two considerations, AutoPipe balances pipeline partitions based on parameter sizes. More specifically, AutoPipe uses a greedy algorithm to allocate all frozen and active layers to evenly distribute partitioned sublayers into <img src="https://render.githubusercontent.com/render/math?math=K"> GPU devices. Pseudocode is described as the `load\_balance()` function in Algorithm 1. The frozen layers are extracted from the original model and kept in a separate model instance <img src="https://render.githubusercontent.com/render/math?math=\mathcal{F}_{\text{frozen}}"> in the first device of a pipeline.
+Based on the above two considerations, AutoPipe balances pipeline partitions based on parameter sizes. More specifically, AutoPipe uses a greedy algorithm to allocate all frozen and active layers to evenly distribute partitioned sublayers into K GPU devices. Pseudocode is described as the `load\_balance()` function in Algorithm 1. The frozen layers are extracted from the original model and kept in a separate model instance \mathcal{F}_{\text{frozen}} in the first device of a pipeline.
 
 Note that the partition algorithm employed in this paper is not the only option; PipeTransformer is modularized to work with any alternatives.
 
 
 ### Pipeline Compression
 
-Pipeline compression helps to free up GPUs to accommodate more pipeline replicas and reduce the number of cross-device communications between partitions. To determine the timing of compression, we can estimate the memory cost of the largest partition after compression, and then compare it with that of the largest partition of a pipeline at timestep <img src="https://render.githubusercontent.com/render/math?math=T=0">. To avoid extensive memory profiling, the compression algorithm uses the parameter size as a proxy for the training memory footprint. Based on this simplification, the criterion of pipeline compression is as follows:
+Pipeline compression helps to free up GPUs to accommodate more pipeline replicas and reduce the number of cross-device communications between partitions. To determine the timing of compression, we can estimate the memory cost of the largest partition after compression, and then compare it with that of the largest partition of a pipeline at timestep T=0. To avoid extensive memory profiling, the compression algorithm uses the parameter size as a proxy for the training memory footprint. Based on this simplification, the criterion of pipeline compression is as follows:
 
 <p align="center">
 <img src="{{ site.url }}/assets/images/memory_reduction.png" width="320">
 <br>
 </p>
 
-Once the freeze notification is received, AutoPipe will always attempt to divide the pipeline length <img src="https://render.githubusercontent.com/render/math?math=K"> by 2 (e.g., from 8 to 4, then 2). By using <img src="https://render.githubusercontent.com/render/math?math=\frac{K}{2}"> as the input, the compression algorithm can verify if the result satisfies the criterion in Equation (1). Pseudocode is shown in lines 25-33 in Algorithm 1. Note that this compression makes the acceleration ratio exponentially increase during training, meaning that if a GPU server has a larger number of GPUs (e.g., more than 8), the acceleration ratio will be further amplified.
+Once the freeze notification is received, AutoPipe will always attempt to divide the pipeline length K by 2 (e.g., from 8 to 4, then 2). By using \frac{K}{2} as the input, the compression algorithm can verify if the result satisfies the criterion in Equation (1). Pseudocode is shown in lines 25-33 in Algorithm 1. Note that this compression makes the acceleration ratio exponentially increase during training, meaning that if a GPU server has a larger number of GPUs (e.g., more than 8), the acceleration ratio will be further amplified.
 
 <p align="center">
 <img src="{{ site.url }}/assets/images/pipe_buble.png" width="560">
 <br>
-Figure 7. Pipeline Bubble: <img src="https://render.githubusercontent.com/render/math?math=F_{d,b}">, and <img src="https://render.githubusercontent.com/render/math?math=U_d"> denote forward, backward, and the optimizer update of micro-batch <img src="https://render.githubusercontent.com/render/math?math=b"> on device <img src="https://render.githubusercontent.com/render/math?math=d">, respectively. The total bubble size in each iteration is <img src="https://render.githubusercontent.com/render/math?math=K-1"> times per micro-batch forward and backward cost.
+Figure 7. Pipeline Bubble: F_{d,b}, and U_d" denote forward, backward, and the optimizer update of micro-batch b on device d, respectively. The total bubble size in each iteration is K-1 times per micro-batch forward and backward cost.
 </p>
 
-Additionally, such a technique can also speed up training by shrinking the size of pipeline bubbles. To explain bubble sizes in a pipeline, Figure 7 depicts how 4 micro-batches run through a 4-device pipeline <img src="https://render.githubusercontent.com/render/math?math=K = 4">. In general, the total bubble size is <img src="https://render.githubusercontent.com/render/math?math=(K-1)"> times per micro-batch forward and backward cost. Therefore, it is clear that shorter pipelines have smaller bubble sizes.
+Additionally, such a technique can also speed up training by shrinking the size of pipeline bubbles. To explain bubble sizes in a pipeline, Figure 7 depicts how 4 micro-batches run through a 4-device pipeline K = 4. In general, the total bubble size is (K-1) times per micro-batch forward and backward cost. Therefore, it is clear that shorter pipelines have smaller bubble sizes.
 
 ### Dynamic Number of Micro-Batches
 
-Prior pipeline parallel systems use a fixed number of micro-batches per mini-batch (<img src="https://render.githubusercontent.com/render/math?math=M"> ). GPipe suggests <img src="https://render.githubusercontent.com/render/math?math=M \geq 4 \times K">, where <img src="https://render.githubusercontent.com/render/math?math=K"> is the number of partitions (pipeline length). However, given that PipeTransformer dynamically configures <img src="https://render.githubusercontent.com/render/math?math=K">, we find it to be sub-optimal to maintain a static <img src="https://render.githubusercontent.com/render/math?math=M"> during training. Moreover, when integrated with DDP, the value of <img src="https://render.githubusercontent.com/render/math?math=M"> also has an impact on the efficiency of DDP gradient synchronizations. Since DDP must wait for the last micro-batch to finish its backward computation on a parameter before launching its gradient synchronization, finer micro-batches lead to a smaller overlap between computation and communication. Hence, instead of using a static value, PipeTransformer searches for optimal <img src="https://render.githubusercontent.com/render/math?math=M"> on the fly in the hybrid of DDP environment by enumerating <img src="https://render.githubusercontent.com/render/math?math=M"> values ranging from <img src="https://render.githubusercontent.com/render/math?math=K"> to <img src="https://render.githubusercontent.com/render/math?math=6K">. For a specific training environment, the profiling needs only to be done once (see Algorithm 1 line 35).
+Prior pipeline parallel systems use a fixed number of micro-batches per mini-batch (M ). GPipe suggests M \geq 4 \times K, where K is the number of partitions (pipeline length). However, given that PipeTransformer dynamically configures K, we find it to be sub-optimal to maintain a static M during training. Moreover, when integrated with DDP, the value of M also has an impact on the efficiency of DDP gradient synchronizations. Since DDP must wait for the last micro-batch to finish its backward computation on a parameter before launching its gradient synchronization, finer micro-batches lead to a smaller overlap between computation and communication. Hence, instead of using a static value, PipeTransformer searches for optimal M on the fly in the hybrid of DDP environment by enumerating M values ranging from K to 6K. For a specific training environment, the profiling needs only to be done once (see Algorithm 1 line 35).
 
 For the complete source code, please refer to `https://github.com/Distributed-AI/PipeTransformer/blob/master/pipe_transformer/pipe/auto_pipe.py`.
 
@@ -240,7 +240,7 @@ For the complete source code, please refer to `https://github.com/Distributed-AI
 
 This section first summarizes experiment setups and then evaluates PipeTransformer using computer vision and natural language processing tasks.
 
-<strong>Hardware.</strong> Experiments were conducted on 2 identical machines connected by InfiniBand CX353A (<img src="https://render.githubusercontent.com/render/math?math=5">GB/s), where each machine is equipped with 8 NVIDIA Quadro RTX 5000 (16GB GPU memory). GPU-to-GPU bandwidth within a machine (PCI 3.0, 16 lanes) is <img src="https://render.githubusercontent.com/render/math?math=15.754">GB/s.
+<strong>Hardware.</strong> Experiments were conducted on 2 identical machines connected by InfiniBand CX353A (5GB/s), where each machine is equipped with 8 NVIDIA Quadro RTX 5000 (16GB GPU memory). GPU-to-GPU bandwidth within a machine (PCI 3.0, 16 lanes) is 15.754GB/s.
 
 <strong>Implementation.</strong> We used PyTorch Pipe as a building block. The BERT model definition, configuration, and related tokenizer are from HuggingFace 3.5.0. We implemented Vision Transformer using PyTorch by following its TensorFlow implementation. More details can be found in our source code.
 
@@ -250,7 +250,7 @@ This section first summarizes experiment setups and then evaluates PipeTransform
 
 <strong>Baseline.</strong> Experiments in this section compare PipeTransformer to the state-of-the-art framework, a hybrid scheme of PyTorch Pipeline (PyTorch’s implementation of GPipe) and PyTorch DDP. Since this is the first paper that studies accelerating distributed training by freezing layers, there are no perfectly aligned counterpart solutions yet.
 
-<strong>Hyper-parameters.</strong> Experiments use ViT-B/16 (12 transformer layers, <img src="https://render.githubusercontent.com/render/math?math=16 \times 16"> input patch size) for ImageNet and CIFAR-100, BERT-large-uncased (24 layers) for SQuAD 1.1, and BERT-base-uncased (12 layers) for SST-2. With PipeTransformer, ViT and BERT training can set the per-pipeline batch size to around 400 and 64, respectively. Other hyperparameters (e.g., epoch, learning rate) for all experiments are presented in Appendix.
+<strong>Hyper-parameters.</strong> Experiments use ViT-B/16 (12 transformer layers, 16 \times 16 input patch size) for ImageNet and CIFAR-100, BERT-large-uncased (24 layers) for SQuAD 1.1, and BERT-base-uncased (12 layers) for SST-2. With PipeTransformer, ViT and BERT training can set the per-pipeline batch size to around 400 and 64, respectively. Other hyperparameters (e.g., epoch, learning rate) for all experiments are presented in Appendix.
 
 ## Overall Training Acceleration
 <p align="center">
@@ -258,7 +258,7 @@ This section first summarizes experiment setups and then evaluates PipeTransform
 <br>
 </p>
 
-We summarize the overall experimental results in the table above. Note that the speedup we report is based on a conservative <img src="https://render.githubusercontent.com/render/math?math=\alpha"> <img src="https://render.githubusercontent.com/render/math?math=\frac{1}{3}"> value that can obtain comparable or even higher accuracy. A more aggressive <img src="https://render.githubusercontent.com/render/math?math=\alpha"> (<img src="https://render.githubusercontent.com/render/math?math=\frac{2}{5}">, <img src="https://render.githubusercontent.com/render/math?math=\frac{1}{2}">) can obtain a higher speedup but may lead to a slight loss in accuracy. Note that the model size of BERT (24 layers) is larger than ViT-B/16 (12 layers), thus it takes more time for communication.
+We summarize the overall experimental results in the table above. Note that the speedup we report is based on a conservative \alpha \frac{1}{3} value that can obtain comparable or even higher accuracy. A more aggressive \alpha (\frac{2}{5}, \frac{1}{2}) can obtain a higher speedup but may lead to a slight loss in accuracy. Note that the model size of BERT (24 layers) is larger than ViT-B/16 (12 layers), thus it takes more time for communication.
 
 ## Performance Analysis
 
@@ -278,15 +278,15 @@ To understand the efficacy of all four components and their impacts on training
 2. AutoCache's contribution is amplified by AutoDP;
 3. freeze training alone without system-wise adjustment even downgrades the training speed.
 
-### Tuning <img src="https://render.githubusercontent.com/render/math?math=\alpha"> in Freezing Algorithm
+### Tuning \alpha in Freezing Algorithm
 
 <p align="center">
 <img src="{{ site.url }}/assets/images/experiments_tuning_alpha.png" width="460">
 <br>
-Figure 10. Tuning <img src="https://render.githubusercontent.com/render/math?math=\alpha"> in Freezing Algorithm
+Figure 10. Tuning \alpha in Freezing Algorithm
 </p>
 
-We ran experiments to show how the <img src="https://render.githubusercontent.com/render/math?math=\alpha">  in the freeze algorithms influences training speed. The result clearly demonstrates that a larger <img src="https://render.githubusercontent.com/render/math?math=\alpha"> (excessive freeze) leads to a greater speedup but suffers from a slight performance degradation. In the case shown in Figure 10, where <img src="https://render.githubusercontent.com/render/math?math=\alpha=1/5">, freeze training outperforms normal training and obtains a <img src="https://render.githubusercontent.com/render/math?math=2.04">-fold speedup. We provide more results in the Appendix.
+We ran experiments to show how the \alpha  in the freeze algorithms influences training speed. The result clearly demonstrates that a larger \alpha (excessive freeze) leads to a greater speedup but suffers from a slight performance degradation. In the case shown in Figure 10, where \alpha=1/5, freeze training outperforms normal training and obtains a 2.04-fold speedup. We provide more results in the Appendix.
 
 ### Optimal Chunks in the elastic pipeline
 
@@ -296,7 +296,7 @@ We ran experiments to show how the <img src="https://render.githubusercontent.co
 Figure 11. Optimal chunk number in the elastic pipeline
 </p>
 
-We profiled the optimal number of micro-batches <img src="https://render.githubusercontent.com/render/math?math=M"> for different pipeline lengths <img src="https://render.githubusercontent.com/render/math?math=K">. Results are summarized in Figure 11. As we can see, different <img src="https://render.githubusercontent.com/render/math?math=K"> values lead to different optimal <img src="https://render.githubusercontent.com/render/math?math=M">, and the throughput gaps across different M values are large (as shown when <img src="https://render.githubusercontent.com/render/math?math=K=8">), which confirms the necessity of an anterior profiler in elastic pipelining.
+We profiled the optimal number of micro-batches M for different pipeline lengths K. Results are summarized in Figure 11. As we can see, different K values lead to different optimal M, and the throughput gaps across different M values are large (as shown when K=8), which confirms the necessity of an anterior profiler in elastic pipelining.
 
 ### Understanding the Timing of Caching
 
@@ -306,7 +306,7 @@ We profiled the optimal number of micro-batches <img src="https://render.githubu
 Figure 12. the timing of caching
 </p>
 
-To evaluate AutoCache, we compared the sample throughput of training that activates AutoCache from epoch <img src="https://render.githubusercontent.com/render/math?math=0"> (blue) with the training job without AutoCache (red). Figure 12 shows that enabling caching too early can slow down training, as caching can be more expensive than the forward propagation on a small number of frozen layers. After more layers are frozen, caching activations clearly outperform the corresponding forward propagation. As a result, AutoCache uses a profiler to determine the proper timing to enable caching. In our system, for ViT (12 layers), caching starts from 3 frozen layers, while for BERT (24 layers), caching starts from 5 frozen layers.
+To evaluate AutoCache, we compared the sample throughput of training that activates AutoCache from epoch 0 (blue) with the training job without AutoCache (red). Figure 12 shows that enabling caching too early can slow down training, as caching can be more expensive than the forward propagation on a small number of frozen layers. After more layers are frozen, caching activations clearly outperform the corresponding forward propagation. As a result, AutoCache uses a profiler to determine the proper timing to enable caching. In our system, for ViT (12 layers), caching starts from 3 frozen layers, while for BERT (24 layers), caching starts from 5 frozen layers.
 
 For more detailed experimental analysis, please refer to our paper.
 
diff --git a/_posts/2021-8-3-pytorch-profiler-1.9-released.md b/_posts/2021-8-3-pytorch-profiler-1.9-released.md
index 25017a759208..9b820ff60416 100644
--- a/_posts/2021-8-3-pytorch-profiler-1.9-released.md
+++ b/_posts/2021-8-3-pytorch-profiler-1.9-released.md
@@ -202,7 +202,7 @@ Jump to source is ONLY available when Tensorboard is launched within VS Code. St
   <p>Gify: Jump to Source using Visual Studio Code Plug In UI </p>
 </div>
 
-For how to optimize batch size performance, check out the step-by-step tutorial [here](https://opendatascience.com/optimizing-pytorch-performance-batch-size-with-pytorch-profiler/). PyTorch Profiler is also integrated with [PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/stable/advanced/profiler.html#pytorch-profiling) and you can simply launch your lightning training jobs with --```trainer.profiler=pytorch``` flag to generate the traces. Check out an example [here](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/basic_examples/profiler_example.py).  
+For how to optimize batch size performance, check out the step-by-step tutorial [here](https://opendatascience.com/optimizing-pytorch-performance-batch-size-with-pytorch-profiler/). PyTorch Profiler is also integrated with PyTorch Lightning and you can simply launch your lightning training jobs with --```trainer.profiler=pytorch``` flag to generate the traces. 
 
 ## What’s Next for the PyTorch Profiler?
 You just saw how PyTorch Profiler can help optimize a model. You can now try the Profiler by ```pip install torch-tb-profiler``` to optimize your PyTorch model. 
diff --git a/_posts/2022-10-13-scaling-pytorch-models-on-cloud-tpus-with-fsdp.md b/_posts/2022-10-13-scaling-pytorch-models-on-cloud-tpus-with-fsdp.md
index fe852ce4cf70..4f07564cbc86 100644
--- a/_posts/2022-10-13-scaling-pytorch-models-on-cloud-tpus-with-fsdp.md
+++ b/_posts/2022-10-13-scaling-pytorch-models-on-cloud-tpus-with-fsdp.md
@@ -36,7 +36,7 @@ Wrapping an `nn.Module` instance with `XlaFullyShardedDataParallel` enables the
 
 **Model checkpoint saving and loading** for models and optimizers can be done like before by saving and loading their `.state_dict()`. Meanwhile, each training process should save its own checkpoint file of the sharded model parameters and optimizer states, and load the checkpoint file for the corresponding rank when resuming (regardless of ZeRO-2 or ZeRO-3, i.e. nested wrapping or not). A command line tool and a Python interface are provided to consolidate the sharded model checkpoint files together into a full/unshareded model checkpoint file.
 
-[**Gradient checkpointing**](https://spell.ml/blog/gradient-checkpointing-pytorch-YGypLBAAACEAefHs) (also referred to as "activation checkpointing" or "rematerialization") is another common technique for model scaling and can be used in conjunction with FSDP. We provide `checkpoint_module`, a wrapper function over a given `nn.Module` instance for gradient checkpointing (based on `torch_xla.utils.checkpoint.checkpoint`).
+**Gradient checkpointing** (also referred to as "activation checkpointing" or "rematerialization") is another common technique for model scaling and can be used in conjunction with FSDP. We provide `checkpoint_module`, a wrapper function over a given `nn.Module` instance for gradient checkpointing (based on `torch_xla.utils.checkpoint.checkpoint`).
 
 The MNIST and ImageNet examples below provide illustrative usages of (plain or nested) FSDP, saving and consolidation of model checkpoints, as well as gradient checkpointing.
 
diff --git a/_posts/2022-10-28-new-library-updates-in-pytorch-1.13.md b/_posts/2022-10-28-new-library-updates-in-pytorch-1.13.md
index 3c28b3aa7449..4882d1538079 100644
--- a/_posts/2022-10-28-new-library-updates-in-pytorch-1.13.md
+++ b/_posts/2022-10-28-new-library-updates-in-pytorch-1.13.md
@@ -110,7 +110,7 @@ In this release, we further consolidated the API for `DataLoader2` and a [detail
 
 We extended our support to load data from additional cloud storage providers via DataPipes, now covering AWS, Google Cloud Storage, and Azure. A [tutorial is also available](https://pytorch.org/data/0.5/tutorial.html#working-with-cloud-storage-providers). We are open to feedback and feature requests.
 
-We also performed a simple benchmark, comparing the performance of data loading from AWS S3 and attached volume on an AWS EC2 instance. The results are [visible here](https://github.com/pytorch/data/blob/gh/NivekT/100/head/benchmarks/cloud/aws_s3_results.md).
+We also performed a simple benchmark, comparing the performance of data loading from AWS S3 and attached volume on an AWS EC2 instance.
 
 ### torch::deploy (Beta)
 
@@ -154,7 +154,7 @@ torch::deploy now has basic support for aarch64 Linux systems.
 
 TorchEval is a library built for users who want highly performant implementations of common metrics to evaluate machine learning models. It also provides an easy to use interface for building custom metrics with the same toolkit. Building your metrics with TorchEval makes running distributed training loops with [torch.distributed](https://pytorch.org/docs/stable/distributed.html) a breeze.
 
-Learn more with our [docs](https://pytorch.org/torcheval), see our [examples](https://pytorch.org/torcheval/metric_example.html), or check out our [GitHub repo](http://github.com/pytorch/torcheval).
+Learn more with our [docs](https://pytorch.org/torcheval), see our [examples](https://pytorch.org/torcheval/stable/metric_example.html), or check out our [GitHub repo](http://github.com/pytorch/torcheval).
 
 ### TorchMultimodal Release (Beta)
 
diff --git a/_posts/2022-11-22-effective-multi-objective-nueral-architecture.md b/_posts/2022-11-22-effective-multi-objective-nueral-architecture.md
index 96d3ba38da25..ab3643a4873b 100644
--- a/_posts/2022-11-22-effective-multi-objective-nueral-architecture.md
+++ b/_posts/2022-11-22-effective-multi-objective-nueral-architecture.md
@@ -108,7 +108,7 @@ Ax has a number of other advanced capabilities that we did not discuss in our tu
 
 ### Early Stopping
 
-When evaluating a new candidate configuration, partial learning curves are typically available while the NN training job is running. We can use the information contained in the partial curves to identify under-performing trials to stop early in order to free up computational resources for more promising candidates. While not demonstrated in the above tutorial, Ax supports early stopping out-of-the-box - see our [early stopping tutorial](https://ax.dev/versions/latest/tutorials/early_stopping/early_stopping.html) for more details.
+When evaluating a new candidate configuration, partial learning curves are typically available while the NN training job is running. We can use the information contained in the partial curves to identify under-performing trials to stop early in order to free up computational resources for more promising candidates. While not demonstrated in the above tutorial, Ax supports early stopping out-of-the-box.
 
 ### High-dimensional search spaces
 
diff --git a/_posts/2022-11-28-optimizing-production-pytorch-performance-with-graph-transformations.md b/_posts/2022-11-28-optimizing-production-pytorch-performance-with-graph-transformations.md
index ad7593b253c4..93fda1037bd4 100644
--- a/_posts/2022-11-28-optimizing-production-pytorch-performance-with-graph-transformations.md
+++ b/_posts/2022-11-28-optimizing-production-pytorch-performance-with-graph-transformations.md
@@ -2,7 +2,6 @@
 layout: blog_detail
 title: "Optimizing Production PyTorch Models’ Performance with Graph Transformations"
 author: Jade Nie, CK Luk, Xiaodong Wang, Jackie (Jiaqi) Xu
-featured-img: "assets/images/blog1-3b.png"
 ---
 
 ## 1. Introduction
diff --git a/_posts/2022-2-8-quantization-in-practice.md b/_posts/2022-2-8-quantization-in-practice.md
index b95b6f4f7608..43c9aeb1f73f 100644
--- a/_posts/2022-2-8-quantization-in-practice.md
+++ b/_posts/2022-2-8-quantization-in-practice.md
@@ -45,9 +45,9 @@ where [<img src="https://latex.codecogs.com/gif.latex?\alpha, \beta">] is the cl
 
 
 ### Calibration
-The process of choosing the input clipping range is known as **calibration**. The simplest technique (also the default in PyTorch) is to record the running mininmum and maximum values and assign them to <img src="https://latex.codecogs.com/gif.latex?\alpha"> and <img src="https://latex.codecogs.com/gif.latex?\beta">. [TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/calib.html) also uses entropy minimization (KL divergence), mean-square-error minimization, or percentiles of the input range. 
+The process of choosing the input clipping range is known as **calibration**. The simplest technique (also the default in PyTorch) is to record the running mininmum and maximum values and assign them to <img src="https://latex.codecogs.com/gif.latex?\alpha"> and <img src="https://latex.codecogs.com/gif.latex?\beta">. TensorRT also uses entropy minimization (KL divergence), mean-square-error minimization, or percentiles of the input range. 
 
-In PyTorch, `Observer` modules ([docs](https://PyTorch.org/docs/stable/torch.quantization.html?highlight=observer#observers), [code](https://github.com/PyTorch/PyTorch/blob/748d9d24940cd17938df963456c90fa1a13f3932/torch/ao/quantization/observer.py#L88)) collect statistics on the input values and calculate the qparams <img src="https://latex.codecogs.com/gif.latex?S, Z">. Different calibration schemes result in different quantized outputs, and it's best to empirically verify which scheme works best for your application and architecture (more on that later).
+In PyTorch, `Observer` modules ([code](https://github.com/PyTorch/PyTorch/blob/748d9d24940cd17938df963456c90fa1a13f3932/torch/ao/quantization/observer.py#L88)) collect statistics on the input values and calculate the qparams <img src="https://latex.codecogs.com/gif.latex?S, Z">. Different calibration schemes result in different quantized outputs, and it's best to empirically verify which scheme works best for your application and architecture (more on that later).
 
 ```python
 from torch.quantization.observer import MinMaxObserver, MovingAverageMinMaxObserver, HistogramObserver
@@ -166,7 +166,7 @@ torch.backends.quantized.engine = backend
 
 ### QConfig
 
-The `QConfig` ([code](https://github.com/PyTorch/PyTorch/blob/d6b15bfcbdaff8eb73fa750ee47cef4ccee1cd92/torch/ao/quantization/qconfig.py#L165), [docs](https://pytorch.org/docs/stable/torch.quantization.html?highlight=qconfig#torch.quantization.QConfig)) NamedTuple stores the Observers and the quantization schemes used to quantize activations and weights.
+The `QConfig` ([code](https://github.com/PyTorch/PyTorch/blob/d6b15bfcbdaff8eb73fa750ee47cef4ccee1cd92/torch/ao/quantization/qconfig.py#L165)) NamedTuple stores the Observers and the quantization schemes used to quantize activations and weights.
 
 Be sure to pass the Observer class (not the instance), or a callable that can return Observer instances. Use `with_args()` to override the default arguments.
 
diff --git a/_posts/2022-3-10-pytorch-1.11-new-library-releases.md b/_posts/2022-3-10-pytorch-1.11-new-library-releases.md
index a7b08f2312f0..4c5f9c328f71 100644
--- a/_posts/2022-3-10-pytorch-1.11-new-library-releases.md
+++ b/_posts/2022-3-10-pytorch-1.11-new-library-releases.md
@@ -61,7 +61,7 @@ Special thanks to Yangyang Shi, Jay Mahadeokar, and Gil Keren for their code con
 
 #### (Beta) HuBERT Pretrain Model
 
-The masked prediction training of HuBERT model requires the masked logits, unmasked logits, and feature norm as the outputs. The logits are for cross-entropy losses and the feature norm is for penalty loss. The release adds [HuBERTPretrainModel](https://github.com/pytorch/audio/blob/main/torchaudio/models/wav2vec2/model.py#L120-L205) and corresponding factory functions ([hubert_pretrain_base](https://github.com/pytorch/audio/blob/main/torchaudio/models/wav2vec2/model.py#L964-L1027), [hubert_pretrain_large](https://github.com/pytorch/audio/blob/main/torchaudio/models/wav2vec2/model.py#L1030-L1090), and [hubert_pretrain_xlarge](https://github.com/pytorch/audio/blob/main/torchaudio/models/wav2vec2/model.py#L1093-L1153)) to enable training from scratch.
+The masked prediction training of HuBERT model requires the masked logits, unmasked logits, and feature norm as the outputs. The logits are for cross-entropy losses and the feature norm is for penalty loss. The release adds HuBERTPretrainModel and corresponding factory functions (hubert_pretrain_base, hubert_pretrain_large, and hubert_pretrain_xlarge) to enable training from scratch.
 
 #### (Prototype) CTC Beam Search Decoder
 
@@ -69,7 +69,7 @@ In recent releases, TorchAudio has added support for ASR models fine-tuned on CT
 
 The CTC decoder in TorchAudio supports customizable beam search decoding with lexicon constraint. It also has optional KenLM language model support.
 
-For more details, please check out the [API tutorial](https://pytorch.org/audio/main/tutorials/asr_inference_with_ctc_decoder_tutorial.html) and [documentation](https://pytorch.org/audio/main/prototype.ctc_decoder.html). This prototype feature is available through nightly builds.
+For more details, please check out the [API tutorial](https://pytorch.org/audio/main/tutorials/asr_inference_with_ctc_decoder_tutorial.html). This prototype feature is available through nightly builds.
 
 #### (Prototype) Streaming API
 
@@ -77,7 +77,7 @@ TorchAudio started as simple audio I/O APIs that supplement PyTorch. With the re
 
 Streaming API makes it easy to develop and test the model in online inference. It utilizes ffmpeg under the hood, and enables reading media from online services and hardware devices, decoding media in an incremental manner, and applying filters and preprocessing.
 
-Please checkout the [API tutorial](https://pytorch.org/audio/main/tutorials/streaming_api_tutorial.html) and [the documentation](https://pytorch.org/audio/main/prototype.io.html). There are also the [streaming ASR](https://pytorch.org/audio/main/tutorials/online_asr_tutorial.html) tutorial and the [device streaming ASR tutorial](https://pytorch.org/audio/main/tutorials/device_asr.html). This feature is available from nightly releases. Please refer to [pytorch.org](https://pytorch.org/get-started/locally/) for how to install nightly builds.
+Please checkout the [API tutorial](https://pytorch.org/audio/main/) and [the documentation](https://pytorch.org/audio/main/). There are also the [streaming ASR](https://pytorch.org/audio/main/tutorials/online_asr_tutorial.html) tutorial and the [device streaming ASR tutorial](https://pytorch.org/audio/main/tutorials/device_asr.html). This feature is available from nightly releases. Please refer to [pytorch.org](https://pytorch.org/get-started/locally/) for how to install nightly builds.
 
 ## TorchText 0.12
 
diff --git a/_posts/2022-3-10-pytorch-1.11-released.md b/_posts/2022-3-10-pytorch-1.11-released.md
index 3da56acac66e..179cdffc350a 100644
--- a/_posts/2022-3-10-pytorch-1.11-released.md
+++ b/_posts/2022-3-10-pytorch-1.11-released.md
@@ -19,11 +19,11 @@ We are delighted to present the Beta release of [TorchData](https://github.com/p
 
 A `DataPipe` takes in some access function over Python data structures, `__iter__` for `IterDataPipe` and `__getitem__` for `MapDataPipe`, and returns a new access function with a slight transformation applied. You can chain multiple DataPipes together to form a data pipeline that performs all the necessary data transformation.
 
-We have implemented over 50 DataPipes that provide different core functionalities, such as opening files, parsing texts, transforming samples, caching, shuffling, and batching. For users who are interested in connecting to cloud providers (such as Google Drive or AWS S3), the [fsspec](https://pytorch.org/data/0.3.0/torchdata.datapipes.iter.html#io-datapipes) and iopath DataPipes will allow you to do so. The documentation provides detailed explanations and usage examples of each [IterDataPipe](https://pytorch.org/data/0.3.0/torchdata.datapipes.iter.html) and [MapDataPipe](https://pytorch.org/data/0.3.0/torchdata.datapipes.map.html).
+We have implemented over 50 DataPipes that provide different core functionalities, such as opening files, parsing texts, transforming samples, caching, shuffling, and batching. For users who are interested in connecting to cloud providers (such as Google Drive or AWS S3), the fsspec and iopath DataPipes will allow you to do so. The documentation provides detailed explanations and usage examples of each IterDataPipe and MapDataPipe.
 
-In this release, some of the PyTorch domain libraries have migrated their datasets to use DataPipes. In TorchText, the [popular datasets provided by the library](https://github.com/pytorch/text/tree/release/0.12/torchtext/datasets) are implemented using DataPipes and a [section of its SST-2 binary text classification tutorial](https://pytorch.org/text/0.12.0/tutorials/sst2_classification_non_distributed.html#dataset) demonstrates how you can use DataPipes to preprocess data for your model. There also are other prototype implementations of datasets with DataPipes in [TorchVision (available in nightly releases)](https://github.com/pytorch/vision/tree/main/torchvision/prototype/datasets/_builtin) and in [TorchRec](https://pytorch.org/torchrec/torchrec.datasets.html). You can find more [specific examples here](https://pytorch.org/data/0.3.0/examples.html).
+In this release, some of the PyTorch domain libraries have migrated their datasets to use DataPipes. In TorchText, the [popular datasets provided by the library](https://github.com/pytorch/text/tree/release/0.12/torchtext/datasets) are implemented using DataPipes and a [section of its SST-2 binary text classification tutorial](https://pytorch.org/text/0.12.0/tutorials/sst2_classification_non_distributed.html#dataset) demonstrates how you can use DataPipes to preprocess data for your model. There also are other prototype implementations of datasets with DataPipes in [TorchVision (available in nightly releases)](https://github.com/pytorch/vision/tree/main/torchvision/prototype/datasets/_builtin) and in [TorchRec](https://pytorch.org/torchrec/torchrec.datasets.html).
 
-The [documentation for TorchData](https://pytorch.org/data) is now live. It contains a tutorial that covers [how to use DataPipes](https://pytorch.org/data/0.3.0/tutorial.html#using-datapipes), [use them with DataLoader](https://pytorch.org/data/0.3.0/tutorial.html#working-with-dataloader), and [implement custom ones](https://pytorch.org/data/0.3.0/tutorial.html#implementing-a-custom-datapipe). FAQs and future plans related to DataLoader are described in [our project’s README file](https://github.com/pytorch/data#readme).
+The [documentation for TorchData](https://pytorch.org/data) is now live. It contains a tutorial that covers how to use DataPipes, use them with DataLoader, and implement custom ones. FAQs and future plans related to DataLoader are described in [our project’s README file](https://github.com/pytorch/data#readme).
 
 ## Introducing functorch
 
diff --git a/_posts/2022-6-28-pytorch-1.12-new-library-releases.md b/_posts/2022-6-28-pytorch-1.12-new-library-releases.md
index e2155b7c8e12..0070c1372b15 100644
--- a/_posts/2022-6-28-pytorch-1.12-new-library-releases.md
+++ b/_posts/2022-6-28-pytorch-1.12-new-library-releases.md
@@ -372,15 +372,15 @@ We added optimized kernels to speed up [TorchRec JaggedTensor](https://pytorch.o
   <img src="/https/patch-diff.githubusercontent.com/assets/images/Jagged-Tensor-Figure-from-FBGEMM-section.png" width="80%">
 </p>
 
-We added ops for [converting jagged tensors from sparse to dense formats](https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/src/jagged_tensor_ops_cpu.cpp#L982) [and back](https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/src/jagged_tensor_ops_cpu.cpp#L968), performing [matrix multiplications with jagged tensors](https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/src/jagged_tensor_ops_cpu.cpp#L996), and [elementwise ops](https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/src/jagged_tensor_ops_cpu.cpp#L995).
+We added ops for converting jagged tensors from sparse to dense formats and back, performing matrix multiplications with jagged tensors, and elementwise ops.
   
 ### Optimized permute102-baddbmm-permute102
 
-It is difficult to fuse various matrix multiplications where the batch size is not the batch size of the model, switching the batch dimension is a quick solution. We created the [permute102_baddbmm_permute102](https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/src/sparse_ops_cpu.cpp#L2401) operation that switches the first and the second dimension, performs the batched matrix multiplication and then switches back. Currently we only support forward pass with FP16 data type and will support FP32 type and backward pass in the future.
+It is difficult to fuse various matrix multiplications where the batch size is not the batch size of the model, switching the batch dimension is a quick solution. We created the permute102_baddbmm_permute102 operation that switches the first and the second dimension, performs the batched matrix multiplication and then switches back. Currently we only support forward pass with FP16 data type and will support FP32 type and backward pass in the future.
 
 ### Optimized index_select for dim 0 index selection
 
-index_select is normally used as part of a sparse operation. While PyTorch supports a generic index_select for an arbitrary-dimension index selection, its performance for a special case like the dim 0 index selection is suboptimal. For this reason, we implement a [specialized index_select for dim 0](https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/src/sparse_ops_cpu.cpp#L2421). In some cases, we have observed 1.4x performance gain from FBGEMM’s index_select compared to the one from PyTorch (using uniform index distribution).
+index_select is normally used as part of a sparse operation. While PyTorch supports a generic index_select for an arbitrary-dimension index selection, its performance for a special case like the dim 0 index selection is suboptimal. For this reason, we implement a specialized index_select for dim 0. In some cases, we have observed 1.4x performance gain from FBGEMM’s index_select compared to the one from PyTorch (using uniform index distribution).
 
 More about the implementation of influential instances can be found on our [GitHub](https://github.com/pytorch/captum/tree/master/captum/influence) page and [tutorials](https://captum.ai/tutorials/TracInCP_Tutorial).
 
diff --git a/_posts/2022-6-28-pytorch-1.12-released.md b/_posts/2022-6-28-pytorch-1.12-released.md
index 8e3accb024e9..434984eac130 100644
--- a/_posts/2022-6-28-pytorch-1.12-released.md
+++ b/_posts/2022-6-28-pytorch-1.12-released.md
@@ -83,7 +83,7 @@ Forward-mode AD allows the computation of directional derivatives (or equivalent
 
 #### BC DataLoader + DataPipe
 
-\`DataPipe\` from TorchData becomes fully backward compatible with the existing \`DataLoader\` regarding shuffle determinism and dynamic sharding in both multiprocessing and distributed environments. For more details, please check out the [tutorial](https://pytorch.org/data/0.4.0/tutorial.html#working-with-dataloader).
+\`DataPipe\` from TorchData becomes fully backward compatible with the existing \`DataLoader\` regarding shuffle determinism and dynamic sharding in both multiprocessing and distributed environments.
 
 #### (Beta) AWS S3 Integration
 
@@ -119,7 +119,7 @@ We significantly improved coverage for ``functorch.jvp`` (our forward-mode autod
 
 #### (Prototype) functorch.experimental.functionalize 
 
-Given a function f, ``functionalize(f)`` returns a new function without mutations (with caveats). This is useful for constructing traces of PyTorch functions without in-place operations. For example, you can use ``make_fx(functionalize(f))`` to construct a mutation-free trace of a pytorch function. To learn more, please see the [documentation](https://pytorch.org/functorch/stable/generated/functorch.experimental.functionalize.html#functorch.experimental.functionalize).
+Given a function f, ``functionalize(f)`` returns a new function without mutations (with caveats). This is useful for constructing traces of PyTorch functions without in-place operations. For example, you can use ``make_fx(functionalize(f))`` to construct a mutation-free trace of a pytorch function. To learn more, please see the [documentation](https://pytorch.org/functorch/stable/).
 
 For more details, please see our [installation instructions](https://pytorch.org/functorch/stable/install.html), [documentation](https://pytorch.org/functorch/), [tutorials](https://pytorch.org/functorch), and [release notes](https://github.com/pytorch/functorch/releases).
 
diff --git a/_posts/2023-03-15-new-library-updates-in-pytorch-2.0.md b/_posts/2023-03-15-new-library-updates-in-pytorch-2.0.md
index 1b1f9046d606..d33d9af343b2 100644
--- a/_posts/2023-03-15-new-library-updates-in-pytorch-2.0.md
+++ b/_posts/2023-03-15-new-library-updates-in-pytorch-2.0.md
@@ -158,17 +158,17 @@ TensorDict is a new data carrier for PyTorch.
 
 ### [Beta] TensorDict: specialized dictionary for PyTorch
 
-TensorDict allows you to execute many common operations across batches of tensors carried by a single container. TensorDict supports many shape and device or storage operations, and  can readily be used in distributed settings. Check the [documentation](https://pytorch-labs.github.io/tensordict/) to know more.
+TensorDict allows you to execute many common operations across batches of tensors carried by a single container. TensorDict supports many shape and device or storage operations, and  can readily be used in distributed settings. Check the [documentation](https://pytorch.org/tensordict/) to know more.
 
 
 ### [Beta] @tensorclass: a dataclass for PyTorch
 
-Like TensorDict, [tensorclass](https://pytorch-labs.github.io/tensordict/reference/prototype.html) provides the opportunity to write dataclasses with built-in torch features such as shape or device operations. 
+Like TensorDict, [tensorclass](https://pytorch.org/tensordict/reference/prototype.html) provides the opportunity to write dataclasses with built-in torch features such as shape or device operations. 
 
 
 ### [Beta] tensordict.nn: specialized modules for TensorDict
 
-The [tensordict.nn module](https://pytorch-labs.github.io/tensordict/reference/nn.html) provides specialized nn.Module subclasses that make it easy to build arbitrarily complex graphs that can be executed with TensorDict inputs. It is compatible with the latest PyTorch features such as functorch, torch.fx and torch.compile.
+The [tensordict.nn module](https://pytorch.org/tensordict/reference/nn.html) provides specialized nn.Module subclasses that make it easy to build arbitrarily complex graphs that can be executed with TensorDict inputs. It is compatible with the latest PyTorch features such as functorch, torch.fx and torch.compile.
 
 
 ## TorchRec
diff --git a/_posts/2023-03-15-pytorch-2.0-release.md b/_posts/2023-03-15-pytorch-2.0-release.md
index da790cb0ce58..bd6cdafc63f1 100644
--- a/_posts/2023-03-15-pytorch-2.0-release.md
+++ b/_posts/2023-03-15-pytorch-2.0-release.md
@@ -487,12 +487,12 @@ PyTorch [DistributedTensor](https://github.com/pytorch/pytorch/blob/master/torch
 
 #### [Prototype] TensorParallel
 
-We now support DTensor based Tensor Parallel which users can distribute their model parameters across different GPU devices. We also support Pairwise Parallel which shards two concatenated linear layers in a col-wise and row-wise style separately so that only one collective(all-reduce/reduce-scatter) is needed in the end. More details can be found in this [example](https://github.com/pytorch/examples/blob/main/distributed/tensor_parallelism/example.py).
+We now support DTensor based Tensor Parallel which users can distribute their model parameters across different GPU devices. We also support Pairwise Parallel which shards two concatenated linear layers in a col-wise and row-wise style separately so that only one collective(all-reduce/reduce-scatter) is needed in the end.
 
 
 #### [Prototype] 2D Parallel
 
-We implemented the integration of the aforementioned TP with FullyShardedDataParallel(FSDP) as 2D parallel to further scale large model training. More details can be found in this [slide](https://docs.google.com/presentation/d/17g6WqrO00rP3MsxbRENsPpjrlSkwiA_QB4r93_eB5is/edit?usp=sharing) and [code example](https://github.com/pytorch/pytorch/blob/master/test/distributed/tensor/parallel/test_2d_parallel.py).
+We implemented the integration of the aforementioned TP with FullyShardedDataParallel(FSDP) as 2D parallel to further scale large model training. More details can be found in this [slide](https://docs.google.com/presentation/d/17g6WqrO00rP3MsxbRENsPpjrlSkwiA_QB4r93_eB5is/edit?usp=sharing).
 
 
 #### [Prototype] torch.compile(dynamic=True)
diff --git a/_posts/2023-03-22-pytorch-2.0-xla.md b/_posts/2023-03-22-pytorch-2.0-xla.md
index ae22dac155b7..2d8e680c7b41 100644
--- a/_posts/2023-03-22-pytorch-2.0-xla.md
+++ b/_posts/2023-03-22-pytorch-2.0-xla.md
@@ -215,7 +215,7 @@ The following graph highlights the memory efficiency benefits of PyTorch/XLA FSD
 
 ## Closing Thoughts…
 
-We are excited to bring these features to the PyTorch community, and this is really just the beginning. Areas like dynamic shapes, deeper support for OpenXLA and many others are in development and we plan to put out more blogs to dive into the details. PyTorch/XLA is developed fully open source and we invite you to join the community of developers by filing issues, submitting pull requests, and sending RFCs on [GitHub](github.com/pytorch/xla). You can try PyTorch/XLA on a variety of XLA devices including TPUs and GPUs. [Here](https://colab.sandbox.google.com/github/pytorch/xla/blob/master/contrib/colab/getting-started.ipynb) is how to get started.
+We are excited to bring these features to the PyTorch community, and this is really just the beginning. Areas like dynamic shapes, deeper support for OpenXLA and many others are in development and we plan to put out more blogs to dive into the details. PyTorch/XLA is developed fully open source and we invite you to join the community of developers by filing issues, submitting pull requests, and sending RFCs on [GitHub](https://github.com/pytorch/xla). You can try PyTorch/XLA on a variety of XLA devices including TPUs and GPUs. [Here](https://colab.sandbox.google.com/github/pytorch/xla/blob/master/contrib/colab/getting-started.ipynb) is how to get started.
 
 Congratulations again to the PyTorch community on this milestone!
 
diff --git a/_posts/2023-05-02-accelerated-image-seg.md b/_posts/2023-05-02-accelerated-image-seg.md
index 492833d32dab..bd585950f130 100644
--- a/_posts/2023-05-02-accelerated-image-seg.md
+++ b/_posts/2023-05-02-accelerated-image-seg.md
@@ -115,7 +115,7 @@ I used the Moscow satellite image dataset, which consists of 1,352 images of 1,3
 
 <small style="line-height: 1.1"><em>**Figure 3**. Satellite image 3-channel RGB chips from Moscow (top row) and corresponding pixel segmentation masks with varying speed limits (bottom row) (image by author)</em></small>
 
-There is a JSON configuration file that must be updated for all remaining components: training and validation split, training, and inference. [An example configuration can be found here](http://github.com/avanetten/cresi/blob/main/cresi/configs/sn5_baseline_aws.json.). I perform an 80:20 training/validation split, making sure to point to the correct folder of satellite images and corresponding masks for training. The configuration parameters are explained in more in the [notebook under examples in GitHub for Intel Extension for PyTorch here](http://github.com/intel/intel-extension-for-pytorch/tree/master/examples/cpu/usecase_spacenet5).
+There is a JSON configuration file that must be updated for all remaining components: training and validation split, training, and inference. [An example configuration can be found here](http://github.com/avanetten/cresi/blob/main/cresi/configs/sn5_baseline_aws.json). I perform an 80:20 training/validation split, making sure to point to the correct folder of satellite images and corresponding masks for training. The configuration parameters are explained in more in the [notebook under examples in GitHub for Intel Extension for PyTorch here](http://github.com/intel/intel-extension-for-pytorch/tree/master/examples/cpu/usecase_spacenet5).
 
 
 ### Training a ResNet34 + UNet Model
diff --git a/_posts/2023-07-25-announcing-cpp.md b/_posts/2023-07-25-announcing-cpp.md
index d7ffeaf1a6a4..dd1969f98909 100644
--- a/_posts/2023-07-25-announcing-cpp.md
+++ b/_posts/2023-07-25-announcing-cpp.md
@@ -34,10 +34,10 @@ Amazon S3 supports global buckets. However, a bucket is created within a Region.
  
 To read objects in a bucket that aren’t publicly accessible, you must provide AWS credentials through one of the following methods:
 
-* [Install and configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) the [AWS Command Line Interface](aws.amazon.com/cli) (AWS CLI) with `AWS configure`
+* [Install and configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) the [AWS Command Line Interface](https://aws.amazon.com/cli/) (AWS CLI) with `AWS configure`
 * Set credentials in the AWS credentials profile file on the local system, located at `~/.aws/credentials` on Linux, macOS, or Unix
 * Set the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables
-* If you’re using this library on an [Amazon Elastic Compute Cloud](aws.amazon.com/ec2) (Amazon EC2) instance, specify an [AWS Identity and Access Management](aws.amazon.com/iam) (IAM) role and then give the EC2 instance access to that role
+* If you’re using this library on an [Amazon Elastic Compute Cloud](https://aws.amazon.com/ec2) (Amazon EC2) instance, specify an [AWS Identity and Access Management](https://aws.amazon.com/iam) (IAM) role and then give the EC2 instance access to that role
 
 
 ### Example code
diff --git a/assets/hub/pytorch_vision_fcn_resnet101.ipynb b/assets/hub/pytorch_vision_fcn_resnet101.ipynb
index 01bb16b88f96..c29506225a9b 100644
--- a/assets/hub/pytorch_vision_fcn_resnet101.ipynb
+++ b/assets/hub/pytorch_vision_fcn_resnet101.ipynb
@@ -42,7 +42,7 @@
     "\n",
     "The model returns an `OrderedDict` with two Tensors that are of the same height and width as the input Tensor, but with 21 classes.\n",
     "`output['out']` contains the semantic masks, and `output['aux']` contains the auxillary loss values per-pixel. In inference mode, `output['aux']` is not useful.\n",
-    "So, `output['out']` is of shape `(N, 21, H, W)`. More documentation can be found [here](https://pytorch.org/docs/stable/torchvision/models.html#object-detection-instance-segmentation-and-person-keypoint-detection)."
+    "So, `output['out']` is of shape `(N, 21, H, W)`. More documentation can be found [here](https://pytorch.org/vision/stable/models.html#object-detection-instance-segmentation-and-person-keypoint-detection)."
    ]
   },
   {
diff --git a/docs/1.4.0/notes/distributed_autograd.html b/docs/1.4.0/notes/distributed_autograd.html
index 753310fccb19..0feb48a1875b 100644
--- a/docs/1.4.0/notes/distributed_autograd.html
+++ b/docs/1.4.0/notes/distributed_autograd.html
@@ -16,11 +16,6 @@
   
   
   
-  
-    <link rel="canonical" href="https://pytorch.org/docs/stable/notes/distributed_autograd.html"/>
-  
-
-  
 
   
   
diff --git a/docs/1.4.0/notes/rref.html b/docs/1.4.0/notes/rref.html
index 94206245cc40..f7d0f4f31638 100644
--- a/docs/1.4.0/notes/rref.html
+++ b/docs/1.4.0/notes/rref.html
@@ -15,12 +15,6 @@
 
   
   
-  
-  
-    <link rel="canonical" href="https://pytorch.org/docs/stable/notes/rref.html"/>
-  
-
-  
 
   
   
diff --git a/docs/1.5.0/rpc/index.html b/docs/1.5.0/rpc/index.html
index fe130c15a911..472b32c4fc92 100644
--- a/docs/1.5.0/rpc/index.html
+++ b/docs/1.5.0/rpc/index.html
@@ -17,13 +17,6 @@
   
   
   
-    <link rel="canonical" href="https://pytorch.org/docs/stable/rpc/index.html"/>
-  
-
-  
-
-  
-  
     
 
   
diff --git a/docs/1.5.1/rpc/index.html b/docs/1.5.1/rpc/index.html
index 4499f1fd9887..40f077749b4e 100644
--- a/docs/1.5.1/rpc/index.html
+++ b/docs/1.5.1/rpc/index.html
@@ -17,13 +17,6 @@
   
   
   
-    <link rel="canonical" href="https://pytorch.org/docs/stable/rpc/index.html"/>
-  
-
-  
-
-  
-  
     
 
   
diff --git a/javadoc/1.4.0/stylesheet.css b/javadoc/1.4.0/stylesheet.css
index 9681235b9e5a..7373c149b4a8 100644
--- a/javadoc/1.4.0/stylesheet.css
+++ b/javadoc/1.4.0/stylesheet.css
@@ -2,8 +2,6 @@
  * Javadoc style sheet
  */
 
-@import url('resources/fonts/dejavu.css');
-
 /*
  * Styles for individual HTML elements.
  *
diff --git a/javadoc/1.9.0/stylesheet.css b/javadoc/1.9.0/stylesheet.css
index 98055b22d6d5..d590d030cac7 100644
--- a/javadoc/1.9.0/stylesheet.css
+++ b/javadoc/1.9.0/stylesheet.css
@@ -3,8 +3,6 @@
 Overall document style
 */
 
-@import url('resources/fonts/dejavu.css');
-
 body {
     background-color:#ffffff;
     color:#353833;