ℹ️ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.2 months ago (distributed domain, exempt) |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://github.com/openai/CLIP |
| Last Crawled | 2026-04-09 17:17:36 (4 days ago) |
| First Indexed | 2021-01-05 21:21:15 (5 years ago) |
| HTTP Status Code | 200 |
| Meta Title | GitHub - openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image · GitHub |
| Meta Description | CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image - openai/CLIP |
| Meta Canonical | null |
| Boilerpipe Text | [Blog]
[Paper]
[Model Card]
[Colab]
CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.
Approach
Usage
First,
install PyTorch 1.7.1
(or later) and torchvision, as well as small additional dependencies, and then install this repo as a Python package. On a CUDA GPU machine, the following will do the trick:
$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git
Replace
cudatoolkit=11.0
above with the appropriate CUDA version on your machine or
cpuonly
when installing on a machine without a GPU.
import
torch
import
clip
from
PIL
import
Image
device
=
"cuda"
if
torch
.
cuda
.
is_available
()
else
"cpu"
model
,
preprocess
=
clip
.
load
(
"ViT-B/32"
,
device
=
device
)
image
=
preprocess
(
Image
.
open
(
"CLIP.png"
)).
unsqueeze
(
0
).
to
(
device
)
text
=
clip
.
tokenize
([
"a diagram"
,
"a dog"
,
"a cat"
]).
to
(
device
)
with
torch
.
no_grad
():
image_features
=
model
.
encode_image
(
image
)
text_features
=
model
.
encode_text
(
text
)
logits_per_image
,
logits_per_text
=
model
(
image
,
text
)
probs
=
logits_per_image
.
softmax
(
dim
=
-
1
).
cpu
().
numpy
()
print
(
"Label probs:"
,
probs
)
# prints: [[0.9927937 0.00421068 0.00299572]]
API
The CLIP module
clip
provides the following methods:
clip.available_models()
Returns the names of the available CLIP models.
clip.load(name, device=..., jit=False)
Returns the model and the TorchVision transform needed by the model, specified by the model name returned by
clip.available_models()
. It will download the model as necessary. The
name
argument can also be a path to a local checkpoint.
The device to run the model can be optionally specified, and the default is to use the first CUDA device if there is any, otherwise the CPU. When
jit
is
False
, a non-JIT version of the model will be loaded.
clip.tokenize(text: Union[str, List[str]], context_length=77)
Returns a LongTensor containing tokenized sequences of given text input(s). This can be used as the input to the model
The model returned by
clip.load()
supports the following methods:
model.encode_image(image: Tensor)
Given a batch of images, returns the image features encoded by the vision portion of the CLIP model.
model.encode_text(text: Tensor)
Given a batch of text tokens, returns the text features encoded by the language portion of the CLIP model.
model(image: Tensor, text: Tensor)
Given a batch of images and a batch of text tokens, returns two Tensors, containing the logit scores corresponding to each image and text input. The values are cosine similarities between the corresponding image and text features, times 100.
More Examples
Zero-Shot Prediction
The code below performs zero-shot prediction using CLIP, as shown in Appendix B in the paper. This example takes an image from the
CIFAR-100 dataset
, and predicts the most likely labels among the 100 textual labels from the dataset.
import
os
import
clip
import
torch
from
torchvision
.
datasets
import
CIFAR100
# Load the model
device
=
"cuda"
if
torch
.
cuda
.
is_available
()
else
"cpu"
model
,
preprocess
=
clip
.
load
(
'ViT-B/32'
,
device
)
# Download the dataset
cifar100
=
CIFAR100
(
root
=
os
.
path
.
expanduser
(
"~/.cache"
),
download
=
True
,
train
=
False
)
# Prepare the inputs
image
,
class_id
=
cifar100
[
3637
]
image_input
=
preprocess
(
image
).
unsqueeze
(
0
).
to
(
device
)
text_inputs
=
torch
.
cat
([
clip
.
tokenize
(
f"a photo of a
{
c
}
"
)
for
c
in
cifar100
.
classes
]).
to
(
device
)
# Calculate features
with
torch
.
no_grad
():
image_features
=
model
.
encode_image
(
image_input
)
text_features
=
model
.
encode_text
(
text_inputs
)
# Pick the top 5 most similar labels for the image
image_features
/=
image_features
.
norm
(
dim
=
-
1
,
keepdim
=
True
)
text_features
/=
text_features
.
norm
(
dim
=
-
1
,
keepdim
=
True
)
similarity
=
(
100.0
*
image_features
@
text_features
.
T
).
softmax
(
dim
=
-
1
)
values
,
indices
=
similarity
[
0
].
topk
(
5
)
# Print the result
print
(
"
\n
Top predictions:
\n
"
)
for
value
,
index
in
zip
(
values
,
indices
):
print
(
f"
{
cifar100
.
classes
[
index
]:>16s
}
:
{
100
*
value
.
item
():.2f
}
%"
)
The output will look like the following (the exact numbers may be slightly different depending on the compute device):
Top predictions:
snake: 65.31%
turtle: 12.29%
sweet_pepper: 3.83%
lizard: 1.88%
crocodile: 1.75%
Note that this example uses the
encode_image()
and
encode_text()
methods that return the encoded features of given inputs.
Linear-probe evaluation
The example below uses
scikit-learn
to perform logistic regression on image features.
import
os
import
clip
import
torch
import
numpy
as
np
from
sklearn
.
linear_model
import
LogisticRegression
from
torch
.
utils
.
data
import
DataLoader
from
torchvision
.
datasets
import
CIFAR100
from
tqdm
import
tqdm
# Load the model
device
=
"cuda"
if
torch
.
cuda
.
is_available
()
else
"cpu"
model
,
preprocess
=
clip
.
load
(
'ViT-B/32'
,
device
)
# Load the dataset
root
=
os
.
path
.
expanduser
(
"~/.cache"
)
train
=
CIFAR100
(
root
,
download
=
True
,
train
=
True
,
transform
=
preprocess
)
test
=
CIFAR100
(
root
,
download
=
True
,
train
=
False
,
transform
=
preprocess
)
def
get_features
(
dataset
):
all_features
=
[]
all_labels
=
[]
with
torch
.
no_grad
():
for
images
,
labels
in
tqdm
(
DataLoader
(
dataset
,
batch_size
=
100
)):
features
=
model
.
encode_image
(
images
.
to
(
device
))
all_features
.
append
(
features
)
all_labels
.
append
(
labels
)
return
torch
.
cat
(
all_features
).
cpu
().
numpy
(),
torch
.
cat
(
all_labels
).
cpu
().
numpy
()
# Calculate the image features
train_features
,
train_labels
=
get_features
(
train
)
test_features
,
test_labels
=
get_features
(
test
)
# Perform logistic regression
classifier
=
LogisticRegression
(
random_state
=
0
,
C
=
0.316
,
max_iter
=
1000
,
verbose
=
1
)
classifier
.
fit
(
train_features
,
train_labels
)
# Evaluate using the logistic regression classifier
predictions
=
classifier
.
predict
(
test_features
)
accuracy
=
np
.
mean
((
test_labels
==
predictions
).
astype
(
float
))
*
100.
print
(
f"Accuracy =
{
accuracy
:.3f
}
"
)
Note that the
C
value should be determined via a hyperparameter sweep using a validation split.
See Also
OpenCLIP
: includes larger and independently trained CLIP models up to ViT-G/14
Hugging Face implementation of CLIP
: for easier integration with the HF ecosystem |
| Markdown | [Skip to content](https://github.com/openai/CLIP#start-of-content)
## Navigation Menu
Toggle navigation
[Sign in](https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenai%2FCLIP)
Appearance settings
- Platform
- AI CODE CREATION
- [GitHub CopilotWrite better code with AI](https://github.com/features/copilot)
- [GitHub SparkBuild and deploy intelligent apps](https://github.com/features/spark)
- [GitHub ModelsManage and compare prompts](https://github.com/features/models)
- [MCP RegistryNewIntegrate external tools](https://github.com/mcp)
- DEVELOPER WORKFLOWS
- [ActionsAutomate any workflow](https://github.com/features/actions)
- [CodespacesInstant dev environments](https://github.com/features/codespaces)
- [IssuesPlan and track work](https://github.com/features/issues)
- [Code ReviewManage code changes](https://github.com/features/code-review)
- APPLICATION SECURITY
- [GitHub Advanced SecurityFind and fix vulnerabilities](https://github.com/security/advanced-security)
- [Code securitySecure your code as you build](https://github.com/security/advanced-security/code-security)
- [Secret protectionStop leaks before they start](https://github.com/security/advanced-security/secret-protection)
- EXPLORE
- [Why GitHub](https://github.com/why-github)
- [Documentation](https://docs.github.com/)
- [Blog](https://github.blog/)
- [Changelog](https://github.blog/changelog)
- [Marketplace](https://github.com/marketplace)
[View all features](https://github.com/features)
- Solutions
- BY COMPANY SIZE
- [Enterprises](https://github.com/enterprise)
- [Small and medium teams](https://github.com/team)
- [Startups](https://github.com/enterprise/startups)
- [Nonprofits](https://github.com/solutions/industry/nonprofits)
- BY USE CASE
- [App Modernization](https://github.com/solutions/use-case/app-modernization)
- [DevSecOps](https://github.com/solutions/use-case/devsecops)
- [DevOps](https://github.com/solutions/use-case/devops)
- [CI/CD](https://github.com/solutions/use-case/ci-cd)
- [View all use cases](https://github.com/solutions/use-case)
- BY INDUSTRY
- [Healthcare](https://github.com/solutions/industry/healthcare)
- [Financial services](https://github.com/solutions/industry/financial-services)
- [Manufacturing](https://github.com/solutions/industry/manufacturing)
- [Government](https://github.com/solutions/industry/government)
- [View all industries](https://github.com/solutions/industry)
[View all solutions](https://github.com/solutions)
- Resources
- EXPLORE BY TOPIC
- [AI](https://github.com/resources/articles?topic=ai)
- [Software Development](https://github.com/resources/articles?topic=software-development)
- [DevOps](https://github.com/resources/articles?topic=devops)
- [Security](https://github.com/resources/articles?topic=security)
- [View all topics](https://github.com/resources/articles)
- EXPLORE BY TYPE
- [Customer stories](https://github.com/customer-stories)
- [Events & webinars](https://github.com/resources/events)
- [Ebooks & reports](https://github.com/resources/whitepapers)
- [Business insights](https://github.com/solutions/executive-insights)
- [GitHub Skills](https://skills.github.com/)
- SUPPORT & SERVICES
- [Documentation](https://docs.github.com/)
- [Customer support](https://support.github.com/)
- [Community forum](https://github.com/orgs/community/discussions)
- [Trust center](https://github.com/trust-center)
- [Partners](https://github.com/partners)
[View all resources](https://github.com/resources)
- Open Source
- COMMUNITY
- [GitHub SponsorsFund open source developers](https://github.com/sponsors)
- PROGRAMS
- [Security Lab](https://securitylab.github.com/)
- [Maintainer Community](https://maintainers.github.com/)
- [Accelerator](https://github.com/accelerator)
- [GitHub Stars](https://stars.github.com/)
- [Archive Program](https://archiveprogram.github.com/)
- REPOSITORIES
- [Topics](https://github.com/topics)
- [Trending](https://github.com/trending)
- [Collections](https://github.com/collections)
- Enterprise
- ENTERPRISE SOLUTIONS
- [Enterprise platformAI-powered developer platform](https://github.com/enterprise)
- AVAILABLE ADD-ONS
- [GitHub Advanced SecurityEnterprise-grade security features](https://github.com/security/advanced-security)
- [Copilot for BusinessEnterprise-grade AI features](https://github.com/features/copilot/copilot-business)
- [Premium SupportEnterprise-grade 24/7 support](https://github.com/premium-support)
- [Pricing](https://github.com/pricing)
Search or jump to...
# Search code, repositories, users, issues, pull requests...
[Search syntax tips](https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax)
# Provide feedback
Cancel
Submit feedback
# Saved searches
## Use saved searches to filter your results more quickly
Cancel
Create saved search
[Sign in](https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenai%2FCLIP)
[Sign up](https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E&source=header-repo&source_repo=openai%2FCLIP)
Appearance settings
Resetting focus
You signed in with another tab or window. [Reload](https://github.com/openai/CLIP) to refresh your session. You signed out in another tab or window. [Reload](https://github.com/openai/CLIP) to refresh your session. You switched accounts on another tab or window. [Reload](https://github.com/openai/CLIP) to refresh your session.
Dismiss alert
[openai](https://github.com/openai) / **[CLIP](https://github.com/openai/CLIP)** Public
- [Notifications](https://github.com/login?return_to=%2Fopenai%2FCLIP)
You must be signed in to change notification settings
- [Fork 4k](https://github.com/login?return_to=%2Fopenai%2FCLIP)
- [Star 33.1k](https://github.com/login?return_to=%2Fopenai%2FCLIP)
- [Code](https://github.com/openai/CLIP)
- [Issues 251](https://github.com/openai/CLIP/issues)
- [Pull requests 21](https://github.com/openai/CLIP/pulls)
- [Actions](https://github.com/openai/CLIP/actions)
- [Security and quality 0](https://github.com/openai/CLIP/security)
- [Insights](https://github.com/openai/CLIP/pulse)
Additional navigation options
- [Code](https://github.com/openai/CLIP)
- [Issues](https://github.com/openai/CLIP/issues)
- [Pull requests](https://github.com/openai/CLIP/pulls)
- [Actions](https://github.com/openai/CLIP/actions)
- [Security and quality](https://github.com/openai/CLIP/security)
- [Insights](https://github.com/openai/CLIP/pulse)
# openai/CLIP
main
[**4** Branches](https://github.com/openai/CLIP/branches)
[**0** Tags](https://github.com/openai/CLIP/tags)
Go to file
Code
Open more actions menu
## Folders and files
| Name | Name | Last commit message | Last commit date |
|---|---|---|---|
| Latest commit [](https://github.com/hintz-openai)[hintz-openai](https://github.com/openai/CLIP/commits?author=hintz-openai) [Pin GitHub Actions workflow references (](https://github.com/openai/CLIP/commit/d05afc436d78f1c48dc0dbf8e5980a9d471f35f6)[\#537](https://github.com/openai/CLIP/pull/537)[)](https://github.com/openai/CLIP/commit/d05afc436d78f1c48dc0dbf8e5980a9d471f35f6) failure Mar 25, 2026 [d05afc4](https://github.com/openai/CLIP/commit/d05afc436d78f1c48dc0dbf8e5980a9d471f35f6) · Mar 25, 2026 History [58 Commits](https://github.com/openai/CLIP/commits/main/) Open commit details 58 Commits | | | |
## Repository files navigation
- [README](https://github.com/openai/CLIP)
- [MIT license](https://github.com/openai/CLIP)
# CLIP
[\[Blog\]](https://openai.com/blog/clip/) [\[Paper\]](https://arxiv.org/abs/2103.00020) [\[Model Card\]](https://github.com/openai/CLIP/blob/main/model-card.md) [\[Colab\]](https://colab.research.google.com/github/openai/clip/blob/master/notebooks/Interacting_with_CLIP.ipynb)
CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.
## Approach
[](https://github.com/openai/CLIP/blob/main/CLIP.png)
## Usage
First, [install PyTorch 1.7.1](https://pytorch.org/get-started/locally/) (or later) and torchvision, as well as small additional dependencies, and then install this repo as a Python package. On a CUDA GPU machine, the following will do the trick:
```
$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git
```
Replace `cudatoolkit=11.0` above with the appropriate CUDA version on your machine or `cpuonly` when installing on a machine without a GPU.
```
import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]]
```
## API
The CLIP module `clip` provides the following methods:
#### `clip.available_models()`
Returns the names of the available CLIP models.
#### `clip.load(name, device=..., jit=False)`
Returns the model and the TorchVision transform needed by the model, specified by the model name returned by `clip.available_models()`. It will download the model as necessary. The `name` argument can also be a path to a local checkpoint.
The device to run the model can be optionally specified, and the default is to use the first CUDA device if there is any, otherwise the CPU. When `jit` is `False`, a non-JIT version of the model will be loaded.
#### `clip.tokenize(text: Union[str, List[str]], context_length=77)`
Returns a LongTensor containing tokenized sequences of given text input(s). This can be used as the input to the model
***
The model returned by `clip.load()` supports the following methods:
#### `model.encode_image(image: Tensor)`
Given a batch of images, returns the image features encoded by the vision portion of the CLIP model.
#### `model.encode_text(text: Tensor)`
Given a batch of text tokens, returns the text features encoded by the language portion of the CLIP model.
#### `model(image: Tensor, text: Tensor)`
Given a batch of images and a batch of text tokens, returns two Tensors, containing the logit scores corresponding to each image and text input. The values are cosine similarities between the corresponding image and text features, times 100.
## More Examples
### Zero-Shot Prediction
The code below performs zero-shot prediction using CLIP, as shown in Appendix B in the paper. This example takes an image from the [CIFAR-100 dataset](https://www.cs.toronto.edu/~kriz/cifar.html), and predicts the most likely labels among the 100 textual labels from the dataset.
```
import os
import clip
import torch
from torchvision.datasets import CIFAR100
# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)
# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)
# Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)
# Calculate features
with torch.no_grad():
image_features = model.encode_image(image_input)
text_features = model.encode_text(text_inputs)
# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)
# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")
```
The output will look like the following (the exact numbers may be slightly different depending on the compute device):
```
Top predictions:
snake: 65.31%
turtle: 12.29%
sweet_pepper: 3.83%
lizard: 1.88%
crocodile: 1.75%
```
Note that this example uses the `encode_image()` and `encode_text()` methods that return the encoded features of given inputs.
### Linear-probe evaluation
The example below uses [scikit-learn](https://scikit-learn.org/) to perform logistic regression on image features.
```
import os
import clip
import torch
import numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm
# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)
# Load the dataset
root = os.path.expanduser("~/.cache")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)
def get_features(dataset):
all_features = []
all_labels = []
with torch.no_grad():
for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
features = model.encode_image(images.to(device))
all_features.append(features)
all_labels.append(labels)
return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()
# Calculate the image features
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)
# Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)
# Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(float)) * 100.
print(f"Accuracy = {accuracy:.3f}")
```
Note that the `C` value should be determined via a hyperparameter sweep using a validation split.
## See Also
- [OpenCLIP](https://github.com/mlfoundations/open_clip): includes larger and independently trained CLIP models up to ViT-G/14
- [Hugging Face implementation of CLIP](https://huggingface.co/docs/transformers/model_doc/clip): for easier integration with the HF ecosystem
## About
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
### Topics
[machine-learning](https://github.com/topics/machine-learning "Topic: machine-learning") [deep-learning](https://github.com/topics/deep-learning "Topic: deep-learning")
### Resources
[Readme](https://github.com/openai/CLIP#readme-ov-file)
### License
[MIT license](https://github.com/openai/CLIP#MIT-1-ov-file)
### Uh oh\!
There was an error while loading. [Please reload this page](https://github.com/openai/CLIP).
[Activity](https://github.com/openai/CLIP/activity)
[Custom properties](https://github.com/openai/CLIP/custom-properties)
### Stars
[**33\.1k** stars](https://github.com/openai/CLIP/stargazers)
### Watchers
[**327** watching](https://github.com/openai/CLIP/watchers)
### Forks
[**4k** forks](https://github.com/openai/CLIP/forks)
[Report repository](https://github.com/contact/report-content?content_url=https%3A%2F%2Fgithub.com%2Fopenai%2FCLIP&report=openai+%28user%29)
## [Used by 3\.5k](https://github.com/openai/CLIP/network/dependents)
[        + 3,532](https://github.com/openai/CLIP/network/dependents)
## [Contributors](https://github.com/openai/CLIP/graphs/contributors)
### Uh oh\!
There was an error while loading. [Please reload this page](https://github.com/openai/CLIP).
## Languages
- [Jupyter Notebook 99.1%](https://github.com/openai/CLIP/search?l=jupyter-notebook)
- [Python 0.9%](https://github.com/openai/CLIP/search?l=python)
## Footer
© 2026 GitHub, Inc.
### Footer navigation
- [Terms](https://docs.github.com/site-policy/github-terms/github-terms-of-service)
- [Privacy](https://docs.github.com/site-policy/privacy-policies/github-privacy-statement)
- [Security](https://github.com/security)
- [Status](https://www.githubstatus.com/)
- [Community](https://github.community/)
- [Docs](https://docs.github.com/)
- [Contact](https://support.github.com/?tags=dotcom-footer)
- Manage cookies
- Do not share my personal information
You can’t perform that action at this time. |
| Readable Markdown | [\[Blog\]](https://openai.com/blog/clip/) [\[Paper\]](https://arxiv.org/abs/2103.00020) [\[Model Card\]](https://github.com/openai/CLIP/blob/main/model-card.md) [\[Colab\]](https://colab.research.google.com/github/openai/clip/blob/master/notebooks/Interacting_with_CLIP.ipynb)
CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.
Approach
[](https://github.com/openai/CLIP/blob/main/CLIP.png)
Usage
First, [install PyTorch 1.7.1](https://pytorch.org/get-started/locally/) (or later) and torchvision, as well as small additional dependencies, and then install this repo as a Python package. On a CUDA GPU machine, the following will do the trick:
```
$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git
```
Replace `cudatoolkit=11.0` above with the appropriate CUDA version on your machine or `cpuonly` when installing on a machine without a GPU.
```
import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]]
```
API
The CLIP module `clip` provides the following methods:
`clip.available_models()`
Returns the names of the available CLIP models.
`clip.load(name, device=..., jit=False)`
Returns the model and the TorchVision transform needed by the model, specified by the model name returned by `clip.available_models()`. It will download the model as necessary. The `name` argument can also be a path to a local checkpoint.
The device to run the model can be optionally specified, and the default is to use the first CUDA device if there is any, otherwise the CPU. When `jit` is `False`, a non-JIT version of the model will be loaded.
`clip.tokenize(text: Union[str, List[str]], context_length=77)`
Returns a LongTensor containing tokenized sequences of given text input(s). This can be used as the input to the model
***
The model returned by `clip.load()` supports the following methods:
`model.encode_image(image: Tensor)`
Given a batch of images, returns the image features encoded by the vision portion of the CLIP model.
`model.encode_text(text: Tensor)`
Given a batch of text tokens, returns the text features encoded by the language portion of the CLIP model.
`model(image: Tensor, text: Tensor)`
Given a batch of images and a batch of text tokens, returns two Tensors, containing the logit scores corresponding to each image and text input. The values are cosine similarities between the corresponding image and text features, times 100.
More Examples
Zero-Shot Prediction
The code below performs zero-shot prediction using CLIP, as shown in Appendix B in the paper. This example takes an image from the [CIFAR-100 dataset](https://www.cs.toronto.edu/~kriz/cifar.html), and predicts the most likely labels among the 100 textual labels from the dataset.
```
import os
import clip
import torch
from torchvision.datasets import CIFAR100
# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)
# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)
# Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)
# Calculate features
with torch.no_grad():
image_features = model.encode_image(image_input)
text_features = model.encode_text(text_inputs)
# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)
# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")
```
The output will look like the following (the exact numbers may be slightly different depending on the compute device):
```
Top predictions:
snake: 65.31%
turtle: 12.29%
sweet_pepper: 3.83%
lizard: 1.88%
crocodile: 1.75%
```
Note that this example uses the `encode_image()` and `encode_text()` methods that return the encoded features of given inputs.
Linear-probe evaluation
The example below uses [scikit-learn](https://scikit-learn.org/) to perform logistic regression on image features.
```
import os
import clip
import torch
import numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm
# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)
# Load the dataset
root = os.path.expanduser("~/.cache")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)
def get_features(dataset):
all_features = []
all_labels = []
with torch.no_grad():
for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
features = model.encode_image(images.to(device))
all_features.append(features)
all_labels.append(labels)
return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()
# Calculate the image features
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)
# Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)
# Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(float)) * 100.
print(f"Accuracy = {accuracy:.3f}")
```
Note that the `C` value should be determined via a hyperparameter sweep using a validation split.
See Also
- [OpenCLIP](https://github.com/mlfoundations/open_clip): includes larger and independently trained CLIP models up to ViT-G/14
- [Hugging Face implementation of CLIP](https://huggingface.co/docs/transformers/model_doc/clip): for easier integration with the HF ecosystem |
| Shard | 174 (laksa) |
| Root Hash | 6325672905007345774 |
| Unparsed URL | com,github!/openai/CLIP s443 |