Posit AI Weblog: Hugging Face Integrations

July 15, 2023

1

We’re joyful to announce the primary releases of hfhub and tok are actually on CRAN.
hfhub is an R interface to Hugging Face Hub, permitting customers to obtain and cache recordsdata
from Hugging Face Hub whereas tok implements R bindings for the Hugging Face tokenizers
library.

Hugging Face quickly grew to become the platform to construct, share and collaborate on
deep studying functions and we hope these integrations will assist R customers to
get began utilizing Hugging Face instruments in addition to constructing novel functions.

We even have beforehand introduced the safetensors
package deal permitting to learn and write recordsdata within the safetensors format.

hfhub

hfhub is an R interface to the Hugging Face Hub. hfhub at the moment implements a single
performance: downloading recordsdata from Hub repositories. Mannequin Hub repositories are
primarily used to retailer pre-trained mannequin weights along with some other metadata
essential to load the mannequin, such because the hyperparameters configurations and the
tokenizer vocabulary.

Downloaded recordsdata are ached utilizing the identical structure because the Python library, thus cached
recordsdata may be shared between the R and Python implementation, for simpler and faster
switching between languages.

We already use hfhub within the minhub package deal and
within the ‘GPT-2 from scratch with torch’ weblog submit to
obtain pre-trained weights from Hugging Face Hub.

You need to use hub_download() to obtain any file from a Hugging Face Hub repository
by specifying the repository id and the trail to file that you just need to obtain.
If the file is already within the cache, then the perform returns the file path imediately,
in any other case the file is downloaded, cached after which the entry path is returned.

path <- hfhub::hub_download("gpt2", "mannequin.safetensors")
path
#> /Customers/dfalbel/.cache/huggingface/hub/models--gpt2/snapshots/11c5a3d5811f50298f278a704980280950aedb10/mannequin.safetensors

weblog submit ‘What are Giant Language Fashions? What are they not?’.

When utilizing a pre-trained mannequin (each for inference or for tremendous tuning) it’s very
essential that you just use the very same tokenization course of that has been used throughout
coaching, and the Hugging Face crew has performed an incredible job ensuring that its algorithms
match the tokenization methods used most LLM’s.

tok supplies R bindings to the 🤗 tokenizers library. The tokenizers library is itself
carried out in Rust for efficiency and our bindings use the extendr mission
to assist interfacing with R. Utilizing tok we will tokenize textual content the very same approach most
NLP fashions do, making it simpler to load pre-trained fashions in R in addition to sharing
our fashions with the broader NLP group.

tok may be put in from CRAN, and at the moment it’s utilization is restricted to loading
tokenizers vocabularies from recordsdata. For instance, you possibly can load the tokenizer for the GPT2
mannequin with:

tokenizer <- tok::tokenizer$from_pretrained("gpt2")
ids <- tokenizer$encode("Hey world! You need to use tokenizers from R")$ids
ids
#> [1] 15496   995     0   921   460   779 11241 11341   422   371
tokenizer$decode(ids)
#> [1] "Hey world! You need to use tokenizers from R"

Keep in mind which you could already host
Shiny (for R and Python) on Hugging Face Areas. For instance, we’ve got constructed a Shiny
app that makes use of:

torch to implement GPT-NeoX (the neural community structure of StableLM – the mannequin used for chatting)
hfhub to obtain and cache pre-trained weights from the StableLM repository
tok to tokenize and pre-process textual content as enter for the torch mannequin. tok additionally makes use of hfhub to obtain the tokenizer’s vocabulary.

The app is hosted at on this House.
It at the moment runs on CPU, however you possibly can simply change the the Docker picture if you need
to run it on a GPU for sooner inference.

The app supply code can be open-source and may be discovered within the Areas file tab.

Wanting ahead

It’s the very early days of hfhub and tok and there’s nonetheless a whole lot of work to do
and performance to implement. We hope to get group assist to prioritize work,
thus, if there’s a function that you’re lacking, please open a problem within the
GitHub repositories.

Reuse

Textual content and figures are licensed below Artistic Commons Attribution CC BY 4.0. The figures which were reused from different sources do not fall below this license and may be acknowledged by a notice of their caption: “Determine from …”.

Quotation

For attribution, please cite this work as

Falbel (2023, July 12). Posit AI Weblog: Hugging Face Integrations. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2023-07-12-hugging-face-integrations/

BibTeX quotation

@misc{hugging-face-integrations,
  creator = {Falbel, Daniel},
  title = {Posit AI Weblog: Hugging Face Integrations},
  url = {https://blogs.rstudio.com/tensorflow/posts/2023-07-12-hugging-face-integrations/},
  12 months = {2023}
}

Supply hyperlink

Previous articleFrom Hive Tables to Iceberg Tables: Problem-Free

Next articleColleen Ballinger’s poisonous gossip prepare will not cease chugging on

Posit AI Weblog: Hugging Face Integrations

hfhub

tok

Areas

Wanting ahead

Reuse

Quotation

MBA in USA with out Work Expertise

The Obtain: placing actors coaching AI, and breaking ‘unbreakable’ encryption

Microsoft Safety Copilot Early Entry Program is now accessible

LEAVE A REPLY Cancel reply

Most Popular

AWS Weekly Roundup – EBS Standing Verify, Textract Customized Queries, Amazon Linux 2, and extra – October 16, 2023

Google Chrome’s new “IP Safety” will conceal customers’ IP addresses

Drone Mapping Information: How Drone Mapping Works?

How To Make Cash on Fb: 6 Greatest Methods To Strive Now (2024)

Recent Comments

ABOUT US

POPULAR POSTS

AWS Weekly Roundup – EBS Standing Verify, Textract Customized Queries, Amazon Linux 2, and extra – October 16, 2023

Google Chrome’s new “IP Safety” will conceal customers’ IP addresses

Drone Mapping Information: How Drone Mapping Works?

POPULAR CATEGORY