Top Open Source Large Language Models (LLMs) Available For Commercial Use

This post was originally published on this site

The top open source Large Language Models available for commercial use are as follows.

  1. Llama – 2

Meta released Llama 2, a set of pretrained and refined LLMs, along with Llama 2-Chat, a version of Llama 2. These models are scalable up to 70 billion parameters. It was discovered after extensive testing on safety and helpfulness-focused benchmarks that Llama 2-Chat models perform better than current open-source models in most cases. Human evaluations have shown that they align well with several closed-source models. 

The researchers have even taken a few steps to guarantee the security of these models. This includes annotating data, especially for safety, conducting red-teaming exercises, fine-tuning models with an emphasis on safety issues, and iteratively and continuously reviewing the models. 

Variants of Llama 2 with 7 billion, 13 billion, and 70 billion parameters have also been released. Llama 2-Chat, optimized for dialogue scenarios, has also been released in variants with the same parameter scales.

Project: https://huggingface.co/meta-llama

Paper: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

  1. Falcon

Researchers from Technology Innovation Institute, Abu Dhabi introduced the Falcon series, which includes models with 7 billion, 40 billion, and 180 billion parameters. These models, which are intended to be causal decoder-only models, were trained on a high-quality, varied corpus that was mostly obtained from online data. Falcon-180B, the largest model in the series, is the only publicly available pretraining run ever, having been trained on a dataset of more than 3.5 trillion text tokens.

The researchers discovered that Falcon-180B shows great advancements over other models, including PaLM or Chinchilla. It outperforms models that are being developed concurrently, such as LLaMA 2 or Inflection-1. Falcon-180B achieves performance close to PaLM-2-Large, which is noteworthy given its lower pretraining and inference costs. With this ranking, Falcon-180B joins GPT-4 and PaLM-2-Large as the leading language models in the world.

Project: https://huggingface.co/tiiuae/falcon-180B

Project: https://arxiv.org/pdf/2311.16867.pdf

  1. Dolly 2.0

Researchers from Databricks created the LLM Dolly-v2-12b, which has been designed for commercial use and was created on the Databricks Machine Learning platform. Based on pythia-12b as a base, it is trained using roughly 15,000 instruction/response pairs (named databricks-dolly-15k) that were produced by Databricks personnel. The several capacity areas covered by these instruction/response pairings are brainstorming, classification, closed question-answering, generation, information extraction, open question-answering, and summarising, as stated in the InstructGPT document.

Dolly-v2 is also available in smaller model sizes for different use cases. Dolly-v2-7b has 6.9 billion parameters and is based on pythia-6.9b.

Dolly-v2-3b has 2.8 billion parameters and is based on pythia-2.8b.

HF Project: https://huggingface.co/databricks/dolly-v2-12b

Github: https://github.com/databrickslabs/dolly#getting-started-with-response-generation

  1. MPT

Transformer-based language models have made great progress with the release of MosaicML’s MPT-7B. MPT-7B was trained from the beginning and has been exposed to a massive corpus of 1 trillion tokens, which includes both text and code.  

The efficiency with which MPT-7B was trained is amazing. In just 9.5 days, the full training process, which was carried out without any human involvement, was finished. MPT -7 B was trained at an exceptionally low price, given the size and difficulty of the assignment. The training procedure, which made use of MosaicML’s cutting-edge infrastructure, cost about $200,000. 

HF Project: https://huggingface.co/mosaicml/mpt-7b

Github: https://github.com/mosaicml/llm-foundry/

  1. FLAN – T5

Google introduced FLAN – T5, an enhanced version of T5 that has been finetuned in a mixture of tasks. Flan-T5 checkpoints demonstrate robust few-shot performance even when compared to significantly larger models like PaLM 62B. With FLAN – T5, The team discussed instruction fine-tuning as a versatile approach for improving language model performance across various tasks and evaluation metrics. 

HF Project: https://huggingface.co/google/flan-t5-base

Paper: https://arxiv.org/pdf/2210.11416.pdf

  1. GPT-NeoX-20B

EleutherAI presented GPT-NeoX-20B, a huge autoregressive language model with 20 billion parameters. GPT-NeoX-20B’s performance is assessed on a variety of tasks that include knowledge-based skills, mathematical reasoning, and language comprehension.

The evaluation’s key conclusion is that GPT-NeoX-20B performs admirably as a few-shot reasoner, even when given very little information. GPT-NeoX-20B performs noticeably better than comparable-sized devices like GPT-3 and FairSeq, especially in five-shot evaluations.

HF Project: https://huggingface.co/EleutherAI/gpt-neox-20b

Paper: https://arxiv.org/pdf/2204.06745.pdf

  1. Open Pre-trained Transformers (OPT) 

Since LLM models are frequently trained over hundreds of thousands of computing days, they usually need substantial computing resources. This makes replication extremely difficult for researchers that lack substantial funding. Complete access to the model weights is frequently restricted, preventing in-depth research and analysis, even in cases where these models are made available through APIs.

To address these issues, Meta researchers presented Open Pre-trained Transformers (OPT), a set of pre-trained transformers that are limited to decoders and cover a broad range of parameter values, from 125 million to 175 billion. OPT’s main goal is to ddemocratizeaccess to cutting-edge language models by making these models fully and ethically available to academics. 

OPT-175B, the flagship model in the OPT suite, is shown by the researchers to perform similarly to GPT-3. But what really distinguishes OPT-175B is that, in comparison to conventional large-scale language model training techniques, it requires only 1/7th of the environmental effect during development.

HF Project: https://huggingface.co/facebook/opt-350m

Paper: https://arxiv.org/pdf/2205.01068.pdf

  1. BLOOM

Researchers from BigScience developed BLOOM, a significant 176 billion-parameter open-access language model. Since BLOOM is a decoder-only Transformer language model, it is particularly good at producing text sequences in response to input cues. The ROOTS corpus, an extensive dataset with content from hundreds of sources covering 46 natural languages and 13 programming languages for a total of 59 languages, served as its training ground. Because of the large amount of training data, BLOOM is able to comprehend and produce text in a variety of linguistic contexts. 

Paper: https://arxiv.org/pdf/2211.05100.pdf

HF Project: https://huggingface.co/bigscience/bloom

  1. Baichuan 

The most recent version of the extensive open-source language models created by Baichuan Intelligence Inc. is called Baichuan 2. With 2.6 trillion tokens in its carefully chosen corpus, this sophisticated model is taught to capture a wide range of linguistic nuances and patterns. Notably, Baichuan 2 has established new norms for models of similar size by exhibiting exceptional performance across credible benchmarks in both Chinese and English.

Baichuan 2 has been released in various versions, each designed for a specific use case. Options are offered in parameter combinations of 7 billion and 13 billion for the Base model. Baichuan 2 provides Chat models in matching variants with 7 billion and 13 billion parameters, which are tailored for dialogue settings. Moreover, a 4-bit quantized version of the Chat model is offered for increased efficiency, which lowers processing needs without sacrificing performance.

HF Project:https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat#Introduction

  1. BERT

Google introduced BERT (Bidirectional Encoder Representations from Transformers). BERT is specially developed to pre-train deep bidirectional representations from unlabeled text, unlike earlier language models. This means that BERT can capture a more thorough grasp of linguistic nuances because it concurrently takes into account the left and right context in every layer of its architecture.

BERT’s conceptual simplicity and exceptional empirical power are two of its main benefits. It acquires rich contextual embeddings by extensive pretraining on text data, which may be refined with little effort to produce highly efficient models for a wide range of natural language processing applications. Adding just one extra output layer is usually all that is required for this fine-tuning process, which leaves BERT extremely flexible and adaptable to a wide range of applications without requiring significant task-specific architecture changes.

BERT performs well on eleven distinct natural language processing tasks. It shows notable gains in SQuAD question-answering performance, MultiNLI accuracy, and GLUE score. As an example, BERT increases the GLUE score to 80.5%, which is a significant 7.7% absolute improvement.

Github: https://github.com/google-research/bert

Paper: https://arxiv.org/pdf/1810.04805.pdf

HF Project: https://huggingface.co/google-bert/bert-base-cased

  1.  Vicuna

LMSYS presented Vicuna-13B, an open-source chatbot that was created by using user-shared conversations gathered from ShareGPT to fine-tune the LLaMA model. Vicuna-13B offers consumers superior conversational capabilities and is a big leap in chatbot technology. 

In the initial assessment, Vicuna-13B’s performance was judged using the GPT-4. The evaluation results showed that Vicuna-13B outperforms other well-known chatbot models like OpenAI ChatGPT and Google Bard, with a quality level that surpasses 90%. Vicuna-13B performs better and is more efficient in producing high-quality responses than other models, such as LLaMA and Stanford Alpaca, in more than 90% of the cases. Vicuna-13B is a great device in terms of cost-effectiveness. Vicuna-13B can be developed for about $300 in training, which makes it a cost-effective solution.

HF Project: https://huggingface.co/lmsys/vicuna-13b-delta-v1.1

  1.  Mistral 

Mistral 7B v0.1 is a cutting-edge 7-billion-parameter language model that has been developed for remarkable effectiveness and performance. Mistral 7B breaks all previous records, outperforming Llama 2 13B in every benchmark and even Llama 1 34B in crucial domains like logic, math, and coding. 

State-of-the-art methods like grouped-query attention (GQA) have been used to accelerate inference and sliding window attention (SWA) to efficiently handle sequences with different lengths while reducing computing overhead. A customized version, Mistral 7B — Instruct, has also been provided and optimized to perform exceptionally well in activities requiring following instructions.

HF Project: https://huggingface.co/mistralai/Mistral-7B-v0.1

Paper: https://arxiv.org/pdf/2310.06825.pdf

  1. Gemma

Gemma is a series of state-of-the-art open models that Google has built using the same technology and research as the Gemini models. These English-language, decoder-only large language models, dubbed Gemma, are intended for text-to-text applications. They have three variations: instruction-tuned, pre-trained, and open-weighted. Gemma models do exceptionally well in a variety of text creation tasks, such as summarising, reasoning, and answering questions.

Gemma is unique in that it is lightweight, which makes it ideal for deployment in contexts with limited resources, like desktops, laptops, or personal cloud infrastructure. 

HF Project: https://huggingface.co/google/gemma-2b-it

  1. Phi-2 

Microsoft introduced Phi-2, which is a Transformer model with 2.7 billion parameters. It was trained using a combination of data sources similar to Phi-1.5. It also integrates a new data source, which consists of NLP synthetic texts and filtered websites that are considered instructional and safe. Examining Phi-2 against benchmarks measuring logical thinking, language comprehension, and common sense showed that it performed almost at the state-of-the-art level among models with less than 13 billion parameters.

HF Project: https://huggingface.co/microsoft/phi-2

  1. StarCoder2 

StarCoder2 was introduced by the BigCode project; a cooperative endeavor centered on the conscientious creation of Large Language Models for Code (Code LLMs). The Stack v2 is based on the digital commons of Software Heritage’s (SWH) source code archive, which covers 619 computer languages. A carefully chosen set of additional high-quality data sources, such as code documentation, Kaggle notebooks, and GitHub pull requests, makes the training set four times bigger than the initial StarCoder dataset.

StarCoder2 models with 3B, 7B, and 15B parameters are extensively tested on an extensive collection of Code LLM benchmarks after being trained on 3.3 to 4.3 trillion tokens. The results show that StarCoder2-3B performs better on most benchmarks than similar-sized Code LLMs and even beats StarCoderBase-15B. StarCoder2-15B performs on par with or better than CodeLlama-34B, a model twice its size, and greatly beats devices of similar size. 

Paper: https://arxiv.org/abs/2402.19173

HF Project: https://huggingface.co/bigcode

  1. Mixtral 

Mistral AI released Mixtral 8x7B, a sparse mixture of expert models (SMoE) with open weights and an Apache 2.0 license. Mixtral sets itself apart by delivering six times faster inference speeds and outperforming Llama 2 70B on most benchmarks. It offers the finest cost/performance trade-offs in the industry and is the top open-weight model with a permissive license. Mixtral outperforms GPT3.5 on a variety of common benchmarks, reaffirming its standing as the top model in the field. 

Mixtral supports English, French, Italian, German, and Spanish, and it handles contexts of up to 32k tokens with ease. Its usefulness is further increased by the fact that it demonstrates excellent proficiency in code-generating jobs. Mixtral can also be optimized to become an instruction-following model, as demonstrated by its high 8.3 MT-Bench evaluation score. 

HF Project: https://huggingface.co/mistralai/Mixtral-8x7B-v0.1

Blog: https://mistral.ai/news/mixtral-of-experts/


Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.


Scroll to Top