Maty's Notes

The Race for Ultra Ethernet in Data Centers by Juniper Networks

Sun, 16 Jun 2024 00:00:00 +0200

This is my first post from the analyst week I was invited by The Futurum Group to Tech Field Day. Juniper Networks was our first day and this post will be focusing on Juniper’s vision and approach to AI Datacenter Networking.

There are more and more opinions that AI is at an inflection point. It seems that Generative AI has caused the public to wake up to AI technology, even though AI/ML has been around for at least the last 10 years. Whether this is true or not, I will leave for you to decide. What’s undoubtedly true is that this has started a race to build new infrastructure like we haven’t seen since the ’90s, and like in the ’90s, everything is about more and MORE bandwidth. Do you think 400G is enough? Forget it. 800G is now the minimum, and 1.6T is around the corner. And still, the network is the bottleneck.

So, first of all, why do you need an 800G network? If you run your single Dell server with 72 Blackwell GPUs and 400G uplinks, you have nothing to worry about. Your AI/ML compute is happening within the chassis, between GPUs connected via NVLink. But it seems like one chassis is not enough, and if your 72 Blackwell GPUs need to talk to another 72 Blackwell GPUs in another chassis, you have to start using RDMA - Remote Direct Memory Access, which allows writing to memory between two different hosts. Why do you need write into GPU memory? Because for instance you are building huge LLM models which can’t fit into single GPU memory. However, you will now face things which you haven’t faced in your data network DC:

1) Huge, and I mean HUGE, UDP single flows (elephant flows) = goodbye ECMP and goodbye any other load balancing because this is your east/west traffic. 2) Because your UDP flows are huge, any congestion control will become very tricky and will probably cause more harm than good (and Juniper folks confirmed this).

So, those are your challenges; what are your options?

1) Insert a smart NIC between the application and the network to run a full lossless network - This is what Nvidia is currently using for their InfiniBand architecture, where they are offloading lots of traffic control and management onto endpoint smart NICs. Simply put, if the receiver can’t keep up, it will ask the source to slow down - we are talking about RDMA, where hosts exchange memory data between hosts. 2) Use the current standard and get the most out of it (Ethernet) and, in the meantime, develop a new networking standard that will start addressing it (Ultra Ethernet) - Yes, you guessed right; that’s where Juniper (and Cisco, Dell, and many more vendors) have decided to go. How do they achieve RDMA over Ethernet? Using RoCEv2. In the future, we will see that the Ultra Ethernet Transport (UET) protocol will be a new RDMA protocol to replace RoCEv2.

Right, this is the terminology, and I hope it all makes more sense. So, what is Juniper’s view on the current state of AI/ML DC infrastructure?

Juniper’s opinion is that InfiniBand, with its complexity and approach to the Request/Grant mechanism, is not worth the effort, and they can prove that with the features and protocols they can deliver on top of Ethernet, it will make their solution superior to Nvidia’s InfiniBand.

Every data center network (as a matter of fact, any network in general) needs congestion avoidance and congestion control. We want to absolutely avoid any congestion, and to address challenges around elephant flow load balancing, Juniper will use DLB either via Flowlet mode, where a single flow will be split into multiple flows and load balanced, or using packet spraying.

Results? Evenly load balanced traffic across multiple uplinks:

So now, for congestion control, we have the industry-wide used DCQCN as Congestion Control. However, an interesting statement from Juniper was that we don’t need a lossless network for AI/ML workloads and it can be actually beneficial to let the application breath rather than chase 100% lossless.

Results with and without congestion control:

Now, when we “see some packet drops,” do not expect we are talking about 10%+ packet loss. We are talking about less than 0.1% packet loss. However, this goes against the general statement that you need a lossless network to have a reliable and high-performing AI/ML network.

Add to not needing lossless network the much lower TCO of Ethernet, and you have a compelling reason to keep your AI/ML workloads running on Ethernet or Ultra Ethernet in the future and not needing InfiniBand. At least, that’s Juniper’s view.

Edit: Corrected statement about not needing congestion control. Corrected statement about packet loss. Corrected typo with lossless network.

Dell Technologies World 2024 - Day 1

Tue, 21 May 2024 00:00:00 +0200

Dell Technologies is celebrating its 40th anniversary this year, and I’ve been invited by Dell Technologies and Arrow Electronics to Las Vegas to attend their annual Dell Technologies World 2024. Michael Dell immediately announced that this year’s Dell Technologies World conference is the “AI Edition,” which seemed to be the anticipated theme. He began his keynote by declaring that we are moving from the “age of computation” to the “age of cognition,” a statement I found quite profound. After experiencing the keynotes, it’s clear to me that few vendors are as well-prepared for this shift as Dell is.

Dell’s strategy involves AI Factories, a collaboration with NVIDIA announced over 12 months ago under the project name Helix, which had its official launch at the GTC 2024 a few months back. This Dell-validated architecture introduces end to end AI infrastructure for enterprises. The Dell AI Factory architecture is unique because it’s comprehensive, encompassing compute, networking, storage, and more importantly endpoints. Currently, I believe no other vendor offers such an architecture—perhaps HPE/Juniper might in the future, but integrating these two portfolios will require time.

The keynotes featured Silicon Valley legends, starting with Bill McDermott, the current CEO of ServiceNow (and former CEO of SAP), who discussed AI in his characteristic manner, describing it as “really big, really fast.” He highlighted how ServiceNow uses Dell’s infrastructure to develop their own LLM models. McDermott made an excellent point about the disarray of enterprise data and ServiceNow’s role in addressing it.

Following that, Michael Dell introduced none other than Jensen Huang. Listening to them discuss the AI Factory and how NVIDIA relies on Dell products for a complete AI delivery vision showcased a synergy rarely seen between two major vendors. This makes sense when considering Dell’s performance over the past year. Dell’s recent earnings reports show that their stock performance has almost matched Nvidia’s year-over-year increase of +100%, which is remarkable. This growth is unparalleled in the field, especially compared to Cisco or Hewlett Packard Enterprise, both of which have remained relatively stagnant over the past year.

Some notable announcements included:

The new PowerSwitch Z9864F, designed with Nvidia and built on the Broadcom Tomahawk 5 platform, is an AI-ready switch with low latency.
The new PowerScale F910, a storage solution for unstructured data, is a scale-out system focused on AI and the GenAI pipeline.
The new PowerStore Prime, an all-flash storage system for structured data.
The new PowerEdge XE9680, a blade solution with direct liquid cooling and eight NVIDIA Blackwell Tensor Core GPUs per 1RU, totaling 72 GPUs per chassis.
Five new Dell AI PCs powered by Qualcomm Snapdragon X Elite and Snapdragon X Plus, capable of running local models like Llama 3, etc.
Dell is the first infrastructure on-prem vendor partnering with Hugging Face offering optimized deployment for GenAI

Other highlights included a panel discussion with Dell and Nvidia leadership about the AI Factory announcement and how they were building this architecture in the last 12 months. However, the session that stood out to me the most was led by Ananta Nair from Dell Technologies and Hamid Shojanazeri from Meta , discussing the implementation of Llama3 on Dell’s infrastructure AI platform. It offered a deep dive into Llama3’s architecture and insights into how Dell uses Llama3 internally and documents the process.

From what I’ve observed at the keynotes, Dell appears to be exceptionally well-positioned. They have a clear and distinct vision and proposition that sets them apart from competitors. Dell and Nvidia are at the forefront, shaping the industry with innovative concepts like AI Factories and AI PCs—the best position to be in. The potential for endpoints, such as PCs that can run models locally, to unlock entirely new LLM use cases is immense. Now, it remains to be seen how effectively Dell will execute their vision. Like always it is all about execution.

KubeCon 2024 Day 1: A Decade of Kubernetes

Wed, 20 Mar 2024 00:00:00 +0100

As Kubernetes enters its second decade, the first day of KubeCon 2024 was marked by a series of great discussions and presentations.

The opening remarks were delivered by a panel of experts including Priyanka Sharma, Timothée Lacroix, co-founder of Mistral AI, Paige Bailey from Google Gen AI, and Jeffrey Morgan, the creator of Ollama. They discussed the gap between Development and AI Research, similar to the previous gap between Development and Operations. Machine learning teams often overlook containers, a point underscored by Paige Bailey who noted that Google’s AI infrastructure engineering team is struggling to keep up with the demand for training larger models.

A presentation on accelerating AI workloads with GPUs in Kubernetes was delivered by Kevin Klues and Sanjay from Nvidia. They discussed the current focus on GPU sharing and the various ways to achieve it, from multiple workloads on a single GPU to partitioning, virtualization, and CUDA abstraction. They also introduced Nvidia Picasso, a Kubernetes-driven solution for AI training.

Jorge Palma from Microsoft discussed building an open-source platform for AI/ML. He introduced KAITO, an OSS Operator to deploy LLM models, which provides an API endpoint to access models and supports Llama and Mistral.

One of the highlights of the day was a round table conversation about RAG, fine-tuning, and running LLMs on Kubernetes in both cloud and on-prem environments. In an unexpected turn of events, I found myself moderating the discussion after the scheduled speaker didn’t turn up :)

The day concluded with a presentation by Christian Posta from solo.io comparing Istio and Cilium - sidecar vs non-sidecar service mesh. The talk was particularly interesting for its focus on the benefits of eBPF and how it provides other features like observability and security.

In summary, the first day of KubeCon 2024 celebrated the achievements of Kubernetes over the past decade and looked ahead to the future of AI and LLMs. With its growing adoption in the AI community, Kubernetes is becoming the de facto platform for AI, setting the stage for an exciting future.

What did I Learn About AI in the Last 12 Months?

Tue, 19 Mar 2024 00:00:00 +0100

Have you ever wondered what the future of AI will look like? How will it change our lives, our work, our society? If you are like me, you are probably fascinated by the rapid developments and innovations in this field. In this post, I want to share with you some of my thoughts and observations on the current state and trends of AI, based on my personal experience and research.

The last 18 months have been absolutely fascinating. We have witnessed the release of ChatGPT3.5, a groundbreaking natural language generation model that can produce coherent and diverse texts on almost any topic. This model has sparked a lot of interest and excitement among both casual and technical users. For example, some people use ChatGPT to prepare a meal plan for the whole week, while others use it to generate code, poetry, or music. ChatGPT3.5 has also inspired an avalanche of new open source tools and models, as well as new vendors that offer various AI solutions and services. In Natilik, especially in the last 6 months, we have been trying to get our head around the use case for products like Co-Pilot, which is probably the first widely used ChatGPT product. We have also been exploring the infrastructure side of AI, such as open source LLMs, RAG, storage, network infrastructure, Azure OpenAI, etc.

It is 2024 and we are, in my humble opinion, in the beginning of building a completely new infrastructure, similar to what we saw in the late ’90s. Nvidia stocks are reaching extreme heights, comparable to Cisco stocks in the early 2000s. We see Nvidia creating and selling vital tools, much like Cisco did nearly 30 years ago. As we grappled with what to do with all the performance and bandwidth then, we are now pondering what to do with all these GPUs. However, there are some major differences between now and the late 90s. In the late 90s, Cisco had multiple competitors, such as Juniper, Nortel, or Lucent, which Nvidia, except for very recently AMD, really doesn’t have. I think the dominance of Nvidia will get only stronger in the next few years. We are also seeing a difference in who the major spending driver is. It isn’t telecommunications like it was in the 90s, as was the case with Cisco, but public cloud providers. But is it going to be the same for the future? With latest Nvidia announcement to form AI-RAN alliance bring 5G and AI together to open up platform for completely new workloads we may will see some shift.

While Nvidia, AWS, Azure, and Google don’t share specific figures, it’s estimated that 80-90% of Nvidia’s datacenter revenue this year is driven by cloud providers. It is not usually enterprises that are building new AI infrastructure, but cloud providers that have lots of free cash and don’t want to miss the opportunity to be part of this next generation of the infrastructure. As an example, Nvidia and Dell backlogs are so massive that it is expected that this will continue for another 2 or 3 years.

To see the bigger picture, we have to understand that Nvidia is only the tip of the iceberg. We are seeing a coming wave of new vendors in storage and compute, such as VAST, Weka, Groq, and others. We are exploring and reshaping approaches to the things we have started doing only 12 months ago. One of the examples is AI inference. Based on the Nvidia CEO, around 40% of their shipments are being used for this purpose. We have been building these models for the last 18 months and now we are coming into the stage where we want to actually run them - efficiently and fast. Because of this, we are already starting to see the rise of new vendors in the LPU market. Sam Altman’s plan to raise trillions of dollars for chip development only confirms that we are only at the start of something much bigger. Only time will tell how AI adoption will enable these new segments to grow. There are huge opportunities in this space. Nvidia H100 consumes up to 700W and Nvidia declered that their next gen B200 will consume up to 1000W at full utilization. As the scale and usage will only get higher we are throwing virtually endless amount of power and water into running AI infrastructure. This is not sustainable.

However, don’t get me wrong, most of the AI applications in the enterprise world are either in PoC or dev stage. We are still waiting for an AI use case or application that will cause massive AI adoption, like we were waiting after the internet bubble for the applications that would use all the new bandwidth we had built in the 90s, and we eventually got with YouTube, MySpace, and Netflix. As AI infrastructure grows and AI becomes integral to areas like healthcare, education, and environmental conservation, it will prepare the platform to create these “killer” applications.

For me this is month number 18 and we are only at the beginning. The Total Addressable Market (TAM) is massive (and it is actually still growing) and the attack on Nvidia revenues and margins is only about to start, which for the enterprises will push the cost down and drive the innovation, which will then enable much wider adoption among normal users. We will see lots of new vendors who will challenge Nvidia and try to take a slice of their revenues. I still have a lot to learn, I know I’m only at the beginning. I’m currently deep with open source LLMs and really do believe open sourcing LLMs is the right way to do things. The progress which open source LLMs have done in the last 12 months is staggering. I was testing one of the first LLaMAs models 12 months ago and when you compare it to the latest models like Gemma or Mixtral, it is amazing to see the progress. Open source LLMs will be the key driver for innovations in the infrastructure space, which I’m really excited about. I’m currently experimenting with local LLMs running on my M1 laptop, using Co-pilot, and testing Azure OpenAI, so I will try to drop more info from time to time about what I’m working on or what I’m testing.

Building AI Data platforms with WEKA

Wed, 25 Oct 2023 00:00:00 +0200

In the realm of AI and Machine Learning (AI/ML) as well as image processing, the necessity for high-performance storage cannot be stressed enough. This involves high I/O capabilities with minimal latency, ensuring that operations are swift and seamless.

Traditionally, many setups, whether in cloud environments like AWS, Azure, GCP, or Oracle, or on-premises, have been plagued by a primary challenge: storage that struggles to feed compute adequately. This issue is further intensified when when you are for instance company who is focusing on AI building LLM. These companies make significant investments in GPUs in public cloud, but often, the storage bottlenecks, characterized by prolonged load/write times, prevent them from achieving optimal performance.

To address these challenges, a more streamlined AI data pipeline is crucial. A typical workflow might look something like this: Data Ingestion => Metadata lookups => High I/O operations => Data storage. The ultimate goal? To feed GPUs faster, potentially reducing the number of GPUs required or do more with the same amount of GPUs.

One innovative solution to address exactly these challenges is WEKA. WEKA leverages the NVMe storage in AWS, Azure, GCP and Oracle instances and than aggregates 80-90% of the data of storage to i.e. S3 or Azure Blob. This pool is called WEKA namespace which creates object storage available to endpoints outside of WEKA namespace. It has been benchmarked to achieve an impressive 5M IOPs in AWS with 40X faster model deployment. Pretty impressive! There are significant cost savings tied to this approach as well. Organizations can sidestep the need to maintain multiple instances, thanks to features like autoscaling and data reduction. Furthermore, the flexibility of such a system allows for hybrid environments, functioning both on-premises and in the cloud or even spanning multiple cloud providers. Whether you’re looking at brownfield (existing) or greenfield (new) deployments, the system is designed to support both.

Auto-scaling is made even more efficient through defined groups in the cloud, and the data tiering in S3 offers added advantages. By chopping files into manageable pieces, parallelism is achieved. Meanwhile, smaller files can be merged into larger objects, optimizing storage. Pretty neat!

The support extends to multi-AZ deployments, with the only current limitation being the non-movement of data closer to instances but apparently this is something which is being explored. With robust solutions like AWS’s P4d and P5d instances, the core of these operations is fortified.

In case the cost is your priority you can always spin down the whole WEKA cluster with storing the data to your S3 or Azure Blob. Later if you spin up the cluster again it will re-hydrate the data from S3 and you are good to go. In conclusion, as AI/ML operations grow in complexity and demand, it’s vital for organizations to evaluate their storage solutions critically. Optimal performance isn’t just about computational power but ensuring that storage infrastructures can keep pace.

Multi-Cloud Networking Solution by Prosimo

Thu, 19 Oct 2023 00:00:00 +0200

There has been lots of good sessions and good conversations yesterday with vendors like Juniper or AMD but one company really stand out for me personally and that was Prosimo.io .

I was following Prosimo.io last few years for multiple reasons. I was always big believer in multi-cloud, the team who have founded Prosimo.io is full of impressive people who for instance co-founded Viptela (Acquired by Cisco) or senior personas behind Cisco ACI BU and VCs behind this startup. Combination of these factors very often means that this is going to be interesting company. And sure it is!

As lots of enterprises and Natilik clients embark on multi-cloud journeys, one of the key challenges they face is managing diverse cloud environments. The solution often lies in putting an abstraction layer on top of these clouds, ensuring they all function in a consistent manner. This abstraction helps in keeping operations uniform, simplifying the management of various clouds. This is achieved through the effective management of cloud connectivity, employing strategies such as simplifying Virtual Private Clouds (VPCs) and Virtual Networks (VNets), using cloud-native gateways, advanced Layer 3 networking, and Cloud Connect solutions.And this is exactly what Prosimo is trying to address - and obviously much more. Conversations moved away from “how to connect to cloud” into “how to operate and connect our multiple clouds environments”.

Use cases are usually migrate applications into the cloud or between clouds and not thinking about networking. One of the use cases are companies who acquires other companies and need to quickly connect their cloud environments into their. Prosimo deploys an “Elastic Edge” , which allows for dynamic scaling based on required throughput into every cloud region where you run your any of your workloads - it can be any PaaS or SaaS. In the core of this edge solutions is Kubernetes which plays a pivotal role in this strategy, making it possible to scale resources up or down as needed, in real-time.

In high level deploying Prosimo looks pretty much like this: 1) Discovery - connect into you each cloud environment from Prismo and do discovery to see what objects and what applications are in each cloud environment. 2) Deploy Prosimo Fabric across all environments. This is achieved by Prismo Edge - Kubernetes, deployed in Azure, AWS and GCP 3) Network - Build L3 network - onprem, AWS, Azure and 4) Secure - create segmentation so called “namespaces”. 5) Attach Services - attaching services (SQL, API management etc) to networking 6) Deploy Prosimo Policy - zero trust, not trusting anything

After this the world is your oyster. You can deploy policies can defined by multiple criteria like address, services, tags, etc. called “envelope” in Prosimo language which will define each application and how can communicate across multiple clouds.

Great feature was around observability where you can see have insight into your cloud environment from data and performance perspective but also cost control where you can view your cost via “Cost 360” and see immediately your multi-cloud cost in one single dashboard.

Overall very impressive product. I think it really address lots of challenges I’m seeing with our clients around managing and migrating multiple cloud environments which is becoming more and more adopted by our clients. Great job Prosimo.io !

Cloud Tech Field Day - Day 0

Wed, 18 Oct 2023 00:00:00 +0200

Before Cloud Tech Field Day kicks off tomorrow, I decided to make the most of my Day 0 in Silicon Valley by meeting with some of Natilik key Multi-Cloud vendors. The day was packed with insightful conversations, and the first meeting I had was to have a breakfast with our newest vendor in our portfolio Spectro Cloud with their CTO, Saad Malik and Tenry Fu , Spectro Cloud CEO.

We talked about inherent challenges faced by startups in securing funding. Saad, with his background as a co-founder, shared valuable insights into the tenacity required to navigate these hurdles. One of the most interesting parts of our conversation revolved around the ongoing debate between ARM and RISC-V. This isn’t just a technological competition; it’s a geopolitical shift too. We have discussed how this rivalry is shaping the tech industry and even impacting national interests.

SpectroCloud’s innovative products and solutions were a highlight. Saad spoke about their integration with K8sGPT and the introduction of LocalAI, demonstrating their commitment to staying at the cutting edge of technology. It was fascinating to hear how they’re shaping the future of Kubernetes management and edge computing.

We also discussed SpectroCloud’s crowning achievement – their successful endeavors with Kubernetes running on the edge, particularly with one of the leading coffee chains. This showcased how much their products are influenced by the unique needs of their clients, a trait commonly associated with startups.

After a morning filled with insights, I rushed to PureStorage’s office, where I was warmly hosted by Niki Armstrong, their Chief Legal Officer. Our discussion centered on the exciting announcements made at Pure Accelerate in London, with a particular focus on financing and DRaaS. Niki’s mention of PureStorage registering over 2000 patents in their relatively short existence was a testament to their commitment to innovation. It was also intriguing to learn that Pure CEO Charlie Giancarlo has registered some of these patents.

I was lucky enough to meet and spend some time with PureStorage CEO Charles Giancarlo . It was short but super interesting conversations about the current challenges and opportunities in various regions.

My day concluded with a dinner gathering, led by Stephen Foskett , and attended by our group of Tech Field Day delegates. The evening was marked by the sharing of Tech Field Day secrets and a sense of anticipation for the upcoming Tech Field Day :) It’s set to be a busy, action-packed day, starting early at 8 am and wrapping up after 5 pm. On to Tech Field Day 1!

Quick summary on Pure//Accelerate 2023

Fri, 16 Jun 2023 00:00:00 +0200

Before I will leave Pure Accelerate I wanted to quickly put together few thoughts on what I have seen this week in Las Vegas. This is quick dump which I have done in an hour so apologies for any typos. You could say that the only takeaway from Pure Accelerate should be that flash is displacing HDD which after many years of repeating the same message got little bit….flat but let’s not be cynical because if you look little bit closer there are some exciting innovations which are happening across PureStorage portfolio which will shape their portfolio for many following years.

I have listed ones which I think are the most impactful I have seen this week in Las Vegas:

75 TB DFMs

75 TBs on a single DFM and 150TBs on single DFMs will be announce this year - yeap! This will enable PureStorage to increase the density of the current FAs and squeeze the price per GB even more. Performance increase is impressive - 40% increase in performance for //X series and 60% increase in performance for //E series. Series //X will now have up to 1.5 PB (!!!!) raw capacity in 3RU physical factor. Controllers have been upgraded so you get Sapphire Rapids, DDR5 and support for PCIe5. With higher density it will obviously won’t make sense to maintain //X10s so say goodbye to good old //X10s. Higher density goes hand in hand with better stats on Ws/TBs - //X series improvement approx. 97% and //C series approx. 48%. Will this be enough especially when this year NVMe EDSFF should be able to close the gap from a capacity standpoint? I guess we will see!

Block storage in cloud

It feels like Cloud Block Store (CBS) maybe undergoes little bit of renaissance. Maybe it was the timing which wasn’t right for the product when it was introduced around 2018 or 2019 (can’t remember the year right now) but it looks like there are some new interesting use cases probably due to maturity of the market and clients. SQL databases replication between cloud and on-prem, data reduction and therefore improving cloud spent or to simplify and improve your block store in Azure and AWS which decrease your cloud spent and improves the operation side of your compute in public cloud. Probably the time is right now?

Kubernetes Object Storage

It is inventible that object storage will enter into Kubernetes world this or next year. You want to move your cloud native databases between cloud providers or your prem environment often using S3 storage as target but right now that’s difficult if not impossible as each cloud provider uses different standards for objects. It feels like Portworx quietly entering object storage arena with Portworx Object Service challenging vendors like MinIO who are defacto leaders in cloud native object storage. Portworx announced Portworx Object Service which feels like a stop gap before COSI (Container Object Storage Interface) will be finalised. Very excited where this is going.

The way we see and store data is changing

I’m not AI expert so I can’t tell how much AI will change things in our daily life what I’m pretty sure AI is changing is the cloud and datacenter infrastructure. The amount of data which are needed to build LLMs or use generative AIs requires huge amount of compute power and fast and large storage for the data. It will be interesting how these needs will be reflected into the products like Portworx where most of the infrastructure for AI run. Silos between cold, warm or any other types of storage are disappearing - if you are training model your all data has to be available immediately and in very high speed. PureStorage is perfectly positioned however Vast is challenging them mainly in AI/ML space with different architecture for instance garbage collection etc. Time will tell….

Cloud operating model

Last but not least is the cloud operating model. We in Natilik are always big believers in cloud operating model. We have onboarded HashiCorp who are the pioneers in cloud operating model for many years and helped lots of our clients on this journey. PureStorage always felt little bit behind - not so much from commercial side where Evergreen pay as you go model exists for a while or Portworx acquisition but mainly from operational side. PureStorage was working really hard last year on Fusion which I would call their orchestration tool. Imagine that your developers will need to provision new volume - you can now provide your developers set of Terraform modules which they can integrate into their current Terraform modules and deploy their application without asking “storage guys” to deploy new volume(s). It should also be able to rebalance the workloads dynamically. All this will run as part of Pure1. In the future I hope we will see the support for Portworx.

Overall great event. It has been great to meet people I haven’t seen since my last Pure Accelerate 2019 in Austin before the world went mental. Thank you PureStorage for having me and being so opened talk to me about your new products, ideas and innovations. Looking forward to the next one!

CCW Estimate API Part 2 - Acquire Estimate

Mon, 13 Apr 2020 00:00:00 +0200

Using CCW Estimate API

If you have have finished reading the previous post I’m must congratulate you. I know that it was really boring and yes it must have been difficult to understand sometime but you have made it so now let’s do that fun part - programming it and testing it. For the purpose of this series I will be using Python but I know about engineers I’m currently talking they are using Go for this. I’m hoping these posts will help other engineers and we will see using this for other languages!

Acquire Estimate with CCW Estimate APIs

As I have explained in the previous post we will have multiple requests which we can use for CCW Estimate API.

Available requets are:

List Estimate request - https://api.cisco.com/commerce/EST/v2/async/listEstimate
Update Estimate request - https://api.cisco.com/commerce/EST/v2/async/updateEstimate
Create Estimate request - https://api.cisco.com/commerce/EST/v2/async/createEstimate
Acquire Estimate request - https://api.cisco.com/commerce/EST/v2/async/acquireEstimate

I have decided to start with Acquire Estimate in this post and will focus on next requests in the future posts. I will explain in the following paragraph why I have decided to start with Acquire Estimate Request.

CCW Estimate API ‘blog project’ workflow

I wanted to create the workflow which will allow me to show you as many features as possible. I will start with acquireEstimate which will read the Estimate. The script will export the whole estimate into the text file and with all the details we will need for using any items from exported estimate for any new estimate.

Exported text file will contain all the items we will want to add into the new estimate. In our case it will be default built of Catalyst 9300 48 ports, Nexus 9300 and HyperFlex including all necessary licenses. Afterwards we will split these items into separate files so we can use them in the future as the reference points for adding these items to new estimates.

Let’s assume we have segregated file for Catalyst 9300, Nexus 9300 and HyperFlex. We will now have createEstimate script which will import these text files into the single XML request and added into CCW Estimate Create request and create new CCW estimate with the valid configuration.

Let’s start with Postman

General rule when you are automating stuff is to start with small steps and in my experience that always starts with testing the APIs and what they can offer. I have done the same thing with CCW Estimate API and I would recommend to do the same thing.

You can find how to set up Postman and where to download Postman files in the previous post.

Creating estimate

For purposes of this post let’s create new standard CCW estimate. I have created estimate with single C9300-48P-A. It will look like this:

CCW Estimate API Acquire Estimate - Authentication

I will start my script with importing all necessary variables: client_id client_secret username password from pass.txt in the root folder. Each item defined in the new line.

pass_file = open("./pass.txt","r") #opens file with name of "pass.txt"
fileList = pass_file.read().splitlines()
client_id = fileList[0]
client_secret = fileList[1]
username= fileList[2]
password = fileList[3]

We will need these variables to authenticate so now let’s add OAuth2 part which I have xplained in the previous post:

oauth = OAuth2Session(client=LegacyApplicationClient(client_id=client_id))
token = oauth.fetch_token(token_url='https://cloudsso.cisco.com/as/token.oauth2',
        username=username, password=password, client_id=client_id,
        client_secret=client_secret)

As you can see this is nothing new and it is using variables which we have already defined in the text file pass.txt.

CCW Estimate API Acquire Estimate - Head and Body

Now the funny part - head and body. I don’t want to tell you how much time I have spent on this as the CCW documentation didn’t really explain any of these details. Everything I will show you in this paragraph is the minimum amount of attributes and variable you will need in Head and Body of your request. If you will change any of these variables it will most likely fail with the series of totally unhelpful messages for instance ‘Invalid Solution Version ID’. Go figure what you have done wrong….

Anyway let’s start with the header. In the header we will use our token which have generated in the previous part of our script:

headers = {"Content-Type": "text/xml", "Authorization": "Bearer "+str(token['access_token'])}

We have the header now so let’s focus on the body of our Acquire Estimate request. We will use standard soapenv tags so the final template for Acquire Estimate will look like this:

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:hs="https://api.cisco.com/commerce/EST/v2/async/acquireEstimate">
 <soapenv:Body>
  <ProcessQuote releaseID="2014" versionID="1.0" systemEnvironmentCode="Production" languageCode="en-US" xmlns="http://www.openapplications.org/oagis/10">
   <ApplicationArea>
    <Sender>
     <ComponentID schemeAgencyID="Cisco">B2B-3.0</ComponentID>
    </Sender>
    <Extension>
     <Code typeCode="Estimate">Estimate</Code>
    </Extension>
   </ApplicationArea>
   <DataArea>
    <Quote>
     <QuoteHeader>
      <ID typeCode="Estimate ID">$estimate_id</ID>
      <Extension>
      </Extension>
     </QuoteHeader>
    </Quote>
   </DataArea>
  </ProcessQuote>
 </soapenv:Body>
</soapenv:Envelope>

There are two important parts which you should be aware and that is ComponentID and QuoteHeader. ComponentID has to have variable schemeAgencyID="Cisco" defined as B2B-3.0. There is now way you will use any other scheme for your request. Any deviation from this will cause the error. Second important thing is QuoteHeader where as you can see we have a bit <ID typeCode="Estimate ID">$estimate_id</ID>. Here we will use our estimate ID (for instance PF92406645GK) which we will want to export.

When this is all done we can submit our request to CCW Estimate API:

out = requests.post("https://api.cisco.com/commerce/EST/v2/async/acquireEstimate", data=body, headers=headers)

CCW Estimate API Acquire Estimate - Response

If everything went well we should have the response stored in out variable. This is cool but working with response in XML format will not be much helpful. I have decided to pull the data from XML file with BeautifulSoup libraries as variables and use these variable when we will be creating the text file.

BeautifulSoup is an amazing Python library which essentially convert XML into the table with columns and rows and each row can contain multiple variables (columns).

First things first so let’s import xml response and use txt as the output and convert this into the table.

soup = BeautifulSoup(out.text, 'xml')
table = soup.Quote

Now when we have a table with x rows which is equal of x amount of items we will want to go through them item by item. We will achieve this by searching for all QuoteLine attributes in all items which will give us total number of times. Every item we have exported from the estimate has its unique QuoteLine.

table_rows = table.find_all ('QuoteLine')

We have total number of lines in our table therefore we can loop through every single item and get all the attributes we will need. However you may ask which attributes you would need to export (there are dozens of attributes for each item) and after long testing and reverse engineering I can tell you which attributes exactly are required:

Attribute	Description
CCWLineNumber	Item’s line number you see in CCW
PartNumber	Item’s part number
LineNumberID	Unique line number for each item within specific CCW estimate
Quantity	Quantity of items
ConfigurationPath	Cisco CCW internal attribute
ConfigurationSelectCode	Cisco CCW internal attribute
ItemType	Cisco CCW internal attribute
ParentID	Parent ID - linking sub-items with parent items
Duration	Important for licensing

Now when we know which attributes are required we will extract them from XML:

NameValue = tr.find_all('NameValue', {"name": "CCWLineNumber"})
id = tr.find_all('ID', {"typeCode": "PartNumber"})
lineId = tr.find_all('LineNumberID')
qt = tr.find_all('Quantity', {"unitCode": "each"})
configPath = tr.find_all('ValueText', {"typeCode": "ConfigurationPath"})
configSelectCode = tr.find_all('ValueText', {"typeCode": "ConfigurationSelectCode"})
itemType = tr.find_all('ValueText', {"typeCode": "ItemType"})
parentID = tr.find_all('ParentID')
duration = tr.find_all('Duration')

This can be tricky as some of the items has attribute duration but some of them don’t. For these reasons we have to make sure that if the attribute is not defined for particular item we will make it zero:

if not duration:
    durationLn = "0"
else:
    durationLn = [i.text for i in duration]
if not configPath:
    configPathLn = "0"
else:
    configPathLn = [i.text for i in configPath]
if not configSelectCode:
    configSelectCodeLn = "0"
else:
    configSelectCodeLn = [i.text for i in configSelectCode]
if not itemType:
    itemTypeLn = "0"
else:
    itemTypeLn = [i.text for i in itemType]

For purposes of this blog post we will now print out items line by line:

print (idLn[0] + "," + lineIdLn[0] + "," + NameValueLn[0] + "," + configPathLn[0] + "," + parentIDLn[0] + ","+ qtLn[0] + "," + durationLn[0] + "," + itemTypeLn[0]+ "," + configSelectCodeLn[0])

The output of our script will look something like this:

0,77003256742,C9300-48P-A,C9300-MODEL:PRODUCTNAME|C9300-48P-A,0,1,P0Y0M21DT0H0M,CONFIGURABLE,USER
1,77003257432,C9300-NW-A-48,C9300-MODEL:NETWORK_LICENSE|C9300-NW-A-48,77003256742,1,P0Y0M14DT0H0M,NONCONFIGURABLE,SYSTEM
2,77003257433,S9300UK9-1612,C9300-MODEL:SOFTWARE|S9300UK9-1612,77003256742,1,P0Y0M14DT0H0M,NONCONFIGURABLE,USER
3,77003257434,PWR-C1-715WAC-P,C9300-MODEL:PRIMARY_POWER_SUPPLY|PWR-C1-715WAC-P,77003256742,1,P0Y0M14DT0H0M,NONCONFIGURABLE,USER
4,77003257435,PWR-C1-715WAC-P/2,C9300-MODEL:SECOND_POWER_SUPPLY|PWR-C1-715WAC-P/2,77003256742,1,P0Y0M14DT0H0M,NONCONFIGURABLE,USER
5,77003257436,CAB-TA-UK,C9300-MODEL:POWER_CABLES|CAB-TA-UK,77003256742,2,P0Y0M14DT0H0M,NONCONFIGURABLE,USER
6,77003257437,SSD-120G,C9300-MODEL:SSD_MODULE|SSD-120G,77003256742,1,P0Y0M14DT0H0M,NONCONFIGURABLE,USER
7,77003257438,STACK-T1-50CM,C9300-MODEL:STACKWISE_CABLES|STACK-T1-50CM,77003256742,1,P0Y0M14DT0H0M,NONCONFIGURABLE,USER
8,77003257439,CAB-SPWR-30CM,C9300-MODEL:STACK_POWER_CABLES|CAB-SPWR-30CM,77003256742,1,P0Y0M14DT0H0M,NONCONFIGURABLE,USER
9,77003257440,C9300-DNA-P-48,C9300-MODEL:SUBSCRIPTION_SW_CLAS:2:CISCO_ONE_AND_DNA|C9300-DNA-P-48,77003256742,1,P0Y0M14DT0H0M,CONFIGURABLE,USER
9.0.1,77003257441,C9300-DNA-P-48-5Y,0,77003257440,1,P0Y60M0DT0H0M,SUBSCRIPTION,0
10,77003257442,ISE-BASE-T,C9300-MODEL:SUBSCRIPTION_SW_CLAS:2:CISCO_ONE_EXPANSIONS|ISE-BASE-T,77003256742,25,P0Y0M3DT0H0M,NONCONFIGURABLE,SYSTEM
10.0.1,77003257443,ISE-BASE-TRK-5Y,0,77003257442,25,P0Y60M0DT0H0M,SUBSCRIPTION,0

Summary

In this post were exploring how to use read the CCW Estimate, read the items from estimate and build Python script with Python libraries BeautifulSoup on top of it. You should be now able to build your own script also in other languages than Python. We can now export items from the CCW estimate which we will be use for our next step - creating new estimates but I will explain this in the another post.

Intro into CCW Estimate API

Tue, 07 Apr 2020 00:00:00 +0200

Intro

I work in pre-sales so working with CCW takes up a big part of my working day (joy!) and as a network automation enthusiast I’ve decided to automate the heck out of CCW. After a few people asked me to help them with CCW Estimate APIs, I decided to document the process so it can, hopefully, help others.

What is CCW?

CCW stands for Cisco Commerce Workspace. If you work in pre-sales or sales, you probably spend a significant amount of your time in CCW. CCW allows you to create BoMs, subscriptions, quotes, deal IDs and orders of Cisco equipment. As I mentioned above, I work in pre-sales so I use CCW Estimate mainly for building BoMs, and that will be the focus of a few of my upcoming blog posts.

Where to start?

You won’t find a CCW sandbox on DevNet or any community forum where you could ask questions concerning CCW. A lot of the stuff that I’ll be talking about here is therefore stuff I learnt through trial and error.

Before we actually start talking about CCW Estimate APIs, the first thing you should do is to set up a ClientID and Client Secret as these will be necessary for authentication.

Requesting your ClientID and Client Secret

Let’s start. You will need to login with your Cisco login at https://apiconsole.cisco.com/. Click on ‘My Apps & Keys’ and ‘Register a New App’.

When you are in the ‘Register a New App’ section, you will need to fill in some of the details relating to your new application. There are two important settings: ‘OAuth2.0 Credentials’, which needs to be configured as ‘Resource Owner Credentials’, and ‘Select APIs’ where you need to choose ‘CCW Estimate API’. This will generate a ClientID and Client Secret specific to this API.

Shortly after this you should receive an email with your ClientID and Client Secret.

CCW API Requests

CCW API Requests are very specific and a bit different from typical API requests, which you issue with respect to Cisco hardware or software. Firstly, CCW API requests use OAuth2.0 for authentication. For those who are not familiar with OAuth2.0 for this type of authentication, you will need a username, password and also a ClientID and Client Secret. I described how you can obtain a ClientID and Client Secret in the previous section.

Secondly, CCW API requests are all POST requests. Even if you query CCW API you are still carrying out a POST request.

Last but not least, if you need to send the body you use XML. For me, the body was the most challenging part in the entire process. CCW API is very sensitive about the use of the correct syntax and, as the relevant documentation can be a bit tricky, I would like to focus mainly on this in this post. There will be lots of XML parsing so wait for it! This is going to be so much fun:)

CCW API response is in the XML format so it can be a little problematic as well but I will help you to understand how this can be done. I’m of course not saying that my way is the only, or best, way. I would love to hear anyone’s suggestions for parsing data from an XML response, though I intend to focus more on this in my next post.

Authentication

You normally login to CCW with your CCO user account and you will use the same credentials for CCW Estimate APIs. CCW APIs use OAuth2.0 for authentication; you will therefore need a ClientID and a Client Secret on top of your username and password. I explained how you can get these details from the CCW API portal in the previous section.

Let’s say we have all the details we need: a CCO username and password, Client ID and Client Secret. We will now therefore need to obtain the token, which we’ll use for CCW API requests. For testing purposes, I would suggest starting with Postman to test that you have the right credentials and privileges. You can find exported Postman collections in the git repo in directory postman. Import the JSON file into your Postman and you should have 5 basic CCW API requests: AcqEstimate, createEstimate, getProduct, listEstimate and updateEstimate.

URLs which we will be using for three basic API call:

List Estimate request - https://api.cisco.com/commerce/EST/v2/async/listEstimate
Update Estimate request - https://api.cisco.com/commerce/EST/v2/async/updateEstimate
Create Estimate request - https://api.cisco.com/commerce/EST/v2/async/createEstimate

I would suggest that you start with listEstimate as it is the easiest one. listEstimate request will query the address https://api.cisco.com/commerce/EST/v2/async/listEstimate and it will list all the estimates available.

ListEstimate will run a POST API request to the URL https://api.cisco.com/commerce/EST/v2/async/listEstimate but let’s check the authorization first. Choose OAuth2.0 in authorization of the request and click on ‘Get New Access Token’. The below window should open:

When you click ‘Request Token’ you should obtain the token. Click on ‘Use Token’. Authorization will be exactly the same with the other CCW API requests as well.

Skeleton body of CCW API Request

Before we will click on ‘Send’, we should check what it is we’re actually sending in the body of the request. The body is always in the XML format wrapped in something called soapenv. soapenv stands for SOAP Envelope and it helps to indicate the start and the end of the message so that the receiver knows when an entire message has been received. Thanks to SOAP Envelope you know when you are done receiving a message and this is ready to be processed. SOAP Envelope is therefore basically a packaging mechanism.

When you click on body in Postman, you will see this:

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:hs="https://api.cisco.com/commerce/EST/v2/async/listEstimate">
 <soapenv:Body>
 <GetQuote releaseID="2014" versionID="1.0" systemEnvironmentCode="Production" languageCode="en-US" xmlns="http://www.openapplications.org/oagis/10">
 	 <ApplicationArea>
 	  <Sender>
       <ComponentID schemeAgencyID="Cisco">B2B-3.0</ComponentID>
    </Sender>
 	 </ApplicationArea>
     <DataArea>
		    <Get maxItems="25">
		      <Expression expressionLanguage="SortBy">LAST_MODIFIED</Expression>
		    </Get>
	 </DataArea>
 </GetQuote>
 </soapenv:Body>
</soapenv:Envelope>

You will see in further sections that the body of the request always starts and ends with <soapenv:Envelope> and </soapenv:Envelope>. <ComponentID schemeAgencyID="Cisco">B2B-3.0</ComponentID> is also an important part. If you use a different schemeAgencyID value, the request will not be valid.

$xmlns:hs needs to be defined based on the type of the request:

List Estimate request - https://api.cisco.com/commerce/EST/v2/async/listEstimate
Update Estimate request - https://api.cisco.com/commerce/EST/v2/async/updateEstimate
Create Estimate request - https://api.cisco.com/commerce/EST/v2/async/createEstimate

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:hs="$xmlns">
 <soapenv:Body>
 <GetQuote releaseID="2014" versionID="1.0" systemEnvironmentCode="Production" languageCode="en-US" xmlns="http://www.openapplications.org/oagis/10">
 	 <ApplicationArea>
 	  <Sender>
       <ComponentID schemeAgencyID="Cisco">B2B-3.0</ComponentID>
    </Sender>
 	 </ApplicationArea>
     <DataArea>

     ........

	 </DataArea>
 </GetQuote>
 </soapenv:Body>
</soapenv:Envelope>

Skeleton body of CCW API Response

Now that we know what the authorization and the request body look like, we will have a look at the structure of the response. As I’ve mentioned before, the response is structured in the XML format.

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
    <soapenv:Body>
        <AcknowledgeQuote xmlns="http://www.openapplications.org/oagis/10">
            <ApplicationArea>
                <CreationDateTime>2020-03-19T08:17:37Z</CreationDateTime>
                <BODID schemeAgencyID="Cisco" schemeVersionID="1.0">3b89bde2-348f-4630-96db-22fd9c10af02</BODID>
            </ApplicationArea>
            <DataArea>
                ........
            </DataArea>
        </AcknowledgeQuote>
    </soapenv:Body>
</soapenv:Envelope>

You can see that the body of the response is very similar to the request, however there are some minor differences. For instance soapenv:Body contains AcknowledgeQuote and not the GetQuote attribute.

Summary

Let’s summarise what we have learned so far. CCW API requests and responses are encoded in the XML format. We’ve gone through, step by step, how to get a ClientID and Client Secret, which are necessary for CCW API authentication. We have showed that each request is wrapped in SOAP Envelope and explained some of its benefits. CCW requests are always sent using the POST method via REST API.

In the next post I will focus on specific examples of how to create, list and read CCW estimates. Stay tuned :)

Sources

https://apiconsole.cisco.com
https://apiconsole.cisco.com/docs/read/external_apis/CCW_Estimate_API
https://git.matyasprokop.com/mprokop/ccw-tools-blog-repo